Knowledge-Based Systems 183 (2019) 104858
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Robust Bhattacharyya bound linear discriminant analysis through an adaptive algorithm✩ ∗
Chun-Na Li a , Yuan-Hai Shao a , , Zhen Wang b , Nai-Yang Deng c , Zhi-Min Yang d a
School of Management, Hainan University, Haikou, 570228, PR China School of Mathematical Sciences, Inner Mongolia University, Hohhot, 010021, PR China c College of Science, China Agricultural University, Beijing, 100083, PR China d Zhijiang College, Zhejiang University of Technology, Hangzhou, 310024, PR China b
article
info
Article history: Received 24 January 2019 Received in revised form 12 June 2019 Accepted 17 July 2019 Available online 19 July 2019 Keywords: Dimensionality reduction Linear discriminant analysis Robust linear discriminant analysis Bhattacharyya error bound Alternating direction method of multipliers
a b s t r a c t In this paper, we propose a novel linear discriminant analysis (LDA) criterion via the Bhattacharyya error bound estimation based on a novel L1-norm (L1BLDA) and L2-norm (L2BLDA). Both L1BLDA and L2BLDA maximize the between-class scatters which are measured by the weighted pairwise distances of class means and meanwhile minimize the within-class scatters under the L1-norm and L2-norm, respectively. The proposed models can avoid the small sample size (SSS) problem and have no rank limit that may encounter in LDA. It is worth mentioning that, the employment of L1-norm gives a robust performance of L1BLDA, and L1BLDA is solved through an effective non-greedy alternating direction method of multipliers (ADMM), where all the projection vectors can be obtained once for all. In addition, the weighting constants of L1BLDA and L2BLDA between the between-class and withinclass terms are determined by the involved data, which makes our L1BLDA and L2BLDA more adaptive. The experimental results on both benchmark data sets as well as the handwritten digit databases demonstrate the effectiveness of the proposed methods. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Linear discriminant analysis (LDA) [1,2] is a well-known supervised dimensionality reduction method, and has been extensively studied since it was proposed. LDA tries to find an optimal linear transformation by maximizing the quadratic distance between the class means simultaneously minimizing the within-class distance in the projected space. Due to its simplicity and effectiveness, LDA is widely applied in many applications, including image recognition [3–6], gene expression [7], biological populations [8], image retrieval [9], etc. Despite the popularity of LDA, there exist some drawbacks that restrict its applications. As we know, LDA is solved through a generalized eigenvalue problem Sb w = λSw w, where Sb and Sw are the classical between-class scatter and the within-class scatter, respectively. When coping with the SSS problem, Sw is not of full rank and LDA will encounter the singularity. Moreover, since LDA is constructed based on the L2-norm, it is sensitive to ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.07.029. ∗ Corresponding author. E-mail address:
[email protected] (Y.-H. Shao). https://doi.org/10.1016/j.knosys.2019.07.029 0950-7051/© 2019 Elsevier B.V. All rights reserved.
the presence of outliers. These make LDA non-robust. In addition, since the rank of Sb is most c − 1, where c is the class number, LDA can find at most c − 1 meaningful features, which is also a limitation. For solving the above non-robustness issues, many endeavors were made from different aspects, including using the null space information [10,11], the subspace learning technique [3,9,12], the regularization technique [7,13], incorporating a model of data uncertainty in the classification and optimizing for the worstcase [14], utilizing the pseudo inverse of Sw [15], and using the robust mean and scatter variance estimators [16,17]. For the rank limit issue, incorporating the local information [18] and the recursive technique [19–22] were usually considered. Recently, the employment of the L1-norm rather than the L2-norm in LDA was studied to cope with the non-robustness and rank limit problems. Li et al. [23] considered a rotational invariant L1-norm (R1-norm) based LDA, while the L1-norm based LDA-L1 [24– 26], ILDA-L1 [27], L1-LDA [28], L1-ELDA [29], RLDA [30] and RSLDA [30] were also put forward, where their scatter matrices are measured by the R1-norm and L1-norm, respectively. The matrix based LDA-L1 was further raised and studied [31–33]. The extension to the Lp-norm (p > 0) [34,35] scatter covariances was also used in LDA. However, as pointed in [29], some of the above methods were still not robust enough.
2
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
Fig. 1. Artificial data set and its projection directions obtained by LDA, L2BLDA, and L1BLDA.
As we know, minimizing the Bhattacharyya error [36] bound is a reasonable way to establish classification [2,37]. In this paper, based on the Bhattacharyya error bound, a novel robust L1-norm linear discriminant analysis (L1BLDA) and its corresponding L2norm criterion (L2BLDA) are proposed. Both of them can avoid the singularity and the rank limit issues, and the employment of the L1-norm makes our L1BLDA more robust. In summary, the proposed L1BLDA and L2BLDA have the following several characteristics: • Both L1BLDA and L2BLDA are derived by minimizing the Bhattacharyya error bound, which ensures the rationality of the proposed methods. Specifically, we prove that the upper bound of the Bhattacharyya error can be expressed linearly by the between-class scatter and the within-class scatter, so that minimizing this upper bound leads to our optimization problem of the form min − SbB + D · SwB , where SbB is the between-class scatter, SwB is the within-class scatter, and D weights SbB and SwB . In particular, it should be pointed out that the weight D is calculated based the input data, so that our models adapt to different data sets. • For the between-class scatter and within-class scatter, L1BLDA uses the L1-norm (LASSO) loss f1 (a) = |a|, while both the scatters of L2BLDA and LDA are described by the L2-norm (square) loss f2 (a) = |a|2 . It is obvious that the difference between f2 (a) and f1 (a) becomes larger as |a| getting larger, so we expect that L1BLDA is more robust than L2BLDA and LDA when the data set contains outliers. To testify the robustness of L1BLDA, we here perform an experiment on a simple data set with four classes. The first
class contains 120 data samples, while each of the other three classes contains 30 data samples. We apply the following three algorithms LDA, our L1BLDA and L2BLDA on the data set and obtain the one dimensional projected data, as shown in Fig. 1. Then two additional outliers are added on the above data for testing. It is obvious that for our L1BLDA, the outliers have little influence to its projection direction, and the projected samples are separated well. On the contrary, LDA and LB2DLDA are greatly affected by outliers. • Two nongreedy adaptive algorithms are proposed for the optimization problems in solving L1BLDA and L2BLDA, respectively: (i) for L1BLDA, it is solved by an effective ADMM algorithm, which is characterized by a one-time projection matrix without the need to recursively solve a single projection vector. Compared with traditional recursive algorithm for L1-norm based LDA, our ADMM approach could maintain the orthogonality and the normalization of the projection directions; (ii) for L2BLDA, it is solved through a standard eigenvalue decomposition problem that does not involve the inversion operation, rather than a generalized eigenvalue decomposition problem in LDA. • Our L1BLDA and L2BLDA can avoid the singularity caused by the SSS problem. Moreover, L1BLDA does not have the rank limit issue. The notation of the paper is given as follows. All vectors are column ones. Given the training set T = {x1 , x2 , . . . , xN } with the associated class labels y1 , y2 , . . . , yN belonging to {1, 2, . . . , c }, where xl ∈ Rn for l = 1, 2, . . . , N. Denote X = (x1 , x2 , . . . , xN ) ∈ Rn×N as the∑ data matrix. Assume that ith class contains ∑the c N 1 Ni samples. Then N = N. Let x = x i i=1 l=1 l be the mean of N
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
all samples and xi =
1 Ni
∑Ni
s=1
xis be the mean of samples in the ith
class, i = 1, 2, . . . , c, where xis is the sth sample of the ith class. For the lth sample xl , denote xtl as the center of the class that xl belongs to, l = 1, 2, . . . , N. The paper is organized as follows. Section 2 briefly reviews LDA. Sections 3 and 4 elaborate on our L2BLDA and L1BLDA, respectively. Section 5 makes comparisons of the proposed methods with their related methods. At last, concluding remarks are given in Section 6. 2. Linear discriminant analysis The main idea of LDA is to find an optimal projection transformation matrix W such that the ratio of between-class scatter to within-class scatter is maximized in the projected space of W ∈ Rn×d , d ≤ n. Specifically, LDA solves the following optimal problem T
max W
tr(W Sb W) tr(WT Sw W)
,
The Bhattacharyya error bound is given by
ϵB =
c ∑ √
Sb =
N
pi (x)pj (x)dx.
√
(5)
We now derive an upper bound of ϵB under some assumptions. Proposition 1. Assume Pi and pi (x) are the prior probability and the probability density function of the ith class for the training data set T , respectively, and the data samples in each class are independent identically normally distributed. Let p1 (x), p2 (x), . . . , pc (x) be the Gaussian functions given by pi (x) = N (x|xi , Σi ), where xi and Σi are the class mean and the class covariance matrix, respectively. We further suppose Σi = Σ, i = 1, 2, . . . , c, and xi and Σi can be estimated accurately from the training data set T . Then for arbitrary projection vector w ∈ Rn , the Bhattacharyya error bound ϵB defined by (5) on the data set ˜ T = {˜ xi |˜ xi = w T xi } satisfies the following:
ϵB ≤ −
c N ∑√
8
(1)
Ni (xi − x)(xi − x)T
∫ Pi Pj
i
Pi Pj ∥w T (xi − xj )∥22 +
i
c ∑ √
+
where the between-class scatter matrix Sb and the within-class scatter matrix Sw are defined by c 1 ∑
3
N 8
∆∥w T X − w T xI ∥22 (6)
Pi Pj ,
i
∑c √
(2)
where xI = (xt1 , . . . , xtl , . . . , xtN ) ∈ Rn×N and ∆ = ∥xi − xj ∥22 .
(3)
Proof. We first note that pi (˜ x) = N (˜ x|˜ xi , σi ), where ˜ xi = wT xi is the i-class mean and σi is the i-class standard variance in the 1-dimensional projected space. Then we have [2]
1 4
i
Pi Pj
i=1
and Ni c 1 ∑∑ i Sw = (xs − xi )(xis − xi )T . N i=1 s=1
The optimization problem (1) is equivalent to the generalized problem Sb w = λSw w where λ ̸ = 0, with its solution W = (w1 , . . . , wd ) given by the first d largest eigenvalues of (Sw )−1 Sb in case Sw is nonsingular. Since the rank of Sb is at most c − 1, the number of extracted features is less or equal than c − 1.
∫ √
pi (˜ x)pj (˜ x) = e
−
¯ i −x˜ ¯ j )2 (x˜ 8σ 2
.
(7)
Since
σ2 =
Ni c N 1 ∑∑ T i 1 ∑ T (w xl − wT xtl )2 (w (xk − xi ))22 = N N i=1 k=1
l=1
1
3. L2-norm linear discriminant analysis criterion via the Bhattacharyya error bound estimation
T
T
= ∥w X − w N
(8)
∥ ,
xI 22
we have The error probability minimization is a natural way to obtain dimensionality reduction for classification, which involves the maximization of probabilistic distance measures and probabilistic dependence measures between different classes [2,3,37–41]. Since the Bayes classifier is the best classifier which minimizes the probability of classification error, minimizing its error rate (called the Bayes error or probability of misclassification [2]) is expected to lead to a good classification model. The Bayes error is defined as
ϵ =1−
∫ max
i∈{1,2,...,c }
{Pi pi (x)}dx,
(4)
where x ∈ Rn is a sample vector, Pi and pi (x) are the prior probability and the probability density function of the ith class for the data set T , respectively. The computation of the Bayes error is a very difficult task except in some special cases, and an alternative way of minimizing the Bayes error is to minimize its upper bound [2,42–45]. In particular, the Bhattacharyya error [36] is a widely used upper bound that provides a close bound to the Bayes error. In the following, we will derive a novel L2-norm linear discriminant analysis criterion via the Bhattacharyya error bound estimation, named L2BLDA, and give its solving algorithm.
ϵB ≤
c ∑ √
Pi Pj e
− 18
¯ i −x˜ ¯ j )2 (x˜ σ2
i
=
c ∑ √
x¯ −x¯ j )2 2 2σ
i −( √
˜ ˜
Pi Pj e
i
≤
c ∑ √
Pi Pj (1 −
i
=
c ∑ √
Pi Pj −
i
≤
c ∑ √
Pi Pj −
i
+
c N ∑√
8
(˜ x¯ i − ˜ x¯ j )2 ) √ 2 2σ
c N ∑√
8
i
c N ∑√
8
Pi Pj ·
∥wT (xi − xj )∥22 ∥wT X − wT xI ∥22
(9)
Pi Pj · ∥wT (xi − xj )∥22
i
Pi Pj · ∆′ij ∥wT X − wT xI ∥22 ,
i
where ∆ij ≥
1 4
∥xi − xj ∥22 . The second inequality of (9) holds by 2 the fact that e−x ≤ 1 − x2 for x ≥ 0. For the last inequality, T since ∥w (xi − xj )∥22 ≤ ∥w∥22 · ∥xi − xj ∥22 = ∥xi − xj ∥22 and ′
4
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858 1
(
∥wT X−wT xI ∥22
1−
T
1
)
∥wT X−wT xI ∥22
2 2
∥w (xi − xj )∥ ·
≤ 14 , we have
1
∥wT X − wT xI ∥22
(1 −
where 1
∥wT X − wT xI ∥22
1
S=−
)
+∆
4
1
≥∥wT (xi − xj )∥22 − ∥xi − xj ∥22 · ∥wT X − wT xI ∥22
and hence (9). ∑c √ By taking ∆ = Pi Pj ∆′ij = i
1 4
∑c √ i
Pi Pj ∥xi − xj ∥22 , we
We are now ready to derive our optimization problem from Proposition 1. As stated before, we try to minimize the Bhattacharyya error bound. Therefore, by minimizing √ right side ∑c the Pi Pj and the of (6) and neglecting the third constant term i
i=1 s=1
s.t. w w = 1, T
(12)
∑c √ 1
where ∆ = 4 Pi Pj ∥xi − xj ∥22 . The above optimization probi
4.1. L1BLDA Bhattacharyya error bound derivation In this section, we derive a different upper bound of the Bhattacharyya error under the L1-norm measure, aiming to construct a robust L1-norm based Bhattacharyya bound linear discriminant analysis (L1BLDA). Similar to L2BLDA, we first give the following proposition. Proposition 2. Assume Pi and pi (x) are the prior probability and the probability density function of the ith class for the training data set T , respectively, and the data samples in each class are independent identically normally distributed. Let p1 (x), p2 (x), . . . , pc (x) be the Gaussian functions given by pi (x) = N (x|xi , Σi ), where xi and Σi are the class mean and the class covariance matrix, respectively. We further suppose Σi = Σ, i = 1, 2, . . . , c, and xi and Σi can be estimated accurately from the training data set T . Then for arbitrary projection vector w ∈ Rn , there exist some constants B and C such that the Bhattacharyya error bound ϵB defined by (5) on the data set ˜ T = {˜ xi |˜ xi = w T xi } satisfies the following:
ϵB ≤ − B
c ∑ √
Pi Pj ∥w T (xi − xj )∥1 + BΩ ∥w T X − w T xI ∥1
i
i=1 s=1
+C
s.t. WT W = I,
c ∑ √
(16) Pi Pj ,
i
(13) where W ∈ Rn×d , d ≤ n. The geometric meaning of L2BLDA is clear. By minimizing the first term in (13), the sum of the distances in the projected space between the centroid of ith class and the centroid not in the ith class is guaranteed to be as large as possible. Minimizing the second term in (13) makes sure any sample close to its own √ class centroid in the low dimensional space. The coefficients N1 Ni Nj in the first term weight distance pairs between the ith and the jth class, while the constant ∆ in front of the second term plays the balancing role while also makes sure minimum error bound. The constraint WT W = I forces the obtained discriminant directions orthonormal to each other, which ensures minimum redundancy in the projected space. L2BLDA can be solved through a simple standard eigenvalue decomposition problem. In fact, problem (13) can be rewritten as min tr(WT SW) s.t. W W = I,
− xi ) . T
4. L1-norm linear discriminant analysis criterion via the Bhattacharyya error bound estimation
≥∥wT (xi − xj )∥22 − ∆′ij · ∥wT X − wT xI ∥22 ,
T
−
xi )(xis
(11)
4
W
(15) (xis
Then W = (w1 , w2 , . . . , wd ) is obtained by the d orthonormal eigenvectors that correspond to the first d smallest eigenvectors of S.
∥wT (xi − xj )∥22 ∥wT X − wT xI ∥22
i
Ni c ∑ ∑ i=1 s=1
which implies
w
i
(10)
≤ ∥xi − xj ∥22 ,
min −
1 ∑√ Ni Nj (xi − xj )(xi − xj )T N
(14)
√
n 4
where Ω =
∑c √
Pi Pj ∥xi − xj ∥1 .
i
Proof. From (7), we have
ϵB ≤
c ∑ √
Pi Pj e
− 18
˜ −x˜ )2 (x i j σ2
i
≤
c ∑ √
Pi Pj e
− 18
(wT xi −wT xj )2 1 ∑N (wT x −wT x )2 k lk i=1 N
i
=
c ∑ √
− N8
Pi Pj e
(17)
( ∥wT (x −x )∥ )2 i j 2 ∥wT X−wT xI ∥2
i
≤
c ∑ √
Pi Pj e
− N8
( ∥wT (x −x )∥ )2 i j 1 ∥wT X−wT xI ∥1
.
i
Here the last inequality of (17) follows from the fact that
∥wT (xi − xj )∥2 ∥wT (xi − xj )∥1 ∥wT (xi − xj )∥1 = ≥ . ∥wT X − wT xI ∥2 ∥wT X − wT xI ∥2 ∥wT X − wT xI ∥1
(18)
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
We now derive a linear upper bound g(x) of the right side of 2 (17). Denote h(x) = e−bx , 0 ≤ x ≤ a, b > 0. It is easy to know √1
that h(x) is concave when 0 ≤ x ≤ x ≥
√1 . 2b
√1 , 2b
Therefore, when x ≥
2b
2b
2b
1 − 2b
−e−ba a− √1
√1
When 0 ≤ x ≤
2b
2
1 −ba2 −e− 2b
1
x + e− 2b − √1 · e
(
)
.
a− √1
2b
2b
2b
, it is obvious that there exists some constant 1 − 2b
2
−e−ba a− √1
s > 0 such that g(x) = − e
1
−ba2 −e− 2b
1
x + e− 2b − √1 · e
(
2b
2b
a− √1
+s
) min
g(x) = −Ex + C ,
2b
√1 2b
and C = e
1 − 2b
−
√1 2b
·
ϵB ≤
i
=
2b
·
2 − 1 e−ba −e 2b
a− √1
then by combining (17) we have c ∑ √
√1
2 − e−ba −e 2b a− √1
if x ≥
√1 2b
+ s if
,b=
where Bij , Zis ∈ Rd , W, D ∈ Rn×d , and ℓ(W) =
( ∥wT (xi − xj )∥1 ) Pi Pj g ∥wT X − wT xI ∥1
=
−
c ∑
∥Bij ∥1 + Ω
i
(20)
Pi Pj · ∥wT (xi − xj )∥1 + BΩ ∥wT X − wT xI ∥1
+
Pi Pj ,
+
i
+∞, other wise.
Ni c ∑ ∑
∥Zis ∥1 + ℓ(W)
∑
uTij (
Ni c ∑ ∑
1√ Ni Nj WT (xi − xj ) − Bij ) N vTij (WT (xis − xi ) − Zis ) + QT (D − W)
(23)
i=1 j=1
√
=
Proposition 1.
WT W = I
i=1 j=1
i
c ∑ √
Ω
0,
Lρ (W, Bij , Zis , D; uij , vij , Q)
i
where
{
Then the augmented Lagrangian is given by
c ∑ √
+C
(22)
i = 1, . . . , c , j = 1, . . . , Ni ,
N , 8
( ) ∥wT (xi − xj )∥1 Pi Pj − B +C T T ∥w X − w xI ∥1
≤ −B
∥Zis ∥1 + ℓ(W)
i=1 s=1
D − W = 0,
c ∑ √ i
Ni c ∑ ∑
Ni Nj WT (xi − xj ) − Bij = 0, i < j, N WT (xis − xi ) − Zis = 0,
2b
2b
∥Bij ∥1 + Ω
1√
1
1
, C = e− 2b −
∑ i
s.t.
(19)
2 − 1 e 2b −e−ba 1 √ a−
−
W, Bij Zis , D
In summary, if we define
0≤x<
We now give the solving algorithm of L1BLDA. As we see, the non-smoothness and non-convexity of the objective of L1BLDA (21) together with the orthonormal constraint make L1BLDA difficult to solve by traditional optimal techniques. Here we give an ADMM algorithm to solve it. To apply ADMM, we first transfer (21) to its following ADMM form
2b
is tangent to h(x) and also a linear upper bound of h(x).
where E =
4.2. L1BLDA algorithm
, and h(x) is convex when
the linear function passing
through ( √1 , h( √1 )) and (a, h(a)) is the tightest linear upper bound of h(x), e.g., g(x) = − e
5
n 4
∑c
i
√
Pi Pj ∥xi − xj ∥1
similar
as
c ρ ∑ 1√ + ∥ Ni Nj WT (xi − xj ) − Bij ∥22
in
2
□
+ Therefore, by minimizing the upper bound (16) of the Bhattacharyya error, we get the formulation of L1BLDA as Ni c ∑ ∑ 1 ∑√ T min − Ni Nj ∥W (xi − xj )∥1 + Ω ∥WT (xis − xi )∥1 W N i
i=1 s=1
s.t. W W = I , T
(21) where W ∈ Rn×d , d ≤ n. Problem (21) expresses the similar ideal of L2BLDA, but with the L2-norm terms in L2BLDA replaced with the L1-norm ones,
=
− +
in L2BLDA, the coefficients in the first term of L1BLDA weight the distances between different classes, and Ω weights the between-
+
WT W = I again guarantees the minimum redundancy in the projected space.
2
∥WT (xis − xi ) − Zis ∥22 +
i=1 j=1
c ∑
∥Bij ∥1 + Ω
i
ith class under the L1-norm measure as large as possible, and
class and the within-class scatters. The orthonormal constraint
ρ ∑∑
ρ 2
∥D − W∥2F ,
Lρ (W, Bij , Zis , D; αij , βij , Γ)
(21) makes the sum of the distances in the projected space
also ensures each sample be close to its own class centroid. As
N
Ni
where uij , vij ∈ Rd , Q ∈ Rn×d are dual variables for i = 1, . . . , c , j = 1, . . . , Ni , and ρ > 0 is the penalty parameter. ∥ · ∥2F is the square of F -norm of a matrix, which is defined as the sum of squares of elements in the matrix. ⟨·⟩ denotes the inner product, where for two matrices P = (pij ) and H = (hij ) of the same size, ∑ their inner product is defined as ⟨P, H⟩ = i,j pij hij . By letting αij = uij /ρ , βij = vij /ρ and Γ = Q/ρ for i = 1, . . . , c , j = 1, . . . , Ni , the scaled form Lagrangian of (23) can be formed as (without considering terms only contain αij , βij or Γ)
and with a different weighting constant. Therefore, minimizing between the centroid of the ith class and the centroid not in the
i
+
Ni c ∑ ∑
∥Zis ∥1 + ℓ(W)
i=1 j=1
c ρ ∑ 1√ ∥ Ni Nj WT (xi − x) − Bij + αij ∥22
2
i
N
Ni ρ ∑∑
2
ρ 2
∥WT (xis − xi ) − Zis + βij ∥22
i=1 j=1
∥D − W + Γ∥2F .
(24)
6
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858 Table 1 Computational complexity of LDA related methods.
Then the ADMM algorithm for problem (22) can be presented as Algorithm 1. For step (a) in Algorithm 1, we need to solve k+1
W
=arg min Lρ (W,
Bkij
W
=arg min ℓ(W) + W
+ +
Ni c ρ ∑∑
2
ρ 2
+
2
ρ 2
k ij
k ij
k
c ρ ∑ 1√ ∥ Ni Nj WT (xi − xj ) − Bkij + αkij ∥22
2
N
i
2
N2
i
(xi − xj )(xi − xj )T )W)
Ni
c
tr(WT (
c ∑ Ni Nj
tr(WT (
∑∑
(xis − xi )(xis − xi )T )W)
i=1 j=1
W
(b) Bkij+1 = arg minLρ (Wk+1 , Bij , Zkis , Dk ; αkij , βkij , Γk );
c ∑ [ 1√ tr(W W) − ρ · tr( Ni Nj (xi − xj )T W(Bkij − αkij ))
Bij
N
i
(c) Zkis+1 = arg minLρ (Wk+1 , Bkij+1 , Zis , Dk ; αkij , βkij , Γk ); Zis
Ni c ∑ ∑
(xis
− xi )
T
W(Zkis
−β
(d) Dk+1 = arg minLρ (Wk+1 , Bkij+1 , Zkis+1 , D; αkij , βkij , Γk );
k ij ))
Zis
(e) αkij+1 = αkij + ( N1
i=1 j=1
+ ρ · tr((D + Γ ) W) k
k T
=arg min ℓ(W) + W
+
Ni c ∑ ∑
(xis
ρ 2
] c
tr [WT (
∑ Ni Nj i
N2
(xi − xj )(xi − xj )T
− xj ) − Bkij+1 );
i,j
−
xi )(xis
||(Wk+1 )T (xis − xi ) − Zkis+1 ||2 , ||Wk+1 − Dk+1 ||2 } ≤ ϵ pri
T
− xi ) + I)W]
and ||sk+1 ||= max{||ρ (xi − xj )(Bkij+1 − Bkij )′ ||2 ,
c ∑ 1√ i
Ni Nj (Wk+1 )T (xi k+1 T ((W ) (xis xi ) k+1 k+1
√
=β + − − Zkis+1 ); (f) β (g) Γ = Γ + (D − W ). Until √ ||r k+1 ||= max{|| N1 Ni Nj (Wk+1 )T (xi − xj ) − Bkij+1 ||2 , k ij k
k+1 ij k+1
i=1 j=1
− ρ · tr [(
O((N + c + 1)n) O(T (N + c)n) O(T (n3 + n2 (N + c) + nc)) O(n3 + T0 Nc 2 ) O(n3 ) O(n3 ) or O(n3 + T0 n2 d)
Γ0 ∈ Rn×d , i = 1, . . . , c, j = 1, . . . , Ni ; maximum iteration number ItMax. Process: while k < ItMax (a) Wk+1 = arg minLρ (W, Bkij , Zkis , Dk ; αkij , βkij , Γk );
T
+ ρ · tr(
Computing complexity
LDA LDA-L1 L1-LDA RLDA/RSLDA L2BLDA L1BLDA
Input: Data set T = {(x1 , y1 ), ..., (xN , yN )}; the positive tolerances ϵ pri and ϵ dual . Set the iteration number k = 0 and initialize D0 ∈ Rn×d , B0ij ∈ Rd ,Z0is ∈ Rd and α0ij , β0ij ∈ Rd ,
∥Dk − W + Γk ∥22 ρ
Method
Algorithm 1 Scaled ADMM algorithm for problem (13).
i=1 j=1
W
ρ
,D ;α ,β ,Γ ) k
∥WT (xis − xi ) − Zkis + βkij ∥22
=arg min ℓ(W) +
+
,
Zkis
i,j
Ni Nj (Bkij − αkij )(xi − xj )T
N
||ρ (xis − xi )(Zkis+1 − Zkis )′ ||2 , ||ρ (Dk+1 − Dk )||F } ≤ ϵ dual . Output: W∗ = Wk+1 .
Ni
+
c ∑ ∑
(Zkis − βkij )(xis − xi )T + (Dk + Γk )T )W]
i=1 j=1
=arg min ℓ(W) + W
=arg min W
ρ 2
ρ 2
to
tr(WT GW) + ρ · tr((Ak )T W),
arg min W
tr(WT GW) + ρ · tr((Ak )T W)
(25) Ni Nj xj )(xi i
∑c
−
− x )T + ∑c j 1 √
∑c
∑Ni
i=1 Ni Nj (Bkij k T
i j=1 (xs k ij )(xi d×n
− xi )(xis − − xj )T +
−α ∑c ∑Ni k k i T . The above j=1 (Zis − βij )(xs − xi ) + (D + Γ ) ∈ R i=1 T
xi ) + I ∈ R
, and (A ) =
i
problem is equivalent to arg min W
ρ 2
2
∥GT0 W − (G0 )−1 Ak ∥2F
(27)
s.t. WT W = I.
s.t. WT W = I ,
where G =
ρ
[tr(WT GW) + 2 · tr((Ak )T W)] (26)
s.t. WT W = I ,
In this situation, problem (27) is a balanced Procrustes problem [46], and can be solved as Wk+1 = Pk1 (Pk2 )T , where Pk1 and Pk2 are orthogonal matrices from the SVD Ak = Pk1 Σk Pk2 . Case 2: d < n. In this situation, problem (26) is an unbalanced Procrustes problem [47], and there is no analytic solution. We here adopt a recently proposed convergent algorithm studied in [48] to solve (26). Specifically, we use Algorithm 2. Algorithm 2 Algorithm for problem (26) when d < n.
From Algorithm 1, we see that we need to solve optimization problems in steps (a)–(d). In the following, we will give specific solutions to these four types of problems. Now we solve problem (26) by two cases.
(a) Compute the dominant eigenvalue a of G. (b) Randomly initialize W ∈ Rn×d such that WT W = I. (c) Update M ← 2(aI − G)W − 2Ak . (d) Calculate W by solving the following problem: max tr(WT M).
Case 1: d = n. Since G is positive definite, we can further write G = G0 (G0 )T by Cholesky decomposition, where G0 is an invertible lower triangular matrix. Then problem (25) is equivalent
(e) Iteratively perform the above steps (c) and (d) until convergence.
WT W=I
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
7
Fig. 2. The mean accuracies for various methods on the UCI data sets.
Table 2 UCI data sets information. Data set
Sample no.
Feature no.
Class no.
Data set
Sample no.
Feature no.
Class no.
Australian BUPA Car Credit Diabetics Echocardiogram Ecoli German Haberman Hourse House_votes
690 345 1782 690 768 131 336 1000 306 300 435
14 6 6 15 8 10 7 20 3 2 16
2 2 4 2 2 2 8 2 2 2 2
Iris Monks3 Musk1 Libras Sonar Spect TicTacToe Titanic Waveform WPBC
150 432 476 360 208 267 958 2201 5000 198
4 6 166 90 60 44 27 3 21 34
3 2 2 15 2 2 2 2 2 2
Table 3 Classification results on original UCI data sets. Data set
PCA Acc (Dim)
PCA-L1 Acc (Dim)
LDA Acc (Dim)
LDA-L1 Acc (Dim)
L1-LDA Acc (Dim)
RLDA Acc (Dim)
RSLDA Acc (Dim)
L2BLDA Acc (Dim)
L1BLDA Acc (Dim)
Australian BUPA Car Credit Diabetics Echocardiogram Ecoli German Haberman Hourse House votes Iris Monk3 Musk1 Libras Sonar Spect TicTacToe Titanic Waveform WPBC
81.16 (7) 60.19 (6) 93.63 (3) 81.64 (5) 71.74 (4) 87.18 (9) 78.22 (7) 73.67 (18) 71.43 (3) 84.44 (11) 88.46 (18) 100.00 (1) 63.08 (3) 63.08 (41) 52.38 (15) 56.45 (9) 83.75 (5) 97.57 (14) 67.73 (1) 86.23 (3) 67.80 (1)
80.19 (6) 61.17 (2) 76.25 (6) 82.61 (3) 70.87 (8) 87.18 (8) 80.20 (4) 73.67 (17) 71.43 (3) 80.00 (11) 87.69 (16) 100.00 (3) 60.77 (3) 83.92 (44) 52.38 (15) 56.45 (9) 78.75 (4) 96.88 (15) 70.30 (1) 86.13 (3) 67.80 (1)
80.19 (1) 56.31 (1) 50.00 (3) 80.68 (1) 68.70 (1) 87.18 (1) 77.23 (7) 69.33 (1) 51.65 (1) 64.44 (1) 91.54 (1) 100.00 (2) 60.00 (1) 76.92 (1) 44.76 (14) 46.77 (1) 71.25 (1) 94.44 (1) 67.73 (1) 81.40 (1) 66.10 (1)
82.31 (8) 63.11 (6) 75.87 (5) 82.61 (8) 73.48 (7) 87.18 (10) 77.23 (7) 74.33 (15) 68.13 (2) 82.22 (14) 90.77 (7) 100.00 (3) 70.00 (1) 83.92 (45) 52.38 (11) 62.90 (7) 78.75 (2) 95.83 (15) 70.30 (2) 86.60 (4) 67.80 (2)
76.33 (1) 63.11 (1) 75.87 (1) 81.64 (1) 66.96 (1) 87.18 (1) 78.22 (7) 70.00 (13) 57.14 (1) 65.56 (1) 91.54 (1) 100.00 (1) 60.00 (1) 65.73 (1) 44.76 (1) 53.23 (12) 80.00 (12) 94.44 (12) 67.73 (1) 79.53 (1) 79.66 (1)
85.99 (2) 67.96 (5) 84.17 (2) 83.09 (2) 74.35 (7) 89.74 (1) 79.21 (5) 74.67 (16) 74.73 (1) 82.22 (22) 94.62 (3) 100.00 (4) 73.85 (3) 83.92 (100) 60.95 (31) 74.19 (2) 80.00 (26) 94.44 (1) 70.30 (1) 80.93 (12) 76.27 (2)
80.19 (9) 71.84 (4) 93.24 (6) 80.68 (15) 69.57 (5) 89.74 (3) 75.25 (3) 72.33 (8) 76.92 (2) 82.22 (25) 93.85 (1) 100.00 (4) 80.00 (3) 83.22 (150) 50.48 (74) 72.58 (2) 80.00 (22) 94.44 (2) 70.30 (1) 66.87 (12) 76.27 (27)
82.31 (2) 65.05 (3) 93.82 (5) 81.16 (6) 70.87 (8) 87.18 (9) 78.22 (5) 73.67 (18) 73.63 (2) 83.33 (11) 90.77 (16) 100.00 (4) 56.15 (5) 83.92 (45) 52.38 (15) 56.45 (9) 81.25 (5) 99.31 (14) 67.73 (1) 86.53 (3) 71.19 (7)
83.57 (3) 70.87 (4) 92.28 (5) 82.61 (5) 72.17 (5) 87.18 (9) 80.20 (6) 74.67 (3) 74.73 (2) 84.44 (13) 91.54 (16) 100.00 (3) 73.85 (1) 83.92 (44) 52.38 (16) 62.90 (5) 83.75 (6) 93.75 (15) 70.30 (1) 85.80 (9) 72.88 (1)
For step (b) in Algorithm 1, we need to solve
By direct computation, its solution can be given as
Bkij+1 =arg min Lρ (Wk+1 , Bij , Zkis , Dk ; αkij , βkij , Γk ) Bij
=arg min − Bij
c ∑ i
c
+
∥Bij ∥1
ρ ∑ 1√ ∥ Ni Nj (Wk+1 )T (xi − xj ) − Bij + αkij ∥22 . 2
i
N
Bkij+1 = (28)
⎧ ⎪ ⎪ ⎪ ⎨
1 N
⎪ ⎪ ⎪ ⎩
N
Ni Nj (Wk+1 )T (xi − xj ) + αkij + 1/ρ,
√ if
1
√
√N 1
Ni Nj (Wk+1 )T (xi − xj ) + αkij ≥ 0;
Ni Nj (Wk+1 )T (xi − xj ) + αkij − 1/ρ,
if
1 N
√
for i = 1, 2, . . . , c.
Ni Nj (Wk+1 )T (xi − xj ) + αkij < 0,
8
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
Table 4 Classification results on the UCI data sets with 30% features added with Gaussian noise. Data set
PCA Acc (Dim)
PCA-L1 Acc (Dim)
LDA Acc (Dim)
LDA-L1 Acc (Dim)
L1-LDA Acc (Dim)
RLDA Acc (Dim)
RSLDA Acc (Dim)
L2BLDA Acc (Dim)
L1BLDA Acc (Dim)
Australian BUPA Car Credit Diabetics Echocardiogram Ecoli German Haberman Hourse House votes Iris Monk3 Musk1 Libras Sonar Spect TicTacToe Titanic Waveform WPBC
81.64 56.31 81.66 83.57 73.04 76.92 77.23 71.67 75.82 80.00 89.23 88.89 65.38 79.72 60.00 45.16 75.00 90.97 70.30 81.67 72.88
81.64 57.28 70.08 84.06 73.04 82.05 75.25 73.00 75.82 77.78 87.69 91.11 47.69 78.32 57.14 50.00 82.50 91.67 70.30 81.87 67.80
78.74 57.28 45.56 78.26 63.04 74.36 75.25 67.33 61.54 65.56 90.77 97.78 51.54 65.73 39.05 54.84 60.00 45.14 70.30 80.86 59.32
84.54 60.19 67.57 86.47 74.35 84.62 75.25 72.33 72.53 85.56 90.00 93.33 53.08 79.72 61.90 54.84 81.25 92.71 70.30 83.60 76.27
84.06 55.34 56.18 80.19 68.70 82.05 71.29 70.00 56.04 67.78 90.77 97.78 52.31 69.93 57.14 54.84 80.00 65.28 70.30 77.40 76.27
82.13 (3) 68.93 (3) 72.39 (5) 85.99 (14) 73.91 (1) 84.62 (4) 74.26 (7) 76.00 (18) 76.92 (3) 80.00 (21) 94.62 (8) 100.00 (2) 63.85 (3) 80.42 (164) 56.19 (56) 72.58 (4) 81.25 (2) 94.44 (3) 70.30 (1) 80.40 (1) 81.36 (21)
81.16 (4) 73.79 (4) 82.43 (6) 83.09 (11) 74.78 (8) 84.62 (10) 73.27 (4) 78.00 (16) 78.02 (2) 75.56 (6) 93.08 (6) 100.00 (2) 74.62 (6) 86.01 (83) 44.76 (80) 69.35 (6) 80.00 (28) 82.29 (22) 70.30 (1) 73.27 (19) 79.66 (8)
81.64 56.31 81.80 83.57 73.04 79.49 75.25 72.00 75.82 78.89 87.69 93.33 74.62 79.72 57.14 43.55 73.75 91.32 70.30 81.73 71.19
84.54 68.93 84.56 85.99 75.65 84.62 78.22 73.33 75.82 82.22 92.31 97.78 76.92 83.22 56.19 67.74 80.00 92.71 70.30 84.27 76.27
(9) (6) (4) (15) (8) (1) (4) (8) (3) (9) (11) (2) (2) (27) (24) (3) (9) (6) (1) (8) (15)
(2) (1) (6) (9) (8) (6) (7) (8) (3) (8) (16) (3) (1) (21) (37) (3) (18) (18) (1) (12) (7)
(1) (1) (3) (1) (1) (1) (5) (1) (1) (1) (1) (2) (5) (1) (13) (1) (1) (1) (1) (1) (1)
(7) (2) (2) (12) (3) (2) (5) (15) (1) (8) (10) (3) (1) (11) (34) (1) (6) (16) (1) (17) (25)
(1) (1) (1) (1) (1) (1) (2) (13) (1) (1) (1) (1) (1) (1) (1) (1) (13) (13) (1) (1) (12)
(2) (3) (3) (15) (8) (8) (6) (5) (3) (10) (16) (3) (2) (48) (23) (6) (9) (7) (1) (12) (3)
(7) (1) (5) (7) (7) (2) (4) (6) (3) (8) (3) (2) (5) (12) (24) (1) (7) (15) (1) (6) (4)
For step (c) in Algorithm 1, we need to solve Zkis+1
=arg minLρ (Wk+1 , Bkij+1 , Zis , Dk ; αkij , βkij , Γk ) Zis
=arg min Ω Zis
+
Ni c ∑ ∑
Ni c ρ ∑∑
2
∥Zis ∥1
(29)
i=1 j=1
Fig. 3. The contaminated samples of 30% noise from the MNIST database.
∥(Wk+1 )T (xis − xi ) − Zis + βkij ∥22 .
i=1 j=1
For LDA, the time complexity is O((N + c + 1)n). The time complexity of RLDA and RSLDA is O(n3 + T0 Nc 2 ) [30], where T0 is their iteration number. The summary is listed in Table 1.
Its solution can be given through soft thresholding function: Zkis+1
= ΦΩ /ρ [(W
k+1 T
)
(xis
− xi ) + β ]
for ⎧ i = 1, 2, . . . , c and j ⎨a − κ, a > κ . 0, |a| ≤ κ
⎩
k ij
= 1, 2, . . . , Ni , where Φκ (a) =
a + κ, a < −κ For step (d) in Algorithm 1, we need to solve
Dk+1 =arg min Lρ (Wk+1 , Bkij+1 , Zkis+1 , D; αkij , βkij , Γk ) D
=arg min D
ρ 2
∥D − Wk+1 + Γk ∥2F ,
(30)
whose solution is componentwisely given by Dk+1 = Wk+1 − Γk . 4.3. Analysis of computational complexity We now present the computational cost for the proposed L2BLDA and L1BLDA, as well as the LDA-based methods closely related to our methods. The main computation cost of our L2BLDA is the computation of the matrix S and performing its standard eigenvalue decomposition. Therefore, its computation complexity is O(n3 ). For our L1BLDA, the computation complexity for Step (a) in Algorithm 1 is O(n3 ) or O(n3 + T0 n2 d) when d < n, where d is the reduced dimension, and T0 is the iteration number of Algorithm 2. The computation complexity for both Step (b) and Step (c) is O(dn), and the one for Step (d) is O(n). Therefore, the computation complexity for L1BLDA is O(n3 ) or O(n3 + T0 n2 d). When one discriminant direction is learned, the time complexity of LDA-L1 is O(T ((N + c)n)) [24], where T is the iteration number. For L1-LDA, its time complexity is O(T (n3 + n2 (N + c) + nc)) [28].
5. Experiments In this section, we perform experiments to test the proposed methods on some UCI benchmark data sets and two handwritten digit databases. Several related dimensionality reduction methods, including PCA [49], PCA-L1[50], LDA [2], LDA-L1 [24], L1LDA [28], RLDA [30] and RSLDA [30] are used for comparison. The learning rate parameter for LDA-L1 is chosen from the set {10−4 , 10−3 , . . . , 101 }, and parameters λ, ρ for RLDA and RSLDA are selected from {0.1, 0.5, 1, 5, 50, 100, 1000}, and δ is selected from {0.01, 0.05, 0.1, 0.5, 1, 5, 10}. To test the discriminant ability of various methods, the nearest neighbor classification accuracy (%) in the projected space is used as an indicator, where the projected space is obtained by applying each dimensionality reduction method on the training data. All the methods are carried out on a PC with P4 2 GHz CPU and 2 GB RAM memory by Matlab 2017b. 5.1. The UCI data sets We apply the proposed methods on 21 benchmark data sets. All the data sets are normalized to the interval [0, 1], and their information is listed in Table 2. Random 70% percent data is used for training, and the rest data forms the test set. To test the robustness of our L1BLDA, we also add random Gaussian noise to each training data. Specifically, 30% and 50% percent features are selected and added with random Gaussian noise of mean zero and variance 0.1, respectively. The classification results on the original data sets and the noise data sets are listed in Tables 3, 4 and 5,
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
9
Fig. 4. The variation of accuracies along different dimensions on the MNIST database.
respectively. From the tables, we have the following observations: (i) When the noise is added, the performance for all the methods degenerates on almost all the data sets. However, our L1BLDA is less affected comparing to the other methods; (ii) L1-norm based methods generally perform better than the L2-norm ones, especially on the noise data sets, which shows the robustness of L1-norm based LDAs; (iii) Our L1BLDA and RLDA perform the best comparing to the other methods, while our L1BLDA outperforms RLDA on noise data; (iv) Our methods behave better on relatively large UCI data sets. To see the results clearer, we also depict the
mean accuracies for each method over all the data sets in Fig. 2, and list their average ranks in Table 6. One can see that although RLDA behaves a little better than L1BLDA on the original data, our L1BLDA outperforms the others on noise data. 5.2. Handwritten digit databases In this subsection, the behaviors of various methods are investigated on two handwritten digit databases, including the MNIST database and the USPS database.
10
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
Table 5 Classification results on the UCI data sets with 50% features added with Gaussian noise. Data set
PCA Acc (Dim)
PCA-L1 Acc (Dim)
LDA Acc (Dim)
LDA-L1 Acc (Dim)
L1-LDA Acc (Dim)
RLDA Acc (Dim)
RSLDA Acc (Dim)
L2BLDA Acc (Dim)
L1BLDA Acc (Dim)
Australian BUPA Car Credit Diabetics Echocardiogram Ecoli German Haberman Hourse House votes Iris Monk3 Musk1 Libras Sonar Spect TicTacToe Titanic Waveform WPBC
82.61 58.25 78.19 79.71 66.52 84.62 78.22 70.33 70.33 83.33 90.00 88.89 51.54 77.62 59.05 51.61 77.50 94.44 67.73 76.73 76.26
83.57 57.28 70.08 76.81 69.57 79.49 78.22 70.67 71.43 82.22 92.31 84.44 65.38 77.62 61.90 56.45 78.75 92.36 67.73 76.67 74.58
78.74 62.14 44.21 78.26 62.17 79.49 75.25 68.33 63.74 76.67 91.54 82.22 48.46 62.24 42.86 48.39 70.00 58.68 67.73 75.27 59.32
83.09 59.22 68.73 81.64 67.83 84.62 73.27 73.67 70.33 82.22 92.31 86.67 72.31 78.33 60.95 59.68 80.00 93.40 67.73 77.27 74.58
76.33 59.22 47.88 77.29 66.52 71.79 63.37 70.00 67.03 68.89 89.23 80.00 50.00 62.24 61.90 53.23 80.00 65.28 67.73 66.87 76.27
83.57 68.93 67.76 85.51 66.96 87.18 65.35 74.67 72.53 80.00 94.62 93.33 70.77 80.42 61.90 69.35 82.50 85.42 67.73 76.13 79.66
82.13 64.08 75.87 80.19 66.96 89.74 63.37 71.67 74.73 78.89 93.08 95.56 72.31 82.52 30.48 66.13 83.75 77.78 67.73 73.47 81.36
83.09 60.19 77.03 76.81 66.52 82.50 78.22 69.33 70.33 84.44 90.00 84.44 49.23 77.62 55.24 53.23 78.75 94.44 67.73 76.73 77.97
84.06 67.96 78.57 81.64 69.57 84.62 78.22 74.67 74.73 85.56 92.31 91.11 70.77 78.33 59.05 59.68 81.25 89.93 70.30 78.00 79.66
(7) (2) (5) (4) (8) (2) (7) (5) (3) (17) (7) (1) (1) (35) (48) (6) (7) (9) (1) (11) (8)
(3) (4) (6) (6) (4) (4) (7) (10) (3) (23) (12) (2) (3) (129) (54) (1) (8) (15) (1) (19) (7)
(5) (2) (4) (6) (6) (8) (5) (8) (3) (26) (15) (3) (2) (93) (60) (3) (7) (16) (1) (16) (27)
(5) (2) (4) (6) (6) (8) (5) (8) (3) (26) (15) (3) (2) (93) (60) (3) (7) (16) (1) (16) (27)
(1) (1) (1) (1) (1) (1) (1) (12) (1) (1) (1) (1) (1) (1) (1) (1) (13) (12) (1) (13) (12)
(4) (4) (1) (13) (8) (9) (5) (15) (1) (16) (8) (3) (6) (125) (32) (5) (10) (21) (1) (2) (14)
(3) (2) (6) (13) (5) (6) (7) (20) (3) (11) (2) (2) (2) (82) (89) (5) (3) (26) (1) (16) (18)
(7) (2) (5) (15) (8) (77) (7) (14) (3) (15) (16) (4) (5) (35) (39) (7) (7) (9) (1) (20) (9)
(6) (4) (4) (4) (3) (3) (7) (3) (2) (18) (5) (2) (2) (38) (29) (7) (8) (15) (1) (19) (6)
Table 6 Average ranks of various methods for the accuracies on the UCI data sets. Data set
PCA
PCA-L1
LDA
LDA-L1
L1-LDA
RLDA
RSLDA
L2BLDA
L1BLDA
Australian BUPA Car Credit Diabetics Echocardiogram Ecoli German Haberman Hourse House votes Iris Monk3 Musk1 Libras Sonar Spect TicTacToe Titanic Waveform WPBC
5.67 7.83 2.67 5.33 5.67 6.00 3.17 6.00 5.33 2.67 7.50 6.00 5.00 6.67 3.83 7.33 5.50 3.17 6.00 3.83 6.33
5.17 7.17 5.67 5.17 4.33 6.33 2.83 4.67 4.67 5.83 7.17 6.50 6.67 5.33 3.33 6.00 5.00 3.67 4.50 4.00 7.50
8.00 6.17 9.00 7.83 8.67 7.50 5.67 9.00 8.67 8.67 4.83 5.67 8.17 8.17 8.50 7.67 9.00 8.17 6.00 6.33 9.00
3.17 5.33 6.83 2.17 2.67 4.17 6.00 3.67 6.67 3.50 5.50 5.50 3.83 3.83 3.00 4.00 4.83 3.17 4.50 1.67 6.17
7.00 7.00 7.83 6.83 8.00 6.83 7.50 7.67 8.33 8.33 5.83 6.00 7.17 8.17 4.83 6.50 4.83 7.50 6.00 8.33 3.33
2.50 2.17 5.67 1.50 3.17 2.00 5.67 1.67 2.50 4.83 1.00 2.83 3.67 2.67 3.17 1.00 3.17 4.50 4.50 6.67 2.00
7.33 1.67 3.00 6.50 4.50 1.67 8.50 4.00 1.17 6.33 2.00 2.50 1.67 2.67 8.00 2.00 3.67 6.83 4.50 8.67 1.83
4.67 5.50 2.33 7.00 6.17 6.33 4.00 6.33 4.83 3.33 7.50 6.00 6.50 4.67 5.00 7.17 5.83 2.50 6.00 3.17 5.33
1.50 2.17 2.00 2.67 1.83 4.17 1.67 2.00 2.83 1.50 3.67 4.00 2.33 2.83 5.33 3.33 3.17 5.50 3.00 2.33 3.50
Average rank
5.31
5.31
7.65
4.29
6.85
3.18
4.24
5.25
2.92
Fig. 5. The contaminated samples of 30% noise from the USPS database.
5.2.1. The MNIST database The MNIST database contains 70000 digit images with 10 classes of the size 28 × 28. We up-sample the images to the size 16 × 16, and further reshape them to vectors of the length 256. 30% data from each class is randomly selected for training, while the rest data is used for testing. Further, Gaussian noise with mean 0 and variance 0.05 is added on the training data, where the noise covers random 30% and 50% rectangular area of each image. The contaminated digit images are displayed in Fig. 3. All the methods are then applied on the original training data and the
contaminated training data. We show the classification results in Table 7. To see how the reduced dimension affect each method, we depict the variation of accuracies along dimensions, as shown in Fig. 4. The results show that: (i) On the original data, all the methods behave similarly except for LDA. However, when the samples are contaminated, PCA, PCA-L1, LDA and L1-LDA are greatly influenced by noise, while LDA-L1, RLDA and RSLDA and our L2BLDA, L1BLDA are less affected. In addition, our L2BLDA and L1BLDA are both better than the other methods, and our L1BLDA has the best performance; (ii) For our L2BLDA and L1BLDA, as the increasing of the number of the reduced dimensions, their accuracies become higher, and then achieve steady. The influence of reduced dimensions to PCA and PCA-L1 has the similar pattern to that of our methods. However, they are not as robust as our methods. (iii) The optimal reduced dimensions of L2BLDA and L1BLDA are not too large in general. LDA-L1, RLDA and RSLDA usually need more reduced dimensions. L1-LDA mostly achieves its optimal accuracy when the reduced dimension is very low, and then its behavior degenerates significantly.
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
11
Fig. 6. The variation of accuracies along different dimensions on the USPS database. Table 7 Classification results on the MNIST database. Data
PCA Acc (Dim)
PCA-L1 Acc (Dim)
LDA Acc (Dim)
LDA-L1 Acc (Dim)
L1-LDA Acc (Dim)
RLDA Acc (Dim)
RSLDA Acc (Dim)
L2BLDA Acc (Dim)
L1BLDA Acc (Dim)
Original 30% Noise 50% Noise
96.79 (53) 71.90 (225) 70.85 (53)
96.81 (66) 72.27 (256) 71.90 (58)
86.24 (9) 77.57 (9) 77.10 (9)
96.50 (246) 91.41 (166) 87.82 (246)
85.31 (2) 75.60 (2) 70.85 (2)
96.50 (52) 90.91 (182) 87.72 (219)
96.50 (179) 89.33 (241) 82.34 (256)
96.81 (56) 92.28 (151) 90.00 (56)
96.79 (57) 92.31 (156) 91.41 (75)
12
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858
Table 8 Classification results on the USPS database. Data
PCA Acc (Dim)
PCA-L1 Acc (Dim)
LDA Acc (Dim)
LDA-L1 Acc (Dim)
L1-LDA Acc (Dim)
RLDA Acc (Dim)
RSLDA Acc (Dim)
L2BLDA Acc (Dim)
L1BLDA Acc (Dim)
Original 20% noise 30% noise
96.42 (40) 88.79 (46) 85.55 (92)
96.27 (60) 88.88 (116) 84.64 (126)
91.52 (9) 79.88 (8) 74.67 (9)
96.85 (139) 88.91 (150) 83.39 (182)
90.91 (2) 80.33 (2) 72.85 (2)
93.82 (238) 88.79 (256) 85.91 (179)
96.24 (232) 88.79 (235) 85.00 (218)
96.42 (41) 89.39 (95) 85.61 (95)
96.85 (41) 89.79 (42) 86.67 (64)
5.2.2. The USPS database The USPS database contains 11000 samples with 10 classes of dimension 256, and each sample corresponds to a digit. We randomly select 70% samples from each class for training, while the rest data is used for testing. To test the robustness of the proposed method, we further add black block on each training data, where the block covers random 20% and 30% rectangular area of each image, as shown in Fig. 5. As before, all the methods are then applied on the original training data and the contaminated training data, and the corresponding results are given in Table 8 and Fig. 6. The results reveal the similar pattern as the MNIST database. Similar to the MNIST database, as the increasing of the number of the reduced dimensions, the accuracies of our L2BLDA and L1BLDA first grow and then achieve steady. The optimal reduced dimensions of L2BLDA and L1BLDA are not too large in general. On both the original data and the noise data, our L1BLDA performs the best, which shows the robustness of our L1BLDA. 6. Conclusion This paper proposed two novel L1-norm and L2-norm based linear discriminant analysis (L1BLDA and L2BLDA) which were upper bounds of the theoretical framework of the Bhattacharyya optimality. Different from the existing LDAs: (i) L1BLDA and L2BLDA maximize a novel weighted pairwise between-class scatter and meanwhile minimize the within-class scatter, under the L1-norm and L2-norm, respectively; (ii) The meaningful weighting constants between the between-class scatter and the withinclass scatter are determined by the involved data set, and therefore saving the tedious parameter tuning work; (iii) For both L1BLDA and L2BLDA, all the projection directions can be obtained once for all by an efficient ADMM algorithm and a standard eigenvalue decomposition; (iv) Experimental results on one artificial data set, some UCI data sets and handwritten image data sets support the superiority our proposed methods. Our Matlab code can be downloaded from http://www.optimal-group.org/ Resources/Code/L1BLDA.html. Acknowledgments The authors thank the editor and the referees for their valuable comments which have largely improved the presentation of this work. This work is supported by the Natural Science Foundation of Hainan Province, China (No. 118QN181), the National Natural Science Foundation of China (No. 61703370, No. 11871183, No. 61866010, No. 11501310 and No. 61603338), and the Natural Science Foundation of Zhejiang Province, China (No. LQ17F030003, No. LY18G010018 and No. LY16A010020). References [1] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugen. 7 (2) (1936) 179–188. [2] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, New York, 1991. [3] P.N. Belhumeur, J.P. Hespanha, Kriegman D.J., Eigenfaces vs. fisherfaces: Recognition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720.
[4] S. Sun, X. Xie, M. Yang, Multiview uncorrelated discriminant analysis, IEEE Trans. Cybern. 46 (12) (2016) 3272–3284. [5] T. Luo, C. Hou, F. Nie, et al., Dimension reduction for non-gaussian data by adaptive discriminative analysis, IEEE Trans. Cybern. 49 (3) (2019) 933–946. [6] C.N. Li, Y.H. Shao, W.J. Chen, et al., Generalized two-dimensional linear discriminant analysis with regularization, 2018, arXiv:1801.07426. [7] Y. Guo, T. Hastie, R. Tibshirani, Regularized linear discriminant analysis and its application in microarrays, Biostatistics 8 (1) (2006) 86–100. [8] T. Jombart, S. Devillard, F. Balloux, Discriminant analysis of principal components: a new method for the analysis of genetically structured populations, BMC Genet. 11 (1) (2010) 94. [9] D.L. Swets, J.J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831–836. [10] L.F. Chen, H.Y.M. Liao, M.T. Ko, et al., A new LDA-based face recognition system which can solve the small sample size problem, Pattern Recognit. 33 (10) (2000) 1713–1726. [11] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional datawith application to face recognition, Pattern Recognit. 34 (10) (2001) 2067–2070. [12] Z. Lai, D. Mo, W.K. Wong, et al., Robust discriminant regression for feature extraction, IEEE Trans. Cybern. 48 (8) (2018) 2472–2484. [13] J.H. Friedman, Regularized discriminant analysis, J. Amer. Statist. Assoc. 84 (405) (1989) 165–175. [14] S.J. Kim, A. Magnani, S. Boyd, Robust fisher discriminant analysis, in: Advances in Neural Information Processing Systems, 2006, pp. 659–666. [15] Q. Tian, M. Barbero, Z.H. Gu, et al., Image classification by the Foley–Sammon transform, Opt. Eng. 25 (7) (1986) 257834. [16] C. Croux, C. Dehon, Robust linear discriminant analysis using S-estimators, Canad. J. Statist. 29 (3) (2001) 473–493. [17] M. Hubert, K. Van Driessen, Fast and robust discriminant analysis, Comput. Statist. Data Anal. 45 (2) (2004) 301–320. [18] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis, J. Mach. Learn. Res. 8 (2007) 1027–1061. [19] C. Xiang, X.A. Fan, T.H. Lee, Face recognition using recursive Fisher linear discriminant, IEEE Trans. Image Process. 15 (8) (2006) 2097–2105. [20] Q.L. Ye, C.X. Zhao, H.F. Zhang, et al., Recursive ‘‘concave-convex’’ Fisher linear discriminant with applications to face, handwritten digit and terrain recognition, Pattern Recognit. 45 (2012) 54–65. [21] X. Chen, J. Yang, Q. Mao, et al., Regularized least squares fisher linear discriminant with applications to image recognition, Neurocomputing 122 (2013) 521–534. [22] C.N. Li, Z.R. Zheng, M.Z. Liu, et al., Robust recursive absolute value inequalities discriminant analysis with sparseness, Neural Netw. 93 (2017) 205–218. [23] X. Li, W. Hua, H. Wang, et al., Linear discriminant analysis using rotational invariant L1 norm, Neurocomputing 73 (13–15) (2010) 2571–2579. [24] F. Zhong, J. Zhang, Linear discriminant analysis based on L1-norm maximization, IEEE Trans. Image Process. 22 (8) (2013) 3018–3027. [25] H. Wang, X. Lu, Z. Hu, et al., Fisher discriminant analysis with L1-norm, IEEE Trans. Cybern. 44 (6) (2014) 828–842. [26] Y. Liu, Q. Gao, S. Miao, et al., A non-greedy algorithm for L1-norm LDA, IEEE Trans. Image Process. 26 (2) (2017) 684–695. [27] X. Chen, J. Yang, Z. Jin, An improved linear discriminant analysis with L1-norm for robust feature extraction, in: The 22nd IEEE International Conference on Pattern Recognition, 2014, pp. 1585–1590. [28] W. Zheng, Z. Lin, H. Wang, L1-norm kernel discriminant analysis via Bayes error bound optimization for robust feature extraction, IEEE Trans. Neural Netw. Learn. Syst. 25 (4) (2014) 793–805. [29] Q. Ye, J. Yang, F. Liu, et al., L1-norm distance linear discriminant analysis based on an effective iterative algorithm, IEEE Trans. Circuits Syst. Video Technol. 28 (1) (2018) 114–129. [30] C.N. Li, Y.H. Shao, W. Yin, et al., Robust and sparse linear discriminant analysis via an alternating direction method of multipliers, IEEE Trans. Neural Netw. Learn. Syst. (2019) http://dx.doi.org/10.1109/TNNLS.2019. 2910991. [31] C.N. Li, Y.H. Shao, N.Y. Deng, Robust L1-norm two-dimensional linear discriminant analysis, Neural Netw. 65 (2015) 92–104. [32] S.B. Chen, D.R. Chen, B. Luo, L1-norm based two-dimensional linear discriminant analysis (In Chinese), J. Electron. Inf. Technol. 37 (6) (2015) 1372–1377.
C.-N. Li, Y.-H. Shao, Z. Wang et al. / Knowledge-Based Systems 183 (2019) 104858 [33] M. Li, J. Wang, Q. Wang, et al., Trace ratio 2DLDA with L1-norm optimization, Neurocomputing 266 (29) (2017) 216–225. [34] J.H. Oh, N. Kwak, Generalization of linear discriminant analysis using Lp-norm, Pattern Recognit. Lett. 34 (6) (2013) 679–685. [35] L.L. An, H.J. Xing, Linear discriminant analysis based on Lp-norm maximization, in: The 2nd International Conference on Information Technology and Electronic Commerce, ICITEC, 2014, pp. 88–92. [36] A. Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distribution, Bull. Calcutta Math. Soc. (1943). [37] P.A. Devijver, J. Kittler, Pattern Recognition: A Statistical Approach, Prentice/Hall International, 1982. [38] J. Tou, R. Heydorn, Some approaches to optimum feature selection, in: Computer and Information Sciences II, 1967, pp. 57–89. [39] R.P.W. Duin, M. Loog, Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion, IEEE Trans. Pattern Anal. Mach. Intell. 26 (6) (2004) 732–739. [40] F. De la Torre, T. Kanade, Multimodal oriented discriminant analysis, in: Proceedings of the 22nd ICML, ACM, 2005, pp. 177–184. [41] K.T. Abou-Moustafa, F. De La Torre, F.P. Ferrie, Pareto discriminant analysis, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2010, pp. 3602–3609.
13
[42] G. Saon, M. Padmanabhan, Minimum Bayes error feature selection, in: Proceeding of NIPS, 2002, pp. 800–806. [43] F. Nielsen, Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means, Pattern Recognit. Lett. 42 (2014) 25–34. [44] B.J. Oommen, Q. Wang, Computing the Bhattacharyya error bound in classifiers using direct bias correction bootstrap methods, 2001. [45] L. Rueda, M. Herrera, Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space, Pattern Recognit. 41 (10) (2008) 3138–3152. [46] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The Johns Hopkins Univ. Press, Baltimore, MD, USA, 1996. [47] S.A. Mulaik, The Foundations of Factor Analysis, McGraw-Hill, New York, 1972. [48] F. Nie, R. Zhang, X. Li, A generalized power iteration method for solving quadratic problem on the Stiefel manifold, Sci. China Inf. Sci. 60 (11) (2017) 112101:1-112101-10. [49] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cogn. Neurosci. 3 (1) (1991) 71–86. [50] N. Kwak, Principal component analysis based on L1-norm maximization, IEEE Trans. Pattern Anal. Mach. Intell. 30 (9) (2008) 1672–1680.