Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis

Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis

Accepted Manuscript Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis Qiaolin Ye, Liyong Fu, Zhao Zhang, Henghao Zhao, Meem Naiem PI...

1MB Sizes 0 Downloads 54 Views

Accepted Manuscript Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis Qiaolin Ye, Liyong Fu, Zhao Zhang, Henghao Zhao, Meem Naiem

PII: DOI: Reference:

S0893-6080(18)30182-5 https://doi.org/10.1016/j.neunet.2018.05.020 NN 3966

To appear in:

Neural Networks

Received date : 19 September 2017 Revised date : 19 March 2018 Accepted date : 28 May 2018 Please cite this article as: Ye, Q., Fu, L., Zhang, Z., Zhao, H., Naiem, M., Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis. Neural Networks (2018), https://doi.org/10.1016/j.neunet.2018.05.020 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Title Page (With all author details listed)

Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis Qiaolin Ye Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, P. R. China. College of Information Science and Technology, Nanjing Forestry University, Nanjing, Jiangsu 210037, P. R. China.

Liyong Fu Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, P. R. China.

Zhao Zhang College of Computer Science and Technology, Soochow University, Nanjing, Jiangsu 215006, P. R. China.

Henghao Zhao College of Information Science and Technology, Nanjing Forestry University, Nanjing, Jiangsu 210037, P. R. China.

Meem Naiem College of Information Science and Technology, Nanjing Forestry University, Nanjing, Jiangsu 210037, P. R. China.

*Manuscript Click here to view linked References

Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis Qiaolin Ye1,2, Liyong Fu1*, Zhao Zhang3, Henghao Zhao2, and Meem Naiem2 1. Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing 100091, P. R. China. 2. College of Information Science and Technology, Nanjing Forestry University, Nanjing, Jiangsu 210037, P. R. China. 3. College of Computer Science and Technology, Soochow University, Nanjing, Jiangsu 215006, P. R. China.

*Corresponding author Abstract—Recently, L1-norm distance measure based Linear Discriminant Analysis (LDA) techniques have been shown to be robust against outliers. However, these methods have no guarantee of obtaining a satisfactory-enough performance due to the insufficient robustness of L1-norm measure. To mitigate this problem, inspired by recent works on Lp-norm based learning, this paper proposes a new discriminant method, called Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis (FLDA-Lsp). The proposed method achieves robustness by replacing the L2-norm within- and between-class distances in conventional LDA with Lp- and Ls-norm ones. By specifying the values of p and s, many of previous efforts can be naturally expressed by our objective. The requirement of simultaneously maximizing and minimizing a number of Lp- and Ls-norm terms results in a difficulty to the optimization of the formulated objective. As one of the important contributions of this paper, we design an efficient iterative algorithm to address this problem, and also conduct some insightful analysis on the existence of local minimum and the convergence of the proposed algorithm. Theoretical insights of our method are further supported by promising experimental results on several images databases.

Index Terms—linear discriminant analysis, Lp-norm, Ls-norm, robustness.

I. INTRODUCTION DIMENSION reduction (DR), as an effective data analysis tool, is typically used to reduce the dimensionality of given high-dimensional data. Two of the most representative methods are Principal Component Analysis (PCA) [1] and Linear Discriminant Analysis (LDA) [2]. Applications of PCA to high-dimensional data, such as face images and text categorization quickly run up against practical performance limit, mainly because of the low recognition rate. Many extensions to PCA have recently been presented to deal with this problem [3], [4]. However, PCA and its variants do not utilize of the labels of data, impairing the class discrimination. In this case, supervised DR methods would be a better choice for the extraction of discriminant features. LDA is a typical supervised method, which finds a transformation matrix maximizing the ratio of the between-class scatter to the within-class scatter. To date, there have been a large number of extensions to LDA [5]-[16]. Among them, many works focus on the avoidance of singularity resulting from the small-sample problem [17], while Local Fisher Discriminant Analysis (LFDA) [6] is one of the most representative ones, which can be viewed as a supervised variant of Laplacian Eigenmaps (LE) [18] or Locality Preserving Projection (LPP) [19]. LFDA, LE, and LPP are essentially categorized as manifold learning [20]. To the best of our knowledge, how to determine the weighting parameter and the neighborhood size has been an open problem in manifold learning. By considering imposing L1-norm on the projection, many researchers proposed sparse LDA methods [21]-[23], which, however, in

fact conduct feature selection. In [24], the authors reconstructed the original data by employing low-rank representation (LRR) [25] or sparse representation (SR) [26] and simultaneously performed LDA on the reconstructed data. We know that LRR is unsupervised and overlooks the class information, which assumes the (linear and affine) subspaces are independent [25]. Only when the assumption holds, the robustness of reconstruction coefficient matrices can be guaranteed. In real worlds, however, it is difficult to estimate the real distribution of data such that the assumption may not hold, and there is no theoretical guarantee that the reconstructed data is clean. This idea has been extended for manifold learning [27]-[29]. A common practice of both conventional PCA and LDA is to introduce the L2-norm distance metric to the objective. Hereafter, conventional PCA and LDA are referred to as PCA-L2 and LDA-L2, for convenience of description. Despite the success for many problems, all the methods aforementioned are prone to outliers due to the use of L2-norm distances, making the projection vectors drift from the desirable direction [30]-[32]. This facilitates much research on robust DR using L1-norm distance measure. By applying L1-norm distance on the objective of PCA, a family of robust PCA algorithms [12]-[16] have been presented. Two representative ones are R1-PCA [33] and PCA-L1 [30], [34]. PCA-L1 has demonstrated the most promising result. Following the basic idea of PCA-L1, both papers [35] and [36] extended PCA-L1 to two-dimension and tensor cases. By imposing L1-norm distance on both the objective and the constraint of PCA, the research [37] proposed a robust sparse PCA (RSPCA) that possesses the merits of sparseness and robustness. Consider two facts. First, L1-norm distance lacks of satisfactory robustness, especially when the number of outliers and noisy data are large [38]. Second, L1- and L2-norms are special cases of Lp-norm. In this way, a newly-proposed method called PCA-Lp [39] formulates the objective using Lp-norm distance. However, existing L1-norm or Lp-norm distance related PCA algorithms cannot guarantee the sparsity of solution, while traditional sparse PCA algorithms, like MSPCA, have no guarantee of obtaining sufficient robustness [40]. To mitigate the problem, the research [40] proposed Robust and Sparse PCA (RSPCA) by applying L1-norm in the objective and the solution of PCA-L2. As concluded in [40], RSPCA can improve the robustness of traditional sparse PCA algorithms. Inspired by the ideas of 2DPCA and Lp-norm Generalized PCA [41], Wang et al. applied Lp-norm in the objective and the constraint function of 2DPCA [42]. Recently, Lp-norm distance has been introduced to LPP [43]. We can also see the successful use of Lp-norm in machine learning, such as [38], [44], [10], in which the objectives involve only Lp-norm minimization terms. Illuminated by the work on L1-norm robust PCA, papers [10] and [32] extended LDA-L2 to L1-norm based LDA (LDA-L1) by replacing the L2-norm distances in LDA-L2 with L1-norm ones. To solve the objective of LDA-L1, a gradient ascending (GA) iterative algorithm is proposed. The idea has been extended for Discriminant LPP (DLPP) [45] and 2-DLDA [46] which, like LDA-L2, involve many L2-norm distance maximization-minimization terms. However, the GA algorithm may not guarantee the optimality of solution, due to the introduction of a non-convex surrogate function [39], [47]. Attempting to address the problem, [47] developed an alternating optimization (AO) algorithm, which, however, does not focus on the optimization of the initial model for LDA-L1 such that the solution may deviate the original objective value. We should recognize the successes earned by recent PCA-Lp algorithms. In the literature, however, the study on Lp-norm distance based LDA is rare. Only Kwak et al. [48] had a simple attempt [39]. One of the main reasons is that solving the objectives involving Lp-norm distance maximization-minimization problems is by far more challenging. In this paper, we propose a robust linear discriminant analysis via simultaneous Ls–norm distance maximization and Lp–norm distance minimization (FLDA-Lsp), which utilizes Ls- and Lp-norm to respectively measure the between- and within-class distances. Although there exist a few algorithms to solve the objectives involving Lp-norm maximization or minimization problems [22], [38], [39], [42], [44], the resulted objective problem of FLDA-Lsp is difficult to solve. Besides different norms measured on the between-class and within-class distances, FLDA-Lsp differs from GLDA [48], mainly by designing a more effective iterative algorithm for the solution to the objective. We prove the convergence and the existence of the local minimum solution of the algorithm. Our

formulation is flexible, since it is shown that LDA-L2 and its robust variants can be explained as special cases. Our formulation is general, since its idea can also be applied to other methods involving a simultaneous L2/L1-norm distance maximization and L2/L1-norm distance minimization problem, such as recent sparse LDA techniques [9] and SRRS [11]. Theoretical studies and extensive experimental results on several image databases verify the effectiveness and applicability of our method. II. PROBLEM FORMULATION In this section, we propose a novel discriminant analysis algorithm via Ls-norm distance maximization and Lp-norm distance minimization, called Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis (FLDA-Lsp). A. Notation Let X  {x1, x2 ,..., xn}  R d n be a data matrix and Y  { y1,..., yn }  R n be the corresponding label, where yi {1,..., c} is the label of the point x i , and n and d are the number and dimensionality of data respectively. The number of samples of the i -th class is represented by ni . Define three matrices B  [n1b1 ,..., ncbc ]  R d c , T  [t1 ,..., t n ]  R d n , and F  [f1 ,..., fn ] R dn , where t j  x j  m ( j  1,..., n ), bi  mi  m ( i  1,..., c ), and fi  xi  m yi in which mi  1/ ni  ki1 xk denotes the average vector of samples n

of the i-th class and m  1/ n k 1 xk the mean of the total training set. Let sgn() be a sign function with sgn()  1 if () is a n

negative value and

sgn()  1 otherwise.

Define the linear transformation matrix as W   w1 ,..., w r   R d  r . Given p  0 , the Lp-norm

of a vector v  Rd is defined as || v || p  ( i 1| vi | p )1/ p . Note that p is a variable, which can be specified by an arbitrary character, d

such as s. B. Our formulation and solution algorithm Conventional LDA-L2 can be formulated as finding r discriminant vectors that maximize the distances between the means of different classes and the total mean and simultaneously minimize the distances between the data points of the same classes. Specifically, the optimal discriminant vector W can be found by the following problem

  c

max J 2 (W)= W

n ||WT bi ||22

i 1 i n i 1

||WT fi ||22

(1)

.

LDA-L1 [29], [30], as a robust data analysis tool, emerges by replacing the L2-norm distances in LDA-L2 with L1-norm ones

 n ||W b ||    J (W)=  ||W f ||   c

max T

W ,W WI

T

i 1 i n

1

i 1

T

i 1

i

c

r

i 1 n

j 1 i r

i 1

1

n | wTj bi| | wTj fi |

(2)

.

j 1

LDA-L1 emphasizes the L1-norm distance measure in the model. We are inspired by the Lp-norm distance related PCA algorithms [39], [41], [42], and consider that the degree of outlier effects on the within-class and between-class dispersions may be different, we replace the L1-norm within- and between-class distances in LDA-L1 with Ls-norm and Lp-norm ones, respectively. This results in the following formulation

  c

max T

W ,W WI

n ||WT bi ||ss

i 1 i n i 1

||WT fi || pp



    c

r

i 1 n

j 1 i r

i 1

n | wTj bi|s | wTj fi |p

.

(3)

j 1

The problem maximizes the ratio of the sum of the overall Ls-norm between-class distances to the sum of the overall Lp -norm within-class distances. As a result, when 0  s  2 and 0  p  2 , robustness is conferred on the objective. Although the formulation of FLDA-Lsp in (3) is simple, it is meaningful, since: 1) Unlike traditional LDA-L2, we propose to

formulate the between- and within-class distances using Ls- and Lp-norms, respectively. By setting smaller p and s, the negative effects resulting from outliers can be reduced, which is not true in LDA-L2; 2) Since the objective is not absolutely convex and contains many Lp- and Ls-norm distance maximization-minimization terms, it is hard to solve. We will propose a novel algorithm for addressing this problem; 3) Clearly, existing LDA-L2 and its L1-norm distance based variants are all special cases of our formulation. If we set p  2 and s  2 in (3), our objective function (3) becomes (1). This shows that LDA-L2 is a special case of FLDA-Lsp. When setting s  1 and p  1 in (3), we obtain the formulation of LDA-L1 in (2). These show that our method is very flexible. Despite the applicability of LDA-L1, it cannot handle heavy outliers, attributing to the utilization of L1-norm distance measure [38]. On the contrary, Lp-norm distance is an effective measure to overcome this problem [38]. Lp-norm has been widely introduced in PCA, SVM, and dictionary learning [22], [38], [39], [42], [44]. Although our method is inspired by these works, it has a more complex formulation. So, one of the contributions of this paper is the optimization of the objective problem; 4) Our idea can be incorporated into other methods involving L1- or L2-norm maximization-minimization problems, which will be explored in future work. Motivated by [30]-[33], for the solution to the problem in (3), we can convert it into a series of r=1 problems by a deflating procedure, where each step produces only a discriminant vector. If setting r=1, (3) is transformed to the following problem

  c

max J p (w )= w

n ||wT bi ||ss

i 1 i n i 1

T

||w fi ||

p p

 n |w b | .  |w f | c



T

i 1 i n

s

i

T

i 1

(4)

p

i

The extension to multiple discriminant vectors is introduced as follows. Suppose we have obtained the first k  1 discriminant vectors w1,...,wk 1 ( 1  k  r ), represented by matrix Wk 1 . To find the k -th discriminant vector w k , the information represented by the previous k  1 discriminant vectors is first discarded from the original data as follows: X  X  Wk 1WkT1X . Then, w k can be computed by optimizing the following problem

 n |w b | .  |w f | c

max J p (w k )= wk

i 1 i n i 1

T k

s

i

T k i

(5)

p

We next focus on the optimization of the problem. Most of existing solution algorithms resulting from the recent research progress [39], [41]-[43] that optimize the objectives merely involving Lp-norm related terms are inapplicable for resolving this problem. As a bright spot of this paper, we will propose an efficient algorithm to solve (4) and (5). Clearly, (4) can be viewed as a special case of (5), since the latter becomes the former when setting k  1 . Therefore, we limit our attention to the solution of (5). As for (4), we can apply the same algorithm for the derivation of solution. Our idea is to directly optimize the objective of (5). What we first need to do is to assume that λ(t )  i1 ni| w (kt ) bi|s / i1| w (kt ) fi |p is the objective value of c

T

n

T

(t ) (5) after the t -th iteration, where w is the optimal discriminant vector solved at iteration t . Then, we turn to solve the following

problem to obtain the solution of the (t  1)-th iteration w(kt +1)  argmax i1 ni| wTk bi|s  λ(t ) i1| wTk fi |p c

n

(6)

wk

Taking the derivative of the objective function in (6) with respect to w k and setting it to zero, we have that



c

n s| wTk bi|s1 sgn(wTk bi )bi  λ (t ) i1 p| wTk fi|p1 sgn(wTk fi )fi  0. n

i 1 i

(7)

Since sgn(wT fi )=wT fi /| wT fi| , the problem (7) can be rewritten equivalently as



sni| wTk bi|s1 sgn(wTk b i )b i  λ (t )  i1 i 1 c

n

2 pwTk fifiT  0. 2| wTk fi|2 p

(8)

Let ui ,i  p / (2| wTk fi|2 p ) and construct the diagonal matrix U  Rnn with its i -th diagonal entry as ui ,i , and let ki  ni s| wTk bi|s1 sgn(wTk bi ) and construct the vector k  R c with its

i-th element as k . Then, we can rewrite the equation (8) as i

Bk  2λ(t )FUFT w k  0 , whose solution is the same as that of the following problem

w(kt 1 )  argmax wTk Bk  λ (t ) wTk FUFT w k ,

(9)

wk

In (9), k and U are dependent on w k , which can be solved under the iterative framework to be proposed. To be specific, we compute k and U upon the current w k solved in last iteration. Finally, we sum up the complete computation procedure of (5) in Algorithm 1. In each iteration of Algorithm 1, the solution of the problem in Step 3 is reduced to the following system of linear equations w (kt 1)  0.5 / λ (t ) (FU (t ) FT ) 1Bk (t ) .

(10)

In real problems, FU (t ) FT may be rank deficient. In practice, we compute its inverse by introducing the regularization term  I ,

where  is a small positive constant and I an identity matrix. Algorithm 1: An effective iterative algorithm to find w k by solving the Ls-norm maximization and Lp-norm minimization problem in (5). Input: B  [b1,...,bc ]Rdc and F  [f1,...,fc ]Rdn . Initialize

w k and set t  1 .

While not converge do

  c

1. Compute λ (t ) 

T

n | w (kt ) bi|s

i 1 i n

T

| w (kt ) fi |p

i 1

2. Compute the diagonal matrix U (t ) with its i-th diagonal entry as ui ,i  p / (2| w(kt ) fi|2 p ) and the T

column vector k (t ) with its i-th element as ki  ni s| wTk bi|s1 sgn(wTk bi ) . 3. Solve max w (kt 1) Bk (t )  λ (t ) w (kt 1) FU (t ) FT w (kt 1) . (t 1) T

T

wk

4. Normalize w (kt 1) by its length and set t  t  1 . end while Output: w k

.

(t ) Remark 1: From Algorithm 1, the objective of (5) is iteratively optimized. Note that ui ,i in the step 2 of Algorithm 1 cannot be

well defined if there exist some fi resulting in w (kt ) fi  0 . Naturally, a candidate for addressing the problem is to slightly move T

(t ) (t ) (t ) w (kt ) in the way of wk  (wk  Δ) / wk  Δ , as done in [30], [32], where Δ is a small random vector. Considering that this way may

(t ) require trying a large number of times, we regularize ui ,i as ui(,ti)  0.5 p / ( (w (kt ) fi )2  υ)2 p , where υ is a very small positive number. T

Also, | wTk bi|s1 in the step 2 may be not well-defined in the case of s  1 . We use the same method to overcome the problem. Remark 2: Likewise, in papers [10], [32], the deflating procedure is applied for the solving of multiple discriminant vectors of LDA-L1. In each step seeking for only a discriminant vector, the GA is proposed to solve the objective of LDA-L1, which may not obtain an optimal solution due to introduction of a non-convex surrogate function and the complication of the choice of the stepsize [39],[47]. To address this problem, [47] proposed an alternative optimization (AO) strategy. To use this strategy, we need to

transform the numerator of (5) into an equality constraint beforehand, i.e.,



c

n ||WT bi ||1  1 . As a result, the transformed objective

i 1 i

min i1| WT fi | instead of the original one is optimized. To the end, the AO cannot iteratively minimize the original objective, such n

that the solution may deviate the original objective value. It should be noted that [47] minimizes the objective function  i1| WT t i | , n



whose Bayes error bound, however, is looser than

n

| WT fi | [31]. Since the objective function of LDA-L1 is identical to the one

i1

of our FLDA-Lsp under the setting of p=1 and s=1, naturally the solution of LDA-L1 can be found by the proposed iterative framework in Algorithm 1. In such case, compared to the GA algorithm, ours does not need the introduction of the non-convex surrogate function and simultaneously overcomes the complication of the choice of stepsize. When comparing with the AO algorithm, ours escapes the transformation of the original objective in each iteration. On the contrary, we directly tackle the original problem of LDA-L1. In fact, λ (t ) in Algorithm 1 can be viewed as a factor balancing different contributions of the within-class and between-class dispersions, and its value is automatically obtained by learning. Besides, our iterative algorithm is applicable for the solution to the proposed Ls- and Lp-norm distance maximization and minimization problem. In total, the proposed Algorithm 1 is very flexible and more effective. The experiments in Section V will provide numerical results, indicating the effectiveness of our iterative algorithm in solving the LDA-L1 problem. III. ALGORITHMIC ANALYSIS The goal of this section is to first provide the theoretical poof of Algorithm 1 in the terms of convergence, then present analysis on the time complexity, and finally discuss how to initialize the solution. We still discuss the problem (5), unless otherwise stated. A. Theoretical analysis Now, we analyze the convergence and the existence of local maximum of the proposed iterative procedure in Algorithm 1. Before this, we introduce the following lemmas. Lemma 1 [49]: For any scalar   0. , the inequality 2 p / 2  p  p  2  0 Lemma 2: For any two nonzero scalars  and  , we have p||p2||2 2| | p  p||p2||2 2||p when 0  p  2 . 2 Proof: Given any scalar  , we let     0 . According to Lemma 1, we have 2||p  p 2  p  2  0 . Let    /  , we obtain that

2 || /||  p(|| /|| ) 2  p  2  0 p

 p(|| /|| ) 2||p 2||p  (p  2)| |p  p||p2||2 2||p  p||p2| |2 -2| |p

Thus the proof is completed.  Theorem 1: For s  1 , Algorithm 1 monotonically increases the objective of (5) in each iteration. Proof: According to the step 3 in Algorithm 1, for each iteration we have w (kt 1) Bk (t )  λ (t ) w (kt 1) FU (t )FT w (kt 1)  w (kt ) Bk (t )  λ (t ) w (kt ) FU (t )FT w (kt ) T

T

T

T

(11)

(t ) According to the definitions of U and k (t ) , the inequality (13) can be expressed with following equivalent formulation

s  i1 ni| w (kt ) b i|s1 sgn(w (kt ) b i )(w (kt 1) b i )  λ (t )

p n w (kt 1) fifiT w (kt 1) T  2 i1 | w (kt ) fi|2 p

 s  i1 ni| w (kt ) b i|s1 sgn(w (kt ) b i )(w (kt ) b i )  λ (t )

p n w (kt ) fifiT w (kt ) T  2 i1 | w (kt ) fi|2 p

c

T

c

T

T

 s  i1 ni| w (kt ) b i|s  λ (t ) c

T

T

T

T

T

p n w (kt ) fifiT w (kt ) T  2 i1 | w (kt ) fi|2 p

T

T

(12)

(t 1) (t ) Let   w k fi and   w k fi in Lemma 2. According to Lemma 2, we have T

T

T T p n w (kt 1) fifiT w (kt 1) p n w (t ) f f T w (t ) n n  λ (t )  i1| w (kt 1) fi|p  λ (t )  i1 k (t )Ti i 2 kp  λ (t )  i1| w (kt ) fi|p  i 1 (t ) T 2 p 2 2 | w fi| | w k fi| T

λ (t )

T

(14)

By adding the two inequalities (14) and (15) in the both sides, we achieve s  i1 ni| w (kt ) bi|s1 sgn(w (kt ) bi )(w (kt 1) bi )  λ (t )  i1| w (kt 1) fi|p c

T

T

n

T

T

(15)

 s  i1 ni| w (kt ) bi|s  λ (t )  i1| w (kt ) fi|p . c

n

T

T

Cleary, the function f (w k )   i1 ni | wTk bi |s is convex in the case of s  1 . Since for any convex function f (x) with respect to c

(t ) (t ) variable x , the inequality f (x)  f (x )+  f (x) xx(t ) (x  x ) is satisfied, where x f (x) xx(t ) denotes the gradient of f (x) at point

x(t )



c

.

Using

this

fact

and

the

n | w(kt 1) bi|s  i1 ni| w(kt ) bi|s + f (wk )|w c

T

T

i1 i



c

 f (wk )|w

equality (t ) k w k

=si1 ni| w(kt ) bi|s1 sgn(w(kt ) bi )bi c

(t ) k w k

T

T

,

one

easily

gets

( w(kt 1)  w(kt ) ) which leads to

n | w (kt 1) bi|s  si1 ni| w (kt ) bi|s1 sgn(w(kt ) bi )bTi w (kt 1)  (1  s)i1| w (kt ) bi|s c

T

T

c

T

T

i 1 i

(16)

Adding (15) and (16) into the both sides gives



n | w (kt 1) bi|s  λ(t ) i1| w(kt 1) fi|p  i1 ni| w (kt ) bi|s  λ(t ) i1| w(kt ) fi|p .

c

n

T

c

T

n

T

T

i 1 i

(17)

Substituting λ(t )  i1 ni| w (kt ) bi|s / i1| w (kt ) fi |p into the right-hand term of (17), we obtain that c

n

T



c

T

n | w (kt 1) bi|s  λ(t ) i1| w(kt 1) fi|p  i1 ni| w(kt ) bi|s  i1 ni| w(kt ) bi|s  0, n

T

c

T

c

T

T

i 1 i

resulting in

 n|w  |w c

i 1 i n i 1

(t 1) T k

(t 1) k

T

bi|s

fi|p

 |w   |w c

λ

(t )

i 1 n

i 1

(t )T k

bi|s

T

fi|p

(t ) k

(18)

Thus, the objective value of (5) monotonically increases in each iteration.  Theorem 1 makes us convince that objective function in (5) will monotonically increase in each iteration. According to the optimization theory, as long as we show that it has an upper bound, the above Algorithm 1 will converge and the equality in (18) holds at convergence. The existence of the upper bound of objective (5) can be guaranteed by the following theorem. Theorem 2: Let λop be the maximum value of the objective function (5). With Theorem 1, λop is existent. Proof: According to the definition of λop , we have

 n |w b | . )=  |w f | c

λop  max J p (w k

i 1 i n i 1

T k

T k i

s

i

(19)

p

There is no doubt that for arbitrary w k , the inequality J p (wk )  λop holds, which can be expressed with the one



c

n | wTk bi |s  λop i1| wTk fi |p  0. n

(20)

i1 i

Denote a function with respect to λ as follows J (λ)=max i1 ni| wTk bi|s  λop i1| wTk fi |p . It is observed from (20) that when the c

n

algorithm converges, we can get the maximum value 0, i.e., ic1 ni| wTk bi|s  λop in1| wTk fi |p =0. It is evident that λop is the solution of J (λ)=0 . According to the step 3 of Algorithm 1, the function value of J (λ) at λ (t ) is J (λ(t ) )=max wTk Bk (t )  λ(t ) wTk FU(t )FT wk  w(kt 1) Bk (t )  λ(t ) w(kt 1) FU(t )FT w(kt 1) T

wk

T

(21)

Taking the derivative of J (λ(t ) ) with respect to λ (t ) gives J' (λ (t ) )  w (kt 1) (FU (t )FT  ηWk 1WkT1 )w (kt 1) . Note that w (kt 1) FU (t )FT w (kt 1)  0 . T

T

Thus, J' (λ(t ) )  0 . We further conclude that at any point J (λ) is monotonically decreasing. Also, it is easy to check that J (0)=max i1 ni| wTk bi|s  0 and J (+)  0 ,and J (λ) is continuous. In this way, we have that the solution of J (λ)=0 exists and is c

unique, λop is, therefore, existent. This means that there exists an upper bound in the objective function (5).  Note that the proof of theorem 2 is motivated by [50]. Obviously, our objective function (5) is nonnegative. In this way, with existence of upper bound and increasing mononic property, Algorithm 1 will converge, which, therefore, can iterate until the condition of convergence is reached. To guarantee an iterative algorithm iterates in finite steps, in real applications, the condition of convergence is usually set as the difference between objective values of two iterations less than a small value rather than zero. Theorem 3: At convergence, Algorithm 1 will find a local optimal solution to the problem (5). Proof: The Lagrangian function of (5) is

  c

L (w k ) 

n | wTk bi|s

i 1 i n

| wTk fi |p

(22)

.

i 1

Taking the derivative with respect to w k and setting it to zero we get the KKT conditions of (22) as follows: s (i1 ni| wTk bi|s1 sgn(wTk bi )bi )(i1| wTk fi |p )  p(i1| wTk fi|p1 sgn(wTk fi )fi )(i1 ni| wTk bi|s ) c

n

n

c

(23)

Suppose that Algorithm 1 converges to a solution w* . At each iteration, we find the optimal solution of the problem in the step 3. Therefore, the converged solution w* satisfies the KKT condition of the problem. The Lagrangian function of the problem in the step 3 of Algorithm 1 is as follows: L (w (t 1) )  w (kt 1) Bk (t )  λ (t ) w (kt 1) FU (t )FT w (kt 1) ). T

T

(24)

Taking the derivative w.r.t. w (t1) and setting it to zero, we have that Bk (t )  2λ(t )FU(t )FT w(kt 1) =0. , which results in T T f f T w (t 1) c n s i1 ni| w (kt ) bi|s1 sgn(w (kt ) bi )bi  λ(t ) pi1 i (it )T k 2 p  0. | w k fi|

(25)

Using λ(t )  i1 ni| w(kt ) bi|s / i1| w (kt ) fi |p , (25) becomes c

T

n

T

T T T T f f T w (t 1) c n n c s ( i1 ni| w (kt ) bi|s1 sgn(w (kt ) bi )bi )( i1| w (kt ) fi |p )  (p  i1 i (it )T k 2 p )( i1 ni| w (kt ) bi|s ). | w k f i|

(26)

Since fifiT w (kt 1) | w (kt 1) fi| sgn(w (kt 1) fi )fi , (26) becomes T

T

s( i1 ni| w (kt ) bi|s1 sgn(w (kt ) bi )bi )( i1| w (kt ) fi |p ) c

T

T

n

T

 (p  i1| w (kt 1) fi|| w (kt ) fi|p2 sgn(w (kt 1) fi )fi )( i1 ni| w (kt ) bi|s ). n

T

T

T

c

T

(27)

We can see that (23) and (27) are the same when Algorithm 1 is converged. This implies that the converged solution w* satisfies (23), which is the KKT condition of the problem (5). According to [34], satisfying the KKT condition indicates that the solution is a local optimum solution.  One point should be highlighted. From Theorem 1, Algorithm 1 is not shown to be monotonically increasing when s  1 , but our FLDA-Lsp performs well, even for small values of s in practice. Likewise, the problem exists in Lp-norm distance based PCA [39], [42].

B Time complexity analysis As stated in Algorithm 1, the proposed objective problem is iteratively optimized. In the first step, λ (t ) is first computed, whose time complexity is O(dc  dn) , where dc and dn stand for the time complexities computing



c

T

n | w (kt ) bi|s , and

i 1 i



n

T

| w (kt ) fi |p ,

i1

respectively. In the second step, we need to compute U (t ) and k (t ) . U (t ) is computed with the cost of O (dn ) , while the complexity to obtain k (t ) is O(dc  c2 ) . Thus, the time complexity of the second step is O(dc  c2 +dn) . According to (10), the cost of the third step is determined by two parts: matrix multiplication and inversion computation. Let H  FU(t ) FT . Since the cost for computing FU (t )FT is O(dn2  d 2n) , it needs O(dn2  d 2n) to obtain the matrix H . With H obtained, its inversion cost is O(d 3 ) . In the end, the

time complexity of computing H -1 is O(dn2  d 2n+d 3 ) . We still need the computational complexity of O(d 2c  dc) , which is determined

by

the

matrix

multiplication H-1Bk (t ) .

Thus

the

total

time

complexity

of

our

method

is

O(T (d (c  n  n2 )  c2  d 2 (n  c)+d 3 )) in which T is the number of iterations.

C Initiation of FLDA-Lsp It should be highlighted that each of exiting robust L1- or Lp-norm techniques in pattern recognition and machine learning generally requires resorting to an iterative algorithm to determine the projection. Undoubtedly, there is a possibility that the obtained solution may be not global optimal [13]. So, the choice of initialization significantly affects the derivation of optimal solution. Although our FLDA-Lsp is essentially different from these methods in solution algorithm, it suffers from the same problem. There are two common practices for solution initialization of FLDA-Lsp. One is the multistart method [21]. That is, multiple initial solutions are obtained by trying random initializations of multiple times and the one resulting in the maximal objective value is finally used. This practice, however, is rarely used. We can only see its use in L1-norm related PCA methods [17]. Furthermore, the optimal times of random initializations is difficult to be given by an exact value. Given a robust L1- or Lp- norm method, the second practice is to set the initial solution as the solution of its L2-norm counterpart, which is widely used in more recent works, such as [39], [41], [10], [45], [48]. Considering that LDA-L2 is a L2-norm version of FLDA-Lsp, we directly use the discriminant vectors of LDA-L2 as the initial solutions of FLDA-Lsp. In this way, the iterative algorithm of FLDA-Lsp plays an important role to iteratively update the solution of LDA-L2, such that the projection direction of LDA-L2 that drifts from the desired direction can be well corrected. Furthermore, the relationships between FLDA-Lsp and previous LDA methods, such as LDA-L2, LDA-L1 [32], [47], and GLDA [48], are fully utilized. IV. EXPERIMENTAL VERIFICATION In this section, we conduct extensive experiments on several image databases to further evaluate the discriminant performance of the proposed method and further verify the previous theoretical analysis made by us. The same experiments are also conducted using several popular methods. A. Database description For a fair comparison, we conduct experiments on the BA, AR, and YALE databases used in [10], [32], [47]. Additionally, we also evaluate our algorithm on ALOI [51] and German Traffic Sign Detection Benchmark (GTSDB) [52] to verify its effectiveness in suppressing the negative effects of the noise, such varies on illumination directions, illumination temperatures, object orientations, perspective, and lighting condition. The sample images of each database are shown in Fig.1. In the BA database, there are a number of 390 digit images of 10 classes, each being represented by a 20×16 dimensional vector

in image space. The ALOI database contains 72000 images of 1000 general object categories taken at different illumination directions, illumination temperatures, object orientations, and wide-baseline stereo angles. We selected the images of the first 80 classes under different illumination directions, and resized each image to 32×32 pixels in the experiment. The GTSDB database contains 900 outdoor training images of 1360×800 pixels, each containing zero to six traffic signs. There are 42 kinds of traffic signs, which may appear in every perspective and under every lighting condition. The sizes of the traffic signs vary from 16×16 to 128×128. There are at least three images in each kind of signs. For conveniently conducting experiments, we removed those classes in which the number of samples are smaller than 15. The traffic signs portion of each original image was automatically cropped, and the cropped image was resized to 32×32 pixels.

(a) AB

(b) GTSDB

(c) ALOI

(d)YALE

(e) AR Fig.1 Some images of each database.

The YALE database was constructed by the Yale Center for Computational Vision and Control. It contains 165 grayscale images of 15 individuals. The images demonstrate variation in lighting condition (center light, left light, and right light), facial expression (normal, happy, sad, sleepy, surprised, and winking). In our experiments, the facial portion of each original image was automatically cropped based on the location of eyes and mouth, and the cropped image was resized to 32× 32 pixels. The AR database consists of over 4000 face images of 126 individuals, including frontal views of faces with different facial expressions, lighting conditions, and occlusions. The images of 120 individuals are taken in in two sessions and each section has 13 color pictures. The pictures in the first session are selected and used in our experiments. The face portion of each image is manually cropped and then resized to 80×60 pixels.

B. Experimental settings

We compare FLDA-Lsp with several popular LDA methods: LDA-L2, OLDA [53], LDA-L1 [29], L1-LDA [47], and GLDA [48]. We also evaluate the effectiveness of the AO proposed in [47] by applying it to the objective of LDA-L1, leading to a newly-defined method called L1-BOLDA. As discussed, our FLD-Lsp is a flexible method, whose solution algorithm in Algorithm 1 can be generalized to solve the objective of LDA-L1. To indicate the merit of the algorithm in solving the objective of LDA-L1, we add one method in the experiment, called FLDA-L1. In this way, there are two methods for us, i.e., FLDA-L1 and FLDA-Lsp. Simply speaking, FLDA-L1 is the L1-norm version of FLDA-Lsp. Finally, there are eight methods for comparisons, i.e., LDA-L2, OLDA, LDA-L1, L1-LDA, GLDA, L1-BOLDA, FLDA-L1, and FLDA-Lsp. After the optimal subspaces are found, many classifiers can be finally used for predicting the unseen data, such as Radial basis probabilistic neural networks [54], [55]. However, we use a nearest neighbor classifier, followed the researches [29], [47], [48]. We implement all algorithms in MATLAB 2014 equipped in a PC with Intel(R) Xeon(R) 2CPU processor (2.4 GHz) 32-GB RAM. For the BA data set, we would like to investigate the performance of each method using random K images (K=7, 9, and 11) per kind of digits for training and the remaining images for testing. For each K, we implement each method on the training set contaminated by outliers. The contamination levels vary from 0% to 50%. Following [29], [47], the outliers are simulated by adding occlusion with black or white dots rectangle noise, whose size and location in the image are at least 3×3 and random. We run the system 10 times and report the mean optimal results. The pictures in GTSDB are easy to be corrupted by outliers, since they are collected from real outdoor environments. Thus, for GTSDB, we implement all the methods on the original dataset. To be specific, 5, 10, and 15 images per class are randomly selected to from the training set and the remaining images are used for testing. Finally, the recognition rate is recorded by averaging the 10 rounds of repetitions. For the ALOI dataset, we use the first K (K=5, 7, and 10) images per object for training and the other images for testing. For the YALE face database, the first K=5 and 7 face images per person are used for training and the remaining 11-K images for testing. For the AR face database, K=2 and 3 images per person are used as training set and the remaining images as testing set. For each K, the average recognition rate over 5 runs is reported. Same as the settings in the BA dataset, for each K of ALOI,YALE, and AR, 0%, 10%, 20%, 30%, 40%, and 50% of the training images are selected to add black or white dots rectangle noise, whose size and location in image are at least 20×20 and random.

(a)

(b)

Fig.2. The optimal recognition rates of FLDA-Lsp versus the variations of (a) the norm parameter p and (b) the norm parameter s on the ALOI database.

LDA-L1, L1-LDA, L1-BOLDA, GLDA, FLDA-L1, and FLDA-Lsp are iterative methods. For each of them, the maximum iterative number and the stopping criterion are set as 30 and the difference between objective values of two iterations less than 0.001, respectively. For overcoming the singularity and simultaneously keeping the most discriminant information of LDA-L2, PCA-L2 is applied as a preprocessing step by throwing away the components corresponding to zero eigenvalues. In using the eight methods, we need to set the initial discriminant vectors. In the experiments, the initial discriminant vector of each of the six

methods is set as the solution of its L2-norm counterpart. The optimal dimensions are ranged in [1, 2*c] on BA, YALE, and GTSDB, and in [5,100] on AR, and ALOI, with step 5. LDA-L1 has a single parameter: stepsize. For GLDA and FLDA-Lsp, there are two parameters: stepsize and norm parameter p for GLDA, and norm parameters p and s for FLDA-Lsp. Since simultaneously determining these parameters is difficult, we adopt a stepwise selection strategy here [56]. Fig.2 plots the variations of the parameters versus the recognition rates of our FLDA-Lsp on ALOI with the contamination levels of 20%, 40%, and 50%, when the dimensions are 50 and K=7. It can be seen that when the norm parameters p and s are smaller, FLDA-Lsp generally achieves good performances. Similar phenomenon exists on other databases. C. Experimental results and analysis In this subsection, we use Figs.3-8 and Table 1 to show the recognition rates of each method versus the variation of dimensions and its optimal value under different levels of contamination. Considering the page limit, we only plot the recognition rate of each method versus the variation of dimensions on ALOI, BA, GTSDB, and AR when the contamination level is 20%. It is observed from the reported results that OLDA, as an orthogonal extension of LDA-L2, generally obtains higher recognition rate compared to LDA-L2. This attributes to the orthogonal discriminant vectors generated. Like LDA-L1, L1-LDA embeds the L1-norm distance measure in the LDA-L2 objective problem. However, L1-LDA obtains by far lower recognition rates than LDA-L1 and OLDA in all cases. We also see that in many cases, the recognition rate of L1-LDA is worse than LDA-L2. As analyzed previously, there are two facts making L1-LDA inferior to LDA-L1. Firstly, the AO algorithm of L1-LDA does not iteratively minimize the original objective function. Secondly, the defined objective function of L1-LDA cannot guarantee the derivation of optimal Bayes solution. The second argument can be further verified by the experimental results of L1-BOLDA. The objective of L1-BOLDA is the same as that of LDA-L1, which, however, is also solved by the AO algorithm. As shown, in all cases, L1-BOLDA outperforms LDA-L2 and L1-LDA, in the terms of recognition rate. Although L1-BOLDA is not a consistent winner when comparing it with OLDA, it obtains satisfactory results. To be specific, on YALE, and AR, L1-BOLDA has higher recognition rates than OLDA, whereas on ALOI, L1-BOLDA generally has better results when the contamination level is greater than 40%. Despite the exhibited good results, still L1-BOLDA has great improvement space compared to LDA-L1. As can be seen from the results, LDA-L1 significantly improves the performance of L1-BOLDA in most cases, although it achieves better performances on AR. This indicates that the AO algorithm is not applicable enough for the solving of L1-norm discriminant objectives.

(a) K=5

(b) K=7

(c) K=10

Fig.3. Recognition rate of each method versus the variation of dimensions (the second row) and its optimal value (the first row) under different levels of contamination on the ALOI database.

(a) K=7

(b) K=9

(c) K=11

Fig.4.Recognition rate of each method versus the variation of dimensions (the second row) and its optimal value (the first row) under different levels of contamination on the BA database.

(a) K=5

(b) K=10

(c) K=15

Fig.5. Recognition rate of each method versus the variation of dimensions (the second row) and its optimal value (the first row) on the GTSDB database.

(a) K=2

(b) K=3 Fig.6. Recognition rate of each method versus the variation of dimensions and its optimal value under different levels of contamination on the AR database.

GLDA can be selected as a more satisfactory robust feature extraction tool compared to LDA-L2, OLDA, LDA-L1, L1-BOLDA, and L1-LDA, which can be observed from the results reported. This convinces us that the Lp-norm distance is more beneficial for the performance improvement of discriminant analysis. We should point out such a point that GLDA is a generalization of LDA-L1, which shares the same objective of LDA-L1 under the setting of p=1 and s=1. This makes GLDA inherit the shortcoming of LDA-L1 that cannot guarantee the derivation of the optimality solution, such that the performance advantage of GLDA is absent when comparing it with our FLDA-L1 and FLDA-Lsp. In most cases, FLDA-L1 obtains higher recognition rates, which shares the same objective of LDA-L1 but applies the algorithm in Algorithm 1. The higher recognition rate of FLDA-L1 also indicates that our algorithm in Algorithm 1 can also be effectively applied to the solution to the LDA-L1 objective. Among the eight methods, FLDA-Lsp achieves the best recognition rate, regardless of whether or not the data includes outliers. FLDA-L1 and FLDA-Lsp are two methods for us, and the former is the L1-norm version of the latter. The higher recognition rate of FLDA-Lsp over FLDA-L1 indicates that applying the Ls-norm and Lp-norm to respectively measure the between-class and within-class distances can better promote the robustness. In addition, we can see from the figures, the recognition rate of each method increases with the increase of the training samples. When the contamination percent is increasing, the recognition rate is decreasing, for each method. Despite this, our FLDA-Lsp still obtain satisfactory results. We see from the plots of recognition rate versus the variation of dimensions that FLDA-Lsp always outperforms other methods at any dimension. The special case of

FLDA-Lsp, FLDA-L1, basically exhibits the same phenomenon when compared to LDA-L2, OLDA, L1-BOLDA, L1-LDA, LDA-L1, and GLDA.

(a) K=5

(b) K=7

Fig.7. Optimal recognition rate of each method under different levels of contamination on the YALE database.

Finally, we focus on the computing time comparison shown in Table 1. We conduct the experiment on the two larger image databases, i.e., ALOI and AR databases. The numbers of training samples of each class are set as 10 and 3 (the settings have been used in the previous experiments), respectively. LDA-L1, L1-LDA, L1-BOLDA, GLDA, FLDA-L1, and FLDA-Lsp adopt a deflating procedure and thus their computational costs are influenced by the dimensions. To explore this, for each database, we records the computing time of each method when the dimensions are set to 50 and 100, respectively. As seen, among the five methods, LDA-L1 and GLDA are the fastest for no inverse computation and no need to solve a linear system of equations. Our FLDA-L1 and FLDA-Lsp’s computational costs are the highest, since they require the computation of matrix inverse except for a number of matrix multiplications. Despite this, all the computations associated with FLDA-L1 and FLDA-Lsp are done in the learning phase. After the features are extracted, there will be no extra cost in classification. Furthermore, they can obtain better recognition performances as indicated in our experiments. TABLE I THE COMPUTATIONAL TIME OF EACH METHOD ON ALOI AND AR UNDER DIFFERENT DIMENSIONS. Method ALOI

AR

LDA-L2

OLDA

L1-LDA

L1-BOLDA

LDA-L1

GLDA

FLDA-L1

FLDA-Lsp

Dim=50

0.7851

0.7683

88.783

296.25

2.2733

1.9821

589.75

468.27

Dim=100

0.8004

0.7335

181.95

445.62

3.6168

3.3989

1089.59

866.22

Dim=50

0.2086

0.2683

5.1883

36.723

0.5409

0.6720

189.91

151.04

Dim=100

0.2091

0.2335

10.395

66.687

1.0237

1.1016

378.27

229.28

V. CONCLUSIONS In this paper, we proposed a new robust linear discriminant analysis method based on Ls- and Lp-norm distances, called FLDA-Lsp. The proposed method formulated the objective using Ls-norm between- class and Lp-norm within-class distances. Our FLDA-Lsp is very flexible, which can be explained as many of previous efforts by specifying s and p. Despite the complexity of model, an iterative algorithm was proposed to solve the optimal discriminant vectors of FLDA-Lsp. We also applied the algorithm for solving the LDA-L1 objective. Theoretically, we conducted the proofs of the existence of local minimum and the convergence of the iterative algorithm. The arguments made were further verified by empirical evaluations on several image databases. Our

method is general, whose idea can be extended for other dimension reduction methods involving a simultaneous L2/L1-norm distance maximization-minimization problem. ACKNOWLEDGEMENTS The work is supported in part by the Central Public-interest Scientific Institution Basal Research Fund under Grant CAFYBB2016SZ003, the National Science Foundation of China under Grants 61401214 and 61773210, and the Natural Science Foundation of Jiangsu Province under Grant BK20171453. REFERENCES [1] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71-86, 1991. [2] A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233, Feb. 2001. [3] J. Yang, D. Zhang, A.F. Frangi, and J. Y. Yang, “Two-dimensional PCA: a new approach to face representation and recognition,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 26, no. 1, pp. 131-137, Jan. 2004. [4] Z. H. Lai, W. K. Wong, Y. Xu, J. Yang, and D. Zhang, “Approximate Orthogonal Sparse Embedding for Dimensionality Reduction,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no.4, pp.723 – 735, 2016. [5] S. Yan, D. Xu, B. Zhang, and H.-J. Zhang, “Graph embedding: a general framework for dimensionality reduction,” in Proc. IEEE Conf. CVPR, 2005, pp. 830-837. [6] M. Sugiyama, “Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis,” J. Mach. Learn. Res., vol. 8, pp. 1027–1061, May 2007. [7] Q. X. Gao, J. J. Ma, H. L. Zhang, and X. B. Gao, “Stable orthogonal local discriminant embedding for linear dimensionality reduction,” IEEE Trans. Image Process., vol. 22, no.7, pp.2521-2531, 2013. [8] F. Dornaika and Y. Traboulsi,“Learning Flexible Graph-based Semi-supervised Embedding,” IEEE Trans. Cybern., vol.45, no.6, pp. 206-218, 2015. [9] Y. Wu, D.P. Wipf, and J. M. Yun, “Understanding and Evaluating Sparse Linear Discriminant Analysis,” in Proc. 18th Int. Conf. Artif. Intell. Statist, vol. 38, San Diego, CA, USA, 2015. [10] F.J. Zhong and J. S. Zhang, “Linear discriminant analysis based on L1-norm maximization,” IEEE Trans. Image Process., vol. 22, no. 8, pp.3018-3027, 2013. [11] S. Li and Y. Fu, “Learning Robust and Discriminative Subspace with Low-Rank Constraints,” IEEE Trans. Neural Netw. Learning Syst., vol.27, no.11, pp. 2160 -217, 2016. [12] L. P. Xie, D. C. Tao, H. K. Wei, “Joint Structured Sparsity Regularized Multi-view Dimension Reduction for Video-based Facial Expression Recognition,” ACM Transactions on Intelligent Systems and Technology, vol. no. 2017. [13] F. Z. Chelali , A. Djeradi , and R. Djeradi, “Linear discriminant analysis for face recognition,” International conference on multimedia computing and systems, vol. 2, no.2, pp 1-10, 2009. [14] Y. Guo, T. Hastie, and R. Tibshirani, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol.8, no.1, pp.86-100, 2007. [15] Z. Q. Zhao, D. S. Huang and B. Y. Sun, "Human face recognition based on multiple features using neural networks committee," Pattern Recognition Letters, vol.25,no.12, pp.1351-1358, 2004. [16] B. Li and D. S. Huang, Locally linear discriminant embedding: An efficient method for face recognition," Pattern Recognition, vol.41, no.12, pp. 3813-3821, 2008.

[17] P. R, Rosenbaum, D. B. Rubin, “The central role of the propensity score in observational studies for causal effects,” Biometrika, vol. 70. no.1, pp. 41-55, 1983. [18] M. Belkin, P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol.15, 1373 – 1396, 2003. [19] X. He, S. Yan, Y. Hu, P. Niyogi, and H. J. Zhang, “Face recognition using Laplacianfaces,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 27, no. 3, pp. 328-340, Mar. 2005. [20] Q. L. Ye, J. Yang, T. M. Yin, and Z. Zhang, “Can the Virtual Labels Obtained by Traditional LP Approaches Be Well Encoded in WLR?” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no.7, 1591-1598, 2016. [21] Y. Wu, D.P. Wipf, and J. M. Yun, “Understanding and Evaluating Sparse Linear Discriminant Analysis,” in Proc. 18th Int. Conf. Artif. Intell. Statist, vol. 38, San Diego, CA, USA, 2015. [22] H. Tao, C. Hou, F. Nie, Y. Jiao and D. Yi. “Effective Discriminative Feature Selection with Non-trivial Solutions,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 4, pp. 796-806, 2016. [23] L. P. Xie, D. C. Tao, H. K. Wei, “Joint Structured Sparsity Regularized Multi-view Dimension Reduction for Video-based Facial Expression Recognition,” ACM Transactions on Intelligent Systems and Technology, 2016, to appear. [24] S. Li and Y. Fu, “Learning Robust and Discriminative Subspace with Low-Rank Constraints,” IEEE Trans. Neural Netw. Learning Syst., 2016, to appear. [25] G. C. Liu, Z. C. Lin, and S.C Yan, “Robust Recovery of Subspace Structures by Low-Rank Representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 1, pp. 171-184, 2013. [26] J. Wright, A. Y. Yang, A. Ganesh, et al., “Robust Face Recognition via Sparse Representation,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vo1.31, no.2, pp.210-227, 2009. [27] Y. Yu, Z. H, Lai, Y. Xu, X. Li, and D. Zhang, “Low-Rank Preserving Projections,” IEEE Trans. Cybern., vol.46, no. 8 pp. 1900-1913, 2016. [28] M. Shao, D. Kit, and Y. Fu , “Generalized Transfer Subspace Learning through Low-Rank Constraint,” International Journal of Computer Vision, vol.109, no.1-2, pp 74-93, 2014. [29] Y. Xu, X. Fang, J. Wu, X. L. Li, D. Zhang,” Discriminative Transfer Subspace Learning via Low-Rank and Sparse Representation,” IEEE Trans. Image Processing, vol. 25, no.2) , pp.850-86,2016. [30] N. Kwak, “Principal component analysis based on L1-norm maximization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1672–1680, 2008. [31] Q. L. Ye, J. Yang, F. Liu, C. X. Zhao, and N. Ye, “L1-Norm Distance Linear Discriminant Analysis Based on an Effective Iterative Algorithm,” 2016, to appear. [32] H. X. Wang, X.S. Lu, Z.L. Hu, and W.M. Zheng, “Fisher discriminant analysis with L1-norm,” IEEE Trans. Cybern., vol. 44, no. 6, pp. 828 – 842, June 2014. [33] C. Ding, D. Zhou, X. He, and H. Zha, “R1-PCA: Rotational invariant L1- norm principal component analysis for robust subspace factorization,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 281–288. [34] F. Nie, H. Huang, C. Ding, D.J Luo, and H. Wang, “Robust principal component analysis with non-greedy L1-norm maximization,” in Proc. Int. Joint Conf. Artif. Intell., 2011. [35] X. Li, Y. Pang, and Y. Yuan, “L1-norm-based 2DPCA,” IEEE Trans. Syst. Man. Cybern. B, Cybern., vol. 40, no. 4, pp. 1170–1175, Aug. 2009. [36] Y. Pang, X. Li, and Y. Yuan, “Robust tensor analysis with L1-norm,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 2, pp. 172–178, Feb. 2010.

[37] D. Meng, Q. Zhao, and Z. Xu, “Improve robustness of sparse PCA by L1-norm maximization,” Pattern Recognit., vol. 45, no. 1, pp. 487–497, 2012. [38] H. Wang, F. P. Nie, W. D. Cai, and H. Huang, “Semi-Supervised Robust Dictionary Learning via Efficient ℓ2,0+-Norms Minimization,” in Proc. IEEE Conf. CVPR, 2013. [39] N. Kwak, “Principal component analysis by Lp-norm maximization,” EEE Trans. Cybern., vol. 44, no. 5, pp. 594–609, May 2014. [40] D. Meng, Q. Zhao, and Z. Xu, “Improve robustness of sparse PCA by L1-norm maximization,” Pattern Recognit., vol. 45, no. 1, pp. 487–497, 2012. [41] Z. Liang, S. Xia, Y. Zhou, L. Zhang, and Y. Li, “Feature extraction based on Lp-norm generalized principal component analysis,” Pattern Recognit. Lett., vol. 34, no. 9, pp. 1037–1045, 2013. [42] J. Wang, “Generalized 2-D Principal Component Analysis by Lp-Norm for Image Analysis,” IEEE Trans. Cybern., vol.46, no.3, pp.792-803, 2016. [43] H. Wang, F. P.Nie, “Learning Robust Locality Preserving Discriminant via p-Order Minimization,” in Proc. Int. Conf. Artif. Intell., 2015. [44] F. P. Nie, Y. Z. Huang, X. Q. Wang, and H. Huang, “New Primal SVM Solver with Linear Computational Cost for Big Data Classifications,” in Proc. Int. Conf. Mach. Learn., 2014. [45] F. J. Zhong and J. S. Zhang, “Discriminant, Locality Preserving Discriminants Based on L1-Norm Maximization,” IEEE Trans. Neural Netw. Learning Syst., vo.25, no.11, pp.2065- 2074, 2014 [46] C. N. Li, Y. H. Shao, and N.Y. Deng, “Robust L1-norm two-dimensional linear discriminant analysis,” Neural Netw., vol. 65, pp.92-104, 2015. [47] W. M. Zheng, Z. C. Lin, and H. X. Wang, “L1-norm kernel discriminant analysis via Bayes error bound optimization for Robust Feature Extraction,” IEEE Trans. Neural Netw., vol.25, no.4, pp.793-805, 2014. [48] J. H. Oh, N. Kwak, “Generalization of linear discriminant analysis using Lp-norm,” Pattern Recognition Letters, vol.34, pp.679-685, 2013. [49] F. P. Nie, H. Huang, and C. Ding, “Low-Rank Matrix Recovery via Efficient Schatten p-Norm Minimization,” in Proc. Int. Conf. Artif. Intell. 2012. [50] Y. Huang, “Semi-Supervised Dimension Reduction Using Trace Ratio Criterion, “ IEEE Trans. Neural Netw. Learning Syst. vol. 23, no. 3, pp.519-526, 2012. [51] J.-M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders,”The Amsterdam library of object images,” Int. J. Comput. Vis., vol. 61, no. 1, pp. 103-112, 2005. [52] S Houben,J Stallkamp,J Salmen,M Schlipsing,and C Igel, “Detection of Traffic Signs in Real-World Images: The German Traffic Sign Detection Benchmark,” in Proc. Int. Conf. Neural Netw, 2013, pp1-8. [53] J. Ye, “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” J. Mach. Learn. Res., vol. 6, pp. 483-502, Apr. 2005. [54] D. S. Huang, "Radial basis probabilistic neural networks: Model and application," International Journal of Pattern Recognition and Artificial Intelligence, vol.13, no. 7, pp.1083-1101, 1999. [55] S. Li, D. S. Huang, J. X. Du, and C. H. Zheng, “Palmprint recognition using FastICA algorithm and radial basis probabilistic neural network,” Neurocomputing, vol.69, no.13-15, pp. 1782-1786, 2006. [56] J. Yang and D. Zhang, “Globally Maximizing, Locally Minimizing: Unsupervised Discriminant Discriminant with Applications to Face and Palm Biometrics,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 29, no. 4, pp.650-664, Apr. 2007.