High order discriminant analysis based on Riemannian optimization

High order discriminant analysis based on Riemannian optimization

Journal Pre-proof High order discriminant analysis based on Riemannian optimization Wanguang Yin, Zhengming Ma PII: DOI: Reference: S0950-7051(20)30...

1MB Sizes 0 Downloads 21 Views

Journal Pre-proof High order discriminant analysis based on Riemannian optimization Wanguang Yin, Zhengming Ma

PII: DOI: Reference:

S0950-7051(20)30091-5 https://doi.org/10.1016/j.knosys.2020.105630 KNOSYS 105630

To appear in:

Knowledge-Based Systems

Received date : 26 January 2019 Revised date : 3 January 2020 Accepted date : 6 February 2020 Please cite this article as: W. Yin and Z. Ma, High order discriminant analysis based on Riemannian optimization, Knowledge-Based Systems (2020), doi: https://doi.org/10.1016/j.knosys.2020.105630. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier B.V.

Journal Pre-proof

High Order Discriminant Analysis Based on Riemannian Optimization Wanguang Yina,1 , Zhengming Mab,1,∗ b School

Abstract

of Electronics and Information Technology, Sun Yan-sen University, Guangzhou, Guangdong, 510006 China of Electronics and Information Technology, Sun Yan-sen University, Guangzhou, Guangdong, 510006 China

lP repro of

a School

Supervised learning of linear discriminant analysis is a well-known algorithm in machine learning, but most of the discriminant relevant algorithms are generally fail to discover the nonlinear structures in dimensionality reduction. To address such problem, thus we propose a novel method for dimensionality reduction of high-dimensional dataset, named manifold-based high order discriminant analysis (MHODA). Transforming the optimization problem from the constrained Euclidean space to a restricted search space of Riemannian manifold and employing the underlying geometry of nonlinear structures, it takes advantage of the fact that matrix manifold is actually of low dimension embedded into the ambient space. More concretely, we update the projection matrices for optimizing over the Stiefel manifold, and exploit the second order geometry of trust-region method. Moreover, in order to validate the efficiency and accuracy of the proposed algorithm, we conduct clustering and classification experiments by using six benchmark datasets. The numerical results demonstrate that MHODA is superiority to the most state-of-the-art methods. Keywords: Classification, clustering, dimensionality reduction, discriminant analysis, product manifold, Riemannian optimization, Stiefel manifold.

1. Introduction

Jou

rna

As the advancement of big data technology, high-dimensional data representation and dimensionality reduction frequently appear in fields such as neuroscience, text mining, signal processing, and machine learning. To the best of our knowledge, most kinds of traditional methods like principal component analysis (PCA) and linear discriminant analysis (LDA) usually transform the high-dimensional data into vectors as prerequisite procedures in ahead of implementing the algorithmic model. In contrast, using tensor representation for high-dimensional data can effectively avoid to destroy the space coherency and consistency structures of original data. Even though a lot of models for tensor decomposition or factorization have been proposed, such as Canonical Polyadic (CP) [16], Tucker [1], and Tensor Train (TT) [27]. Tucker decomposition is one of the most well-known models for mapping high-dimensional data onto the low-dimensional subspace, and widely applied in machine learning and many other fields. Examples include high order discriminant analysis (HODA) in the target of obtaining the training features by maximizing the Fisher ratio between the core tensors. In an alternating fashion by fixing other factors but one to solve the learning objective. In practice, alternating least squire (ALS) is a special case of the block coordinate descent method. Previous studies show that ALS for the search of discriminant bases are frequent convergence to spurious local minima and result in a poor performance of solution. To address such problem, we consider to employ the second order geometry of trust-region method for solving the traditional method of high order discriminant analysis (HODA). The main contributions of this paper can be outlined as follows. First, we propose a novel method for dimensionality reduction, named manifold-based high order discriminant analysis (MHODA). Experimental results show that MHODA provides a mapping from high-dimensional space to low-dimensional subspace, it can effectively cope with the data sampled from the nonlinear manifold. Second, our proposed algorithm outperforms most of existing dimension ∗ Corresponding

author Email addresses: [email protected] (Wanguang Yin), [email protected] (Zhengming Ma)

Preprint submitted to Knowledge-Based Systems

January 3, 2020

Journal Pre-proof

2. Related Work

lP repro of

reduction methods, and demonstrates a promising performance both in clustering and classification. Third, the framework of trust-region method provides a new viewpoint to explain and understand many dimension reduction techniques in the area of Riemannian optimization. The rest of this paper is organized in the following way. Section 2, we first briefly review a lot of related works. Then in section 3, we present some basic notations and operations in the field of tensor decomposition. Afterwards, section 4 is divided into three subsections, involving with high order discriminant analysis (HODA), trust-region method, and basic ingredients used in Riemannian optimization, respectively. Section 5 is about numerical report conducted on six benchmark datasets for cluster and classification experiments. Finally, section 6 involves with several conclusions for this paper, as well as the prospective for future work.

Jou

rna

This section we briefly review some related works in the area of supervised, semi-supervised, and unsupervised learning, and even the relevant optimization problems in dimensionality reduction. In the context of different applications, supervised learning is utilized to the labeled data for better classification, requiring know a large number of labeled samples [29]. While semi-supervised learning only needs to know a limited number of labeled training samples. And unsupervised learning is to exploit the unlabeled data rather than to use labeled samples. Previous studies show that supervised learning, such as linear discriminant analysis (LDA) and discriminant locality alignment (DLA), are generally superiority to the unsupervised learning, such as principal component analysis (PCA) [31], and higher order singular value decomposition (HOSVD) [2]. As the generalization of linear discriminant analysis (LDA) to the higher-order case, the algorithm of high order discriminant analysis (HODA) is to exploit the discriminant information for Tucker decomposition and project the core tensors onto the discriminant bases [3]. Many other algorithms relevant to the discriminant analysis, including the general supervised tensor discriminant analysis for gait recognition [23], and discriminant analysis with tensor representation (DATER) for the classification of tensor data [21], which show that tensor representation for dimensionality reduction can help to enhance the performance for small sample size. However, DATER cannot able to convergent to a stationary point during its iterations. Lu et al. [22] proposed a novel method, named uncorrelated multilinear discriminant analysis (UMLDA) used for exploring the uncorrelated discriminative features directly from a tensor-to-vector projection, there also exist the transformation of vectorization in dimensionality reduction. To preserve the discriminative information of classification tasks, considering that the locality information for the classification of hyperspectral image, Zhang et al. [19] proposed tensor discriminative locality alignment (TDLA) for subspace learning. The computation of each point in high dimension, it clearly characterizes the limitation of high-cost running time and storage consumption, thus the application of this algorithm is limited in high-dimensional data. More importantly, in order to seek an optimal tensor-to-tensor projection for discrimination in a lower-dimensional tensor subspace, Li et al. [20] proposed two methods, named constrained multilinear discriminant analysis (CMDA), and direct general tensor discriminant analysis (DGTDA), if only the between-class scatter matrix is singular, CMDA and DGTDA usually fail to deal with these datasets. In many real-world applications, such as the medical imaging, in obtaining the labeled samples are much more expensive than unlabeled ones. There needs to better cope with the data sampled from a certain type of labeled data, while remaining a large number of sample data are unlabeled, that is semi-supervised learning. Examples include Huang et al. [28] proposed a semi-supervised algorithm, named trace ratio based flexible semi-supervised discriminant analysis (TR-FSDA), and developed an iterative method to simultaneously solve the low-dimensional projection and representation, Xu and Yan [30] developed the semi-supervised bilinear subspace learning for a collection of matrix objects, etc. In addition, without labeled samples for a great number of unsupervised learning over the past few years have been proposed, including the general higher order orthogonal iteration (HOOI), nonnegative Tucker decomposition (NTD), higher-order singular value decomposition (HOSVD), etc. Assuming that the input image is a matrix embedded into a tensor space and exploying the higher-order singular value decomposition, Savas and Eldén [24] proposed an algorithm for image data classification. Even simple and efficient with this algorithm, but they need to provide the class number of image data before implementing the algorithmic model, the automatic discrimination of cluster is very important in many real-world applications. On the other hand, considering the projection matrices of low-rank and orthogonal constraints that correlated to the components in tensor decomposition is a promising approach to extract the principal features in dimensionality reduction. Recently, using the Rayleigh and alternating direction method of multipliers (ADMM) in solving the cost function, Zhang et al. [25] proposed an algorithm, named low-rank regularized 2

Journal Pre-proof

3. Notations and Basic Operations

lP repro of

heterogeneous tensor decomposition (LRRHTD) for image data clustering. In some sense, it is generally an integration of principal component analysis (PCA) and low-rank representation (LLR) [32]. Besides above mentioned, the relevant of optimization problem in solving objective function plays an important role in dimensionality reduction. The target of the optimization algorithm is to isolate an extreme point of objective function. Optimization choice for dealing with large-scale dataset has a critically impact on the performance of the algorithmic model. For the general form of equality constraints, Riemannian optimization and sequential quadratic programming (SQP) are popular techniques to solve nonlinear optimization problems [5, 9, 15, 18]. In particular, Riemannian optimization can efficiently exploit the underlying geometry of search space, leading to Riemannian optimization is competitive with alternative approaches, like alternating least square and convex relaxations. Formally speaking, the core idea behind Riemannian optimization is to formulate the optimization problem on the curved manifold instead of the standard Euclidean space [6]. In mathematical notation, a smooth manifold equipped with a Riemannian metric is named a Riemannian manifold, which plays an important role in determining the notations of distance, angle, and curvature in manifold [8]. From viewpoint of curvature properties, the selection of Riemannian metric, as well as the constrained conditions have a deeply impact on the performance of algorithmic model [5, 9, 14]. Over the past decades, there are increasing the depth-investigation in Riemannian optimization and SQP for solving optimization problems, including multi-way clustering [7], tensor completion [4, 11], blind signal separation [8], and industrial controls [13, 14, 18]. Examples include Sun et al. [7] proposed heterogeneous Tucker decomposition for the multinomial manifold of Riemannian optimization, (i.e., HTD-Multinomial). Reviewing the diagonal constraints in cost function, Florent Bouchard et al. [8] gave a thorough investigation on Riemannian optimization algorithms for blind signal separation. It also worth noting that Gharaei et al. made depth-investigation in solving the supply chain by using the sequential quadratic programming and obtained a great achievement in recent years [9, 12, 14, 15]. Due to the integrated framework and goodness results of Riemannian optimization, thus we focus on Riemannian optimization for solving large-scale problems in dimensionality reduction.

rna

This section we briefly describe some basic notations and operations that will be used throughout this paper. Notation tensor is regarded as a multi-index numerical array, whereby the order of a tensor is the number of its dimensions or modes [1, 16]. Vector is denoted by 1-order tensor, e.g., x, matrix is indicated by 2-order tensor, e.g., X, tensors with more than two dimensions are indicated by Lucida calligraphic, e.g. X. The Khatri-Rao product is indicated by symbol , and Kronecker product is indicated by symbol ⊗. The nabla symbol, i.e., O is used to indicate the Levi-Civita connection. The notation tr (·) denotes the trace operator, the transpose of a matrix X is denoted by X T . The inner  product of two same-sized matrices is the sum of the products of their entries, and denoted by hX, Yi = tr Y T X .   Given two matrices X = [x1 , x2 , . . . , x M ] ∈ RL1 ×M and Y = y1 , y2 , . . . , y M ∈ RL2 ×M , restricting two matrices of same-sized columns, the multiplication of Khatri-Rao product, i.e., X Y is the Kronecker product of columns of two matrices in the corresponding position, i.e.,   X Y = x1 ⊗ y1 , x2 ⊗ y2 , . . . , x M ⊗ y M ∈ RL1 L2 ×M

(1)

Jou

Meanwhile, the multiplication of Kronecker product, i.e., X ⊗ Y is the Kronecker product of one matrix with columns of another matrix in the corresponding position, i.e., X ⊗ Y = [x1 ⊗ Y, x2 ⊗ Y, . . . , x M ⊗ Y] ∈ RL1 L2 ×MM

(2)

X = G ×1 U1 ×2 U2 · · · ×N U N

(3)

It hence that Khatri-Rao product, i.e., X Y is a subset of columns from Kronecker product, i.e., X ⊗ Y. the multiplication of a set of orthonormal matrices results in an orthonormal matrix and belongs to the Stiefel manifold [1]. Tucker decomposition is one of the most well-known decomposition models in machine learning, which can be expressed as a smaller core tensor G ∈ R J1 ×···×JN multiplied by a series of projection matrices Un ∈ RLn ×Jn , n = 1, . . . , N in each mode [10], i.e.,

3

Journal Pre-proof

lP repro of

The form of tensor mode unfolding for Tucker decomposition along the last mode can be expressed by Kronecker product a set of matrices X(N) = G(N) (U1 ⊗ · · · ⊗ U N−1 )T . The detailed justification of this property can refer to the article [13] as reference therein. If we introduce a set of orthonormal matrices On ∈ R Jn ×Jn , n = 1, . . . , N, the corresponding Tucker decomposition remains unchanged under the transformation as follows   X = G ×1 OT1 . . . ×N OTN ×1 (U1 O1 . . . ×N (U N ON )) (4)

Wherein On ∈ R Jn ×Jn , n = 1, . . . , N denote a set of orthonormal groups. This is the important proof for the equivalence relation of Riemannian quotient manifold, the non-uniqueness of Tucker decomposition was deeply investigated in articles [4, 11]. 4. Manifold-Based High Order Discriminant Analysis

Constructing an efficient algorithm depends on properly exploiting the structure of constraints and cost function. To this end, we focus on high order discriminant analysis (HODA), the trust-region method, and orthogonal Stiefel manifold of Riemannian optimization. This section we first briefly review the high order discriminant analysis (HODA) and trust-region method, then we derive several basic ingredients in the area of Riemannian optimization. 4.1. High Order Discriminant Analysis (HODA)

High order discriminant analysis (HODA) is an extension of linear discriminant analysis (LDA) to the higher-order case. Employing the discriminant information for Tucker decomposition and directly projecting the high-dimensional data on the discriminant bases, it hence that a large number of labeled samples can help to improve the accuracy in dimensionality reduction. Specifically, for a given tensor dataset, i.e., X ∈ RL1 ×···LN−1 ×M , its first N − 1 modes are associated with feature dimensions, while the last mode is the number of samples. The target of HODA is to minimize the following objective function in the low-dimensional representation for original data tr (S w − S b )

2 P



2 P M

C = m=1

G − Gcm

− c=1 nc

Gc − G

min

U1 ,...,U N−1

F

(5)

F

Jou

rna

Where, S w denotes the within-class scatter matrix, and S b denotes the between-class scatter matrix. M is the P sample number, while nc is the number of cth class, and equals to M = Cc=1 nc . G ∈ R J1 ×···×JN−1 is the core tensor, the class mean indexed by each sample data is denoted by Gcm ∈ R J1 ×···×JN−1 . In addition, Gc ∈ R J1 ×···×JN−1 is the class mean J1 ×···×JN−1 associated with is the mean associated with the total samples, P collection of samples with cth class, and G ∈ R G (1/M) i.e., m m . Specifically, the between-class scatter matrix S b can be expressed as



2 P S b = Cc=1 nc

Gc − G

F

2

  PC

= c=1 nc Xc − X ×1 U1T · · · ×N U NT

F

 (N)  T

2 PC T

= c=1 nc

Xc − X (6) U1T ⊗ · · · ⊗ U N−1 F

  2 (n)

P = Cc=1 nc

UnT Xc ×i,(n,N) UiT − X ×i,(n,N) ×UiT

F

 (n)

2 =

UnT XC ×i,(n,N) UiT − X ×i,(n,N) UiT

F Wherein XC ∈ RL1 ×···×LN−1 ×C , X ∈ RL1 ×···×LN−1 ×C , and we use the shorthand notation ×i,(n,N) UiT standing for a series of tensor-matrix products in all modes except n and N mode, i.e., X ×i,(n,N) UiT

T T T = X ×1 U1T · · · ×n−1 Un−1 ×n+1 Un+1 · · · ×N−1 U N−1

4

(7)

Journal Pre-proof

lP repro of

Following a similar way, we can obtain the within-class scatter matrix S w as the following expression

2 P M

S w = m=1 G − G

cm F

(n)

2 P M

T  = m=1 Un X ×i,(n,N) UiT − Xcm ×i,(n,N) ×UiT

F

  2 T T T (n)

= Un X ×i,(n,N) Ui − XcM ×i,(n,N) Ui

F

(8)

Wherein X ∈ RL1 ×···×LN−1 ×M , XcM ∈ RL1 ×···×LN−1 ×M . The general way to seek the projection matrices is by fixing all but one Un to learning the objective function in the Euclidean space. 4.2. The Trust-Region Method

The optimization problem of MHODA can be solved using the framework of trust-region method that optimized on the tangent space, each iteration consists of two steps: 1) the solution of the trust-region subproblem; 2) the computation of each update by using the retracting mapping from the tangent space back onto the manifold. The general form of the trust-region method for a given function f around the iterate U can be formulated as the form of second-order Taylor expansion for the tangent vector ξ in the tangent space T U M under the definition of Riemannian metric, that reads by  min f (U) + gU grad f (U) , ξU ξU ∈T U M

rna

(9)    1 + gU hess f (U) ξU , ξU 2 Considering this model is valid on the tangent space T U M of radius 4. Requiring to compute the Riemannian gradient and Riemannian Hessian of a function. The candidate next iterate is obtained by retraction RU (ξU ). This candidate is evaluated to check the model of trust-region method at ξU whether is valid or not. The candidate is accepted and the radius 4 can even be increased only if the model is very good. Otherwise, the candidate is rejected and radius 4 is decreased. The stopping criterion is defined by the norm of the Riemannian gradient lower than a tolerance.   The Riemannian gradient, i.e., grad f (U), and Riemannian Hessian, i.e., hess f (U) ξU of a function on the manifold   can be obtained by projecting the Euclidean gradient i.e., Grad f (U), and Euclidean Hessian i.e., Hess f (U) ξU onto the horizontal space of Riemannian quotient manifold [11]. Figure 1 illustrates a schematic view of Riemannian optimization. As shown in Fig.1., Riemannian quotient space is the subspace of computational space, the optimization is executed on the tangent space, next iterate needs a retracting mapping from a tangent space back onto the manifold. Note that the set of equivalence class is vertical space of the Riemannian quotient manifold, which is denoted by the following M ∼:= M/ (O (J1 ) × · · · × O (JN−1 ))

(10)

M := St (J1 , L1 ) × · · · × St (JN−1 , LN−1 )

(11)

Jou

Wherein M is the product manifold of computational space, the multiplication of Stiefel manifold belongs to the Stiefel manifold as well.

Due to the non-uniqueness of Tucker decomposition, the local minima of (5) on M is difficult to be isolated, but they become isolated on horizontal space of the quotient manifold. 4.3. Riemannian Optimization on Stiefel Manifold In mathematical notation, an embedded matrix manifold St (Jn , Ln ) endowed with a Riemannian metric is called a Riemannian manifold. In other words, which is a smooth subset of a vector space included in the set of matrices Un ∈ RLn ×Jn , n = 1, . . . , N − 1[33]. Determining different choice of constrained conditions, Riemannian optimization can be considered as the special structure of matrix manifold with special requirement of optimization problem. This paper, we consider the orthonormal constraint of projection matrices as the Stiefel manifold embedded into Euclidean space, i.e., 5

lP repro of

Journal Pre-proof

Figure 1: Schematic illustration of Riemannian optimization: (a) the left geometric object denotes the total space of computational space, (b) the right one is the Riemannian quotient manifold [M].

n o St (Jn , Ln ) := Un ∈ RLn ×Jn : UnT Un = Γn

(12)

Jou

rna

Using the general framework of Riemannian optimization of trust-region method, we can solve our objective function (5) with constraint (12) requiring four basic ingredients, including the projection map, Riemannian metric, Riemannian connection, and retracting mapping. The main idea behind this problem can be formulated as the decomposition of total space of computational space into the component of normal space and tangent space. Moreover, the tangent space is decomposed into vertical space and horizontal space, vertical space is the tangent space of the equivalence class. In other words, normal space is the orthogonal complement space of the tangent space, horizontal space is the orthogonal subspace to the vertical space in the sense of the metric. Let Z be a vector in the embedding space (or ambient space), the projection map that projects a point of the embedded space (i.e., high-dimensional space) onto the tangent space, i.e., Π (Z) is obtained by subtracting the component in the orthogonal complement of the tangent space, i.e., normal space, which is parametrized by the symmetric matrix, i.e.,   Π (Z) = Z − Un sym UnT Z (13)  Likewise, the projection map that projects a point of the tangent space onto the horizontal space, i.e., Ψ ηUn is obtained by removing the component along the vertical space, and parameterized by the skew-symmetric matrices [34].    Ψ ηUn = ηUn − Un skew UnT ηUn (14)     Wherein sym (X) = X + X T /2 denotes the symmetric part of the square matrix X, skew (X) = X − X T /2 is the decomposition of X into the sum of skew symmetric term. Each tangent space requires a definition of bilinear, symmetric positive form of Riemannian metric [33, 34]. This paper we define the canonical inner product gUn : T Un M×T Un M → R as the metric used for measuring the distances and directions on the manifold, i.e., 6

Journal Pre-proof

   g ξUn , ηUn = tr ξUT n ηUn

(15)

lP repro of

The notion of connection is intimately related to the notion of vector transport that allows for moving from a tangent space to the other. Note that Levi-Civita connection is a unique affine connection used to define the Riemannian Hessian of a function. Given a concrete instance, Oξ η is the covariant derivative defined for ξ on the tangent space of manifold, wherein η is the set of vector fields on the manifold. Riemannian connection is obtained by mapping the Levi-Civita connection of Euclidean space onto the horizontal space, i.e.,   Oξ η = ΨUn ΠUn Dη (Un ) ξ

(16)   Wherein Dη (Un ) ξ is the directional derivative of η at Un along the direction ξ. Ensuring to each update of Riemannian optimization remains on the manifold, there needs to define a mapping from the tangent space back onto the manifold. It is well-known that exponential retraction is the least efficient approximation of first-order retraction for Riemannian manifold [6]. Here, we give the qf (·) operator as the definition of retracting mapping, i.e.,   Unk+1 = qf Unk + tξk (17)

The computation of qf (X) denotes the Q factor of the decomposition of X as X = QR, wherein Q belongs to M and R is an upper triangular matrix with strictly positive diagonal elements. Based on the derivations provided in subsection 4.1, we can reformulate the objective function as the following representation.

 (n)

2 f (Un ) =

UnT X ×i,(n,N) UiT − XcM ×i,(n,N) UiT

F

 (n)

2 −

UnT XC ×i,(n,N) UiT − X ×i,(n,N) UiT

(18)

F

rna

Wherein the projection matrices i.e., UnT Un = Γn , n = 1, . . . , N − 1 are constrained to be orthonormal. Moreover, by using matrix’s properties, the Euclidean gradient of objective function can be directly computed as follows

  

2 Grad f (Un ) = 2

X − XcM ×i,(n,N) UiT

Un F (19)

  

2 T − 2

XC − X ×i,(n,N) Ui

Un F   Note that Euclidean Hessian, i.e., Hess f (Un ) ξUn of a function is the covariant-derivative of the gradient, i.e., Grad f (Un ) in the direction ξUn ∈ T Un M, the computation of the Euclidean Hessian can be obtained by solving the covariant derivative of the Euclidean gradient, i.e.,

  

2   Hess f (Un ) ξUn = 2

X − XcM ×i,(n,N) UiT

ξUn F (20)

   2 T

− 2 XC − X ×i,(n,N) Ui ξUn F

Jou

According to the notation of abstract quotient manifold, Riemannian gradient can be expressed as the projection of Euclidean gradient onto the horizontal space, i.e.,

 grad f (Un ) = ΨUn ΠUn (Grad f (Un )) (21)   Likewise, Riemannian Hessian hess f (Un ) ξUn is obtained by projecting Euclidean Hessian onto the horizontal space, i.e.,   hess f (Un ) ξUn    = ΨUn ΠUn OξUn grad f (Un ) (22)     T = ΨUn ΠUn Hess f (Un ) − ξUn sym Un Grad f (Un )

Once the expressions of Riemannian gradient (21) and Riemannian Hessian (22) are obtained. By using the framework of the trust-region method provided on the toolbox of Manopt [17], we can perform our proposed MHODA, the proof of convergence is provided in articles [7, 34]. Given the full steps of MHODA in Algorithm 1. 7

Journal Pre-proof

n

10: 11:

end for OUTPUT Un , n = 1, . . . , N − 1.

5. Numerical Experiments

lP repro of

Algorithm 1 Manifold-Based High Order Discriminant Analysis (MHODA) Input: tensor object X ∈ RL1 ×···×LN−1 ×M , sample label Y ∈ R M×1 1: initial matrices U n , n = 1, . . . , N − 1 are generated using randomly orthogonal matrices, gradient norm tolerance ε1 = 10−5 , and max iteration number maxit = 200. Let 0 < c < 1, β1 = 0, ξ0 = 0. 2: for n = 1 to N − 1 do 3: Compute Hessian in Euclidean space by Eq. (20) 4: Compute Hessian in Riemannian space by Eq. (22) 5: Compute value   the weighted   k kT k (k−1)T k−1 β = tr η η /tr η η . 6: Compute a transport direction   TUnk−1 →Unk (ξk−1 ) = PUnk ξk−1 . 7: Compute a conjugate   direction ξk = −gradR f Unk + βk TUnk−1 →Unk (ξk−1 ) . k 8: Compute   Armijo  step  size  α using  backtracking  k k k f RUnk α ξ ≥ f Un + cαk tr ηkT ξk .

2 Terminate and output U k+1 if one of the stopping conditions,

ηk+1

≤ ε1 , or iteration number k ≥ maxit is met. 9: F

This section we start to investigate the efficiency and accuracy of the proposed method. In the following experiments, we first investigate the effectiveness of the proposed method for clustering. Then, we compare performance of the proposed method in the applications of classifying. We compare our method with four state-of-the-art methods, including high order discriminant analysis (HODA) [3], direct general tensor discriminant analysis (DGTDA) [20], low-rank regularized heterogeneous tensor decomposition (LRRHTD) [25], and heterogenous tensor decomposition (HTD-Multinomial) [7]. All the numerical experiments are conducted on a desktop with an Intel Core i5-5200U CPU at 2.20 GHz and with RAM of 8.00 GB, and repeated 10 times, with randomly selected images in each time.

rna

5.1. Datasets Description

Jou

This subsection we use 3-order tensor to execute our numerical experiments, where the first two modes are associated with image pixels and the last mode denotes the number of image data, even though our algorithms and implementations have no such restrictions. The numerical experiments are conducted using six benchmark image datasets, including the COIL20 Object, ETH80 Object, ORL Faces, Olivetti Faces, USPS Digits, and CMU PIE Faces. Figure 2 shows some sample images from these datasets. The COIL20 dataset contains 1420 grayscale images of 20 objects (72 images per object) with a wide variety of complex geometric and reflectance characteristics. Each image was down-sampled to 32 × 32 grayscale (0-255). The ETH80 dataset is a multi-view image dataset for object categorization, it includes eight categories that include eight categories corresponding to apple, car, cow, cup, dog, horse, pear and tomato. Each category contains ten objects, and each object is represented 41 images of different views. The original image resolution is 128 × 128 , we resized each image to be 32 × 32 pixels, for a total of 3280 images. The ORL dataset consists of ten different images for each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open/ closed eyes, smiling/ not smiling) and facial details (glasses/ no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement), each image was resized to 32 × 32 pixels. The Olivetti faces dataset consists of images of 40 individuals with small variations in viewpoint, large variations in expression, and occasional addition of glasses. The dataset consists of 400 images (10 per individual) of size 64 × 64 = 4096 pixels, and is labeled according to identity. 8

lP repro of

Journal Pre-proof

rna

Figure 2: Some sample images from different datasets. (a) Olivetti dataset. (b) COIL20 dataset. (c) ETH80 dataset. (d) ORL dataset. (e) CMU PIE dataset. (f) USPS dataset.

Jou

The CMU PIE dataset is a 32 × 32 gray scale face dataset that includes 68 individuals. Each person has 141 facial images under different light and illumination conditions, we extract a subset of 50 individuals corresponding to 50 facial images of each person, for a total of 2500 images. The USPS is handwritten digits dataset that contains a total of 11000 images of size 16 × 16 = 256 pixels. Considering the computational efficiency, we are not going to use all the data for experiment. We randomly select 2000 of the images (200 images per class) for computation. Note that each dataset has a ground truth class label. Considering the computational efficiency and accuracy of the numerical experiments, we prerequisite the dataset in the following way, all the data are randomly shuffled, the gray value of pixels are normalized to unit. For evaluation, we first perform the dimensionality reduction of tensor data and then use the results to cluster them with the k − means algorithm. Table 1 shows the rough description of these datasets, wherein the attributes of each dataset are the class number, total samples, dimension of the dataset, and target dimension of model reduction. 5.2. Clustering Results

This subsection, we present the numerical results on the cluster analysis. Firstly, we perform clustering using k − means algorithms. Run k − means with initialization algorithm under 10 times and compute the average as the 9

Journal Pre-proof

lP repro of

Table 1: Illustrating of the datasets dataset sample number size dimension reduction classes COIL20 1440 32*32 8*8 20 ETH80 3280 32*32 8*8 8 ORL 400 32*32 6*6 40 Olivetti 400 64*64 8*8 40 USPS 2000 16*16 7*8 10 CMU PIE 2500 32*32 8*8 50

Table 2: Comparison of Clustering Results Using k − means Algorithm on Different Datasets. We Show the Clustering Measurements Using ACC and NMI

Dataset COIL20 ETH80 ORL Olivetti USPS CMU PIE

Metric ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI ACC NMI

Methods for Clustering Tasks MHODA HODA 0.7244±0.0345 0.6144±0.0216 0.8133±0.0139 0.7388±0.0118 0.5098±0 0.4750±0.0039 0.4691±0 0.4523±0.0050 0.5817±0.0262 0.4437±0.0213 0.7871±0.0114 0.6769±0.0089 0.6627±0.0372 0.4900±0.0324 0.8251±0.0154 0.7044±0.0152 0.5074±0.0673 0.4580±0.0339 0.4621±0.0718 0.4368±0.0289 0.5927±0.0193 0.1546±0.0034 0.7472±0.0073 0.3686±0.0078

DGTDA 0.6563±0.0324 0.7637±0.0093 0.4852±0.0108 0.4598±0.0102 0.4390±0.0199 0.6713±0.0149 0.5045±0.0292 0.7155±0.0151 0.3377±0.0152 0.2752±0.0142 0.1206±0.0042 0.3014±0.0040

LRRHTD 0.6633±0.0296 0.7675±0.0116 0.4994±0.0062 0.4764±0.0065 0.5215±0.0252 0.7339±0.0127 0.5300±0.0309 0.7347±0.0166 0.4625±0.0089 0.4699±0.0064 0.1477±0.0041 0.3521±0.0063

HTD 0.6337±0.0178 0.7334±0.0144 0.4714±0.0219 0.4155±0.0180 0.4690±0.0273 0.6538±0.0194 0.5727±0.0404 0.7470±0.0255 0.4912±0.0570 0.4607±0.0447 0.3764±0.0299 0.5690±0.0238

rna

clustering results. Using clustering ACC, and normalized mutual information (NMI) as the evaluation metrics. Table 2 shows the clustering results. Fig.3. and Fig.4. show the accuracy of our proposed algorithm compared with some state-of-the-art methods on USPS and CMU PIE datasets. Clustering results of USPS vary from 2-class to 10-class, while it varies from 10-class to 50-class for CUM PIE dataset. Due to the reasons that nonlinear structures and labeled information are effectively uncovered in dimensionality reduction, numerical results show that our proposed MHODA is superiority to the traditional HODA method that optimized in Euclidean space, as well as the unsupervised-learning algorithms, such as LRRHTD, and HTD-Multinomial, even though HTD-Multinomial is optimized in Riemannian manifold. It worth nothing that MHODA is especially good at dealing with multi-class and comprehensive dataset of CMU PIE. 5.3. Classification Results

Jou

This subsection we investigate the performance in classifying the image dataset, we assume that all the data samples have the uniform distribution. After obtaining all the projection matrices Un , n = 1, . . . , N − 1 from the training samples Xtrain , then we use these projection matrices to learn the new-coming samples for achieving the low-dimensional representation of original data, that can be calculated from the following equality T Gtest = Xtest ×1 U1T ×2 U2T · · · ×N−1 U N−1

(23)

Using the equation (23), we conduct classification experiments by using four benchmark datasets, including COIL20, ETH80, USPS, and CMU PIE. We use 3-fold cross validations on the training datasets and 5-fold cross validations on the testing datasets. In order to compare the algorithmic performance as objective as possible, we adopt the kNN classification method as the evaluation metric. Table 3 shows the effectiveness and accuracy of MHODA on conducting the classification task.

10

rna

lP repro of

Journal Pre-proof

Figure 3: Experiments on USPS dataset, clustering results of MHODA are better than existing methods.

6. Conclusions and Future Work

Jou

This paper we propose manifold-based high order discriminant analysis with Riemannian optimization, the numerical experiments for our proposed method, together with comparison algorithms under the evaluation metrics of ACC and NMI demonstrate that MHODA outperforms many other methods that optimized in the Euclidean space. The presented framework for Riemannian optimization is a very prospective and useful technique in application like model reduction, classification and clustering. Using the second order geometry of trust-region method can help to boost the performance of the traditional HODA, the optimal solution of extreme point can be isolated via the equivalence class of vertical space. In dealing with multi-class and comprehensive dataset, such as CMU PIE, our proposed algorithm obviously takes advantage over most of compared algorithms, demonstrating that Riemannian optimization is a promising technique in many real-world applications. As shown in table 2 and table 3, we can conclude that supervised-learning methods are generally better than unsupervised-learning methods in the most applications of training data, which is consistent with our theoretical analysis. To deeply understand the knowledge of Riemannian optimization, there are many open problems need to be further 11

rna

lP repro of

Journal Pre-proof

Figure 4: Experiments on CUM PIE dataset, clustering results of MHODA are better than existing methods.

Jou

investigated in our future work, such as the relevant knowledge of low-rank tensor decomposition with Riemannian optimization. As we know, many previous studies show that fixed-rank Tucker decomposition is actually an embedded submanifold of Euclidean space, that are deeply investigated in articles [4, 11]. Besides, for the initialization of the factor matrices, and Riemannian metric in determining the performance of algorithmic model, which are very important roles in the area of Riemannian optimization. References

[1]Sidiropoulos, N. D., et al., Tensor Decomposition for Signal Processing and Machine Learning, IEEE Transactions on Signal Processing 65 (13) (2017) 3551-3582, 2017. [2]Lathauwer, L. D., et al., A Multilinear Singular Value Decomposition, SIAM J. Matrix Anal. Appl. 21 (4) (2000) 1253-1278. [3]Phan, A. H. and A. Cichocki, Tensor decompositions for feature extraction and classification of high dimensional datasets, Nonlinear Theory and Its Applications IEICE 1 (1) (2010) 37-68. [4]Mishra, B. and R. Sepulchre, Riemannian Preconditioning, SIAM Journal on Optimization 26 (1) (2016) 635-660.

12

Journal Pre-proof

Table 3: Comparison of Classification on Different Datasets.

COIL20 ETH80 USPS CMU PIE

Metric ACC NMI kNN ACC NMI kNN ACC NMI kNN ACC NMI kNN

DGTDA 0.6551±0.0404 0.7623±0.0188 0.7198±0.0449 0.4836±0.0141 0.4599±0.0116 0.8087±0.0348 0.3389±0.0197 0.2806±0.0233 0.7177±0.0316 0.1395±0.0067 0.3269±0.0111 0.1430±0.0312

LRRHTD 0.6513±0.0289 0.7670±0.0169 0.7312±0.0447 0.4967±0.0173 0.4675±0.0170 0.8014±0.0282 0.4649±0.0362 0.4653±0.0231 0.8364±0.0337 0.1655±0.0033 0.3735±0.0060 0.1911±0.0246

HTD 0.6474±0.0230 0.7374±0.0174 0.7125±0.0461 0.4898±0.0235 0.4086±0.0101 0.7923±0.0291 0.4360±0.0236 0.4133±0.0202 0.8498±0.0244 0.3382±0.0188 0.5414±0.0263 0.1933±0.0415

lP repro of

Dataset

Methods for Classification Tasks MHODA HODA 0.6973±0.0236 0.6155±0.0349 0.8058±0.0135 0.7402±0.0167 0.8385±0.0516 0.6729±0.0492 0.5058±0.0065 0.4793±0.0172 0.4692±0.0049 0.4492±0.0155 0.6856±0.0209 0.7777±0.0175 0.5266±0.0264 0.4554±0.0301 0.4889±0.0098 0.4363±0.0313 0.5952±0.0454 0.7878±0.0425 0.5866±0.0238 0.1708±0.0049 0.7435±0.0142 0.3848±0.0075 0.5261±0.0598 0.2147±0.0337

Jou

rna

[5]Duan, C., et al., Selective maintenance scheduling under stochastic maintenance quality with multiple maintenance actions, International Journal of Production Research 56 (23) (2018) 7160-7178. [6]Absil, P. A. and I. V. Oseledets, Low-rank retractions: a survey and new results, Computational Optimization and Applications 62 (1) (2014) 5-29. [7]Sun, Y., et al., Heterogeneous Tensor Decomposition for Clustering via Manifold Optimization, IEEE Trans Pattern Anal Mach Intell 38 (3) (2016) 476-489. [8]Bouchard, F., et al., Riemannian Optimization and Approximate Joint Diagonalization for Blind Source Separation, IEEE Transactions on Signal Processing 66 (8) (2018) 2041-2054. [9]Gharaei, A., et al., Modelling And optimal lot-sizing of the replenishments in constrained, multi-product and bi-objective EPQ models with defective products: Generalised Cross Decomposition, International Journal of Systems Science: Operations and Logistics (2019) 1-13. [10]Kolda, T. G. and B. W. Bader, Tensor Decompositions and Applications, SIAM Review 51 (3) (2009) 455-500. [11]Kasai, H. and B. Mishra, Low-rank tensor completion: a Riemannian manifold preconditioning approach, In ICML 48 (2016) 1012-1021. [12]Gharaei, A., et al., An integrated multi-product, multi-buyer supply chain under penalty, green, and quality control polices and a vendor managed inventory with consignment stock agreement: The outer approximation with equality relaxation and augmented penalty algorithm, Applied Mathematical Modelling 69 (2019) 223-254. [13]Yin, W. and Z. Ma, LE and LLE Regularized Nonnegative Tucker Decomposition for clustering of high dimensional datasets, Neurocomputing 364 (2019) 77-94. [14]Gharaei, A., et al., Joint Economic Lot-sizing in Multi-product Multi-level Integrated Supply Chains: Generalized Benders Decomposition, International Journal of Systems Science: Operations and Logistics (2019) 1-17. [15]Gharaei, A., et al., Optimization of rewards in single machine scheduling in the rewards-driven systems, Management Science Letters 5 (6) (2015) 629-638. [16]A. Cichocki et al., Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis, IEEE Signal Processing Magazine 32 (2) (2015) 145-163. [17]Boumal, N., et al., Manopt, a matlab toolbox for optimization on manifolds, J. Mach. Learn. Res 15 (1) (2014) 1455-1459. [18]Hoseini Shekarabi, S. A., et al., Modelling and optimal lot-sizing of integrated multi-level multi-wholesaler supply chains under the shortage and limited warehouse space: generalised outer approximation, International Journal of Systems Science: Operations and Logistics 6 (3) (2018) 237-257. [19]Liangpei, Z., et al., Tensor Discriminative Locality Alignment for Hyperspectral Image Spectral–Spatial Feature Extraction, IEEE Transactions on Geoscience and Remote Sensing 51 (1) (2013) 242-256. [20]Q. Li and D. Schonfeld, Multilinear Discriminant Analysis for Higher-Order Tensor Data Classification, IEEE Trans Pattern Anal Mach Intell 36 (12) (2014) 2524-2537. [21]Yan, S., et al., Discriminant Analysis with Tensor Representation, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2005). [22]Lu, H., et al., Uncorrelated multilinear discriminant analysis with regularization and aggregation for tensor object recognition, IEEE Trans Neural Netw 20 (1) (2009) 103-123. [23]Tao, D., et al., General tensor discriminant analysis and gabor features for gait recognition, IEEE Trans Pattern Anal Mach Intell 29 (10) (2007) 1700-1715. [24]Savas, B. and L. Eldén, Handwritten digit classification using higher order singular value decomposition, Pattern Recognition 40 (3) (2007) 993-1003. [25]Zhang, J., et al., Low-Rank Regularized Heterogeneous Tensor Decomposition for Subspace Clustering, IEEE Signal Processing Letters 25 (3) (2018) 333-337. [26]Y. Kim, S. Choi, Nonnegative Tucker Decomposition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (2007). [27]Holtz, S., et al., The Alternating Linear Scheme for Tensor Optimization in the Tensor Train Format, SIAM Journal on Scientific Computing 34 (2) (2012) A683-A713. [28]Huang, Y., et al., Semi-supervised dimension reduction using trace ratio criterion, IEEE Trans Neural Netw Learn Syst 23 (3) (2012) 519-526.

13

Journal Pre-proof

Jou

rna

lP repro of

[29]Nie, F., et al., Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans Image Process 19 (7) (2010) 1921-1932. [30]Xu, D. and S. Yan, Semi-supervised bilinear subspace learning, IEEE Trans Image Process 18 (7) (2009) 1671-1676. [31]Tianhao, Z., et al., Patch Alignment for Dimensionality Reduction, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1299-1313. [32]Liu, G., et al., Robust recovery of subspace structures by low-rank representation, IEEE Trans Pattern Anal Mach Intell 35 (1) (2013) 171-184. [33]Douik, A. and B. Hassibi, Manifold Optimization Over the Set of Doubly Stochastic Matrices: A Second-Order Geometry, IEEE Transactions on Signal Processing 67 (22) (2019) 5761-5774. [34]P.-A. Absil, R. Mahony, and R. Sepulchre, Optimization Algorithms on Matrix Manifolds. Princeton, NJ, USA: Princeton Univ. Press (2008).

14

Journal Pre-proof

lP repro of

Declaration of Interest Statement

Article Title: High Order Discriminant Analysis Based on Riemannian Optimization Authors: Zhengming Ma

I confess that all authors of this manuscript have directly participated in planning, execution, and analysis of this study. And the contents of this manuscript have not been copyrighted or published previously. Moreover, the contents of manuscript are not now under consideration for publication elsewhere. Additionally, there are no directly related manuscripts or abstracts, published or unpublished, by any authors of this manuscript. I am one author signing on behalf of all co-authors of this manuscript, and attesting to the above.

Signature: Zhengming Ma Data: 2020-1-3

Printed Name: Zhengming Ma

Institurion: School of Electronics and Information Technology, Sun Yat-sen

Jou

rna

University

Journal Pre-proof

Credit Author Statement

lP repro of

Author contributions Zhengming Ma: Conceptualization, Methodology, Software, Investigation, Writing Original Draft, Supervision.

Jou

rna

Wanguang Yin: Validation, Visualization, Software, Writing & Editing.