Empirical kernel map-based multilayer extreme learning machines for representation learning

Empirical kernel map-based multilayer extreme learning machines for representation learning

Accepted Manuscript Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning Chi-Man Vong , Chuangquan Chen , Pak-...

1MB Sizes 0 Downloads 89 Views

Accepted Manuscript

Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong PII: DOI: Reference:

S0925-2312(18)30584-8 10.1016/j.neucom.2018.05.032 NEUCOM 19585

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

22 January 2018 8 May 2018 9 May 2018

Please cite this article as: Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong , Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.032

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning 1 Chi-Man Vong ,Chuangquan Chen1, Pak-Kin Wong2 1

Department of Computer and Information Science, University of Macau, Macau 2 Department of Electromechanical Engineering, University of Macau, Macau

Abstract

AC

CE

PT

ED

M

AN US

CR IP T

Recently, multilayer extreme learning machine (ML-ELM) and hierarchical extreme learning machine (H-ELM) were developed for representation learning whose training time can be reduced from hours to seconds compared to traditional stacked autoencoder (SAE). However, there are three practical issues in ML-ELM and H-ELM: 1) the random projection in every layer leads to unstable and suboptimal performance; 2) the manual tuning of the number of hidden nodes in every layer is time-consuming; and 3) under large hidden layer, the training time becomes relatively slow and a large storage is necessary. More recently, issues (1) and (2) have been resolved by kernel method, namely, multilayer kernel ELM (ML-KELM), which encodes the hidden layer in form of a kernel matrix (computed by using kernel function on the input data), but the storage and computation issues for kernel matrix pose a big challenge in large-scale application. In this paper, we empirically show that these issues can be alleviated by encoding the hidden layer in the form of an approximate empirical kernel map (EKM) computed from low-rank approximation of the kernel matrix. This proposed method is called ML-EKM-ELM, whose contributions are: 1) stable and better performance is achieved under no random projection mechanism; 2) the exhaustive manual tuning on the number of hidden nodes in every layer is eliminated; 3) EKM is scalable and produces a much smaller hidden layer for fast training and low memory storage, thereby suitable for large-scale problems. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed ML-EKM-ELM. As an illustrative example, on the NORB dataset, ML-EKM-ELM can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9.

Index Terms: Kernel learning, Multilayer extreme learning machine (ML-ELM), Empirical kernel map (EKM), Representation learning, stacked autoencoder (SAE).

1. Introduction Autoencoder (AE) is an unsupervised neural network whose input layer is equal to โœ‰ Corresponding author: Chi-Man Vong ([email protected])

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

output layer [1, 2]. AE offers an alternative way for traditional noise reduction and traditional feature extraction [3-5], and automatically extracts effective representation from the raw data. Several AEs can be used as building blocks to form a stacked AE (SAE) [1],which is capable of extracting different levels of abstract features from raw data so that it is suitable for applications of dimension reduction [6, 7] and transfer learning [8, 9]. However, the iterative training procedure in SAE based on backpropagation is extremely time-consuming. For this reason, multilayer extreme learning machine (ML-ELM) [10] was proposed (as shown in Fig.1), where multiple ELM-based AEs (ELM-AE) can be stacked for representation learning, followed by a final layer for classification. In ELM-AE [Fig.1(a)], the training mechanism randomly chooses the input weights ๐š(๐‘–) and bias ๐‘ (๐‘–) for the input layer of AE and then analytically compute the transformation matrix ๐šช (๐‘–) . ELM-AE outputs ๐šช (๐‘–) to learn a new input representation [Fig.1(b)]. After the representation learning is done [Fig.1(d)], the final data representation ๐ฑ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is obtained and used as hidden layer to calculate the output weight ๐œท for classification using regularized least squares. Without any iteration, ML-ELM can reduce the training time from hours to seconds compared to traditional SAE.

Fig. 1. The architecture of ML-ELM [11]: (a) The transformation ๐šช (1) is obtained for representation T

learning in ELM-AE; (b) The new input representation ๐ฑ (2) is computed by ๐‘” (๐ฑ (1) โˆ™ (๐šช (1) ) ), where ๐‘” is an activation function. (c) ๐ฑ (2) is the input to ELM-AE for another representation learning. (d) After the unsupervised representation learning is finished, ๐ฑ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is obtained and used as input to calculate the output weight ๐œท for classification using regularized least squares.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

More recently, a variant of ML-ELM called hierarchical ELM (H-ELM) [12] was developed. Compared to ML-ELM that directly uses the final data representation ๐ฑ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ as hidden layer to calculate the output weight ๐œท, H-ELM uses ๐ฑ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ as input to an individual ELM for classification (i.e., first randomly maps ๐ฑ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ into hidden layer), thereby maintaining the universal approximation capability of ELM. Both ML-ELM and H-ELM are very attractive due to their fast training speed. However, there are several practical issues of ML-ELM and H-ELM: 1) Suboptimal model generalization: the random input weight ๐š(๐‘–) and bias ๐‘ (๐‘–) in every layer lead to unstable and suboptimal performance. In some cases, poorly generated ๐š(๐‘–) and ๐‘ (๐‘–) may hinder the network from high generalization and hence numerous trials on ๐š(๐‘–) and ๐‘ (๐‘–) are necessary. 2) Exhaustive tuning: The accuracy of ML-ELM and H-ELM are drastically influenced by the number of hidden nodes ๐ฟ๐‘– that requires exhaustive tuning ranging from several hundreds to tens of thousands subject to the complexity of the application. Hence, numerous tedious trials are needed to determine the optimal ๐ฟ๐‘– . 3) Relatively slow training time and high memory requirement under large ๐ฟ๐‘– : In large-scale application, an accurate ML-ELM model can easily contain thousands or tens of thousands of hidden neurons. For example, ๐ฟ๐‘– can be up to 15000 for image dataset NORB[12]. Practically, many devices such as mobile phones are only with a limited amount of expensive memory so that it may be insufficient to store and run large model. It is therefore crucial to run a compact model on limited resources while maintaining satisfactory accuracy. Analytically, the memory requirement and training time of ML-ELM in the i-th layer are respectively O(๐‘›๐ฟ๐‘– ) and O(๐ฟ๐‘– 2 ๐‘›) for n training samples, which pose a challenge to train a model for large ๐ฟ๐‘– , and hence a compact model with faster solution is of interest. From the literature [13], kernel learning is well known for its optimal performance without any random input parameters. Under this inspiration, a kernel version of ML-ELM was proposed, namely, multilayer kernel ELM (ML-KELM) [11], which encodes the hidden layers in form of kernel matrix. Without tuning the parameter ๐ฟ๐‘– , ๐š(๐‘–) and ๐‘ (๐‘–) , ML-KELM successfully addresses the first two issues of ML-ELM and H-ELM, but worsen the third issue about high memory storage and very slow training time. The reason is that the memory storage and computation issues for kernel matrix ๐›€(๐‘–) โˆˆ R๐‘›ร—๐‘› in each layer pose a big challenge in large-scale application. Specifically, its takes a memory of O(๐‘›2 ) to store ๐›€(๐‘–) , and a time of O(๐‘›3 ) to find the inverse of ๐›€(๐‘–) . Thus, both memory and computation cost will grow exponentially along the training data size ๐‘›, which restrict the application of ML-KELM in large-scale problems especially on mobile devices. ฬƒ๐† ฬƒT , From the literature [14-16], a kernel matrix ๐›€ can be decomposed by ๐›€ โ‰ˆ ๐† ฬƒ. ๐† ฬƒ contains most obtaining an approximate empirical kernel map (EKM) ๐† discriminant information of ๐›€ but with much smaller size, which is very efficient for learning over large-scale data [14-16]. Under this inspiration, a multilayer empirical kernel map ELM (ML-EKM-ELM) is proposed in this paper which addresses the aforementioned issues of ML-ELM and H-ELM by encoding every hidden layer in

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

ฬƒ (๐‘–) with ๐‘™๐‘– dimensions (where ๐‘™๐‘– โ‰ช ๐‘›) computed form of an approximate EKM ๐† from the low-rank approximation of ๐›€(๐‘–) . Moreover, Nystrรถm method [17, 18] is ฬƒ (๐‘–) due to its efficiency in many large-scale adopted in our work to generate ๐† machine learning problems, which does not need to calculate the entire kernel matrix ฬƒ (๐‘–) . ๐›€(๐‘–) and only operates on a small subset of ๐›€(๐‘–) to generate ๐† Compared to ML-ELM and H-ELM, the contributions of ML-EKM-ELM are: 1) Benefited from kernel learning, the constructed kernel matrix does not require any randomly generated parameter (input ๐š(๐‘–) and bias ๐‘ (๐‘–) ) in each layer so that a stable and theoretically optimal performance is always achieved. EKM efficiently approximates the kernel matrix and shares its stable and optimal performance. 2) The exhaustive tuning of ๐ฟ๐‘– is eliminated. Only a few trials of ๐‘™๐‘– is necessary in ML-EKM-ELM, where ๐‘™๐‘– = 0.01๐‘›, 0.05๐‘›, 0.1๐‘›, 0.2๐‘› typically (to be detailed in Section 4). For simplicity, ๐‘™๐‘– can be equally set in every layer (i.e., ๐‘™1 = ๐‘™2 = ๐‘™3 = โ‹ฏ), while maintaining a satisfactory performance. 3) EKM is scalable (i.e., ๐‘™๐‘– can be set with different values) and easy for storage and computation because it can be a very small matrix for hidden representation (i.e., setting ๐‘™๐‘– to a small value is sufficient for practical application), thereby suitable for large-scale problems and mobile devices. Compared to ML-KELM, the benefits of ML-EKM-ELM are: 1) ML-KELM takes O(๐‘›2 ) to store the matrix ๐›€(๐‘–) for the i-th layer while ฬƒ (๐‘–) , which is linear in the data ML-EKM-ELM takes O(๐‘›๐‘™๐‘– ) to store the matrix ๐† size ๐‘› for ๐‘™๐‘– โ‰ช ๐‘›. 2) ML-KELM requires O(๐‘›3 ) time to find the inverse of ๐›€(๐‘–) while ฬƒ (๐‘–) , which saves ML-EKM-ELM only takes O(๐‘›๐‘™๐‘– 2 ) time to find the inverse of ๐† substantial computing resources. 3) Benefited from the above two advantages, much smaller matrices can be constructed for feasible storage and execution in mobile devices. The rest of this paper is organized as follows. Section 2 introduces the related works including EKM, ML-ELM, and ML-KELM. Section 3 introduces ML-EKM-ELM. In Section 4, ML-EKM-ELM is compared with ML-KELM and other existing deep neural networks over benchmark data. Finally, conclusions are given in Section 5.

AC

2. Related works In this section, we briefly describe EKM, ML-ELM, and ML-KELM. All techniques are necessary to develop our proposed ML-EKM-ELM. 2.1 Empirical kernel map (EKM) Let ๐— = [๐ฑ1 , โ€ฆ , ๐ฑ ๐‘› ]T โˆˆ R๐‘›ร—๐‘‘ be an input data matrix that contains ๐‘› data points in R๐‘‘ as its rows. In kernel learning, the inner products in feature space can be computed by adopting a kernel function ฮบ(. , . , . ) on the input data space: ๐›€

,

= ฮบ(๐ฑ , ๐ฑ , ) = โŒฉ (๐ฑ ), (๐ฑ )โŒช,

, = 1โ€ฆ,๐‘›

(1)

where is an adjustable parameter, ๐›€ โˆˆ ๐‘… ๐‘›ร—๐‘› is the kernel matrix and (๐ฑ) is the kernel-induced feature map [19]. The kernel matrix ๐›€ can be decomposed by

ACCEPTED MANUSCRIPT

R๐‘›ร—๐‘™ and ๐›€๐‘™๐‘™ โˆˆ R๐‘™ร—๐‘™ , where (๐›€๐‘›๐‘™ )

,

CR IP T

๐›€ = ๐†๐†T ,where ๐† = [ (๐ฑ1 ), (๐ฑ 2 ), โ€ฆ , (๐ฑ ๐‘› )]T โˆˆ ๐‘… ๐‘›ร—๐ท (where ๐ท is the rank of ๐›€) is called empirical kernel map (EKM), which is a matrix containing ๐‘› data points in R๐ท as its row. Since the dimensionality ๐ท of ๐† is usually large, an approximate ฬƒ โˆˆ ๐‘… ๐‘›ร—๐‘™ is practically computed from the low-rank (and much smaller) EKM ๐† approximation of ๐›€ using Nystrรถm method, where ๐‘™ โ‰ช ๐‘› and specifically, ฬƒ๐† ฬƒT . The Nystrรถm method works by selecting a small subset of the training data ๐›€โ‰ˆ๐† (e.g., randomly select l samples) referred to as landmark points and constructing a small subset of ๐›€ by computing the kernel similarities between the input data points and landmark points. The Nystrรถm method then operates on this small subset of ๐›€ ฬƒ of ๐‘™ dimensions. and determines an appropriate EKM ๐† T ๐‘™ร—๐‘‘ Let ๐• = [๐ฏ1 , โ€ฆ , ๐ฏ๐‘™ ] โˆˆ R be the set of l landmark points in R๐‘‘ , which is selected from ๐—. The Nystrรถm method first generates two small matrices ๐›€๐‘›๐‘™ โˆˆ = ฮบ(๐ฑ , ๐ฏ , ) , and (๐›€๐‘™๐‘™ )

,

= ฮบ(๐ฏ , ๐ฏ , ) .

M

AN US

Both ๐›€๐‘›๐‘™ and ๐›€๐‘™๐‘™ are the sub-matrices of ๐›€. Next, both ๐›€๐‘›๐‘™ and ๐›€๐‘™๐‘™ are used to approximate the kernel matrix ๐›€: ฬƒ = ๐›€๐‘›๐‘™ ๐›€๐‘™๐‘™ ๐›€T๐‘›๐‘™ (2) ๐›€โ‰ˆ๐›€ where ๐›€๐‘™๐‘™ represents the pseudoinverse of ๐›€๐‘™๐‘™ . By applying eigen-decomposition on ๐›€๐‘™๐‘™ , ๐›€๐‘™๐‘™ = ๐”๐‘™ ๐šฒ๐‘™ ๐”๐‘™T is obtained, where ๐šฒ๐‘™ โˆˆ R๐‘™ร—๐‘™ and ๐”๐‘™ โˆˆ R๐‘™ร—๐‘™ contain the ๐‘™ eigenvalues and the corresponding eigenvectors of ๐›€๐‘™๐‘™ , respectively. Then the approximation of (2) can be expressed as: ฬƒ=๐† ฬƒ๐† ฬƒT (3) ๐›€โ‰ˆ๐›€ and 1

ฬƒ = ๐›€๐‘›๐‘™ ๐”๐‘™ ๐šฒ 2 โˆˆ R๐‘›ร—๐‘™ ๐† ๐‘™

(4)

1/2

ฬƒ is only linear in the . For ๐‘™ โ‰ช ๐‘›, the computational cost to construct ๐†

CE

๐›€๐‘›๐‘™ ๐”๐‘™ ๐šฒ๐‘™

PT

ED

ฬƒ serves as an approximate EKM, and the rows of ๐† ฬƒ are known as virtual where ๐† samples [19]. ฬƒ. Specifically, The Nystrรถm method takes O(๐‘‘๐‘›๐‘™ + ๐‘™ 3 + ๐‘›๐‘™ 2 ) time to generate ๐† it takes O(๐‘‘๐‘›๐‘™) of time to form the matrices ๐›€๐‘›๐‘™ and ๐›€๐‘™๐‘™ , and takes O(๐‘™ 3 ) to apply eigen-decomposition on ๐›€๐‘™๐‘™ , and takes O(๐‘›๐‘™ 2 ) time to calculate the product of

AC

data set size ๐‘›. 2.2 Multilayer Extreme learning machines (ML-ELM) (๐‘–)

(๐‘–) T

(๐‘–)

Referring to Fig. 1, let ๐— (๐‘–) = [๐ฑ1 , โ€ฆ . , ๐ฑ ๐‘› ] , where x

is the i-th data

representation for input x ,for k =1 to n. Let ๐‡ (๐‘–) be the i-th hidden layer output matrix with respect to ๐— (๐‘–) . Then the i-th transformation matrix ๐šช (๐‘–) can be learned by (5) ๐‡ (๐‘–) ๐šช (๐‘–) = ๐— (๐‘–) where

ACCEPTED MANUSCRIPT

๐‡ (๐‘–) = [

1

1 (๐‘–)

and

(๐‘–)

(๐‘–)

(๐‘–)

(๐š1 (๐‘–) , ๐‘1 (๐‘–) , x1 ) โ€ฆ (๐‘–)

(๐š1 (๐‘–) , ๐‘1 (๐‘–) , x๐‘› ) โ€ฆ

(๐‘–)

(๐‘–)

(๐š (๐š

(๐‘–)

(๐‘–)

,๐‘ ,๐‘

(๐‘–)

(๐‘–)

(๐‘–)

, x1 ) (๐‘–)

]

(6)

, x๐‘› )

(๐š(๐‘–) , ๐‘ (๐‘–) , x(๐‘–) ) = ๐‘”๐‘– (๐š(๐‘–) x(๐‘–) + ๐‘ (๐‘–) ), where ๐‘”๐‘– is the activation function in

the i-th layer, and both input ๐š(๐‘–) and bias ๐‘ (๐‘–) are randomly generated in the i-th layer. ๐šช (๐‘–) can be calculated by

๐šช { where

and

(๐‘–)

๐‘›

๐‘› ๐‘–

+ ๐‡ (๐‘–) (๐‡ (๐‘–) )T ) 1 ๐— (๐‘–)

๐‘–

๐ฟ๐‘– (7)

(๐‘–) T

=(

,๐‘›

CR IP T

๐šช (๐‘–) = (๐‡ (๐‘–) )T (

(๐‘–)

1

(๐‘–) T (๐‘–)

+ (๐‡ ) ๐‡ ) (๐‡ ) ๐—

,๐‘›

๐ฟ๐‘–

represent an identity matrix of dimension ๐ฟ๐‘– and n, respectively,

AN US

and the user-specified ๐‘– is for regularization used in the i-th layer. In (7), ๐šช (๐‘–) is used for representation learning and by multiplying ๐— (๐‘–) with ๐šช (๐‘–) , a new data representation ๐— ๐‘–+1 is obtained as shown in Fig. 1(b): (8) ๐— ๐‘–+1 = ๐‘”๐‘– (๐— (๐‘–) (๐šช (๐‘–) )T ) The final data representation of ๐— (1) , namely ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , is obtained after the learning procedure in Fig.1(d) is done. Then ML-ELM directly uses ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ as hidden layer to calculate the output weight ๐›ƒ (9) ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ๐›ƒ =

M

where = [๐ญ1 , โ€ฆ . , ๐ญ ๐‘› ]T , and ๐ญ โˆˆ ๐‘… ๐‘ is a one-hot output vector, and c is the number of classes. The weight matrix ๐›ƒ can be solved by T

๐‘“๐‘–๐‘›๐‘Ž๐‘™

T

๐‘“๐‘–๐‘›๐‘Ž๐‘™

T

+ ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ (๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) )

+ (๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) 1 (๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )

PT

๐›ƒ=( {

๐‘›

ED

๐›ƒ = (๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) (

1

T

,๐‘›

๐ฟ๐‘“๐‘–๐‘›๐‘Ž๐‘™

,๐‘›

๐ฟ๐‘“๐‘–๐‘›๐‘Ž๐‘™

(10)

AC

CE

2.3 Multilayer Kernel Extreme learning machines (ML-KELM) In [11], ML-KELM was developed by integrating kernel learning into ML-ELM to achieve high generalization with less user intervention. ML-KELM contains two steps: 1) Unsupervised representation learning by stacking kernel version of ELM-AEs, namely, KELM-AE; 2) Supervised feature classification using a kernel version of ELM (i.e., K-ELM).

ACCEPTED MANUSCRIPT

Input ๐ฑ ๐‘”

Output

Hidden layer

(๐‘–)

๐Š

๐ฑ (๐‘–)

(๐‘–)

(๐‘–) ๐‘ฅ1

(๐‘–)

(๐‘–)

๐šช (๐‘–)

K1 (๐‘–)

(๐‘–)

โ€ฆ

๐‘ฅ2

๐‘ฅ2

โ€ฆ

โ€ฆ

(๐‘–)

CR IP T

K๐‘›

(๐‘–)

๐‘ฅ๐‘‘

(๐‘–)

๐‘ฅ๐‘‘

๐— (๐‘–)

๐— (๐‘–)

๐›€(๐‘–)

AN US

For all ๐ฑ (๐‘–) :

๐‘ฅ1

Fig. 2. The architecture of the i-th KELM-AE [11], in which hidden layer is encoded in form of a kernel matrix ๐›€(๐‘–)

Identical to ELM-AE, KELM-AE learns the transformation

๐šช (๐‘–) from hidden

layer to output layer. From Fig.2, kernel matrix ๐›€(๐‘–) = [๐Š(๐‘–) (๐ฑ1 (๐‘–) ), โ€ฆ , ๐Š(๐‘–) (๐ฑ ๐‘› (๐‘–) )]T

M

is first obtained by using kernel function ฮบ(๐‘–) (๐ฑ

(๐‘–)

,๐ฑ

(๐‘–)

,

๐‘–)

on the input matrix

ED

๐— (๐‘–) , and then the i-th transformation matrix ๐šช (๐‘–) in KELM-AE is learned similar to ELM-AE in (5) (11) ๐›€(๐‘–) ๐šช (๐‘–) = ๐— (๐‘–)

PT

Similar to (7), ๐šช (๐‘–) in (11) is obtained by ๐šช( ) = (

๐‘› ๐‘–

+ ๐›€(๐‘–) ) 1 ๐— (๐‘–)

CE

The data representation ๐— (๐‘–+1) is calculated similar to (8) ๐— (๐‘–+1) = ๐‘”๐‘– (๐— (๐‘–) (๐šช (๐‘–) )T )

(12) (13)

AC

After the unsupervised representation learning procedure is finished, the final data representation ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is obtained and used as input to train a K-ELM classification (14) ๐›€ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ๐œท = where ๐›€ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is the kernel matrix defined on ๐— final . The weight matrix ๐œท can be solved by: ๐œท=(

๐‘›

+ ๐›€ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )

๐‘“๐‘–๐‘›๐‘Ž๐‘™ (1)

(15)

Given a set of m test samples ๐™ = [๐ณ1 (1) , โ€ฆ , ๐ณ๐‘š (1) ]T โˆˆ R๐‘šร—๐‘‘ . In the stage of representation learning, the data representation ๐™(๐‘–+1) is obtained by multiplying the i-th transformation matrix ๐šช (๐‘–) (16) ๐™(๐‘–+1) = ๐‘”๐‘– (๐™(๐‘–) (๐šช (๐‘–) )T ) Then, the final data representation ๐™final = [๐ณ1 ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , โ€ฆ , ๐ณ๐‘š ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ]T is obtained and

ACCEPTED MANUSCRIPT used to calculate the test kernel matrix ๐›€๐‘ โˆˆ R๐‘šร—๐‘› (๐›€๐‘ ) where ๐ฑ

๐‘“๐‘–๐‘›๐‘Ž๐‘™

,

ฮบ(๐ณ

๐‘“๐‘–๐‘›๐‘Ž๐‘™

,๐ฑ

๐‘“๐‘–๐‘›๐‘Ž๐‘™

,

(17)

๐‘“๐‘–๐‘›๐‘Ž๐‘™ )

ฬƒ is is the j-th data point from ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ . Finally, the networkโ€™s output ๐˜

given by

ฬƒ = ๐›€๐‘ ๐›ƒ ๐˜

(18)

Remark: For special case, when linear piecewise activation ๐‘”๐‘– was applied to all

๐šช (๐‘–

1)

CR IP T

๐šช (๐‘–) , ๐šช (๐‘–) can be unified into a single matrix ๐šช๐‘ข๐‘›๐‘–๐‘“๐‘–๐‘’๐‘‘ (i.e., ๐šช๐‘ข๐‘›๐‘–๐‘“๐‘–๐‘’๐‘‘ = ๐šช (๐‘–) โˆ™ โ‹ฏ ๐šช (1) ). For execution, ๐™๐‘“๐‘–๐‘›๐‘Ž๐‘™ = ๐™(1) (๐šช๐‘ข๐‘›๐‘–๐‘“๐‘–๐‘’๐‘‘ )T is directly computed so that

both issues of memory storage and execution time caused by deep neural network can be alleviated.

AN US

3 Proposed ML-EKM-ELM

Input

(๐‘–)

(๐‘–)

๐œ™1

(๐‘–)

๐œ™๐‘™๐‘–

(๐‘–)

๐‘ฅ๐‘‘

๐— (๐‘–)

๐šช (๐‘–)

๐‘ฅ1

(๐‘–)

๐‘ฅ2

โ€ฆ

โ€ฆ

CE

ฬƒ (๐‘–) ๐“

โ€ฆ

(๐‘–)

AC

๐ฑ (๐‘–)

(๐‘–)

๐‘ฅ2

For all ๐ฑ (๐‘–) :

Output

Hidden layer

๐‘ฅ1

PT

๐‘”2 ๐‘”

ED

๐ฑ

(๐‘–)

M

The proposed ML-EKM-ELM is developed by replacing the randomly generated ฬƒ (๐‘–) computed from low-rank hidden layer in ML-ELM into an approximate EKM ๐† approximation of ๐›€(๐‘–) . In ML-EKM-ELM, EKM version of ELM-AEs (EKM-AE) are stacked for representation learning, followed by a final layer of EKM version of ELM for classification.

(๐‘–)

๐‘ฅ๐‘‘ ฬƒ (๐‘–) ๐†

๐— (๐‘–)

Fig. 3. The architecture of the i-th EKM -AE, in which the hidden layer is encoded in form of ฬƒ (๐‘–) . an approximate EKM ๐†

3.1 EKM-AE In this section, the details of EKM-AE are discussed. As shown in Fig. 3, the input ฬƒ (๐‘–) , where matrix ๐— (๐‘–) is first mapped into empirical kernel map ๐†

ACCEPTED MANUSCRIPT

ฬƒ (๐‘–) = [ ฬƒ (๐‘–) (๐ฑ1 (๐‘–) ), โ€ฆ , ฬƒ (๐‘–) (๐ฑ ๐‘› (๐‘–) )]T . The ๐† ฬƒ (๐‘–) with ๐‘™๐‘– -dimension is calculated (to be ๐† detailed in Section 2.1) by first generating two small matrices ๐›€๐‘›๐‘™ (๐‘–) โˆˆ R๐‘›ร—๐‘™ , ๐›€๐‘™ ๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™

using

the

randomly

๐‘™๐‘–

selected

landmark

points

๐• (๐‘–) = [๐ฏ1 (๐‘–) , โ€ฆ , ๐ฏ๐‘™ (๐‘–) ]T from ๐— (๐‘–) such that ,

= ฮบ(๐ฑ

(๐‘–)

, ๐ฏ (๐‘–) ,

๐‘–)

(๐›€๐‘™ ๐‘™ (๐‘–) )

,

= ฮบ(๐ฏ

(๐‘–)

, ๐ฏ (๐‘–) ,

๐‘–)

and

(19)

CR IP T

(๐›€๐‘›๐‘™ (๐‘–) )

(20)

Then, ๐”๐‘™ (๐‘–) and ๐šฒ๐‘™ (๐‘–) are obtained by applying eigen-decomposition on ๐›€๐‘™ ๐‘™ (๐‘–)

where ๐šฒ๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™

and

(21)

AN US

๐›€๐‘™ ๐‘™ (๐‘–) = ๐”๐‘™ (๐‘–) ๐šฒ๐‘™ (๐‘–) ( ๐”๐‘™ (๐‘–) )T ๐”๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™

contain the ๐‘™๐‘– eigenvalues and the

corresponding eigenvectors of ๐›€๐‘™ ๐‘™ (๐‘–) , respectively. Next, ๐”๐‘™ (๐‘–) , ๐šฒ๐‘™ (๐‘–) and ๐›€๐‘›๐‘™ (๐‘–)

ฬƒ (๐‘–) = (๐›€๐‘›๐‘™ (๐‘–) ) ๐† (๐‘–)

๐‘›ร—๐‘™

โˆˆR

(๐‘–)

,

= ( ๐”๐‘™

(๐‘–)

)(๐šฒ๐‘™

(๐‘–)

)

1 2

(22)

โˆˆ R๐‘™ ร—๐‘™

is the mapping matrix of the i-th layer. Then, the rank-๐‘™๐‘– approximation

ED

where

(๐‘–)

M

ฬƒ (๐‘–) : are used to construct ๐†

ฬƒ (๐‘–) ๐† ฬƒ (๐‘–) T . Practically, it is preferred to replace the ๐›€(๐‘–) can be expressed as ๐›€(๐‘–) โ‰ˆ ๐†

CE

PT

ฬƒ (๐‘–) rather than ๐›€(๐‘–) because ๐† ฬƒ (๐‘–) is much smaller hidden layer in ELM-AE by ๐† than ๐›€(๐‘–) while maintaining most of the discriminant information of ๐›€(๐‘–) . EKM is detailed in Algorithm 1.

AC

Algorithm 1: EKM for the i-th layer Input: Input matrix ๐— (๐‘–) ,kernel function ฮบ(. , . , . ) , kernel parameter landmark set size ๐‘™๐‘– ฬƒ (๐‘–) Output: Mapping matrix (๐‘–) , empirical kernel map ๐‘ฎ

๐‘–

Step 1: Randomly select ๐‘™๐‘– landmark points ๐• (๐‘–) = [๐ฏ1 (๐‘–) , โ€ฆ , ๐ฏ๐‘™ (๐‘–) ]T from ๐— (๐‘–) Step 2: Generate the kernel matrix ๐›€๐‘›๐‘™ (๐‘–) โˆˆ R๐‘›ร—๐‘™ , ๐›€๐‘™ ๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™ (๐›€๐‘›๐‘™ (๐‘–) )

,

ฮบ(๐ฑ

(๐‘–)

, ๐ฏ (๐‘–) ,

๐‘–)

(๐›€๐‘™ ๐‘™ (๐‘–) )

,

ฮบ(๐ฏ

(๐‘–)

, ๐ฏ (๐‘–) ,

๐‘–)

and

and

ACCEPTED MANUSCRIPT

Step 3: Calculate

๐”๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™ and ๐šฒ๐‘™ (๐‘–) โˆˆ R๐‘™ ร—๐‘™ by applying

eigen-decomposition on ๐›€๐‘™ ๐‘™ (๐‘–) = ๐”๐‘™ (๐‘–) ๐šฒ๐‘™ (๐‘–) ( ๐”๐‘™ (๐‘–) )T Step 4: Calculate the mapping matrix (๐‘–)

(๐‘–)

= ( ๐”๐‘™

(๐‘–)

)(๐šฒ๐‘™ (๐‘–) )

1 2

ฬƒ (๐‘–) ๐† Return

(๐‘–)

(๐›€๐‘›๐‘™

ฬƒ (๐‘–) ,๐†

(๐‘–)

)

(๐‘–)

CR IP T

ฬƒ (๐‘–) Step 5: Calculate the empirical kernel map ๐‘ฎ

AN US

Next, the i-th transformation matrix ๐šช (๐‘–) in EKM-AE is learned similar to ELM-AE in (5) ฬƒ (๐‘–) ๐šช (๐‘–) = ๐— (๐‘–) (23) ๐† ๐šช (๐‘–) can be solved by ๐šช (๐‘–) = (

๐‘™ ๐‘–

ฬƒ (๐‘–) )T ๐† ฬƒ (๐‘–) ) 1 (๐† ฬƒ (๐‘–) )T ๐— (๐‘–) + (๐†

(24)

M

Compared to (12) in which KELM-AE needs to invert a matrix ๐›€(๐‘–) of size ๐‘› ร— ๐‘›, ฬƒ (๐‘–) )T ๐† ฬƒ (๐‘–) of size just ๐‘™๐‘– ร— ๐‘™๐‘– , EKM-AE only needs to invert a much smaller matrix (๐† ๐‘™๐‘– โ‰ช ๐‘›. The data representation ๐— (๐‘–+1) is calculated similar to (8) (25) ๐— ๐‘–+1 = ๐‘”๐‘– (๐— (๐‘–) (๐šช (๐‘–) )T )

ED

EKM-AE is detailed in algorithm 2.

AC

CE

PT

Algorithm 2: EKM-AE for the i-th Layer Input: Input matrix ๐— (๐‘–) , regularization ๐‘– , kernel function ฮบ(. , . , . ) ,kernel parameter ๐‘– , activation function ๐‘”๐‘– and landmark set size ๐‘™๐‘– Output: New data representation ๐— (๐‘–+1), mapping matrix (๐‘–) , empirical kernel map ฬƒ (๐‘–) , and transformation matrix ๐šช (๐‘–) ๐† ฬƒ (๐‘–) Step 1: Calculate (๐‘–) , ๐† EKM(๐— (๐‘–) , ฮบ(. , . , . ), ๐‘– , ๐‘™๐‘– ) Step 2: Estimate the transformation matrix ๐šช (๐‘–) ๐šช (๐‘–)

(

๐‘™ ๐‘–

ฬƒ (๐‘–) )T ๐† ฬƒ (๐‘–) ) 1 (๐† ฬƒ (๐‘–) )T ๐— (๐‘–) + (๐†

Step3: Calculate new data representation ๐— (๐‘–+1) โ† ๐‘”๐‘– (๐— (๐‘–) (๐šช (๐‘–) )T ) ฬƒ (๐‘–) and ๐šช (๐‘–) Return ๐— (๐‘–+1) , (๐‘–) , ๐† 3.2 Proposed ML-EKM-ELM ML-EKM-ELM follows the two separate learning procedures of ML-KELM [11]. In the stage of unsupervised representation learning, each ๐šช (๐‘–) and ๐— (๐‘–) (for i-th EKM-AE) is obtained using (24) and (25), respectively. In the stage of supervised ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ โˆˆ ๐‘… ๐‘›ร—๐‘™ feature classification, the final EKM ๐† with respect to ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is conveyed as input for training:

ACCEPTED MANUSCRIPT ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ๐œท = ๐†

(26)

The output weight ๐œท in (26) is solved by ๐œท=(

๐‘™ ๐‘“๐‘–๐‘›๐‘Ž๐‘™

ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )T ๐‘ฎ ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) 1 (๐‘ฎ ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )T + (๐‘ฎ

(27)

For the test stage, the data representation ๐™(๐‘–+1) of test data ๐™ (1) is obtained by multiplying the i-th transformation matrix ๐šช (๐‘–) (28) ๐™(๐‘–+1) = ๐‘”๐‘– (๐™(๐‘–) (๐šช (๐‘–) )T )

CR IP T

The final data representation ๐™๐‘“๐‘–๐‘›๐‘Ž๐‘™ = [๐ณ1 ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , โ€ฆ , ๐ณ๐‘š ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ]T is obtained and used to ฬƒ ๐‘ โˆˆ R๐‘šร—๐‘™ calculate the approximate test kernel matrix ๐›€ (29) ฬƒ ๐‘) , (๐›€ ฮบ(๐ณ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , ๐ฏ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) where ๐ฏ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ is the j-th selected landmark point from ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ in the training phase. ๐‘“๐‘–๐‘›๐‘Ž๐‘™

๐œท

AN US

Finally, the model output is given by ฬƒ๐‘ ฬƒ=๐›€ ๐˜

(30)

The procedure of ML-EKM-ELM is detailed in Algorithm 3.

AC

CE

PT

ED

M

Algorithm 3: Proposed ML-EKM-ELM Training Stage: Input: Input matrix ๐— (1) , Output matrix , regularization ๐‘– , kernel function ฮบ(. , . , . )๏ผŒ kernel parameter ๐‘– , the number of layer s, activation function ๐‘”๐‘– and landmark set size ๐‘™๐‘– Output: landmark points ๐•๐‘“๐‘–๐‘›๐‘Ž๐‘™ (selected from ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ), ๐‘  โˆ’ 1 transformation matrix ๐šช (๐‘–) , output weight ๐œท and final mapping matrix ๐‘“๐‘–๐‘›๐‘Ž๐‘™ Step 1: for i=1: ๐‘  โˆ’ 1 do Calculate ๐— (๐‘–+1) , ๐šช (๐‘–) EKM โˆ’ AE(๐— (๐‘–) , ๐‘– , ฮบ(. , . , . ), ๐‘– , ๐‘”๐‘– , ๐‘™๐‘– ) Step 2: i ๐‘  ฬƒ (๐‘–) Step 3: Calculate (๐‘–) , ๐† EKM(๐— (๐‘–) , ฮบ(. , . , . ), ๐‘– , ๐‘™๐‘– ) (๐‘–) ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ฬƒ (๐‘–) Step 4: ๐— ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ๐— (๐‘–) , ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ,๐† ๐† Step 5: Calculate the output weight ๐œท

(

๐‘™ ๐‘“๐‘–๐‘›๐‘Ž๐‘™

ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )T ๐† ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ) (๐† ฬƒ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ )T + (๐†

Return ๐šช (๐‘–) (๐‘– = 1,2, โ€ฆ , ๐‘  โˆ’ 1), ๐œท, ๐‘“๐‘–๐‘›๐‘Ž๐‘™ and ๐•๐‘“๐‘–๐‘›๐‘Ž๐‘™ Prediction Stage Input: test data ๐™(1) , landmark points ๐•๐‘“๐‘–๐‘›๐‘Ž๐‘™ , mapping matrix transformation matrix ๐šช (๐‘–) (๐‘– = 1,2, โ€ฆ ๐‘  โˆ’ 1), and output weight ๐œท ฬƒ Output: ๐˜ Step 1: for i=1: ๐‘  โˆ’ 1 do Calculate ๐™(๐‘–+1) = ๐‘”๐‘– (๐™(๐‘–) (๐šช (๐‘–) )T ) Step 2: i ๐‘ ; ๐™๐‘“๐‘–๐‘›๐‘Ž๐‘™ ๐™(๐‘–)

๐‘“๐‘–๐‘›๐‘Ž๐‘™

,

ACCEPTED MANUSCRIPT ฬƒ ๐‘ โˆˆ R๐‘šร—๐‘™ using ๐™๐‘“๐‘–๐‘›๐‘Ž๐‘™ and ๐•๐‘“๐‘–๐‘›๐‘Ž๐‘™ Step 3: Calculate ๐›€ ฬƒ ๐’) (๐›€

,

ฮบ(๐ณ

๐‘“๐‘–๐‘›๐‘Ž๐‘™

ฬƒ๐‘ ฬƒ=๐›€ Step 4: Calculate the networkโ€™s output ๐˜ ฬƒ Return ๐˜

, ๐ฏ ๐‘“๐‘–๐‘›๐‘Ž๐‘™ , ๐‘“๐‘–๐‘›๐‘Ž๐‘™

๐‘–)

๐œท

ED

M

AN US

CR IP T

3.3 Memory and Computational complexity In this section, we take an example of three hidden layers to analyze the memory and time complexity of ML-EKM-ELM. A๏ผŽMemory Complexity Compared to ML-KELM that needs O(๐‘›2 ) memory to store ๐›€(๐‘–) and ๐— (๐‘–) for the i-th layer during the training stage, ML-EKM-ELM takes O(๐‘›๐‘™๐‘– ) to store the ฬƒ (๐‘–) and ๐— (๐‘–) for the i-th layer which is only linear in the data size ๐‘›. Hence, matrix ๐† the memory requirement is significantly reduced from quadratic to linear. For execution, with ๐‘‘ features and c classes in the training data and m test samples, ML-KELM needs to store i) O(๐‘š๐‘›) memory for test kernel matrix ๐›€๐’› ; ii) O(๐‘‘๐‘›) and O(๐‘›2 ) memory for transformation matrix ๐šช (1) and ๐šช (2) respectively; iii) O(๐‘›๐‘) memory for ๐œท, resulting in a total of O(๐‘›2 + ๐‘›(๐‘ + ๐‘‘ + ๐‘š)) memory. For ฬƒ ๐‘ ; ii) ML-EKM-ELM, we need to store i) O(๐‘š๐‘™3 ) memory for test kernel matrix ๐›€ O(๐‘‘๐‘™1 ) and O(๐‘™1 ๐‘™2 ) memeory for transformation matrix ๐šช (1) and ๐šช (2) ; iii) O(๐‘™3 2 ) memory for ๐‘“๐‘–๐‘›๐‘Ž๐‘™ ; iv) O(๐‘™3 ๐‘) memory for ๐œท; resulting in a total of ๐‘‚(๐‘š๐‘™3 + ๐‘‘๐‘™1 + ๐‘™1 ๐‘™2 + ๐‘™3 2 + ๐‘™3 ๐‘) memory. Assume ๐‘™1 = ๐‘™2 = ๐‘™3 = ๐‘›๐‘, where ๐‘ โ‰ช 1 . Then the memory complexity of ML-EKM-ELM is equal to O(๐‘š๐‘›๐‘ + ๐‘‘๐‘›๐‘ + 2๐‘›2 ๐‘2 + ๐‘›๐‘๐‘). Since ๐‘, ๐‘‘ and ๐‘š << ๐‘› in large-scale application, by keeping only the largest complexity values, the complexities of ML-KELM and ML-EKM-ELM become O(๐‘›2 ) and O(๐‘›2 ๐‘2 ) , respectively. For special case, when linear piecewise activation ๐‘”๐‘– was applied to all transformation matrix ๐šช (๐‘–) , only one single

PT

transformation matrix ๐šช๐‘ข๐‘›๐‘–๐‘“๐‘–๐‘’๐‘‘

(i. e. , ๐šช๐‘ข๐‘›๐‘–๐‘“๐‘–๐‘’๐‘‘ = ๐šช (2) โˆ™ ๐šช (1) ) is necessary for

CE

execution (detailed in Section 2.3). In this case, the complexities of ML-KELM and ML-EKM-ELM are O(๐‘›) and O(๐‘›๐‘) respectively.

AC

B๏ผŽComputational complexity For training stage, the computational complexities of ML-KELM and ML-EKM-ELM are respectively shown in Table 1 and Table 2. In total, ML-KELM takes O(7๐‘›3 + 3๐‘›2 ๐‘‘ + ๐‘›2 ๐‘) while ML-KEM-ELM takes O ((3๐‘™1 2 + 3๐‘™2 2 + 3๐‘™3 2 + 3๐‘™1 ๐‘™2 +๐‘™2 ๐‘™3 + 3๐‘‘๐‘™1 + ๐‘™3 c)๐‘› + 3(๐‘™1 3 + ๐‘™2 3 + ๐‘™3 3 )) .

For

test

stage,

the

time

complexities of ML-KELM and ML-EKM-ELM are shown in Table 3 and Table 4 respectively. In total, the time complexity of the ML-KELM is O(๐‘š(๐‘›๐‘‘ + 2๐‘›2 + ๐‘›๐‘)) and the time complexity of the ML-KEM-ELM is O(๐‘š(๐‘‘๐‘™1 + ๐‘™1 ๐‘™2 + ๐‘™2 ๐‘™3 + ๐‘™3 2 + ๐‘™3 ๐‘)).

ACCEPTED MANUSCRIPT Assume ๐‘™1 = ๐‘™1 = ๐‘™3 = ๐‘›๐‘, where ๐‘ โ‰ช 1 , and keeping only the largest complexity values, and then, for training stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(๐‘›3 ) and O(๐‘2 ๐‘›3 ) , respectively. Similarly, for test stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(๐‘›2 ) and O(๐‘2 ๐‘›2 ) , respectively. Table 1 The computational complexity of ML -KELM in the training stage 1-st layer

2-nd layer

2

O(๐‘› ๐‘‘) O(๐‘›3 ) O(๐‘›2 ๐‘‘) ---

O(๐‘›3 ) O(๐‘›3 ) -O(๐‘›3 )

O(๐‘› O(๐‘›3 ) O(๐‘›3 ) O(๐‘›2 ๐‘‘) --

O(๐‘›2๐‘)

AN US

Construct ๐›€ Invert ๐›€(๐‘–) Calculate ๐šช (๐‘–) Calculate ๐— (๐‘–) Calculate ๐œท

3-rd layer

3)

CR IP T

(๐‘–)

Table 2 The computational complexity of ML-EKM-ELM in the training stage 2-nd layer

3-rd layer

O(๐‘›๐‘™1 ๐‘‘)

O(๐‘›๐‘™1 ๐‘™2 )

๐‘‚(๐‘›๐‘™2 ๐‘™3 )

O(๐‘™1 )

O(๐‘™2 )

O(๐‘™3 3 )

O(๐‘™2 3 ) O(๐‘›๐‘™2 2 ) O(๐‘›๐‘™2 2 ) O(๐‘™2 3 ) O(๐‘›๐‘™2 2 ) + O(๐‘›๐‘™2 ๐‘™1 ) O(๐‘›๐‘‘๐‘™1 )

O(๐‘™3 3 ) O(๐‘›๐‘™3 2 ) O(๐‘›๐‘™3 2 ) O(๐‘™3 3 ) -O(๐‘›๐‘™1 ๐‘™2 )

3

O(๐‘™1 3 ) O(๐‘›๐‘™1 2 ) O(๐‘›๐‘™1 2 ) O(๐‘™1 3 ) O(๐‘›๐‘™1 2 ) + O(๐‘›๐‘™1 ๐‘‘) ---

3

--

PT

Calculate ๐šช (๐‘–) Calculate ๐— (๐‘–) Calculate ๐œท

1-st layer

M

Construct ๐›€๐‘›๐‘™ and ๐›€๐‘™ ๐‘™ Decompose ๐›€๐‘™ ๐‘™ (๐‘–) Calculate (๐‘–) ฬƒ (๐‘–) Calculate ๐† ฬƒ (๐‘–) )T ๐† ฬƒ (๐‘–) Calculate (๐† ฬƒ (๐‘–) )T ๐† ฬƒ (๐‘–) Inverse (๐†

(๐‘–)

ED

(๐‘–)

O(๐‘›๐‘™3 2 ) + O(๐‘›๐‘™3 c)

CE

Table 3 The computational complexity of ML -KELM in the test stage

(๐‘–)

AC

Calculate ๐™ Calculate ๐›€๐‘ ฬƒ Calculate ๐˜

1-st layer

2-nd layer

----

O(๐‘š๐‘›๐‘‘) ---

Table 4 The computational complexity of ML-EKM-ELM in the test stage 1-st layer

Calculate Calculate Calculate

๐™ (๐‘–) ฬƒ๐’ ๐›€ ฬƒ ๐˜

3-rd layer O(๐‘š๐‘›2 ) O(๐‘š๐‘›2 ) O(๐‘š๐‘›๐‘)

----

4 Experiments and results

2-nd layer O(๐‘š๐‘‘๐‘™1 ) ---

3-rd layer O(๐‘š๐‘™1 ๐‘™2 ) O(๐‘š๐‘™2 ๐‘™3 )

O(๐‘š๐‘™3 2 ) + O(๐‘š๐‘™3 ๐‘)

ACCEPTED MANUSCRIPT

CR IP T

Extensive experiments were conducted to evaluate the accuracy, time complexity and memory requirements of the proposed ML-EKM-ELM, which are further compared with ML-KELM and other relevant state-of-the-art methods. 4.1 Comparison Between ML-EKM-ELM and ML-KELM ML-EKM-ELM and ML-KELM are evaluated over 12 publicly available benchmark datasets from UCI Machine Learning repository [20] and http://openml.org [21]. All data sets are described in Table 5, nine of which are originally used in [11] and the results in [11] have shown that ML-KELM outperforms ML-ELM and H-ELM on average by 2.3% and 4.1% respectively. Table 5 Properties of benchmark data sets

1473 1484 2600 3279 5000 5974 7000 7400 9298 10299 14395 19282

1178 1187 2080 2623 4000 4779 5600 5920 7438 8239 11516 15426

295 297 520 656 1000 1195 1400 1480 1860 2060 2879 3856

AN US

CMC Yeast Madelon Isolet Waveform HAR1 Gisette TwoNorm USPS HAR2 Sylva Adult

Numbers of training Numbers of test samples samples

M

Instances

Features

Classes

9 8 500 617 21 561 5000 20 254 561 212 122

3 10 2 26 2 6 2 2 10 6 2 2

ED

Dataset

PT

A. Experiments Setup The experiments were carried out in Matlab 2015a under MacOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24 GB RAM. Fivefold cross validation is used for all experiments for a fair comparison. Following the practice in [11], RBF

CE

kernel ฮบ(๐‘–) (๐ฑ , ๐ฑ ,

๐‘– ) = exp(โˆ’

โˆฅ๐ฑ๐‘˜ ๐ฑ๐‘— โˆฅ22 ) 2๐œŽ 2

was adopted for all experiments, activation

AC

function ๐‘”๐‘– was chosen as linear piecewise activation, and the number of hidden layers is 3. ML-KELM has an automatically determined structure of ๐‘‘ โˆ’ ๐‘› โˆ’ ๐‘› โˆ’ ๐‘› โˆ’ ๐‘. The kernel parameter ๐œŽ๐‘– 2 (i=1 to 3) in each layer is respectively set as (๐‘–)

๐ฑ

โˆ’๐ฑ

(๐‘–)

๐›ฝ ๐‘›2

โˆ‘๐‘›,

=1

โˆฅ

โˆฅ22 , where ๐›ฝ๐‘– โˆˆ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter

in each layer is set as 225 because the experiments indicated that a sufficiently large ๐‘– can result in satisfactory performance on most datasets. Therefore, ML-KELM results in a total of 125 combinations of parameters for (๐œŽ1 , ๐œŽ2 , ๐œŽ3 , 225 , 225 , 225 ). The network structure of ML-EKM-ELM is ๐‘‘ โˆ’ ๐‘™1 โˆ’ ๐‘™2 โˆ’ ๐‘™3 โˆ’ ๐‘. For the sake of ๐‘–

ACCEPTED MANUSCRIPT simplicity, we simply set ๐‘™1 = ๐‘™2 = ๐‘™3 = ๐‘๐‘›, but not to find out the optimal network structure of ML-EKM-ELM because our aim is just to show the effectiveness of ML-EKM-ELM over relevant state-of-the-art methods. Hence, the network structure of ML-EKM-ELM becomes ๐‘‘ โˆ’ ๐‘๐‘› โˆ’ ๐‘๐‘› โˆ’ ๐‘๐‘› โˆ’ ๐‘. The kernel parameter ๐œŽ๐‘– (i=1 to 3) in each layer is respectively set as

๐›ฝ ๐‘›โˆ™๐‘™

โˆ‘๐‘›=1 โˆ‘๐‘™ =1 โˆฅ ๐ฑ

(๐‘–)

โˆ’ ๐ฏ (๐‘–) โˆฅ22 , where

CR IP T

๐›ฝ๐‘– โˆˆ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter ๐‘– in each layer is simply set as 225 . Similarly, a total of 125 combinations of (๐œŽ1 , ๐œŽ2 , ๐œŽ3 , 225 , 225 , 225 ) are resulted, which is consistent with ML-KELM. The parameter ๐‘ is only selected from {0.01, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3} because our later experiments show that a large ๐‘ (> 0.2) cannot result in significant increase of test accuracy but leads to very heavy computation burden.

AC

CE

PT

ED

M

AN US

B. Evaluation and Performance Analysis The training time and the test accuracy of ML-EKM-ELM both depend on the selected value of p. The training time and test accuracy is plotted as a function of p in Figure 4 for Isolet, USPS, HAR2 and Sylva, which are selected from Table 5.

Fig. 4 Training time and test accuracy with different p (ML-EKM-ELM)

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

Fig.4 shows several properties of the proposed work: i) The test accuracy is improved as p grows (i.e., adopting more landmark points). The reason is that ML-EKM-ELM can preserve more discriminant information when more training samples are selected as landmark points to construct the EKM. ii) The value p to achieve satisfactory accuracy depends on the application dataset. For some dataset (e.g., Syala), a small p is sufficient to achieve satisfactory accuracy, while on some difficult tasks (e.g., Isolet), lager p is needed for a good accuracy. iii) The test accuracy grows slowly when the value of p reaches a certain threshold, which exposes that selecting a certain number of training samples as landmark points is sufficient to preserve most discriminant information in EKM. iv) After a certain threshold, increasing of p can only slightly improve the accuracy, but incurs a sharp increase of training time. From the experiments, ML-EKM-ELM with ๐‘ = 0.1 can already achieve satisfactory accuracy in most adopted datasets, which shows that selecting a small part of training samples as landmark points should be sufficient to construct EKM for learning a robust classifier. In practice, the selection of p is a tradeoff between accuracy and the limited resources (i.e., time and memory requirements). As a rule of thumb, we can first set ๐‘ = 0.1 to test the model performance of ML-EKM-ELM. If a device (e.g., mobile device) cannot provide enough resources to store and run such model, we can try a smaller ๐‘ (e.g., ๐‘ = 0.05, 0.01). If the accuracy does not satisfy the demand of the application, we can set a larger one (e.g., ๐‘ = 0.2, 0.3). The time complexities and memory requirements of ML-EKM-ELM and ML-KELM are listed in Table 6 and Table 7 for comparison. For ML-EKM-ELM, the selected value of p is with relative error of less than 1% in terms of accuracy |TA1 TA2| TA1

ED

compared to ML-KELM (i.e.,

ร— 100%

1%, where TA1 and TA2 are the

PT

corresponding test accuracies of ML-KELM and ML-EKM-ELM). In Table 6, ML-EKM-ELM achieves a substantial reduction of training time to get the comparable accuracy of ML-KELM. For adult dataset, our proposed ML-EKM-ELM runs 279 times faster than ML-KELM only with a little loss of 0.11%. The reason is

CE

ฬƒ (๐‘–) and invert ๐† ฬƒ (๐‘–) T ๐† ฬƒ (๐‘–) , that ML-EKM-ELM only takes O(๐‘›๐‘™๐‘– 2 ) time to form ๐†

AC

while ML-KELM requires O(๐‘›3 ) time to form and invert ๐›€(๐‘–) . In addition, the test ฬƒ ๐‘ , ๐šช (๐‘–) and ๐›ƒ are with much smaller time is also substantially reduced because ๐›€ size in the execution stage. For adult dataset, our proposed ML-EKM-ELM runs 315 times faster than ML-KELM. In Table 7, ML-EKM-ELM achieves a substantial reduction of training memory as ฬƒ (๐‘–) while well because ML-EKM-ELM requires a storage of ๐‘› ร— ๐‘™๐‘– matrix ๐† ML-KELM requires a storage of ๐‘› ร— ๐‘› kernel matrix ๐›€(๐‘–) . For adult dataset, the training memory requirement of ML-EKM-ELM is 34.9 MB while ML-KELM takes 3348 MB. In addition, the test memory requirement of ML-EKM-ELM is also ฬƒ ๐‘ , ๐šช (๐‘–) and substantially reduced because only some much smaller matrices (such as ๐›€ ๐›ƒ ) are necessary in the execution stage. For adult dataset, the test memory requirement of ML-EKM-ELM is 4.69 MB while ML-KELM takes up to 432 MB. In

ACCEPTED MANUSCRIPT

both analyses, a conclusion can be drawn that ML-EKM-ELM is of much lower computational complexity and memory requirements than ML-KELM. Table 6 The time complexities of ML-EKM-ELM and ML-KELM Test Accuracy (%) Data Set

Training time (in seconds)

Test time (in seconds)

p TA1

TA2

Tr1

Tr2

Tr1/ Tr2

Te1

Te2

Te1/Te2

0.05

55.48 2.5

55.29 2.5

0.261

0.014

18.6

0.040

0.0016

25.0

Yeast

0.03

59.96 2.

59.39 2.5

0.273

0.010

27.3

0.038

0.0013

29.2

Madelon

0.2

79.13 1.

79.46 2.1

1.213

0.372

3.3

0.162

0.0227

7.1

Isolet

0.2

94.76 0.

94.00 0.

2.056

0.648

3.2

0.239

0.0347

6.9

Waveform

0.01

86.84 1.0

86.72 1.1

5.113

0.025

204.5

0.554

0.0025

221.6

HAR1

0.1

98.75 0.2

97.94 0.3

8.527

0.924

9.2

0.803

0.0564

14.2

Gisette

0.05

97.89 0.5

97.47 0.5

266.6

19.87

13.4

23.77

0.8512

27.9

TwoNorm

0.01

97.80 0.

97.82 0.3

13.44

0.050

268.8

1.212

0.0043

281.9

2.312

0.1928

12.0

3.057

0.1444

21.2

CR IP T

CMC

0.1

98.37 0.3

97.53 0.

26.32

2.400

11.0

0.07

99.11 0.2

98.23 0.3

35.51

1.881

18.9

Sylva

0.01

99.40 0.1

98.47 0.2

83.69

0.354

236.4

6.844

0.0258

265.3

Adult

0.01

84.77 0.5

84.66 0.

212.9

0.763

279.0

15.50

0.0492

315.0

AN US

USPS HAR2

*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; Tr1: the training time of ML-KELM; Tr2: the training time of ML-EKM-ELM; Te1: the test time of ML-KELM; Te2: the test time of ML-EKM-ELM

Test Accuracy (%) Data Set

p TA2

ED

TA1

M

Table 7 The memory requirements of ML-EKM-ELM and ML-KELM Training memory (in MB)

Test memory (in MB)

TrM1

TrM2

TrM1/ TrM2

TeM1

TeM2

TeM1/TeM2

0.05

55.48 2.5

55.29 2.5

19.4

1.03

18.8

2.52

0.18

14.0

Yeast

0.03

59.96 2.

59.39 2.5

20.2

0.62

32.6

2.67

0.10

26.7

Madelon

0.2

79.13 1.

79.46 2.1

67.7

16.6

4.1

15.2

5.59

2.7

Isolet

0.2

Waveform

0.01

HAR1

0.1

PT

CMC

111

Gisette

0.05

97.89 0.5

97.47 0.5

650

33.8

19.2

259

14.1

18.4

TwoNorm

0.01

97.80 0.

97.82 0.3

503

5.16

97.5

63.8

0.69

92.5

USPS

0.1

98.37 0.3

97.53 0.

807

84.6

9.5

113

19.4

5.8

HAR2

0.07

99.11 0.2

98.23 0.3

998

71.2

14.0

154

15.7

9.8

Sylva

0.01

99.40 0.1

98.47 0.2

1922

19.7

97.6

256

2.72

94.1

Adult

0.01

84.77 0.5

84.66 0.

3348

34.9

95.9

432

4.69

92.1

94.00 0.

26.0

4.3

24.7

8.94

2.8

86.84 1.0

86.72 1.1

229

2.38

96.2

29.4

0.33

89.1

98.75 0.2

97.94 0.3

340

36.3

9.4

59.5

9.31

6.4

AC

CE

94.76 0.

*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; TrM1: the training memory of ML-KELM; TrM2: the training memory of ML-EKM-ELM; TeM1: the test memory of ML-KELM; TeM2: the test memory of ML-EKM-ELM

C.

Evaluation on the Effect of Multiple Hidden Layers To further illustrate the effect of multiple hidden layers of ML-EKM-ELM, a highly nonlinear data set Madelon is employed, which has 2600 samples with 500 features. Madelon was originally proposed in the NIPS 2003 feature selection

ACCEPTED MANUSCRIPT

AN US

CR IP T

challenge[22]. Fivefold cross-validation is used in the following experiment, and the input attributes are normalized into the range [-1,1]. In highly nonlinear application, ML-EKM-ELM may require more hidden layers to obtain satisfactory performance. In this experiment, the number of hidden layers is set from 2 to 5. The test accuracy of ML-EKM-ELM under different hidden layers is shown in Table 8, and ML-KELM is also included to make a comparison. From Table 8, it is evident that: i) Under 2 layers, the accuracy of ML-KELM outperforms ML-EKM-ELM. However, the accuracy of ML-EKM-ELM is improved as p grows up. The reason is that shallow model needs to map the highly nonlinear input data to a very high dimensional space which is linearly separable. Therefore, it usually requires large number of hidden neurons to obtain better accuracy for highly nonlinear problem. ii) When p is large enough (p 0.05 for madelon), ELM-EKM-ELM with sufficient number of the hidden layers can perform as close as ML-KELM in terms of accuracy. iii) When the number of hidden layers reaches a certain threshold (4 hidden layers on madelon), the accuracy keeps almost constant or even starts to decline, which reveals that kernel-based model may not need a very deep structure to achieve satisfactory performance due to its strong approximation ability for nonlinear problem. Otherwise, overfitting may even occur. Table 8 Evaluation on the effect of multiple hidden layers over Madelon 2 layers 72.38 2.1

ML-EKM-ELM(p=0.2)

68.77 2.2

ML-EKM-ELM(p=0.1) ML-EKM-ELM(p=0.05)

5 layers

79.73

.๐Ÿ”

78.55 1.7

79.46 2.1

79.92 ๐Ÿ. ๐Ÿ

79.42 2.1

67.86 1.

75.43 2.1

80.51 ๐Ÿ.

80.47 1.8

65.90 2.

74.68 1.

79.56 1.

79.62 2.4

64.11 2.0

66.06 3.0

67.35 2.

67.37 ๐Ÿ. ๐Ÿ

ED

ML-EKM-ELM(p=0.01)

4 layers

79.13 1.

M

ML-KELM

3 layers

PT

The best result of each row is shown in bold.

AC

CE

4.2 Comparison with State-of-the-Art methods on NORB dataset In this experiment, a more complicated data set called NYC Object Recognition Benchmark (NORB) [23] was evaluated to further confirm the effectiveness of the proposed ML-EKM-ELM. NORB contains 48,600 images of different 3D toy objects of five distinct categories: animals, humans, airplanes, trucks, and cars, as shown in Fig. 5. Each image is captured from different viewpoints and under various lighting conditions. The training set contains 24,300 stereo training images of size 2 ร— 32 ร— 32 (i.e., 2048 dimensions), and another set of 24,300 images for test. We adopted the pre-fixed train/test data and the preprocessing method used in [12] to produce the comparable performance of our proposed work and the state-of-the-art relevant algorithms.

CR IP T

ACCEPTED MANUSCRIPT

(a)

(b)

Fig.5 a Examples for training figures. b Examples for testing figures [23]

AC

CE

PT

ED

M

AN US

Existing mainstream methods are compared to verify the effectiveness of the proposed ML-EKM-ELM, including BP-based algorithms (Stacked Auto Encoders (SAE) [2], Stacked Denoising Autoencoder (SDA) [24], Deep Belief Networks (DBN) [25], Deep Boltzmann Machines (DBM) [26], and multilayer perceptron (MLP) [27]), and ELM-based training algorithms (ML-ELM[10] and H-ELM[12]). For BP-based algorithms (SAE, SDA, BDN, DBM, and MLP), the initial learning rate is set as 0.1 and the decay rate is set as 0.95 for each learning epoch. For ELM-based training algorithms (ML-ELM and H-ELM with 3 layers), the regularization parameter ๐‘– of ML-ELM in each layer is respectively set as 10 1 , 103 and 108 ,while the regularization parameter of H-ELM is respectively set as 1,1 and 230 . More detailed parameters setting could be checked in [12]. Due to the huge computational cost, the experimental results on SAE, SDA, BDN, DBM, MLP, and ML-ELM are directly cited from [12]. Only the network structure of H-ELM with 20 โˆ’ 3000 โˆ’ 3000 โˆ’ 15000 โˆ’ 5 is tested in our experiment because [12] has demonstrated that such network structure can obtain optimal performance in NORB dataset. ML-KELM has an automatically determined structure of ๐‘‘ โˆ’ ๐‘› โˆ’ ๐‘› โˆ’ ๐‘› โˆ’ ๐‘, and for the sake of simplicity, ML-EKM-ELM was tested with network structure of ๐‘‘ โˆ’ ๐‘๐‘› โˆ’ ๐‘๐‘› โˆ’ ๐‘๐‘› โˆ’ ๐‘. Experiment in section 4.1 showed that adopting 10% of training samples may be sufficient to construct EKM for learning a robust ML-EKM-ELM, therefore ๐‘ = 0.1 was adopted in our experiments and ๐‘ = 0.01, 0.05 are included as well for comparison. The kernel parameter ๐œŽ๐‘– (i=1 to 3) in each layer is respectively set as

๐›ฝ ๐‘›โˆ™๐‘™

โˆ‘๐‘›=1 โˆ‘๐‘™ =1 โˆฅ ๐ฑ

(๐‘–)

โˆ’ ๐ฏ (๐‘–) โˆฅ22 , where ๐›ฝ๐‘– โˆˆ

{10 2 , 10 1 , 100 , 101 , 102 }, the regularization parameter ๐‘– in the first two layer is set as 225 , and in the final layer is selected from {25 , 210 , 215 , 220 , 225 }. A total of 625 combinations of (๐œŽ1 , ๐œŽ2 , ๐œŽ3 , 225 , 225 , 3 ) are tested in our experiments. All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported. The average training time for each model with fixed parameters is shown in Table 9. For ML-KELM, practically, it

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

may take several weeks to thoroughly try out the numerous combinations of parameters for the best model. In addition to substantial training time, the test time of ML-KELM is 427.48s and the test memory requirement is 4,577 MB, which restricts the application of ML-KELM on many devices with only limited resources such as mobile phones. The proposed ML-EKM-ELM can overcome the time and memory limitation of ML-KELM while comparable accuracy is maintained. Under the comparable accuracy of H-ELM (with structure of 20 โˆ’ 3000 โˆ’ 3000 โˆ’ 15000 โˆ’ 5), the training time and test time of ML-EKM-ELM with ๐‘ = 0.05(the corresponding structure is 20 โˆ’ 1215 โˆ’ 1215 โˆ’ 1215 โˆ’ 5) , are surprisingly fast, i.e., 17.46s and 4.96s respectively, while the training memory and test training memory are only 453MB and 250MB respectively. The reason is that ML-EKM-ELM with ๐‘ = 0.05 employs a much smaller EKM, which is easy for storage and computation. ML-EKM-ELM with ๐‘ = 0.1 (the corresponding structure is 20 โˆ’ 2 30 โˆ’ 2 30 โˆ’ 2 30 โˆ’ 5) performs better than ML-ELM and H-ELM in terms of accuracy, up to 4% and 2% with a smaller variance, respectively. ML-ELM and H-ELM are suboptimal because the input weights ๐š(๐‘–) and the bias ๐‘ (๐‘–) in every layer are randomly generated. In some cases, poorly generated ๐š(๐‘–) and ๐‘ (๐‘–) deteriorate the generalization of ML-ELM and H-ELM. Although large number of hidden nodes can improve the stability (i.e., lower standard deviation (STD)) of the accuracy in H-ELM, the STD of the accuracy in H-ELM with 15000 hidden nodes is still 0.40% as shown in Table 9. Instead, ML-EKM-ELM uses training samples as landmark points (in practice, adopting a small portion of training samples is sufficient) to generate features, which contains more discriminant information than the randomly generated features [28] and thereby its accuracy is higher and more stable. Although only 2,430 features ( ๐‘ = 0.1 means to select 2,430 training samples as landmark points from 24,300) are used in ML-EKM-ELM, the STD of ML-EKM-ELM drops to 0.24% compared to 0.40% in H-ELM with 15,000 features. Note that the accuracy of ML-EKM-ELM can be further improved and can become more stable as p grows. Moreover, ML-ELM and H-ELM need to exhaustively try out the values of ๐ฟ๐‘– from several hundreds to tens of thousands because their accuracies are very sensitive to ๐ฟ๐‘– . In contrast, ML-EKM-ELM only need a few trials of ๐‘ in our experiment. Therefore, the exhaustive and tedious tuning over ๐ฟ๐‘– is eliminated in ML-EKM-ELM. Table 9 Comparison on NORB dataset Accuracy (%)

Training

Test time(s)

time(s)

Training

Test

Memory(MB)

Memory(MB)

SAEa [2]

86.28

60504.34

N/A

N/A

N/A

a

SDA [24]

87.62

65747.69

N/A

N/A

N/A

a

DBN [25]

88.47

87280.42

N/A

N/A

N/A

a

89.65

182183.53

N/A

N/A

N/A

84.20

34005.47

N/A

N/A

N/A

DBM [26] a

MLP-BP [27]

ACCEPTED MANUSCRIPT

a

HELM [12]

88.91

775.29

N/A

N/A

N/A

91.28

432.19

N/A

N/A

N/A

155.19

26.48

3604

3082

3.07

1.35

91

46

17.46

4.96

453

250

60.33

11.62

932

545

965.66

427.48

8785

4577

Maximum: 91.56

HELMb [12]

Mean: 90.47 0. 0

ML-EKM-ELMb

Maximum: 87.98

(p=0.01)

Mean:86.51 0.7

ML-EKM-ELMb

Maximum: 92.18

(p=0.05)

Mean: 91.50 0.30

ML-EKM-ELMb

Maximum:93.17

(p=0.1)

Mean: 92.38 0.2

ML-KELMb

93.52*

CR IP T

ML-ELMa [10]

a. Results from[12] ; machine configuration: Laptop, Intel-i7 2.4G CPU, 16G DDR3 RAM, Windows 7, MATLAB R2013b

b. Results from our computer; machine configuration: macOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24

AN US

GB RAM, MATLAB R2015a

N/A. Results from[12] and [12] didnโ€™t report the results of test time, training memory and test memory. * The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (๐š(๐‘–) and ๐‘ (๐‘–) ) and does not need to randomly select the training sample as landmark points.

AC

CE

PT

ED

M

4.3 Comparison with State-of-the-Art methods on 20Newsgroups dataset To test the performance of ML-EKM-ELM on text categorization, 20Newsgroups2 was used in our experiment. which contains 18846 documents with 26214 distinct words. This dataset has 20 categories, each with around 1000 documents. In our experiment, 11314 documents (60%) is adopted for training data and 7532 documents (40%) for test data. Due to the huge dimensionality of 20Newsgroups (up to 26214 features), large number of hidden nodes is need for autoencoder to keep enough discriminant information. Therefore, for ML-ELM and H-ELM, the numbers of hidden nodes ๐ฟ๐‘– (๐‘– = 1 to 3) are, respectively, set as 500 ร— ๐‘š {๐‘š = 1,2,3, โ€ฆ ,10}. For the proposed ML-EKM-ELM, we simply set ๐‘™1 = ๐‘™2 = ๐‘™3 = ๐‘๐‘› , and ๐‘ = {0.1,0.2,0.3 }. For ML-ELM,ML-KELM and ML-EKM-ELM, the regularization parameter ๐‘– in each layer is simply set as 225 , while the regularization parameter of H-ELM is respectively set as 1,1 and 230 as recommended by [12].All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported in Table 10. Similar to the NORB dataset, ML-EKM-ELM (p=0.2) takes fewer hidden nodes (2262 vs. 3500) in each layer to obtain the comparable accuracy of ML-ELM and H-ELM because ML-EKM-ELM contains more discriminant information in the hidden layer than the randomly generated hidden layer in ML-ELM and H-ELM. Therefore, ML-EKM-ELM outperforms ML-EKM and H-ELM both in computation time and memory requirement under the similar accuracy. Furthermore, ML-EKM-ELM (p=0.2) significantly outperforms ML-ELM, 2

20Newsgroups is download at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html

ACCEPTED MANUSCRIPT

CR IP T

ML-KELM, and H-ELM in terms of training time (40.24 < 186.69 < 196.01 < 316.95). The reason is that dealing with high dimensional features (up to 26,214 input features and thousands of hidden nodes in each layer for 20Newsgroups) poses a big challenge for ML-ELM, ML-KELM and H-ELM. In details, ML-ELM needs to ensure orthogonality of the large randomly generated matrix (i.e., input weights and biases) in each layer for better generalization ability. ML-KELM needs to calculate the entire kernel matrix ๐›€(๐‘–) in each layer from the big input matrix ๐— (๐‘–) . H-ELM needs to apply fast iterative shrinkage-thresholding algorithm (FISTA) to the big input matrices ๐— (๐‘–) and ๐‡ (๐‘–) for sparse autoenocoder. In contrast, ML-EKM-ELM can avoid these above-mentioned operations, maintaining a very attractive training time. Table 10 Comparison on 20Newsgroups Accuracy (%)

ML-ELM

26214-3500-3500-3500-20

HELM

26214-3500-3500-4000-20 26214-1131-1131-1131-20

26214-2262-2262-2262-20

(p=0.2)** ML-EKM-ELM

Maximum:85.52

Mean:85.04 0.18 26214-3394-3394-3394-20

(p=0.3)**

Maximum:85.95

Mean:85.48 0.17 26214-11314-11314-11314-20

Test time(s)

186.69

5.08

86.47*

Training

Test

Memory(MB)

Memory(MB)

1413

6.55

1454

1065

10.14

1.74

402

290

40.24

4.31

844

617

104.10

7.16

1320

982

196.01

30.27

3946

2734

* The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (๐š(๐‘–) and ๐‘ (๐‘–) ) and does not need to randomly select the training sample as landmark points.

** ML-EKM-ELM with ๐‘ = 0.1 results 1131 (i. e. , ๐‘๐‘› = 0.1 1131 โ‰ˆ 1131) hidden nodes in every layer๏ผ›

PT

ML-EKM-ELM with ๐‘ = 0.2 results 22 2 (i. e. , ๐‘๐‘› = 0.2 1131 โ‰ˆ 22 2) hidden nodes in every layer๏ผ›

CE

ML-EKM-ELM with ๐‘ = 0.3 results 33

(i. e. , ๐‘๐‘› = 0.3 1131 โ‰ˆ 33

) hidden nodes in every layer.

5 Conclusion An efficient kernel-based deep learning algorithm called ML-EKM-ELM is proposed, which replaces the randomly generated hidden layer in ML-ELM into an approximate empirical kernel map (EKM) with ๐‘™๐‘– dimensions (where ๐‘™๐‘– = ๐‘›๐‘ , and ๐‘ โˆˆ {0.01, 0.05, 0.1,0.2} typically, n is the data size). By this way, ML-EKM-ELM has resolved the three practical issues of ML-ELM and H-ELM. 1) ML-EKM-ELM is based on kernel learning so that no random projection is necessary. As a result, an optimal and stable performance can be achieved under a fixed set of parameters. In the experiments, ML-EKM-ELM with p=0.1 is respectively up to 2% and 4% better than H-ELM and ML-ELM in terms of accuracy over NORB. 2) ML-EKM-ELM does not need to exhaustively tune the parameters ๐ฟ๐‘– for all layers as in ML-ELM and H-ELM. Only a few trials of ๐‘™๐‘– is necessary in

AC

1024

316.95

ED

ML-KELM

Maximum: 84.05

Mean:83.47 0.25

(p=0.1)** ML-EKM-ELM

Maximum: 85.48

Mean: 84.93 0.20

M

ML-EKM-ELM

Maximum: 85.38 Mean: 84.85 0.1

Training time(s)

AN US

Network Structure

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

ML-EKM-ELM. For simplicity, ๐‘™๐‘– can be equally set in every layer (i.e., ๐‘™1 = ๐‘™2 = ๐‘™3 = โ‹ฏ), while maintaining a satisfactory performance. 3) Both computation time and memory requirements may be significantly reduced in ML-EKM-ELM. For NORB, under the comparable accuracy of H-ELM, ML-EKM-ELM with p=0.05 can be respectively up to 9 times and 5 times faster than H-ELM for training and testing, while the training memory storage and test memory storage can be reduced up to 1/8 and 1/12 respectively. Furthermore, ML-EKM-ELM can overcome the memory storage and computation issues for ML-KELM, producing a much smaller hidden layer for fast training and low memory storage. For NORB, ML-EKM-ELM with p=0.1 can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9. To summarize, we empirically show that hidden layers in multilayer neural network can be encoded in form of EKM. In the future, more advanced low-rank approximation method such as Clustered Nystrรถm method [18] and ensemble Nystrรถm method [29] may be tested for more effective EKM and to further improve the performance of the proposed ML-EKM-ELM. At last, ML-EKM-ELM has the following limitation: when ๐‘ > 0.3 is chosen, ML-EKM-ELM may not outperform ML-KELM in terms of training time for some applications. The reason is that ML-EKM-ELM needs to apply a standard SVD (singular value decomposition) step on the submatrix of kernel matrix in training stage, whose size is proportional to the value of ๐‘. When ๐‘ > 0.3, the SVD step will dominate the computations and become computationally prohibitive. Therefore, in the future, we need to find an approximate and fast SVD method [30] to overcome this problem.

CE

Reference

codes:

PT

Acknowledgments The work is supported by University of Macau under project MYRG2018-00138-FST, and MYRG2016-00134-FST.

AC

[1] Y. Bengio, Learning deep architectures for AI, Foundations and trends in Machine Learning, 2 (2009) 1-127. [2] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science, 313 (2006) 504-507. [3] J. Cao, W. Wang, J. Wang, R. Wang, Excavation equipment recognition based on novel acoustic statistical features, IEEE transactions on cybernetics, 47 (2017) 4392-4404. [4] J. Cao, K. Zhang, M. Luo, C. Yin, X. Lai, Extreme learning machine and adaptive sparse representation for image classification, Neural networks, 81 (2016) 91-102. [5] L. Xi, B. Chen, H. Zhao, J. Qin, J. Cao, Maximum Correntropy Kalman Filter With State Constraints, IEEE Access, 5 (2017) 25846-25853

[6] W. Wang, Y. Huang, Y. Wang, L. Wang, Generalized autoencoder: A neural network framework for dimensionality reduction, in: IEEE Conference on Computer Vision and Pattern Recognition

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Workshops,2014, pp. 490-497. [7] L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, Journal of Machine Learning Research, 10 (2009) 66-71. [8] J. Deng, Z. Zhang, E. Marchi, B. Schuller, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in: Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction,2013, pp. 511-516. [9] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of ICML Workshop on Unsupervised and Transfer Learning,2012, pp. 37-49. [10] L.L.C. Kasun, H. Zhou, G.B. Huang, C.M. Vong, Representational learning with ELMs for big data, IEEE Intelligent Systems,28 (6) (2013), 31-34. [11] C.M. Wong, C.M. Vong, P.K. Wong, J. Cao, Kernel-based multilayer extreme learning machines for representation learning, IEEE transactions on neural networks and learning systems,29(3),(2018),757-762 [12] J. Tang, C. Deng, G.B. Huang, Extreme learning machine for multilayer perceptron, IEEE transactions on neural networks and learning systems, 27 (2016) 809-821. [13] G.B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42 (2012) 513-529. [14] K. Zhang, L. Lan, Z. Wang, F. Moerchen, Scaling up kernel svm on limited resources: A low-rank linearization approach, in: JMLRโ€”Proceedings Track, vol. 22, 2012, pp. 1425โ€“1434 [15] A. Golts, M. Elad, Linearized kernel dictionary learning, IEEE Journal of Selected Topics in Signal Processing, 10 (2016) 726-739. [16] F. Pourkamali-Anaraki, S. Becker, A randomized approach to efficient kernel clustering, in: IEEE Global Conference on Signal and Information Processing, 2016, pp. 207-211. [17] A. Gittens, M.W. Mahoney, Revisiting the Nystrรถm method for improved large-scale machine learning, Journal of Machine Learning Research, 28 (2013) 567-575. [18] F. Pourkamali-Anaraki, S. Becker, Randomized Clustered Nystrom for Large-Scale Kernel Machines, arXiv preprint arXiv:1612.06470, (2016). [19] B. Scholkopf, S. Mika, C.J. Burges, P. Knirsch, K.-R. Muller, G. Ratsch, A.J. Smola, Input space versus feature space in kernel-based methods, IEEE transactions on neural networks, 10 (1999) 1000-1017. [20] M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine,CA (2013). http://archive.ics.uci.edu/ml. [21] J. Vanschoren, J.N. Van Rijn, B. Bischl, L. Torgo, OpenML: networked science in machine learning, ACM SIGKDD Explorations Newsletter, 15 (2014) 49-60. [22] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, Result analysis of the NIPS 2003 feature selection challenge, Advances in neural information processing systems,(2005), pp. 545-552. [23] Y. LeCun, F.J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2004), vol.2, pp. 97-104. [24] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096-1103. [25] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

computation, 18 (2006) 1527-1554. [26] R. Salakhutdinov, G. Hinton, Deep boltzmann machines, in : Proceedings of the Artificial Intelligence and Statistics,2009, pp. 448-455. [27] C.M. Bishop, Pattern recognition and machine learning,springer, New York (2006). [28] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, Z.H. Zhou, Nystrรถm method vs random fourier features: A theoretical and empirical comparison, Advances in neural information processing systems,2012, pp. 476-484. [29] S. Kumar, M. Mohri, A. Talwalkar, Ensemble nystrom method, Advances in Neural Information Processing Systems, 2009, pp. 1060-1068. [30] M. Li, J.T. Kwok, B. Lu, Making large-scale Nystrรถm approximation possible, in proceedings of the International Conference on Machine Learning (ICML),2010,pp.631-638

ACCEPTED MANUSCRIPT

CR IP T

Chi-Man VONG received the M.S. and Ph.D. degrees in software engineering from the University of Macau, Macau, in 2000 and 2005, respectively. He is currently an Associate Professor with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau. His research interests include machine learning methods and intelligent systems.

M

AN US

Chuangquan Chen received the B.S. and M.S. degrees in mathematics from South China Agriculture University, Guangzhou, China, in 2010 and 2013, respectively. He is currently pursuing the Ph.D. degree in Computer Science, University of Macau, Macau, China. His current research interests include machine learning and data mining.

AC

CE

PT

ED

Pak-Kin Wong received the Ph.D. degree in Mechanical Engineering from The Hong Kong Polytechnic University, Hong Kong, in 1997. He is currently a Professor in the Department of Electromechanical Engineering and Associate Dean (Academic Affairs), Faculty of Science and Technology, University of Macau. His research interests include automotive engineering, fluid transmission and control, artificial intelligence, mechanical vibration, and manufacturing technology for biomedical applications. He has published over 200 scientific papers in refereed journals, book chapters, and conference proceedings.