Empirical kernel map-based multilayer extreme learning machines for representation learning

Accepted Manuscript Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning Chi-Man Vong , Chuangquan Chen , Pak-...

Download PDF

1MB Sizes 0 Downloads 89 Views

Report

PDF Reader
Full Text

Accepted Manuscript

Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong PII: DOI: Reference:

S0925-2312(18)30584-8 10.1016/j.neucom.2018.05.032 NEUCOM 19585

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

22 January 2018 8 May 2018 9 May 2018

Please cite this article as: Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong , Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.032

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning 1 Chi-Man Vong ,Chuangquan Chen1, Pak-Kin Wong2 1

Department of Computer and Information Science, University of Macau, Macau 2 Department of Electromechanical Engineering, University of Macau, Macau

Abstract

AC

CE

PT

ED

M

AN US

CR IP T

Recently, multilayer extreme learning machine (ML-ELM) and hierarchical extreme learning machine (H-ELM) were developed for representation learning whose training time can be reduced from hours to seconds compared to traditional stacked autoencoder (SAE). However, there are three practical issues in ML-ELM and H-ELM: 1) the random projection in every layer leads to unstable and suboptimal performance; 2) the manual tuning of the number of hidden nodes in every layer is time-consuming; and 3) under large hidden layer, the training time becomes relatively slow and a large storage is necessary. More recently, issues (1) and (2) have been resolved by kernel method, namely, multilayer kernel ELM (ML-KELM), which encodes the hidden layer in form of a kernel matrix (computed by using kernel function on the input data), but the storage and computation issues for kernel matrix pose a big challenge in large-scale application. In this paper, we empirically show that these issues can be alleviated by encoding the hidden layer in the form of an approximate empirical kernel map (EKM) computed from low-rank approximation of the kernel matrix. This proposed method is called ML-EKM-ELM, whose contributions are: 1) stable and better performance is achieved under no random projection mechanism; 2) the exhaustive manual tuning on the number of hidden nodes in every layer is eliminated; 3) EKM is scalable and produces a much smaller hidden layer for fast training and low memory storage, thereby suitable for large-scale problems. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed ML-EKM-ELM. As an illustrative example, on the NORB dataset, ML-EKM-ELM can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9.

Index Terms: Kernel learning, Multilayer extreme learning machine (ML-ELM), Empirical kernel map (EKM), Representation learning, stacked autoencoder (SAE).

1. Introduction Autoencoder (AE) is an unsupervised neural network whose input layer is equal to ✉ Corresponding author: Chi-Man Vong ([email protected])

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

output layer [1, 2]. AE offers an alternative way for traditional noise reduction and traditional feature extraction [3-5], and automatically extracts effective representation from the raw data. Several AEs can be used as building blocks to form a stacked AE (SAE) [1],which is capable of extracting different levels of abstract features from raw data so that it is suitable for applications of dimension reduction [6, 7] and transfer learning [8, 9]. However, the iterative training procedure in SAE based on backpropagation is extremely time-consuming. For this reason, multilayer extreme learning machine (ML-ELM) [10] was proposed (as shown in Fig.1), where multiple ELM-based AEs (ELM-AE) can be stacked for representation learning, followed by a final layer for classification. In ELM-AE [Fig.1(a)], the training mechanism randomly chooses the input weights 𝐚(𝑖) and bias 𝑏 (𝑖) for the input layer of AE and then analytically compute the transformation matrix 𝚪 (𝑖) . ELM-AE outputs 𝚪 (𝑖) to learn a new input representation [Fig.1(b)]. After the representation learning is done [Fig.1(d)], the final data representation 𝐱 𝑓𝑖𝑛𝑎𝑙 is obtained and used as hidden layer to calculate the output weight 𝜷 for classification using regularized least squares. Without any iteration, ML-ELM can reduce the training time from hours to seconds compared to traditional SAE.

Fig. 1. The architecture of ML-ELM [11]: (a) The transformation 𝚪 (1) is obtained for representation T

learning in ELM-AE; (b) The new input representation 𝐱 (2) is computed by 𝑔 (𝐱 (1) ∙ (𝚪 (1) ) ), where 𝑔 is an activation function. (c) 𝐱 (2) is the input to ELM-AE for another representation learning. (d) After the unsupervised representation learning is finished, 𝐱 𝑓𝑖𝑛𝑎𝑙 is obtained and used as input to calculate the output weight 𝜷 for classification using regularized least squares.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

More recently, a variant of ML-ELM called hierarchical ELM (H-ELM) [12] was developed. Compared to ML-ELM that directly uses the final data representation 𝐱 𝑓𝑖𝑛𝑎𝑙 as hidden layer to calculate the output weight 𝜷, H-ELM uses 𝐱 𝑓𝑖𝑛𝑎𝑙 as input to an individual ELM for classification (i.e., first randomly maps 𝐱 𝑓𝑖𝑛𝑎𝑙 into hidden layer), thereby maintaining the universal approximation capability of ELM. Both ML-ELM and H-ELM are very attractive due to their fast training speed. However, there are several practical issues of ML-ELM and H-ELM: 1) Suboptimal model generalization: the random input weight 𝐚(𝑖) and bias 𝑏 (𝑖) in every layer lead to unstable and suboptimal performance. In some cases, poorly generated 𝐚(𝑖) and 𝑏 (𝑖) may hinder the network from high generalization and hence numerous trials on 𝐚(𝑖) and 𝑏 (𝑖) are necessary. 2) Exhaustive tuning: The accuracy of ML-ELM and H-ELM are drastically influenced by the number of hidden nodes 𝐿𝑖 that requires exhaustive tuning ranging from several hundreds to tens of thousands subject to the complexity of the application. Hence, numerous tedious trials are needed to determine the optimal 𝐿𝑖 . 3) Relatively slow training time and high memory requirement under large 𝐿𝑖 : In large-scale application, an accurate ML-ELM model can easily contain thousands or tens of thousands of hidden neurons. For example, 𝐿𝑖 can be up to 15000 for image dataset NORB[12]. Practically, many devices such as mobile phones are only with a limited amount of expensive memory so that it may be insufficient to store and run large model. It is therefore crucial to run a compact model on limited resources while maintaining satisfactory accuracy. Analytically, the memory requirement and training time of ML-ELM in the i-th layer are respectively O(𝑛𝐿𝑖 ) and O(𝐿𝑖 2 𝑛) for n training samples, which pose a challenge to train a model for large 𝐿𝑖 , and hence a compact model with faster solution is of interest. From the literature [13], kernel learning is well known for its optimal performance without any random input parameters. Under this inspiration, a kernel version of ML-ELM was proposed, namely, multilayer kernel ELM (ML-KELM) [11], which encodes the hidden layers in form of kernel matrix. Without tuning the parameter 𝐿𝑖 , 𝐚(𝑖) and 𝑏 (𝑖) , ML-KELM successfully addresses the first two issues of ML-ELM and H-ELM, but worsen the third issue about high memory storage and very slow training time. The reason is that the memory storage and computation issues for kernel matrix 𝛀(𝑖) ∈ R𝑛×𝑛 in each layer pose a big challenge in large-scale application. Specifically, its takes a memory of O(𝑛2 ) to store 𝛀(𝑖) , and a time of O(𝑛3 ) to find the inverse of 𝛀(𝑖) . Thus, both memory and computation cost will grow exponentially along the training data size 𝑛, which restrict the application of ML-KELM in large-scale problems especially on mobile devices. ̃𝐆 ̃T , From the literature [14-16], a kernel matrix 𝛀 can be decomposed by 𝛀 ≈ 𝐆 ̃. 𝐆 ̃ contains most obtaining an approximate empirical kernel map (EKM) 𝐆 discriminant information of 𝛀 but with much smaller size, which is very efficient for learning over large-scale data [14-16]. Under this inspiration, a multilayer empirical kernel map ELM (ML-EKM-ELM) is proposed in this paper which addresses the aforementioned issues of ML-ELM and H-ELM by encoding every hidden layer in

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

̃ (𝑖) with 𝑙𝑖 dimensions (where 𝑙𝑖 ≪ 𝑛) computed form of an approximate EKM 𝐆 from the low-rank approximation of 𝛀(𝑖) . Moreover, Nyström method [17, 18] is ̃ (𝑖) due to its efficiency in many large-scale adopted in our work to generate 𝐆 machine learning problems, which does not need to calculate the entire kernel matrix ̃ (𝑖) . 𝛀(𝑖) and only operates on a small subset of 𝛀(𝑖) to generate 𝐆 Compared to ML-ELM and H-ELM, the contributions of ML-EKM-ELM are: 1) Benefited from kernel learning, the constructed kernel matrix does not require any randomly generated parameter (input 𝐚(𝑖) and bias 𝑏 (𝑖) ) in each layer so that a stable and theoretically optimal performance is always achieved. EKM efficiently approximates the kernel matrix and shares its stable and optimal performance. 2) The exhaustive tuning of 𝐿𝑖 is eliminated. Only a few trials of 𝑙𝑖 is necessary in ML-EKM-ELM, where 𝑙𝑖 = 0.01𝑛, 0.05𝑛, 0.1𝑛, 0.2𝑛 typically (to be detailed in Section 4). For simplicity, 𝑙𝑖 can be equally set in every layer (i.e., 𝑙1 = 𝑙2 = 𝑙3 = ⋯), while maintaining a satisfactory performance. 3) EKM is scalable (i.e., 𝑙𝑖 can be set with different values) and easy for storage and computation because it can be a very small matrix for hidden representation (i.e., setting 𝑙𝑖 to a small value is sufficient for practical application), thereby suitable for large-scale problems and mobile devices. Compared to ML-KELM, the benefits of ML-EKM-ELM are: 1) ML-KELM takes O(𝑛2 ) to store the matrix 𝛀(𝑖) for the i-th layer while ̃ (𝑖) , which is linear in the data ML-EKM-ELM takes O(𝑛𝑙𝑖 ) to store the matrix 𝐆 size 𝑛 for 𝑙𝑖 ≪ 𝑛. 2) ML-KELM requires O(𝑛3 ) time to find the inverse of 𝛀(𝑖) while ̃ (𝑖) , which saves ML-EKM-ELM only takes O(𝑛𝑙𝑖 2 ) time to find the inverse of 𝐆 substantial computing resources. 3) Benefited from the above two advantages, much smaller matrices can be constructed for feasible storage and execution in mobile devices. The rest of this paper is organized as follows. Section 2 introduces the related works including EKM, ML-ELM, and ML-KELM. Section 3 introduces ML-EKM-ELM. In Section 4, ML-EKM-ELM is compared with ML-KELM and other existing deep neural networks over benchmark data. Finally, conclusions are given in Section 5.

AC

2. Related works In this section, we briefly describe EKM, ML-ELM, and ML-KELM. All techniques are necessary to develop our proposed ML-EKM-ELM. 2.1 Empirical kernel map (EKM) Let 𝐗 = [𝐱1 , … , 𝐱 𝑛 ]T ∈ R𝑛×𝑑 be an input data matrix that contains 𝑛 data points in R𝑑 as its rows. In kernel learning, the inner products in feature space can be computed by adopting a kernel function κ(. , . , . ) on the input data space: 𝛀

,

= κ(𝐱 , 𝐱 , ) = 〈 (𝐱 ), (𝐱 )〉,

, = 1…,𝑛

(1)

where is an adjustable parameter, 𝛀 ∈ 𝑅 𝑛×𝑛 is the kernel matrix and (𝐱) is the kernel-induced feature map [19]. The kernel matrix 𝛀 can be decomposed by

ACCEPTED MANUSCRIPT

R𝑛×𝑙 and 𝛀𝑙𝑙 ∈ R𝑙×𝑙 , where (𝛀𝑛𝑙 )

,

CR IP T

𝛀 = 𝐆𝐆T ,where 𝐆 = [ (𝐱1 ), (𝐱 2 ), … , (𝐱 𝑛 )]T ∈ 𝑅 𝑛×𝐷 (where 𝐷 is the rank of 𝛀) is called empirical kernel map (EKM), which is a matrix containing 𝑛 data points in R𝐷 as its row. Since the dimensionality 𝐷 of 𝐆 is usually large, an approximate ̃ ∈ 𝑅 𝑛×𝑙 is practically computed from the low-rank (and much smaller) EKM 𝐆 approximation of 𝛀 using Nyström method, where 𝑙 ≪ 𝑛 and specifically, ̃𝐆 ̃T . The Nyström method works by selecting a small subset of the training data 𝛀≈𝐆 (e.g., randomly select l samples) referred to as landmark points and constructing a small subset of 𝛀 by computing the kernel similarities between the input data points and landmark points. The Nyström method then operates on this small subset of 𝛀 ̃ of 𝑙 dimensions. and determines an appropriate EKM 𝐆 T 𝑙×𝑑 Let 𝐕 = [𝐯1 , … , 𝐯𝑙 ] ∈ R be the set of l landmark points in R𝑑 , which is selected from 𝐗. The Nyström method first generates two small matrices 𝛀𝑛𝑙 ∈ = κ(𝐱 , 𝐯 , ) , and (𝛀𝑙𝑙 )

,

= κ(𝐯 , 𝐯 , ) .

M

AN US

Both 𝛀𝑛𝑙 and 𝛀𝑙𝑙 are the sub-matrices of 𝛀. Next, both 𝛀𝑛𝑙 and 𝛀𝑙𝑙 are used to approximate the kernel matrix 𝛀: ̃ = 𝛀𝑛𝑙 𝛀𝑙𝑙 𝛀T𝑛𝑙 (2) 𝛀≈𝛀 where 𝛀𝑙𝑙 represents the pseudoinverse of 𝛀𝑙𝑙 . By applying eigen-decomposition on 𝛀𝑙𝑙 , 𝛀𝑙𝑙 = 𝐔𝑙 𝚲𝑙 𝐔𝑙T is obtained, where 𝚲𝑙 ∈ R𝑙×𝑙 and 𝐔𝑙 ∈ R𝑙×𝑙 contain the 𝑙 eigenvalues and the corresponding eigenvectors of 𝛀𝑙𝑙 , respectively. Then the approximation of (2) can be expressed as: ̃=𝐆 ̃𝐆 ̃T (3) 𝛀≈𝛀 and 1

̃ = 𝛀𝑛𝑙 𝐔𝑙 𝚲 2 ∈ R𝑛×𝑙 𝐆 𝑙

(4)

1/2

̃ is only linear in the . For 𝑙 ≪ 𝑛, the computational cost to construct 𝐆

CE

𝛀𝑛𝑙 𝐔𝑙 𝚲𝑙

PT

ED

̃ serves as an approximate EKM, and the rows of 𝐆 ̃ are known as virtual where 𝐆 samples [19]. ̃. Specifically, The Nyström method takes O(𝑑𝑛𝑙 + 𝑙 3 + 𝑛𝑙 2 ) time to generate 𝐆 it takes O(𝑑𝑛𝑙) of time to form the matrices 𝛀𝑛𝑙 and 𝛀𝑙𝑙 , and takes O(𝑙 3 ) to apply eigen-decomposition on 𝛀𝑙𝑙 , and takes O(𝑛𝑙 2 ) time to calculate the product of

AC

data set size 𝑛. 2.2 Multilayer Extreme learning machines (ML-ELM) (𝑖)

(𝑖) T

(𝑖)

Referring to Fig. 1, let 𝐗 (𝑖) = [𝐱1 , … . , 𝐱 𝑛 ] , where x

is the i-th data

representation for input x ,for k =1 to n. Let 𝐇 (𝑖) be the i-th hidden layer output matrix with respect to 𝐗 (𝑖) . Then the i-th transformation matrix 𝚪 (𝑖) can be learned by (5) 𝐇 (𝑖) 𝚪 (𝑖) = 𝐗 (𝑖) where

ACCEPTED MANUSCRIPT

𝐇 (𝑖) = [

1

1 (𝑖)

and

(𝑖)

(𝑖)

(𝑖)

(𝐚1 (𝑖) , 𝑏1 (𝑖) , x1 ) … (𝑖)

(𝐚1 (𝑖) , 𝑏1 (𝑖) , x𝑛 ) …

(𝑖)

(𝑖)

(𝐚 (𝐚

(𝑖)

(𝑖)

,𝑏 ,𝑏

(𝑖)

(𝑖)

(𝑖)

, x1 ) (𝑖)

]

(6)

, x𝑛 )

(𝐚(𝑖) , 𝑏 (𝑖) , x(𝑖) ) = 𝑔𝑖 (𝐚(𝑖) x(𝑖) + 𝑏 (𝑖) ), where 𝑔𝑖 is the activation function in

the i-th layer, and both input 𝐚(𝑖) and bias 𝑏 (𝑖) are randomly generated in the i-th layer. 𝚪 (𝑖) can be calculated by

𝚪 { where

and

(𝑖)

𝑛

𝑛 𝑖

+ 𝐇 (𝑖) (𝐇 (𝑖) )T ) 1 𝐗 (𝑖)

𝑖

𝐿𝑖 (7)

(𝑖) T

=(

,𝑛

CR IP T

𝚪 (𝑖) = (𝐇 (𝑖) )T (

(𝑖)

1

(𝑖) T (𝑖)

+ (𝐇 ) 𝐇 ) (𝐇 ) 𝐗

,𝑛

𝐿𝑖

represent an identity matrix of dimension 𝐿𝑖 and n, respectively,

AN US

and the user-specified 𝑖 is for regularization used in the i-th layer. In (7), 𝚪 (𝑖) is used for representation learning and by multiplying 𝐗 (𝑖) with 𝚪 (𝑖) , a new data representation 𝐗 𝑖+1 is obtained as shown in Fig. 1(b): (8) 𝐗 𝑖+1 = 𝑔𝑖 (𝐗 (𝑖) (𝚪 (𝑖) )T ) The final data representation of 𝐗 (1) , namely 𝐗 𝑓𝑖𝑛𝑎𝑙 , is obtained after the learning procedure in Fig.1(d) is done. Then ML-ELM directly uses 𝐗 𝑓𝑖𝑛𝑎𝑙 as hidden layer to calculate the output weight 𝛃 (9) 𝐗 𝑓𝑖𝑛𝑎𝑙 𝛃 =

M

where = [𝐭1 , … . , 𝐭 𝑛 ]T , and 𝐭 ∈ 𝑅 𝑐 is a one-hot output vector, and c is the number of classes. The weight matrix 𝛃 can be solved by T

𝑓𝑖𝑛𝑎𝑙

T

𝑓𝑖𝑛𝑎𝑙

T

+ 𝐗 𝑓𝑖𝑛𝑎𝑙 (𝐗 𝑓𝑖𝑛𝑎𝑙 ) )

+ (𝐗 𝑓𝑖𝑛𝑎𝑙 ) 𝐗 𝑓𝑖𝑛𝑎𝑙 ) 1 (𝐗 𝑓𝑖𝑛𝑎𝑙 )

PT

𝛃=( {

𝑛

ED

𝛃 = (𝐗 𝑓𝑖𝑛𝑎𝑙 ) (

1

T

,𝑛

𝐿𝑓𝑖𝑛𝑎𝑙

,𝑛

𝐿𝑓𝑖𝑛𝑎𝑙

(10)

AC

CE

2.3 Multilayer Kernel Extreme learning machines (ML-KELM) In [11], ML-KELM was developed by integrating kernel learning into ML-ELM to achieve high generalization with less user intervention. ML-KELM contains two steps: 1) Unsupervised representation learning by stacking kernel version of ELM-AEs, namely, KELM-AE; 2) Supervised feature classification using a kernel version of ELM (i.e., K-ELM).

ACCEPTED MANUSCRIPT

Input 𝐱 𝑔

Output

Hidden layer

(𝑖)

𝐊

𝐱 (𝑖)

(𝑖)

(𝑖) 𝑥1

(𝑖)

(𝑖)

𝚪 (𝑖)

K1 (𝑖)

(𝑖)

…

𝑥2

𝑥2

…

…

(𝑖)

CR IP T

K𝑛

(𝑖)

𝑥𝑑

(𝑖)

𝑥𝑑

𝐗 (𝑖)

𝐗 (𝑖)

𝛀(𝑖)

AN US

For all 𝐱 (𝑖) :

𝑥1

Fig. 2. The architecture of the i-th KELM-AE [11], in which hidden layer is encoded in form of a kernel matrix 𝛀(𝑖)

Identical to ELM-AE, KELM-AE learns the transformation

𝚪 (𝑖) from hidden

layer to output layer. From Fig.2, kernel matrix 𝛀(𝑖) = [𝐊(𝑖) (𝐱1 (𝑖) ), … , 𝐊(𝑖) (𝐱 𝑛 (𝑖) )]T

M

is first obtained by using kernel function κ(𝑖) (𝐱

(𝑖)

,𝐱

(𝑖)

,

𝑖)

on the input matrix

ED

𝐗 (𝑖) , and then the i-th transformation matrix 𝚪 (𝑖) in KELM-AE is learned similar to ELM-AE in (5) (11) 𝛀(𝑖) 𝚪 (𝑖) = 𝐗 (𝑖)

PT

Similar to (7), 𝚪 (𝑖) in (11) is obtained by 𝚪( ) = (

𝑛 𝑖

+ 𝛀(𝑖) ) 1 𝐗 (𝑖)

CE

The data representation 𝐗 (𝑖+1) is calculated similar to (8) 𝐗 (𝑖+1) = 𝑔𝑖 (𝐗 (𝑖) (𝚪 (𝑖) )T )

(12) (13)

AC

After the unsupervised representation learning procedure is finished, the final data representation 𝐗 𝑓𝑖𝑛𝑎𝑙 is obtained and used as input to train a K-ELM classification (14) 𝛀 𝑓𝑖𝑛𝑎𝑙 𝜷 = where 𝛀 𝑓𝑖𝑛𝑎𝑙 is the kernel matrix defined on 𝐗 final . The weight matrix 𝜷 can be solved by: 𝜷=(

𝑛

+ 𝛀 𝑓𝑖𝑛𝑎𝑙 )

𝑓𝑖𝑛𝑎𝑙 (1)

(15)

Given a set of m test samples 𝐙 = [𝐳1 (1) , … , 𝐳𝑚 (1) ]T ∈ R𝑚×𝑑 . In the stage of representation learning, the data representation 𝐙(𝑖+1) is obtained by multiplying the i-th transformation matrix 𝚪 (𝑖) (16) 𝐙(𝑖+1) = 𝑔𝑖 (𝐙(𝑖) (𝚪 (𝑖) )T ) Then, the final data representation 𝐙final = [𝐳1 𝑓𝑖𝑛𝑎𝑙 , … , 𝐳𝑚 𝑓𝑖𝑛𝑎𝑙 ]T is obtained and

ACCEPTED MANUSCRIPT used to calculate the test kernel matrix 𝛀𝑍 ∈ R𝑚×𝑛 (𝛀𝑍 ) where 𝐱

𝑓𝑖𝑛𝑎𝑙

,

κ(𝐳

𝑓𝑖𝑛𝑎𝑙

,𝐱

𝑓𝑖𝑛𝑎𝑙

,

(17)

𝑓𝑖𝑛𝑎𝑙 )

̃ is is the j-th data point from 𝐗 𝑓𝑖𝑛𝑎𝑙 . Finally, the network’s output 𝐘

given by

̃ = 𝛀𝑍 𝛃 𝐘

(18)

Remark: For special case, when linear piecewise activation 𝑔𝑖 was applied to all

𝚪 (𝑖

1)

CR IP T

𝚪 (𝑖) , 𝚪 (𝑖) can be unified into a single matrix 𝚪𝑢𝑛𝑖𝑓𝑖𝑒𝑑 (i.e., 𝚪𝑢𝑛𝑖𝑓𝑖𝑒𝑑 = 𝚪 (𝑖) ∙ ⋯ 𝚪 (1) ). For execution, 𝐙𝑓𝑖𝑛𝑎𝑙 = 𝐙(1) (𝚪𝑢𝑛𝑖𝑓𝑖𝑒𝑑 )T is directly computed so that

both issues of memory storage and execution time caused by deep neural network can be alleviated.

AN US

3 Proposed ML-EKM-ELM

Input

(𝑖)

(𝑖)

𝜙1

(𝑖)

𝜙𝑙𝑖

(𝑖)

𝑥𝑑

𝐗 (𝑖)

𝚪 (𝑖)

𝑥1

(𝑖)

𝑥2

…

…

CE

̃ (𝑖) 𝝓

…

(𝑖)

AC

𝐱 (𝑖)

(𝑖)

𝑥2

For all 𝐱 (𝑖) :

Output

Hidden layer

𝑥1

PT

𝑔2 𝑔

ED

𝐱

(𝑖)

M

The proposed ML-EKM-ELM is developed by replacing the randomly generated ̃ (𝑖) computed from low-rank hidden layer in ML-ELM into an approximate EKM 𝐆 approximation of 𝛀(𝑖) . In ML-EKM-ELM, EKM version of ELM-AEs (EKM-AE) are stacked for representation learning, followed by a final layer of EKM version of ELM for classification.

(𝑖)

𝑥𝑑 ̃ (𝑖) 𝐆

𝐗 (𝑖)

Fig. 3. The architecture of the i-th EKM -AE, in which the hidden layer is encoded in form of ̃ (𝑖) . an approximate EKM 𝐆

3.1 EKM-AE In this section, the details of EKM-AE are discussed. As shown in Fig. 3, the input ̃ (𝑖) , where matrix 𝐗 (𝑖) is first mapped into empirical kernel map 𝐆

ACCEPTED MANUSCRIPT

̃ (𝑖) = [ ̃ (𝑖) (𝐱1 (𝑖) ), … , ̃ (𝑖) (𝐱 𝑛 (𝑖) )]T . The 𝐆 ̃ (𝑖) with 𝑙𝑖 -dimension is calculated (to be 𝐆 detailed in Section 2.1) by first generating two small matrices 𝛀𝑛𝑙 (𝑖) ∈ R𝑛×𝑙 , 𝛀𝑙 𝑙 (𝑖) ∈ R𝑙 ×𝑙

using

the

randomly

𝑙𝑖

selected

landmark

points

𝐕 (𝑖) = [𝐯1 (𝑖) , … , 𝐯𝑙 (𝑖) ]T from 𝐗 (𝑖) such that ,

= κ(𝐱

(𝑖)

, 𝐯 (𝑖) ,

𝑖)

(𝛀𝑙 𝑙 (𝑖) )

,

= κ(𝐯

(𝑖)

, 𝐯 (𝑖) ,

𝑖)

and

(19)

CR IP T

(𝛀𝑛𝑙 (𝑖) )

(20)

Then, 𝐔𝑙 (𝑖) and 𝚲𝑙 (𝑖) are obtained by applying eigen-decomposition on 𝛀𝑙 𝑙 (𝑖)

where 𝚲𝑙 (𝑖) ∈ R𝑙 ×𝑙

and

(21)

AN US

𝛀𝑙 𝑙 (𝑖) = 𝐔𝑙 (𝑖) 𝚲𝑙 (𝑖) ( 𝐔𝑙 (𝑖) )T 𝐔𝑙 (𝑖) ∈ R𝑙 ×𝑙

contain the 𝑙𝑖 eigenvalues and the

corresponding eigenvectors of 𝛀𝑙 𝑙 (𝑖) , respectively. Next, 𝐔𝑙 (𝑖) , 𝚲𝑙 (𝑖) and 𝛀𝑛𝑙 (𝑖)

̃ (𝑖) = (𝛀𝑛𝑙 (𝑖) ) 𝐆 (𝑖)

𝑛×𝑙

∈R

(𝑖)

,

= ( 𝐔𝑙

(𝑖)

)(𝚲𝑙

(𝑖)

)

1 2

(22)

∈ R𝑙 ×𝑙

is the mapping matrix of the i-th layer. Then, the rank-𝑙𝑖 approximation

ED

where

(𝑖)

M

̃ (𝑖) : are used to construct 𝐆

̃ (𝑖) 𝐆 ̃ (𝑖) T . Practically, it is preferred to replace the 𝛀(𝑖) can be expressed as 𝛀(𝑖) ≈ 𝐆

CE

PT

̃ (𝑖) rather than 𝛀(𝑖) because 𝐆 ̃ (𝑖) is much smaller hidden layer in ELM-AE by 𝐆 than 𝛀(𝑖) while maintaining most of the discriminant information of 𝛀(𝑖) . EKM is detailed in Algorithm 1.

AC

Algorithm 1: EKM for the i-th layer Input: Input matrix 𝐗 (𝑖) ,kernel function κ(. , . , . ) , kernel parameter landmark set size 𝑙𝑖 ̃ (𝑖) Output: Mapping matrix (𝑖) , empirical kernel map 𝑮

𝑖

Step 1: Randomly select 𝑙𝑖 landmark points 𝐕 (𝑖) = [𝐯1 (𝑖) , … , 𝐯𝑙 (𝑖) ]T from 𝐗 (𝑖) Step 2: Generate the kernel matrix 𝛀𝑛𝑙 (𝑖) ∈ R𝑛×𝑙 , 𝛀𝑙 𝑙 (𝑖) ∈ R𝑙 ×𝑙 (𝛀𝑛𝑙 (𝑖) )

,

κ(𝐱

(𝑖)

, 𝐯 (𝑖) ,

𝑖)

(𝛀𝑙 𝑙 (𝑖) )

,

κ(𝐯

(𝑖)

, 𝐯 (𝑖) ,

𝑖)

and

and

ACCEPTED MANUSCRIPT

Step 3: Calculate

𝐔𝑙 (𝑖) ∈ R𝑙 ×𝑙 and 𝚲𝑙 (𝑖) ∈ R𝑙 ×𝑙 by applying

eigen-decomposition on 𝛀𝑙 𝑙 (𝑖) = 𝐔𝑙 (𝑖) 𝚲𝑙 (𝑖) ( 𝐔𝑙 (𝑖) )T Step 4: Calculate the mapping matrix (𝑖)

(𝑖)

= ( 𝐔𝑙

(𝑖)

)(𝚲𝑙 (𝑖) )

1 2

̃ (𝑖) 𝐆 Return

(𝑖)

(𝛀𝑛𝑙

̃ (𝑖) ,𝐆

(𝑖)

)

(𝑖)

CR IP T

̃ (𝑖) Step 5: Calculate the empirical kernel map 𝑮

AN US

Next, the i-th transformation matrix 𝚪 (𝑖) in EKM-AE is learned similar to ELM-AE in (5) ̃ (𝑖) 𝚪 (𝑖) = 𝐗 (𝑖) (23) 𝐆 𝚪 (𝑖) can be solved by 𝚪 (𝑖) = (

𝑙 𝑖

̃ (𝑖) )T 𝐆 ̃ (𝑖) ) 1 (𝐆 ̃ (𝑖) )T 𝐗 (𝑖) + (𝐆

(24)

M

Compared to (12) in which KELM-AE needs to invert a matrix 𝛀(𝑖) of size 𝑛 × 𝑛, ̃ (𝑖) )T 𝐆 ̃ (𝑖) of size just 𝑙𝑖 × 𝑙𝑖 , EKM-AE only needs to invert a much smaller matrix (𝐆 𝑙𝑖 ≪ 𝑛. The data representation 𝐗 (𝑖+1) is calculated similar to (8) (25) 𝐗 𝑖+1 = 𝑔𝑖 (𝐗 (𝑖) (𝚪 (𝑖) )T )

ED

EKM-AE is detailed in algorithm 2.

AC

CE

PT

Algorithm 2: EKM-AE for the i-th Layer Input: Input matrix 𝐗 (𝑖) , regularization 𝑖 , kernel function κ(. , . , . ) ,kernel parameter 𝑖 , activation function 𝑔𝑖 and landmark set size 𝑙𝑖 Output: New data representation 𝐗 (𝑖+1), mapping matrix (𝑖) , empirical kernel map ̃ (𝑖) , and transformation matrix 𝚪 (𝑖) 𝐆 ̃ (𝑖) Step 1: Calculate (𝑖) , 𝐆 EKM(𝐗 (𝑖) , κ(. , . , . ), 𝑖 , 𝑙𝑖 ) Step 2: Estimate the transformation matrix 𝚪 (𝑖) 𝚪 (𝑖)

(

𝑙 𝑖

̃ (𝑖) )T 𝐆 ̃ (𝑖) ) 1 (𝐆 ̃ (𝑖) )T 𝐗 (𝑖) + (𝐆

Step3: Calculate new data representation 𝐗 (𝑖+1) ← 𝑔𝑖 (𝐗 (𝑖) (𝚪 (𝑖) )T ) ̃ (𝑖) and 𝚪 (𝑖) Return 𝐗 (𝑖+1) , (𝑖) , 𝐆 3.2 Proposed ML-EKM-ELM ML-EKM-ELM follows the two separate learning procedures of ML-KELM [11]. In the stage of unsupervised representation learning, each 𝚪 (𝑖) and 𝐗 (𝑖) (for i-th EKM-AE) is obtained using (24) and (25), respectively. In the stage of supervised ̃ 𝑓𝑖𝑛𝑎𝑙 ∈ 𝑅 𝑛×𝑙 feature classification, the final EKM 𝐆 with respect to 𝐗 𝑓𝑖𝑛𝑎𝑙 is conveyed as input for training:

ACCEPTED MANUSCRIPT ̃ 𝑓𝑖𝑛𝑎𝑙 𝜷 = 𝐆

(26)

The output weight 𝜷 in (26) is solved by 𝜷=(

𝑙 𝑓𝑖𝑛𝑎𝑙

̃ 𝑓𝑖𝑛𝑎𝑙 )T 𝑮 ̃ 𝑓𝑖𝑛𝑎𝑙 ) 1 (𝑮 ̃ 𝑓𝑖𝑛𝑎𝑙 )T + (𝑮

(27)

For the test stage, the data representation 𝐙(𝑖+1) of test data 𝐙 (1) is obtained by multiplying the i-th transformation matrix 𝚪 (𝑖) (28) 𝐙(𝑖+1) = 𝑔𝑖 (𝐙(𝑖) (𝚪 (𝑖) )T )

CR IP T

The final data representation 𝐙𝑓𝑖𝑛𝑎𝑙 = [𝐳1 𝑓𝑖𝑛𝑎𝑙 , … , 𝐳𝑚 𝑓𝑖𝑛𝑎𝑙 ]T is obtained and used to ̃ 𝑍 ∈ R𝑚×𝑙 calculate the approximate test kernel matrix 𝛀 (29) ̃ 𝑍) , (𝛀 κ(𝐳 𝑓𝑖𝑛𝑎𝑙 , 𝐯 𝑓𝑖𝑛𝑎𝑙 , 𝑓𝑖𝑛𝑎𝑙 ) where 𝐯 𝑓𝑖𝑛𝑎𝑙 is the j-th selected landmark point from 𝐗 𝑓𝑖𝑛𝑎𝑙 in the training phase. 𝑓𝑖𝑛𝑎𝑙

𝜷

AN US

Finally, the model output is given by ̃𝑍 ̃=𝛀 𝐘

(30)

The procedure of ML-EKM-ELM is detailed in Algorithm 3.

AC

CE

PT

ED

M

Algorithm 3: Proposed ML-EKM-ELM Training Stage: Input: Input matrix 𝐗 (1) , Output matrix , regularization 𝑖 , kernel function κ(. , . , . )， kernel parameter 𝑖 , the number of layer s, activation function 𝑔𝑖 and landmark set size 𝑙𝑖 Output: landmark points 𝐕𝑓𝑖𝑛𝑎𝑙 (selected from 𝐗 𝑓𝑖𝑛𝑎𝑙 ), 𝑠 − 1 transformation matrix 𝚪 (𝑖) , output weight 𝜷 and final mapping matrix 𝑓𝑖𝑛𝑎𝑙 Step 1: for i=1: 𝑠 − 1 do Calculate 𝐗 (𝑖+1) , 𝚪 (𝑖) EKM − AE(𝐗 (𝑖) , 𝑖 , κ(. , . , . ), 𝑖 , 𝑔𝑖 , 𝑙𝑖 ) Step 2: i 𝑠 ̃ (𝑖) Step 3: Calculate (𝑖) , 𝐆 EKM(𝐗 (𝑖) , κ(. , . , . ), 𝑖 , 𝑙𝑖 ) (𝑖) ̃ 𝑓𝑖𝑛𝑎𝑙 ̃ (𝑖) Step 4: 𝐗 𝑓𝑖𝑛𝑎𝑙 𝐗 (𝑖) , 𝑓𝑖𝑛𝑎𝑙 ,𝐆 𝐆 Step 5: Calculate the output weight 𝜷

(

𝑙 𝑓𝑖𝑛𝑎𝑙

̃ 𝑓𝑖𝑛𝑎𝑙 )T 𝐆 ̃ 𝑓𝑖𝑛𝑎𝑙 ) (𝐆 ̃ 𝑓𝑖𝑛𝑎𝑙 )T + (𝐆

Return 𝚪 (𝑖) (𝑖 = 1,2, … , 𝑠 − 1), 𝜷, 𝑓𝑖𝑛𝑎𝑙 and 𝐕𝑓𝑖𝑛𝑎𝑙 Prediction Stage Input: test data 𝐙(1) , landmark points 𝐕𝑓𝑖𝑛𝑎𝑙 , mapping matrix transformation matrix 𝚪 (𝑖) (𝑖 = 1,2, … 𝑠 − 1), and output weight 𝜷 ̃ Output: 𝐘 Step 1: for i=1: 𝑠 − 1 do Calculate 𝐙(𝑖+1) = 𝑔𝑖 (𝐙(𝑖) (𝚪 (𝑖) )T ) Step 2: i 𝑠; 𝐙𝑓𝑖𝑛𝑎𝑙 𝐙(𝑖)

𝑓𝑖𝑛𝑎𝑙

,

ACCEPTED MANUSCRIPT ̃ 𝑍 ∈ R𝑚×𝑙 using 𝐙𝑓𝑖𝑛𝑎𝑙 and 𝐕𝑓𝑖𝑛𝑎𝑙 Step 3: Calculate 𝛀 ̃ 𝒁) (𝛀

,

κ(𝐳

𝑓𝑖𝑛𝑎𝑙

̃𝑍 ̃=𝛀 Step 4: Calculate the network’s output 𝐘 ̃ Return 𝐘

, 𝐯 𝑓𝑖𝑛𝑎𝑙 , 𝑓𝑖𝑛𝑎𝑙

𝑖)

𝜷

ED

M

AN US

CR IP T

3.3 Memory and Computational complexity In this section, we take an example of three hidden layers to analyze the memory and time complexity of ML-EKM-ELM. A．Memory Complexity Compared to ML-KELM that needs O(𝑛2 ) memory to store 𝛀(𝑖) and 𝐗 (𝑖) for the i-th layer during the training stage, ML-EKM-ELM takes O(𝑛𝑙𝑖 ) to store the ̃ (𝑖) and 𝐗 (𝑖) for the i-th layer which is only linear in the data size 𝑛. Hence, matrix 𝐆 the memory requirement is significantly reduced from quadratic to linear. For execution, with 𝑑 features and c classes in the training data and m test samples, ML-KELM needs to store i) O(𝑚𝑛) memory for test kernel matrix 𝛀𝒛 ; ii) O(𝑑𝑛) and O(𝑛2 ) memory for transformation matrix 𝚪 (1) and 𝚪 (2) respectively; iii) O(𝑛𝑐) memory for 𝜷, resulting in a total of O(𝑛2 + 𝑛(𝑐 + 𝑑 + 𝑚)) memory. For ̃ 𝑍 ; ii) ML-EKM-ELM, we need to store i) O(𝑚𝑙3 ) memory for test kernel matrix 𝛀 O(𝑑𝑙1 ) and O(𝑙1 𝑙2 ) memeory for transformation matrix 𝚪 (1) and 𝚪 (2) ; iii) O(𝑙3 2 ) memory for 𝑓𝑖𝑛𝑎𝑙 ; iv) O(𝑙3 𝑐) memory for 𝜷; resulting in a total of 𝑂(𝑚𝑙3 + 𝑑𝑙1 + 𝑙1 𝑙2 + 𝑙3 2 + 𝑙3 𝑐) memory. Assume 𝑙1 = 𝑙2 = 𝑙3 = 𝑛𝑝, where 𝑝 ≪ 1 . Then the memory complexity of ML-EKM-ELM is equal to O(𝑚𝑛𝑝 + 𝑑𝑛𝑝 + 2𝑛2 𝑝2 + 𝑛𝑝𝑐). Since 𝑐, 𝑑 and 𝑚 << 𝑛 in large-scale application, by keeping only the largest complexity values, the complexities of ML-KELM and ML-EKM-ELM become O(𝑛2 ) and O(𝑛2 𝑝2 ) , respectively. For special case, when linear piecewise activation 𝑔𝑖 was applied to all transformation matrix 𝚪 (𝑖) , only one single

PT

transformation matrix 𝚪𝑢𝑛𝑖𝑓𝑖𝑒𝑑

(i. e. , 𝚪𝑢𝑛𝑖𝑓𝑖𝑒𝑑 = 𝚪 (2) ∙ 𝚪 (1) ) is necessary for

CE

execution (detailed in Section 2.3). In this case, the complexities of ML-KELM and ML-EKM-ELM are O(𝑛) and O(𝑛𝑝) respectively.

AC

B．Computational complexity For training stage, the computational complexities of ML-KELM and ML-EKM-ELM are respectively shown in Table 1 and Table 2. In total, ML-KELM takes O(7𝑛3 + 3𝑛2 𝑑 + 𝑛2 𝑐) while ML-KEM-ELM takes O ((3𝑙1 2 + 3𝑙2 2 + 3𝑙3 2 + 3𝑙1 𝑙2 +𝑙2 𝑙3 + 3𝑑𝑙1 + 𝑙3 c)𝑛 + 3(𝑙1 3 + 𝑙2 3 + 𝑙3 3 )) .

For

test

stage,

the

time

complexities of ML-KELM and ML-EKM-ELM are shown in Table 3 and Table 4 respectively. In total, the time complexity of the ML-KELM is O(𝑚(𝑛𝑑 + 2𝑛2 + 𝑛𝑐)) and the time complexity of the ML-KEM-ELM is O(𝑚(𝑑𝑙1 + 𝑙1 𝑙2 + 𝑙2 𝑙3 + 𝑙3 2 + 𝑙3 𝑐)).

ACCEPTED MANUSCRIPT Assume 𝑙1 = 𝑙1 = 𝑙3 = 𝑛𝑝, where 𝑝 ≪ 1 , and keeping only the largest complexity values, and then, for training stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(𝑛3 ) and O(𝑝2 𝑛3 ) , respectively. Similarly, for test stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(𝑛2 ) and O(𝑝2 𝑛2 ) , respectively. Table 1 The computational complexity of ML -KELM in the training stage 1-st layer

2-nd layer

2

O(𝑛 𝑑) O(𝑛3 ) O(𝑛2 𝑑) ---

O(𝑛3 ) O(𝑛3 ) -O(𝑛3 )

O(𝑛 O(𝑛3 ) O(𝑛3 ) O(𝑛2 𝑑) --

O(𝑛2𝑐)

AN US

Construct 𝛀 Invert 𝛀(𝑖) Calculate 𝚪 (𝑖) Calculate 𝐗 (𝑖) Calculate 𝜷

3-rd layer

3)

CR IP T

(𝑖)

Table 2 The computational complexity of ML-EKM-ELM in the training stage 2-nd layer

3-rd layer

O(𝑛𝑙1 𝑑)

O(𝑛𝑙1 𝑙2 )

𝑂(𝑛𝑙2 𝑙3 )

O(𝑙1 )

O(𝑙2 )

O(𝑙3 3 )

O(𝑙2 3 ) O(𝑛𝑙2 2 ) O(𝑛𝑙2 2 ) O(𝑙2 3 ) O(𝑛𝑙2 2 ) + O(𝑛𝑙2 𝑙1 ) O(𝑛𝑑𝑙1 )

O(𝑙3 3 ) O(𝑛𝑙3 2 ) O(𝑛𝑙3 2 ) O(𝑙3 3 ) -O(𝑛𝑙1 𝑙2 )

3

O(𝑙1 3 ) O(𝑛𝑙1 2 ) O(𝑛𝑙1 2 ) O(𝑙1 3 ) O(𝑛𝑙1 2 ) + O(𝑛𝑙1 𝑑) ---

3

--

PT

Calculate 𝚪 (𝑖) Calculate 𝐗 (𝑖) Calculate 𝜷

1-st layer

M

Construct 𝛀𝑛𝑙 and 𝛀𝑙 𝑙 Decompose 𝛀𝑙 𝑙 (𝑖) Calculate (𝑖) ̃ (𝑖) Calculate 𝐆 ̃ (𝑖) )T 𝐆 ̃ (𝑖) Calculate (𝐆 ̃ (𝑖) )T 𝐆 ̃ (𝑖) Inverse (𝐆

(𝑖)

ED

(𝑖)

O(𝑛𝑙3 2 ) + O(𝑛𝑙3 c)

CE

Table 3 The computational complexity of ML -KELM in the test stage

(𝑖)

AC

Calculate 𝐙 Calculate 𝛀𝑍 ̃ Calculate 𝐘

1-st layer

2-nd layer

----

O(𝑚𝑛𝑑) ---

Table 4 The computational complexity of ML-EKM-ELM in the test stage 1-st layer

Calculate Calculate Calculate

𝐙 (𝑖) ̃𝒁 𝛀 ̃ 𝐘

3-rd layer O(𝑚𝑛2 ) O(𝑚𝑛2 ) O(𝑚𝑛𝑐)

----

4 Experiments and results

2-nd layer O(𝑚𝑑𝑙1 ) ---

3-rd layer O(𝑚𝑙1 𝑙2 ) O(𝑚𝑙2 𝑙3 )

O(𝑚𝑙3 2 ) + O(𝑚𝑙3 𝑐)

ACCEPTED MANUSCRIPT

CR IP T

Extensive experiments were conducted to evaluate the accuracy, time complexity and memory requirements of the proposed ML-EKM-ELM, which are further compared with ML-KELM and other relevant state-of-the-art methods. 4.1 Comparison Between ML-EKM-ELM and ML-KELM ML-EKM-ELM and ML-KELM are evaluated over 12 publicly available benchmark datasets from UCI Machine Learning repository [20] and http://openml.org [21]. All data sets are described in Table 5, nine of which are originally used in [11] and the results in [11] have shown that ML-KELM outperforms ML-ELM and H-ELM on average by 2.3% and 4.1% respectively. Table 5 Properties of benchmark data sets

1473 1484 2600 3279 5000 5974 7000 7400 9298 10299 14395 19282

1178 1187 2080 2623 4000 4779 5600 5920 7438 8239 11516 15426

295 297 520 656 1000 1195 1400 1480 1860 2060 2879 3856

AN US

CMC Yeast Madelon Isolet Waveform HAR1 Gisette TwoNorm USPS HAR2 Sylva Adult

Numbers of training Numbers of test samples samples

M

Instances

Features

Classes

9 8 500 617 21 561 5000 20 254 561 212 122

3 10 2 26 2 6 2 2 10 6 2 2

ED

Dataset

PT

A. Experiments Setup The experiments were carried out in Matlab 2015a under MacOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24 GB RAM. Fivefold cross validation is used for all experiments for a fair comparison. Following the practice in [11], RBF

CE

kernel κ(𝑖) (𝐱 , 𝐱 ,

𝑖 ) = exp(−

∥𝐱𝑘 𝐱𝑗 ∥22 ) 2𝜎 2

was adopted for all experiments, activation

AC

function 𝑔𝑖 was chosen as linear piecewise activation, and the number of hidden layers is 3. ML-KELM has an automatically determined structure of 𝑑 − 𝑛 − 𝑛 − 𝑛 − 𝑐. The kernel parameter 𝜎𝑖 2 (i=1 to 3) in each layer is respectively set as (𝑖)

𝐱

−𝐱

(𝑖)

𝛽 𝑛2

∑𝑛,

=1

∥

∥22 , where 𝛽𝑖 ∈ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter

in each layer is set as 225 because the experiments indicated that a sufficiently large 𝑖 can result in satisfactory performance on most datasets. Therefore, ML-KELM results in a total of 125 combinations of parameters for (𝜎1 , 𝜎2 , 𝜎3 , 225 , 225 , 225 ). The network structure of ML-EKM-ELM is 𝑑 − 𝑙1 − 𝑙2 − 𝑙3 − 𝑐. For the sake of 𝑖

ACCEPTED MANUSCRIPT simplicity, we simply set 𝑙1 = 𝑙2 = 𝑙3 = 𝑝𝑛, but not to find out the optimal network structure of ML-EKM-ELM because our aim is just to show the effectiveness of ML-EKM-ELM over relevant state-of-the-art methods. Hence, the network structure of ML-EKM-ELM becomes 𝑑 − 𝑝𝑛 − 𝑝𝑛 − 𝑝𝑛 − 𝑐. The kernel parameter 𝜎𝑖 (i=1 to 3) in each layer is respectively set as

𝛽 𝑛∙𝑙

∑𝑛=1 ∑𝑙 =1 ∥ 𝐱

(𝑖)

− 𝐯 (𝑖) ∥22 , where

CR IP T

𝛽𝑖 ∈ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter 𝑖 in each layer is simply set as 225 . Similarly, a total of 125 combinations of (𝜎1 , 𝜎2 , 𝜎3 , 225 , 225 , 225 ) are resulted, which is consistent with ML-KELM. The parameter 𝑝 is only selected from {0.01, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3} because our later experiments show that a large 𝑝 (> 0.2) cannot result in significant increase of test accuracy but leads to very heavy computation burden.

AC

CE

PT

ED

M

AN US

B. Evaluation and Performance Analysis The training time and the test accuracy of ML-EKM-ELM both depend on the selected value of p. The training time and test accuracy is plotted as a function of p in Figure 4 for Isolet, USPS, HAR2 and Sylva, which are selected from Table 5.

Fig. 4 Training time and test accuracy with different p (ML-EKM-ELM)

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

Fig.4 shows several properties of the proposed work: i) The test accuracy is improved as p grows (i.e., adopting more landmark points). The reason is that ML-EKM-ELM can preserve more discriminant information when more training samples are selected as landmark points to construct the EKM. ii) The value p to achieve satisfactory accuracy depends on the application dataset. For some dataset (e.g., Syala), a small p is sufficient to achieve satisfactory accuracy, while on some difficult tasks (e.g., Isolet), lager p is needed for a good accuracy. iii) The test accuracy grows slowly when the value of p reaches a certain threshold, which exposes that selecting a certain number of training samples as landmark points is sufficient to preserve most discriminant information in EKM. iv) After a certain threshold, increasing of p can only slightly improve the accuracy, but incurs a sharp increase of training time. From the experiments, ML-EKM-ELM with 𝑝 = 0.1 can already achieve satisfactory accuracy in most adopted datasets, which shows that selecting a small part of training samples as landmark points should be sufficient to construct EKM for learning a robust classifier. In practice, the selection of p is a tradeoff between accuracy and the limited resources (i.e., time and memory requirements). As a rule of thumb, we can first set 𝑝 = 0.1 to test the model performance of ML-EKM-ELM. If a device (e.g., mobile device) cannot provide enough resources to store and run such model, we can try a smaller 𝑝 (e.g., 𝑝 = 0.05, 0.01). If the accuracy does not satisfy the demand of the application, we can set a larger one (e.g., 𝑝 = 0.2, 0.3). The time complexities and memory requirements of ML-EKM-ELM and ML-KELM are listed in Table 6 and Table 7 for comparison. For ML-EKM-ELM, the selected value of p is with relative error of less than 1% in terms of accuracy |TA1 TA2| TA1

ED

compared to ML-KELM (i.e.,

× 100%

1%, where TA1 and TA2 are the

PT

corresponding test accuracies of ML-KELM and ML-EKM-ELM). In Table 6, ML-EKM-ELM achieves a substantial reduction of training time to get the comparable accuracy of ML-KELM. For adult dataset, our proposed ML-EKM-ELM runs 279 times faster than ML-KELM only with a little loss of 0.11%. The reason is

CE

̃ (𝑖) and invert 𝐆 ̃ (𝑖) T 𝐆 ̃ (𝑖) , that ML-EKM-ELM only takes O(𝑛𝑙𝑖 2 ) time to form 𝐆

AC

while ML-KELM requires O(𝑛3 ) time to form and invert 𝛀(𝑖) . In addition, the test ̃ 𝑍 , 𝚪 (𝑖) and 𝛃 are with much smaller time is also substantially reduced because 𝛀 size in the execution stage. For adult dataset, our proposed ML-EKM-ELM runs 315 times faster than ML-KELM. In Table 7, ML-EKM-ELM achieves a substantial reduction of training memory as ̃ (𝑖) while well because ML-EKM-ELM requires a storage of 𝑛 × 𝑙𝑖 matrix 𝐆 ML-KELM requires a storage of 𝑛 × 𝑛 kernel matrix 𝛀(𝑖) . For adult dataset, the training memory requirement of ML-EKM-ELM is 34.9 MB while ML-KELM takes 3348 MB. In addition, the test memory requirement of ML-EKM-ELM is also ̃ 𝑍 , 𝚪 (𝑖) and substantially reduced because only some much smaller matrices (such as 𝛀 𝛃 ) are necessary in the execution stage. For adult dataset, the test memory requirement of ML-EKM-ELM is 4.69 MB while ML-KELM takes up to 432 MB. In

ACCEPTED MANUSCRIPT

both analyses, a conclusion can be drawn that ML-EKM-ELM is of much lower computational complexity and memory requirements than ML-KELM. Table 6 The time complexities of ML-EKM-ELM and ML-KELM Test Accuracy (%) Data Set

Training time (in seconds)

Test time (in seconds)

p TA1

TA2

Tr1

Tr2

Tr1/ Tr2

Te1

Te2

Te1/Te2

0.05

55.48 2.5

55.29 2.5

0.261

0.014

18.6

0.040

0.0016

25.0

Yeast

0.03

59.96 2.

59.39 2.5

0.273

0.010

27.3

0.038

0.0013

29.2

Madelon

0.2

79.13 1.

79.46 2.1

1.213

0.372

3.3

0.162

0.0227

7.1

Isolet

0.2

94.76 0.

94.00 0.

2.056

0.648

3.2

0.239

0.0347

6.9

Waveform

0.01

86.84 1.0

86.72 1.1

5.113

0.025

204.5

0.554

0.0025

221.6

HAR1

0.1

98.75 0.2

97.94 0.3

8.527

0.924

9.2

0.803

0.0564

14.2

Gisette

0.05

97.89 0.5

97.47 0.5

266.6

19.87

13.4

23.77

0.8512

27.9

TwoNorm

0.01

97.80 0.

97.82 0.3

13.44

0.050

268.8

1.212

0.0043

281.9

2.312

0.1928

12.0

3.057

0.1444

21.2

CR IP T

CMC

0.1

98.37 0.3

97.53 0.

26.32

2.400

11.0

0.07

99.11 0.2

98.23 0.3

35.51

1.881

18.9

Sylva

0.01

99.40 0.1

98.47 0.2

83.69

0.354

236.4

6.844

0.0258

265.3

Adult

0.01

84.77 0.5

84.66 0.

212.9

0.763

279.0

15.50

0.0492

315.0

AN US

USPS HAR2

*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; Tr1: the training time of ML-KELM; Tr2: the training time of ML-EKM-ELM; Te1: the test time of ML-KELM; Te2: the test time of ML-EKM-ELM

Test Accuracy (%) Data Set

p TA2

ED

TA1

M

Table 7 The memory requirements of ML-EKM-ELM and ML-KELM Training memory (in MB)

Test memory (in MB)

TrM1

TrM2

TrM1/ TrM2

TeM1

TeM2

TeM1/TeM2

0.05

55.48 2.5

55.29 2.5

19.4

1.03

18.8

2.52

0.18

14.0

Yeast

0.03

59.96 2.

59.39 2.5

20.2

0.62

32.6

2.67

0.10

26.7

Madelon

0.2

79.13 1.

79.46 2.1

67.7

16.6

4.1

15.2

5.59

2.7

Isolet

0.2

Waveform

0.01

HAR1

0.1

PT

CMC

111

Gisette

0.05

97.89 0.5

97.47 0.5

650

33.8

19.2

259

14.1

18.4

TwoNorm

0.01

97.80 0.

97.82 0.3

503

5.16

97.5

63.8

0.69

92.5

USPS

0.1

98.37 0.3

97.53 0.

807

84.6

9.5

113

19.4

5.8

HAR2

0.07

99.11 0.2

98.23 0.3

998

71.2

14.0

154

15.7

9.8

Sylva

0.01

99.40 0.1

98.47 0.2

1922

19.7

97.6

256

2.72

94.1

Adult

0.01

84.77 0.5

84.66 0.

3348

34.9

95.9

432

4.69

92.1

94.00 0.

26.0

4.3

24.7

8.94

2.8

86.84 1.0

86.72 1.1

229

2.38

96.2

29.4

0.33

89.1

98.75 0.2

97.94 0.3

340

36.3

9.4

59.5

9.31

6.4

AC

CE

94.76 0.

*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; TrM1: the training memory of ML-KELM; TrM2: the training memory of ML-EKM-ELM; TeM1: the test memory of ML-KELM; TeM2: the test memory of ML-EKM-ELM

C.

Evaluation on the Effect of Multiple Hidden Layers To further illustrate the effect of multiple hidden layers of ML-EKM-ELM, a highly nonlinear data set Madelon is employed, which has 2600 samples with 500 features. Madelon was originally proposed in the NIPS 2003 feature selection

ACCEPTED MANUSCRIPT

AN US

CR IP T

challenge[22]. Fivefold cross-validation is used in the following experiment, and the input attributes are normalized into the range [-1,1]. In highly nonlinear application, ML-EKM-ELM may require more hidden layers to obtain satisfactory performance. In this experiment, the number of hidden layers is set from 2 to 5. The test accuracy of ML-EKM-ELM under different hidden layers is shown in Table 8, and ML-KELM is also included to make a comparison. From Table 8, it is evident that: i) Under 2 layers, the accuracy of ML-KELM outperforms ML-EKM-ELM. However, the accuracy of ML-EKM-ELM is improved as p grows up. The reason is that shallow model needs to map the highly nonlinear input data to a very high dimensional space which is linearly separable. Therefore, it usually requires large number of hidden neurons to obtain better accuracy for highly nonlinear problem. ii) When p is large enough (p 0.05 for madelon), ELM-EKM-ELM with sufficient number of the hidden layers can perform as close as ML-KELM in terms of accuracy. iii) When the number of hidden layers reaches a certain threshold (4 hidden layers on madelon), the accuracy keeps almost constant or even starts to decline, which reveals that kernel-based model may not need a very deep structure to achieve satisfactory performance due to its strong approximation ability for nonlinear problem. Otherwise, overfitting may even occur. Table 8 Evaluation on the effect of multiple hidden layers over Madelon 2 layers 72.38 2.1

ML-EKM-ELM(p=0.2)

68.77 2.2

ML-EKM-ELM(p=0.1) ML-EKM-ELM(p=0.05)

5 layers

79.73

.𝟔

78.55 1.7

79.46 2.1

79.92 𝟐. 𝟐

79.42 2.1

67.86 1.

75.43 2.1

80.51 𝟐.

80.47 1.8

65.90 2.

74.68 1.

79.56 1.

79.62 2.4

64.11 2.0

66.06 3.0

67.35 2.

67.37 𝟐. 𝟐

ED

ML-EKM-ELM(p=0.01)

4 layers

79.13 1.

M

ML-KELM

3 layers

PT

The best result of each row is shown in bold.

AC

CE

4.2 Comparison with State-of-the-Art methods on NORB dataset In this experiment, a more complicated data set called NYC Object Recognition Benchmark (NORB) [23] was evaluated to further confirm the effectiveness of the proposed ML-EKM-ELM. NORB contains 48,600 images of different 3D toy objects of five distinct categories: animals, humans, airplanes, trucks, and cars, as shown in Fig. 5. Each image is captured from different viewpoints and under various lighting conditions. The training set contains 24,300 stereo training images of size 2 × 32 × 32 (i.e., 2048 dimensions), and another set of 24,300 images for test. We adopted the pre-fixed train/test data and the preprocessing method used in [12] to produce the comparable performance of our proposed work and the state-of-the-art relevant algorithms.

CR IP T

ACCEPTED MANUSCRIPT

(a)

(b)

Fig.5 a Examples for training figures. b Examples for testing figures [23]

AC

CE

PT

ED

M

AN US

Existing mainstream methods are compared to verify the effectiveness of the proposed ML-EKM-ELM, including BP-based algorithms (Stacked Auto Encoders (SAE) [2], Stacked Denoising Autoencoder (SDA) [24], Deep Belief Networks (DBN) [25], Deep Boltzmann Machines (DBM) [26], and multilayer perceptron (MLP) [27]), and ELM-based training algorithms (ML-ELM[10] and H-ELM[12]). For BP-based algorithms (SAE, SDA, BDN, DBM, and MLP), the initial learning rate is set as 0.1 and the decay rate is set as 0.95 for each learning epoch. For ELM-based training algorithms (ML-ELM and H-ELM with 3 layers), the regularization parameter 𝑖 of ML-ELM in each layer is respectively set as 10 1 , 103 and 108 ,while the regularization parameter of H-ELM is respectively set as 1,1 and 230 . More detailed parameters setting could be checked in [12]. Due to the huge computational cost, the experimental results on SAE, SDA, BDN, DBM, MLP, and ML-ELM are directly cited from [12]. Only the network structure of H-ELM with 20 − 3000 − 3000 − 15000 − 5 is tested in our experiment because [12] has demonstrated that such network structure can obtain optimal performance in NORB dataset. ML-KELM has an automatically determined structure of 𝑑 − 𝑛 − 𝑛 − 𝑛 − 𝑐, and for the sake of simplicity, ML-EKM-ELM was tested with network structure of 𝑑 − 𝑝𝑛 − 𝑝𝑛 − 𝑝𝑛 − 𝑐. Experiment in section 4.1 showed that adopting 10% of training samples may be sufficient to construct EKM for learning a robust ML-EKM-ELM, therefore 𝑝 = 0.1 was adopted in our experiments and 𝑝 = 0.01, 0.05 are included as well for comparison. The kernel parameter 𝜎𝑖 (i=1 to 3) in each layer is respectively set as

𝛽 𝑛∙𝑙

∑𝑛=1 ∑𝑙 =1 ∥ 𝐱

(𝑖)

− 𝐯 (𝑖) ∥22 , where 𝛽𝑖 ∈

{10 2 , 10 1 , 100 , 101 , 102 }, the regularization parameter 𝑖 in the first two layer is set as 225 , and in the final layer is selected from {25 , 210 , 215 , 220 , 225 }. A total of 625 combinations of (𝜎1 , 𝜎2 , 𝜎3 , 225 , 225 , 3 ) are tested in our experiments. All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported. The average training time for each model with fixed parameters is shown in Table 9. For ML-KELM, practically, it

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

may take several weeks to thoroughly try out the numerous combinations of parameters for the best model. In addition to substantial training time, the test time of ML-KELM is 427.48s and the test memory requirement is 4,577 MB, which restricts the application of ML-KELM on many devices with only limited resources such as mobile phones. The proposed ML-EKM-ELM can overcome the time and memory limitation of ML-KELM while comparable accuracy is maintained. Under the comparable accuracy of H-ELM (with structure of 20 − 3000 − 3000 − 15000 − 5), the training time and test time of ML-EKM-ELM with 𝑝 = 0.05(the corresponding structure is 20 − 1215 − 1215 − 1215 − 5) , are surprisingly fast, i.e., 17.46s and 4.96s respectively, while the training memory and test training memory are only 453MB and 250MB respectively. The reason is that ML-EKM-ELM with 𝑝 = 0.05 employs a much smaller EKM, which is easy for storage and computation. ML-EKM-ELM with 𝑝 = 0.1 (the corresponding structure is 20 − 2 30 − 2 30 − 2 30 − 5) performs better than ML-ELM and H-ELM in terms of accuracy, up to 4% and 2% with a smaller variance, respectively. ML-ELM and H-ELM are suboptimal because the input weights 𝐚(𝑖) and the bias 𝑏 (𝑖) in every layer are randomly generated. In some cases, poorly generated 𝐚(𝑖) and 𝑏 (𝑖) deteriorate the generalization of ML-ELM and H-ELM. Although large number of hidden nodes can improve the stability (i.e., lower standard deviation (STD)) of the accuracy in H-ELM, the STD of the accuracy in H-ELM with 15000 hidden nodes is still 0.40% as shown in Table 9. Instead, ML-EKM-ELM uses training samples as landmark points (in practice, adopting a small portion of training samples is sufficient) to generate features, which contains more discriminant information than the randomly generated features [28] and thereby its accuracy is higher and more stable. Although only 2,430 features ( 𝑝 = 0.1 means to select 2,430 training samples as landmark points from 24,300) are used in ML-EKM-ELM, the STD of ML-EKM-ELM drops to 0.24% compared to 0.40% in H-ELM with 15,000 features. Note that the accuracy of ML-EKM-ELM can be further improved and can become more stable as p grows. Moreover, ML-ELM and H-ELM need to exhaustively try out the values of 𝐿𝑖 from several hundreds to tens of thousands because their accuracies are very sensitive to 𝐿𝑖 . In contrast, ML-EKM-ELM only need a few trials of 𝑝 in our experiment. Therefore, the exhaustive and tedious tuning over 𝐿𝑖 is eliminated in ML-EKM-ELM. Table 9 Comparison on NORB dataset Accuracy (%)

Training

Test time(s)

time(s)

Training

Test

Memory(MB)

Memory(MB)

SAEa [2]

86.28

60504.34

N/A

N/A

N/A

a

SDA [24]

87.62

65747.69

N/A

N/A

N/A

a

DBN [25]

88.47

87280.42

N/A

N/A

N/A

a

89.65

182183.53

N/A

N/A

N/A

84.20

34005.47

N/A

N/A

N/A

DBM [26] a

MLP-BP [27]

ACCEPTED MANUSCRIPT

a

HELM [12]

88.91

775.29

N/A

N/A

N/A

91.28

432.19

N/A

N/A

N/A

155.19

26.48

3604

3082

3.07

1.35

91

46

17.46

4.96

453

250

60.33

11.62

932

545

965.66

427.48

8785

4577

Maximum: 91.56

HELMb [12]

Mean: 90.47 0. 0

ML-EKM-ELMb

Maximum: 87.98

(p=0.01)

Mean:86.51 0.7

ML-EKM-ELMb

Maximum: 92.18

(p=0.05)

Mean: 91.50 0.30

ML-EKM-ELMb

Maximum:93.17

(p=0.1)

Mean: 92.38 0.2

ML-KELMb

93.52*

CR IP T

ML-ELMa [10]

a. Results from[12] ; machine configuration: Laptop, Intel-i7 2.4G CPU, 16G DDR3 RAM, Windows 7, MATLAB R2013b

b. Results from our computer; machine configuration: macOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24

AN US

GB RAM, MATLAB R2015a

N/A. Results from[12] and [12] didn’t report the results of test time, training memory and test memory. * The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (𝐚(𝑖) and 𝑏 (𝑖) ) and does not need to randomly select the training sample as landmark points.

AC

CE

PT

ED

M

4.3 Comparison with State-of-the-Art methods on 20Newsgroups dataset To test the performance of ML-EKM-ELM on text categorization, 20Newsgroups2 was used in our experiment. which contains 18846 documents with 26214 distinct words. This dataset has 20 categories, each with around 1000 documents. In our experiment, 11314 documents (60%) is adopted for training data and 7532 documents (40%) for test data. Due to the huge dimensionality of 20Newsgroups (up to 26214 features), large number of hidden nodes is need for autoencoder to keep enough discriminant information. Therefore, for ML-ELM and H-ELM, the numbers of hidden nodes 𝐿𝑖 (𝑖 = 1 to 3) are, respectively, set as 500 × 𝑚 {𝑚 = 1,2,3, … ,10}. For the proposed ML-EKM-ELM, we simply set 𝑙1 = 𝑙2 = 𝑙3 = 𝑝𝑛 , and 𝑝 = {0.1,0.2,0.3 }. For ML-ELM,ML-KELM and ML-EKM-ELM, the regularization parameter 𝑖 in each layer is simply set as 225 , while the regularization parameter of H-ELM is respectively set as 1,1 and 230 as recommended by [12].All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported in Table 10. Similar to the NORB dataset, ML-EKM-ELM (p=0.2) takes fewer hidden nodes (2262 vs. 3500) in each layer to obtain the comparable accuracy of ML-ELM and H-ELM because ML-EKM-ELM contains more discriminant information in the hidden layer than the randomly generated hidden layer in ML-ELM and H-ELM. Therefore, ML-EKM-ELM outperforms ML-EKM and H-ELM both in computation time and memory requirement under the similar accuracy. Furthermore, ML-EKM-ELM (p=0.2) significantly outperforms ML-ELM, 2

20Newsgroups is download at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html

ACCEPTED MANUSCRIPT

CR IP T

ML-KELM, and H-ELM in terms of training time (40.24 < 186.69 < 196.01 < 316.95). The reason is that dealing with high dimensional features (up to 26,214 input features and thousands of hidden nodes in each layer for 20Newsgroups) poses a big challenge for ML-ELM, ML-KELM and H-ELM. In details, ML-ELM needs to ensure orthogonality of the large randomly generated matrix (i.e., input weights and biases) in each layer for better generalization ability. ML-KELM needs to calculate the entire kernel matrix 𝛀(𝑖) in each layer from the big input matrix 𝐗 (𝑖) . H-ELM needs to apply fast iterative shrinkage-thresholding algorithm (FISTA) to the big input matrices 𝐗 (𝑖) and 𝐇 (𝑖) for sparse autoenocoder. In contrast, ML-EKM-ELM can avoid these above-mentioned operations, maintaining a very attractive training time. Table 10 Comparison on 20Newsgroups Accuracy (%)

ML-ELM

26214-3500-3500-3500-20

HELM

26214-3500-3500-4000-20 26214-1131-1131-1131-20

26214-2262-2262-2262-20

(p=0.2)** ML-EKM-ELM

Maximum:85.52

Mean:85.04 0.18 26214-3394-3394-3394-20

(p=0.3)**

Maximum:85.95

Mean:85.48 0.17 26214-11314-11314-11314-20

Test time(s)

186.69

5.08

86.47*

Training

Test

Memory(MB)

Memory(MB)

1413

6.55

1454

1065

10.14

1.74

402

290

40.24

4.31

844

617

104.10

7.16

1320

982

196.01

30.27

3946

2734

* The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (𝐚(𝑖) and 𝑏 (𝑖) ) and does not need to randomly select the training sample as landmark points.

** ML-EKM-ELM with 𝑝 = 0.1 results 1131 (i. e. , 𝑝𝑛 = 0.1 1131 ≈ 1131) hidden nodes in every layer；

PT

ML-EKM-ELM with 𝑝 = 0.2 results 22 2 (i. e. , 𝑝𝑛 = 0.2 1131 ≈ 22 2) hidden nodes in every layer；

CE

ML-EKM-ELM with 𝑝 = 0.3 results 33

(i. e. , 𝑝𝑛 = 0.3 1131 ≈ 33

) hidden nodes in every layer.

5 Conclusion An efficient kernel-based deep learning algorithm called ML-EKM-ELM is proposed, which replaces the randomly generated hidden layer in ML-ELM into an approximate empirical kernel map (EKM) with 𝑙𝑖 dimensions (where 𝑙𝑖 = 𝑛𝑝 , and 𝑝 ∈ {0.01, 0.05, 0.1,0.2} typically, n is the data size). By this way, ML-EKM-ELM has resolved the three practical issues of ML-ELM and H-ELM. 1) ML-EKM-ELM is based on kernel learning so that no random projection is necessary. As a result, an optimal and stable performance can be achieved under a fixed set of parameters. In the experiments, ML-EKM-ELM with p=0.1 is respectively up to 2% and 4% better than H-ELM and ML-ELM in terms of accuracy over NORB. 2) ML-EKM-ELM does not need to exhaustively tune the parameters 𝐿𝑖 for all layers as in ML-ELM and H-ELM. Only a few trials of 𝑙𝑖 is necessary in

AC

1024

316.95

ED

ML-KELM

Maximum: 84.05

Mean:83.47 0.25

(p=0.1)** ML-EKM-ELM

Maximum: 85.48

Mean: 84.93 0.20

M

ML-EKM-ELM

Maximum: 85.38 Mean: 84.85 0.1

Training time(s)

AN US

Network Structure

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

ML-EKM-ELM. For simplicity, 𝑙𝑖 can be equally set in every layer (i.e., 𝑙1 = 𝑙2 = 𝑙3 = ⋯), while maintaining a satisfactory performance. 3) Both computation time and memory requirements may be significantly reduced in ML-EKM-ELM. For NORB, under the comparable accuracy of H-ELM, ML-EKM-ELM with p=0.05 can be respectively up to 9 times and 5 times faster than H-ELM for training and testing, while the training memory storage and test memory storage can be reduced up to 1/8 and 1/12 respectively. Furthermore, ML-EKM-ELM can overcome the memory storage and computation issues for ML-KELM, producing a much smaller hidden layer for fast training and low memory storage. For NORB, ML-EKM-ELM with p=0.1 can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9. To summarize, we empirically show that hidden layers in multilayer neural network can be encoded in form of EKM. In the future, more advanced low-rank approximation method such as Clustered Nyström method [18] and ensemble Nyström method [29] may be tested for more effective EKM and to further improve the performance of the proposed ML-EKM-ELM. At last, ML-EKM-ELM has the following limitation: when 𝑝 > 0.3 is chosen, ML-EKM-ELM may not outperform ML-KELM in terms of training time for some applications. The reason is that ML-EKM-ELM needs to apply a standard SVD (singular value decomposition) step on the submatrix of kernel matrix in training stage, whose size is proportional to the value of 𝑝. When 𝑝 > 0.3, the SVD step will dominate the computations and become computationally prohibitive. Therefore, in the future, we need to find an approximate and fast SVD method [30] to overcome this problem.

CE

Reference

codes:

PT

Acknowledgments The work is supported by University of Macau under project MYRG2018-00138-FST, and MYRG2016-00134-FST.

AC

[1] Y. Bengio, Learning deep architectures for AI, Foundations and trends in Machine Learning, 2 (2009) 1-127. [2] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science, 313 (2006) 504-507. [3] J. Cao, W. Wang, J. Wang, R. Wang, Excavation equipment recognition based on novel acoustic statistical features, IEEE transactions on cybernetics, 47 (2017) 4392-4404. [4] J. Cao, K. Zhang, M. Luo, C. Yin, X. Lai, Extreme learning machine and adaptive sparse representation for image classification, Neural networks, 81 (2016) 91-102. [5] L. Xi, B. Chen, H. Zhao, J. Qin, J. Cao, Maximum Correntropy Kalman Filter With State Constraints, IEEE Access, 5 (2017) 25846-25853

[6] W. Wang, Y. Huang, Y. Wang, L. Wang, Generalized autoencoder: A neural network framework for dimensionality reduction, in: IEEE Conference on Computer Vision and Pattern Recognition

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Workshops,2014, pp. 490-497. [7] L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, Journal of Machine Learning Research, 10 (2009) 66-71. [8] J. Deng, Z. Zhang, E. Marchi, B. Schuller, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in: Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction,2013, pp. 511-516. [9] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of ICML Workshop on Unsupervised and Transfer Learning,2012, pp. 37-49. [10] L.L.C. Kasun, H. Zhou, G.B. Huang, C.M. Vong, Representational learning with ELMs for big data, IEEE Intelligent Systems,28 (6) (2013), 31-34. [11] C.M. Wong, C.M. Vong, P.K. Wong, J. Cao, Kernel-based multilayer extreme learning machines for representation learning, IEEE transactions on neural networks and learning systems,29(3),(2018),757-762 [12] J. Tang, C. Deng, G.B. Huang, Extreme learning machine for multilayer perceptron, IEEE transactions on neural networks and learning systems, 27 (2016) 809-821. [13] G.B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42 (2012) 513-529. [14] K. Zhang, L. Lan, Z. Wang, F. Moerchen, Scaling up kernel svm on limited resources: A low-rank linearization approach, in: JMLR—Proceedings Track, vol. 22, 2012, pp. 1425–1434 [15] A. Golts, M. Elad, Linearized kernel dictionary learning, IEEE Journal of Selected Topics in Signal Processing, 10 (2016) 726-739. [16] F. Pourkamali-Anaraki, S. Becker, A randomized approach to efficient kernel clustering, in: IEEE Global Conference on Signal and Information Processing, 2016, pp. 207-211. [17] A. Gittens, M.W. Mahoney, Revisiting the Nyström method for improved large-scale machine learning, Journal of Machine Learning Research, 28 (2013) 567-575. [18] F. Pourkamali-Anaraki, S. Becker, Randomized Clustered Nystrom for Large-Scale Kernel Machines, arXiv preprint arXiv:1612.06470, (2016). [19] B. Scholkopf, S. Mika, C.J. Burges, P. Knirsch, K.-R. Muller, G. Ratsch, A.J. Smola, Input space versus feature space in kernel-based methods, IEEE transactions on neural networks, 10 (1999) 1000-1017. [20] M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine,CA (2013). http://archive.ics.uci.edu/ml. [21] J. Vanschoren, J.N. Van Rijn, B. Bischl, L. Torgo, OpenML: networked science in machine learning, ACM SIGKDD Explorations Newsletter, 15 (2014) 49-60. [22] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, Result analysis of the NIPS 2003 feature selection challenge, Advances in neural information processing systems,(2005), pp. 545-552. [23] Y. LeCun, F.J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2004), vol.2, pp. 97-104. [24] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096-1103. [25] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

computation, 18 (2006) 1527-1554. [26] R. Salakhutdinov, G. Hinton, Deep boltzmann machines, in : Proceedings of the Artificial Intelligence and Statistics,2009, pp. 448-455. [27] C.M. Bishop, Pattern recognition and machine learning,springer, New York (2006). [28] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, Z.H. Zhou, Nyström method vs random fourier features: A theoretical and empirical comparison, Advances in neural information processing systems,2012, pp. 476-484. [29] S. Kumar, M. Mohri, A. Talwalkar, Ensemble nystrom method, Advances in Neural Information Processing Systems, 2009, pp. 1060-1068. [30] M. Li, J.T. Kwok, B. Lu, Making large-scale Nyström approximation possible, in proceedings of the International Conference on Machine Learning (ICML),2010,pp.631-638

ACCEPTED MANUSCRIPT

CR IP T

Chi-Man VONG received the M.S. and Ph.D. degrees in software engineering from the University of Macau, Macau, in 2000 and 2005, respectively. He is currently an Associate Professor with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau. His research interests include machine learning methods and intelligent systems.

M

AN US

Chuangquan Chen received the B.S. and M.S. degrees in mathematics from South China Agriculture University, Guangzhou, China, in 2010 and 2013, respectively. He is currently pursuing the Ph.D. degree in Computer Science, University of Macau, Macau, China. His current research interests include machine learning and data mining.

AC

CE

PT

ED

Pak-Kin Wong received the Ph.D. degree in Mechanical Engineering from The Hong Kong Polytechnic University, Hong Kong, in 1997. He is currently a Professor in the Department of Electromechanical Engineering and Associate Dean (Academic Affairs), Faculty of Science and Technology, University of Macau. His research interests include automotive engineering, fluid transmission and control, artificial intelligence, mechanical vibration, and manufacturing technology for biomedical applications. He has published over 200 scientific papers in refereed journals, book chapters, and conference proceedings.

Empirical kernel map-based multilayer extreme learning machines for representation learning

Empirical kernel map-based multilayer extreme learning machines for representation learning

Recommend Documents