Accepted Manuscript
Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong PII: DOI: Reference:
S0925-2312(18)30584-8 10.1016/j.neucom.2018.05.032 NEUCOM 19585
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
22 January 2018 8 May 2018 9 May 2018
Please cite this article as: Chi-Man Vong , Chuangquan Chen , Pak-Kin Wong , Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning, Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.032
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Empirical Kernel Map-Based Multilayer Extreme Learning Machines for Representation Learning 1 Chi-Man Vong ,Chuangquan Chen1, Pak-Kin Wong2 1
Department of Computer and Information Science, University of Macau, Macau 2 Department of Electromechanical Engineering, University of Macau, Macau
Abstract
AC
CE
PT
ED
M
AN US
CR IP T
Recently, multilayer extreme learning machine (ML-ELM) and hierarchical extreme learning machine (H-ELM) were developed for representation learning whose training time can be reduced from hours to seconds compared to traditional stacked autoencoder (SAE). However, there are three practical issues in ML-ELM and H-ELM: 1) the random projection in every layer leads to unstable and suboptimal performance; 2) the manual tuning of the number of hidden nodes in every layer is time-consuming; and 3) under large hidden layer, the training time becomes relatively slow and a large storage is necessary. More recently, issues (1) and (2) have been resolved by kernel method, namely, multilayer kernel ELM (ML-KELM), which encodes the hidden layer in form of a kernel matrix (computed by using kernel function on the input data), but the storage and computation issues for kernel matrix pose a big challenge in large-scale application. In this paper, we empirically show that these issues can be alleviated by encoding the hidden layer in the form of an approximate empirical kernel map (EKM) computed from low-rank approximation of the kernel matrix. This proposed method is called ML-EKM-ELM, whose contributions are: 1) stable and better performance is achieved under no random projection mechanism; 2) the exhaustive manual tuning on the number of hidden nodes in every layer is eliminated; 3) EKM is scalable and produces a much smaller hidden layer for fast training and low memory storage, thereby suitable for large-scale problems. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed ML-EKM-ELM. As an illustrative example, on the NORB dataset, ML-EKM-ELM can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9.
Index Terms: Kernel learning, Multilayer extreme learning machine (ML-ELM), Empirical kernel map (EKM), Representation learning, stacked autoencoder (SAE).
1. Introduction Autoencoder (AE) is an unsupervised neural network whose input layer is equal to โ Corresponding author: Chi-Man Vong (
[email protected])
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
output layer [1, 2]. AE offers an alternative way for traditional noise reduction and traditional feature extraction [3-5], and automatically extracts effective representation from the raw data. Several AEs can be used as building blocks to form a stacked AE (SAE) [1],which is capable of extracting different levels of abstract features from raw data so that it is suitable for applications of dimension reduction [6, 7] and transfer learning [8, 9]. However, the iterative training procedure in SAE based on backpropagation is extremely time-consuming. For this reason, multilayer extreme learning machine (ML-ELM) [10] was proposed (as shown in Fig.1), where multiple ELM-based AEs (ELM-AE) can be stacked for representation learning, followed by a final layer for classification. In ELM-AE [Fig.1(a)], the training mechanism randomly chooses the input weights ๐(๐) and bias ๐ (๐) for the input layer of AE and then analytically compute the transformation matrix ๐ช (๐) . ELM-AE outputs ๐ช (๐) to learn a new input representation [Fig.1(b)]. After the representation learning is done [Fig.1(d)], the final data representation ๐ฑ ๐๐๐๐๐ is obtained and used as hidden layer to calculate the output weight ๐ท for classification using regularized least squares. Without any iteration, ML-ELM can reduce the training time from hours to seconds compared to traditional SAE.
Fig. 1. The architecture of ML-ELM [11]: (a) The transformation ๐ช (1) is obtained for representation T
learning in ELM-AE; (b) The new input representation ๐ฑ (2) is computed by ๐ (๐ฑ (1) โ (๐ช (1) ) ), where ๐ is an activation function. (c) ๐ฑ (2) is the input to ELM-AE for another representation learning. (d) After the unsupervised representation learning is finished, ๐ฑ ๐๐๐๐๐ is obtained and used as input to calculate the output weight ๐ท for classification using regularized least squares.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
More recently, a variant of ML-ELM called hierarchical ELM (H-ELM) [12] was developed. Compared to ML-ELM that directly uses the final data representation ๐ฑ ๐๐๐๐๐ as hidden layer to calculate the output weight ๐ท, H-ELM uses ๐ฑ ๐๐๐๐๐ as input to an individual ELM for classification (i.e., first randomly maps ๐ฑ ๐๐๐๐๐ into hidden layer), thereby maintaining the universal approximation capability of ELM. Both ML-ELM and H-ELM are very attractive due to their fast training speed. However, there are several practical issues of ML-ELM and H-ELM: 1) Suboptimal model generalization: the random input weight ๐(๐) and bias ๐ (๐) in every layer lead to unstable and suboptimal performance. In some cases, poorly generated ๐(๐) and ๐ (๐) may hinder the network from high generalization and hence numerous trials on ๐(๐) and ๐ (๐) are necessary. 2) Exhaustive tuning: The accuracy of ML-ELM and H-ELM are drastically influenced by the number of hidden nodes ๐ฟ๐ that requires exhaustive tuning ranging from several hundreds to tens of thousands subject to the complexity of the application. Hence, numerous tedious trials are needed to determine the optimal ๐ฟ๐ . 3) Relatively slow training time and high memory requirement under large ๐ฟ๐ : In large-scale application, an accurate ML-ELM model can easily contain thousands or tens of thousands of hidden neurons. For example, ๐ฟ๐ can be up to 15000 for image dataset NORB[12]. Practically, many devices such as mobile phones are only with a limited amount of expensive memory so that it may be insufficient to store and run large model. It is therefore crucial to run a compact model on limited resources while maintaining satisfactory accuracy. Analytically, the memory requirement and training time of ML-ELM in the i-th layer are respectively O(๐๐ฟ๐ ) and O(๐ฟ๐ 2 ๐) for n training samples, which pose a challenge to train a model for large ๐ฟ๐ , and hence a compact model with faster solution is of interest. From the literature [13], kernel learning is well known for its optimal performance without any random input parameters. Under this inspiration, a kernel version of ML-ELM was proposed, namely, multilayer kernel ELM (ML-KELM) [11], which encodes the hidden layers in form of kernel matrix. Without tuning the parameter ๐ฟ๐ , ๐(๐) and ๐ (๐) , ML-KELM successfully addresses the first two issues of ML-ELM and H-ELM, but worsen the third issue about high memory storage and very slow training time. The reason is that the memory storage and computation issues for kernel matrix ๐(๐) โ R๐ร๐ in each layer pose a big challenge in large-scale application. Specifically, its takes a memory of O(๐2 ) to store ๐(๐) , and a time of O(๐3 ) to find the inverse of ๐(๐) . Thus, both memory and computation cost will grow exponentially along the training data size ๐, which restrict the application of ML-KELM in large-scale problems especially on mobile devices. ฬ๐ ฬT , From the literature [14-16], a kernel matrix ๐ can be decomposed by ๐ โ ๐ ฬ. ๐ ฬ contains most obtaining an approximate empirical kernel map (EKM) ๐ discriminant information of ๐ but with much smaller size, which is very efficient for learning over large-scale data [14-16]. Under this inspiration, a multilayer empirical kernel map ELM (ML-EKM-ELM) is proposed in this paper which addresses the aforementioned issues of ML-ELM and H-ELM by encoding every hidden layer in
ACCEPTED MANUSCRIPT
CE
PT
ED
M
AN US
CR IP T
ฬ (๐) with ๐๐ dimensions (where ๐๐ โช ๐) computed form of an approximate EKM ๐ from the low-rank approximation of ๐(๐) . Moreover, Nystrรถm method [17, 18] is ฬ (๐) due to its efficiency in many large-scale adopted in our work to generate ๐ machine learning problems, which does not need to calculate the entire kernel matrix ฬ (๐) . ๐(๐) and only operates on a small subset of ๐(๐) to generate ๐ Compared to ML-ELM and H-ELM, the contributions of ML-EKM-ELM are: 1) Benefited from kernel learning, the constructed kernel matrix does not require any randomly generated parameter (input ๐(๐) and bias ๐ (๐) ) in each layer so that a stable and theoretically optimal performance is always achieved. EKM efficiently approximates the kernel matrix and shares its stable and optimal performance. 2) The exhaustive tuning of ๐ฟ๐ is eliminated. Only a few trials of ๐๐ is necessary in ML-EKM-ELM, where ๐๐ = 0.01๐, 0.05๐, 0.1๐, 0.2๐ typically (to be detailed in Section 4). For simplicity, ๐๐ can be equally set in every layer (i.e., ๐1 = ๐2 = ๐3 = โฏ), while maintaining a satisfactory performance. 3) EKM is scalable (i.e., ๐๐ can be set with different values) and easy for storage and computation because it can be a very small matrix for hidden representation (i.e., setting ๐๐ to a small value is sufficient for practical application), thereby suitable for large-scale problems and mobile devices. Compared to ML-KELM, the benefits of ML-EKM-ELM are: 1) ML-KELM takes O(๐2 ) to store the matrix ๐(๐) for the i-th layer while ฬ (๐) , which is linear in the data ML-EKM-ELM takes O(๐๐๐ ) to store the matrix ๐ size ๐ for ๐๐ โช ๐. 2) ML-KELM requires O(๐3 ) time to find the inverse of ๐(๐) while ฬ (๐) , which saves ML-EKM-ELM only takes O(๐๐๐ 2 ) time to find the inverse of ๐ substantial computing resources. 3) Benefited from the above two advantages, much smaller matrices can be constructed for feasible storage and execution in mobile devices. The rest of this paper is organized as follows. Section 2 introduces the related works including EKM, ML-ELM, and ML-KELM. Section 3 introduces ML-EKM-ELM. In Section 4, ML-EKM-ELM is compared with ML-KELM and other existing deep neural networks over benchmark data. Finally, conclusions are given in Section 5.
AC
2. Related works In this section, we briefly describe EKM, ML-ELM, and ML-KELM. All techniques are necessary to develop our proposed ML-EKM-ELM. 2.1 Empirical kernel map (EKM) Let ๐ = [๐ฑ1 , โฆ , ๐ฑ ๐ ]T โ R๐ร๐ be an input data matrix that contains ๐ data points in R๐ as its rows. In kernel learning, the inner products in feature space can be computed by adopting a kernel function ฮบ(. , . , . ) on the input data space: ๐
,
= ฮบ(๐ฑ , ๐ฑ , ) = โฉ (๐ฑ ), (๐ฑ )โช,
, = 1โฆ,๐
(1)
where is an adjustable parameter, ๐ โ ๐
๐ร๐ is the kernel matrix and (๐ฑ) is the kernel-induced feature map [19]. The kernel matrix ๐ can be decomposed by
ACCEPTED MANUSCRIPT
R๐ร๐ and ๐๐๐ โ R๐ร๐ , where (๐๐๐ )
,
CR IP T
๐ = ๐๐T ,where ๐ = [ (๐ฑ1 ), (๐ฑ 2 ), โฆ , (๐ฑ ๐ )]T โ ๐
๐ร๐ท (where ๐ท is the rank of ๐) is called empirical kernel map (EKM), which is a matrix containing ๐ data points in R๐ท as its row. Since the dimensionality ๐ท of ๐ is usually large, an approximate ฬ โ ๐
๐ร๐ is practically computed from the low-rank (and much smaller) EKM ๐ approximation of ๐ using Nystrรถm method, where ๐ โช ๐ and specifically, ฬ๐ ฬT . The Nystrรถm method works by selecting a small subset of the training data ๐โ๐ (e.g., randomly select l samples) referred to as landmark points and constructing a small subset of ๐ by computing the kernel similarities between the input data points and landmark points. The Nystrรถm method then operates on this small subset of ๐ ฬ of ๐ dimensions. and determines an appropriate EKM ๐ T ๐ร๐ Let ๐ = [๐ฏ1 , โฆ , ๐ฏ๐ ] โ R be the set of l landmark points in R๐ , which is selected from ๐. The Nystrรถm method first generates two small matrices ๐๐๐ โ = ฮบ(๐ฑ , ๐ฏ , ) , and (๐๐๐ )
,
= ฮบ(๐ฏ , ๐ฏ , ) .
M
AN US
Both ๐๐๐ and ๐๐๐ are the sub-matrices of ๐. Next, both ๐๐๐ and ๐๐๐ are used to approximate the kernel matrix ๐: ฬ = ๐๐๐ ๐๐๐ ๐T๐๐ (2) ๐โ๐ where ๐๐๐ represents the pseudoinverse of ๐๐๐ . By applying eigen-decomposition on ๐๐๐ , ๐๐๐ = ๐๐ ๐ฒ๐ ๐๐T is obtained, where ๐ฒ๐ โ R๐ร๐ and ๐๐ โ R๐ร๐ contain the ๐ eigenvalues and the corresponding eigenvectors of ๐๐๐ , respectively. Then the approximation of (2) can be expressed as: ฬ=๐ ฬ๐ ฬT (3) ๐โ๐ and 1
ฬ = ๐๐๐ ๐๐ ๐ฒ 2 โ R๐ร๐ ๐ ๐
(4)
1/2
ฬ is only linear in the . For ๐ โช ๐, the computational cost to construct ๐
CE
๐๐๐ ๐๐ ๐ฒ๐
PT
ED
ฬ serves as an approximate EKM, and the rows of ๐ ฬ are known as virtual where ๐ samples [19]. ฬ. Specifically, The Nystrรถm method takes O(๐๐๐ + ๐ 3 + ๐๐ 2 ) time to generate ๐ it takes O(๐๐๐) of time to form the matrices ๐๐๐ and ๐๐๐ , and takes O(๐ 3 ) to apply eigen-decomposition on ๐๐๐ , and takes O(๐๐ 2 ) time to calculate the product of
AC
data set size ๐. 2.2 Multilayer Extreme learning machines (ML-ELM) (๐)
(๐) T
(๐)
Referring to Fig. 1, let ๐ (๐) = [๐ฑ1 , โฆ . , ๐ฑ ๐ ] , where x
is the i-th data
representation for input x ,for k =1 to n. Let ๐ (๐) be the i-th hidden layer output matrix with respect to ๐ (๐) . Then the i-th transformation matrix ๐ช (๐) can be learned by (5) ๐ (๐) ๐ช (๐) = ๐ (๐) where
ACCEPTED MANUSCRIPT
๐ (๐) = [
1
1 (๐)
and
(๐)
(๐)
(๐)
(๐1 (๐) , ๐1 (๐) , x1 ) โฆ (๐)
(๐1 (๐) , ๐1 (๐) , x๐ ) โฆ
(๐)
(๐)
(๐ (๐
(๐)
(๐)
,๐ ,๐
(๐)
(๐)
(๐)
, x1 ) (๐)
]
(6)
, x๐ )
(๐(๐) , ๐ (๐) , x(๐) ) = ๐๐ (๐(๐) x(๐) + ๐ (๐) ), where ๐๐ is the activation function in
the i-th layer, and both input ๐(๐) and bias ๐ (๐) are randomly generated in the i-th layer. ๐ช (๐) can be calculated by
๐ช { where
and
(๐)
๐
๐ ๐
+ ๐ (๐) (๐ (๐) )T ) 1 ๐ (๐)
๐
๐ฟ๐ (7)
(๐) T
=(
,๐
CR IP T
๐ช (๐) = (๐ (๐) )T (
(๐)
1
(๐) T (๐)
+ (๐ ) ๐ ) (๐ ) ๐
,๐
๐ฟ๐
represent an identity matrix of dimension ๐ฟ๐ and n, respectively,
AN US
and the user-specified ๐ is for regularization used in the i-th layer. In (7), ๐ช (๐) is used for representation learning and by multiplying ๐ (๐) with ๐ช (๐) , a new data representation ๐ ๐+1 is obtained as shown in Fig. 1(b): (8) ๐ ๐+1 = ๐๐ (๐ (๐) (๐ช (๐) )T ) The final data representation of ๐ (1) , namely ๐ ๐๐๐๐๐ , is obtained after the learning procedure in Fig.1(d) is done. Then ML-ELM directly uses ๐ ๐๐๐๐๐ as hidden layer to calculate the output weight ๐ (9) ๐ ๐๐๐๐๐ ๐ =
M
where = [๐ญ1 , โฆ . , ๐ญ ๐ ]T , and ๐ญ โ ๐
๐ is a one-hot output vector, and c is the number of classes. The weight matrix ๐ can be solved by T
๐๐๐๐๐
T
๐๐๐๐๐
T
+ ๐ ๐๐๐๐๐ (๐ ๐๐๐๐๐ ) )
+ (๐ ๐๐๐๐๐ ) ๐ ๐๐๐๐๐ ) 1 (๐ ๐๐๐๐๐ )
PT
๐=( {
๐
ED
๐ = (๐ ๐๐๐๐๐ ) (
1
T
,๐
๐ฟ๐๐๐๐๐
,๐
๐ฟ๐๐๐๐๐
(10)
AC
CE
2.3 Multilayer Kernel Extreme learning machines (ML-KELM) In [11], ML-KELM was developed by integrating kernel learning into ML-ELM to achieve high generalization with less user intervention. ML-KELM contains two steps: 1) Unsupervised representation learning by stacking kernel version of ELM-AEs, namely, KELM-AE; 2) Supervised feature classification using a kernel version of ELM (i.e., K-ELM).
ACCEPTED MANUSCRIPT
Input ๐ฑ ๐
Output
Hidden layer
(๐)
๐
๐ฑ (๐)
(๐)
(๐) ๐ฅ1
(๐)
(๐)
๐ช (๐)
K1 (๐)
(๐)
โฆ
๐ฅ2
๐ฅ2
โฆ
โฆ
(๐)
CR IP T
K๐
(๐)
๐ฅ๐
(๐)
๐ฅ๐
๐ (๐)
๐ (๐)
๐(๐)
AN US
For all ๐ฑ (๐) :
๐ฅ1
Fig. 2. The architecture of the i-th KELM-AE [11], in which hidden layer is encoded in form of a kernel matrix ๐(๐)
Identical to ELM-AE, KELM-AE learns the transformation
๐ช (๐) from hidden
layer to output layer. From Fig.2, kernel matrix ๐(๐) = [๐(๐) (๐ฑ1 (๐) ), โฆ , ๐(๐) (๐ฑ ๐ (๐) )]T
M
is first obtained by using kernel function ฮบ(๐) (๐ฑ
(๐)
,๐ฑ
(๐)
,
๐)
on the input matrix
ED
๐ (๐) , and then the i-th transformation matrix ๐ช (๐) in KELM-AE is learned similar to ELM-AE in (5) (11) ๐(๐) ๐ช (๐) = ๐ (๐)
PT
Similar to (7), ๐ช (๐) in (11) is obtained by ๐ช( ) = (
๐ ๐
+ ๐(๐) ) 1 ๐ (๐)
CE
The data representation ๐ (๐+1) is calculated similar to (8) ๐ (๐+1) = ๐๐ (๐ (๐) (๐ช (๐) )T )
(12) (13)
AC
After the unsupervised representation learning procedure is finished, the final data representation ๐ ๐๐๐๐๐ is obtained and used as input to train a K-ELM classification (14) ๐ ๐๐๐๐๐ ๐ท = where ๐ ๐๐๐๐๐ is the kernel matrix defined on ๐ final . The weight matrix ๐ท can be solved by: ๐ท=(
๐
+ ๐ ๐๐๐๐๐ )
๐๐๐๐๐ (1)
(15)
Given a set of m test samples ๐ = [๐ณ1 (1) , โฆ , ๐ณ๐ (1) ]T โ R๐ร๐ . In the stage of representation learning, the data representation ๐(๐+1) is obtained by multiplying the i-th transformation matrix ๐ช (๐) (16) ๐(๐+1) = ๐๐ (๐(๐) (๐ช (๐) )T ) Then, the final data representation ๐final = [๐ณ1 ๐๐๐๐๐ , โฆ , ๐ณ๐ ๐๐๐๐๐ ]T is obtained and
ACCEPTED MANUSCRIPT used to calculate the test kernel matrix ๐๐ โ R๐ร๐ (๐๐ ) where ๐ฑ
๐๐๐๐๐
,
ฮบ(๐ณ
๐๐๐๐๐
,๐ฑ
๐๐๐๐๐
,
(17)
๐๐๐๐๐ )
ฬ is is the j-th data point from ๐ ๐๐๐๐๐ . Finally, the networkโs output ๐
given by
ฬ = ๐๐ ๐ ๐
(18)
Remark: For special case, when linear piecewise activation ๐๐ was applied to all
๐ช (๐
1)
CR IP T
๐ช (๐) , ๐ช (๐) can be unified into a single matrix ๐ช๐ข๐๐๐๐๐๐ (i.e., ๐ช๐ข๐๐๐๐๐๐ = ๐ช (๐) โ โฏ ๐ช (1) ). For execution, ๐๐๐๐๐๐ = ๐(1) (๐ช๐ข๐๐๐๐๐๐ )T is directly computed so that
both issues of memory storage and execution time caused by deep neural network can be alleviated.
AN US
3 Proposed ML-EKM-ELM
Input
(๐)
(๐)
๐1
(๐)
๐๐๐
(๐)
๐ฅ๐
๐ (๐)
๐ช (๐)
๐ฅ1
(๐)
๐ฅ2
โฆ
โฆ
CE
ฬ (๐) ๐
โฆ
(๐)
AC
๐ฑ (๐)
(๐)
๐ฅ2
For all ๐ฑ (๐) :
Output
Hidden layer
๐ฅ1
PT
๐2 ๐
ED
๐ฑ
(๐)
M
The proposed ML-EKM-ELM is developed by replacing the randomly generated ฬ (๐) computed from low-rank hidden layer in ML-ELM into an approximate EKM ๐ approximation of ๐(๐) . In ML-EKM-ELM, EKM version of ELM-AEs (EKM-AE) are stacked for representation learning, followed by a final layer of EKM version of ELM for classification.
(๐)
๐ฅ๐ ฬ (๐) ๐
๐ (๐)
Fig. 3. The architecture of the i-th EKM -AE, in which the hidden layer is encoded in form of ฬ (๐) . an approximate EKM ๐
3.1 EKM-AE In this section, the details of EKM-AE are discussed. As shown in Fig. 3, the input ฬ (๐) , where matrix ๐ (๐) is first mapped into empirical kernel map ๐
ACCEPTED MANUSCRIPT
ฬ (๐) = [ ฬ (๐) (๐ฑ1 (๐) ), โฆ , ฬ (๐) (๐ฑ ๐ (๐) )]T . The ๐ ฬ (๐) with ๐๐ -dimension is calculated (to be ๐ detailed in Section 2.1) by first generating two small matrices ๐๐๐ (๐) โ R๐ร๐ , ๐๐ ๐ (๐) โ R๐ ร๐
using
the
randomly
๐๐
selected
landmark
points
๐ (๐) = [๐ฏ1 (๐) , โฆ , ๐ฏ๐ (๐) ]T from ๐ (๐) such that ,
= ฮบ(๐ฑ
(๐)
, ๐ฏ (๐) ,
๐)
(๐๐ ๐ (๐) )
,
= ฮบ(๐ฏ
(๐)
, ๐ฏ (๐) ,
๐)
and
(19)
CR IP T
(๐๐๐ (๐) )
(20)
Then, ๐๐ (๐) and ๐ฒ๐ (๐) are obtained by applying eigen-decomposition on ๐๐ ๐ (๐)
where ๐ฒ๐ (๐) โ R๐ ร๐
and
(21)
AN US
๐๐ ๐ (๐) = ๐๐ (๐) ๐ฒ๐ (๐) ( ๐๐ (๐) )T ๐๐ (๐) โ R๐ ร๐
contain the ๐๐ eigenvalues and the
corresponding eigenvectors of ๐๐ ๐ (๐) , respectively. Next, ๐๐ (๐) , ๐ฒ๐ (๐) and ๐๐๐ (๐)
ฬ (๐) = (๐๐๐ (๐) ) ๐ (๐)
๐ร๐
โR
(๐)
,
= ( ๐๐
(๐)
)(๐ฒ๐
(๐)
)
1 2
(22)
โ R๐ ร๐
is the mapping matrix of the i-th layer. Then, the rank-๐๐ approximation
ED
where
(๐)
M
ฬ (๐) : are used to construct ๐
ฬ (๐) ๐ ฬ (๐) T . Practically, it is preferred to replace the ๐(๐) can be expressed as ๐(๐) โ ๐
CE
PT
ฬ (๐) rather than ๐(๐) because ๐ ฬ (๐) is much smaller hidden layer in ELM-AE by ๐ than ๐(๐) while maintaining most of the discriminant information of ๐(๐) . EKM is detailed in Algorithm 1.
AC
Algorithm 1: EKM for the i-th layer Input: Input matrix ๐ (๐) ,kernel function ฮบ(. , . , . ) , kernel parameter landmark set size ๐๐ ฬ (๐) Output: Mapping matrix (๐) , empirical kernel map ๐ฎ
๐
Step 1: Randomly select ๐๐ landmark points ๐ (๐) = [๐ฏ1 (๐) , โฆ , ๐ฏ๐ (๐) ]T from ๐ (๐) Step 2: Generate the kernel matrix ๐๐๐ (๐) โ R๐ร๐ , ๐๐ ๐ (๐) โ R๐ ร๐ (๐๐๐ (๐) )
,
ฮบ(๐ฑ
(๐)
, ๐ฏ (๐) ,
๐)
(๐๐ ๐ (๐) )
,
ฮบ(๐ฏ
(๐)
, ๐ฏ (๐) ,
๐)
and
and
ACCEPTED MANUSCRIPT
Step 3: Calculate
๐๐ (๐) โ R๐ ร๐ and ๐ฒ๐ (๐) โ R๐ ร๐ by applying
eigen-decomposition on ๐๐ ๐ (๐) = ๐๐ (๐) ๐ฒ๐ (๐) ( ๐๐ (๐) )T Step 4: Calculate the mapping matrix (๐)
(๐)
= ( ๐๐
(๐)
)(๐ฒ๐ (๐) )
1 2
ฬ (๐) ๐ Return
(๐)
(๐๐๐
ฬ (๐) ,๐
(๐)
)
(๐)
CR IP T
ฬ (๐) Step 5: Calculate the empirical kernel map ๐ฎ
AN US
Next, the i-th transformation matrix ๐ช (๐) in EKM-AE is learned similar to ELM-AE in (5) ฬ (๐) ๐ช (๐) = ๐ (๐) (23) ๐ ๐ช (๐) can be solved by ๐ช (๐) = (
๐ ๐
ฬ (๐) )T ๐ ฬ (๐) ) 1 (๐ ฬ (๐) )T ๐ (๐) + (๐
(24)
M
Compared to (12) in which KELM-AE needs to invert a matrix ๐(๐) of size ๐ ร ๐, ฬ (๐) )T ๐ ฬ (๐) of size just ๐๐ ร ๐๐ , EKM-AE only needs to invert a much smaller matrix (๐ ๐๐ โช ๐. The data representation ๐ (๐+1) is calculated similar to (8) (25) ๐ ๐+1 = ๐๐ (๐ (๐) (๐ช (๐) )T )
ED
EKM-AE is detailed in algorithm 2.
AC
CE
PT
Algorithm 2: EKM-AE for the i-th Layer Input: Input matrix ๐ (๐) , regularization ๐ , kernel function ฮบ(. , . , . ) ,kernel parameter ๐ , activation function ๐๐ and landmark set size ๐๐ Output: New data representation ๐ (๐+1), mapping matrix (๐) , empirical kernel map ฬ (๐) , and transformation matrix ๐ช (๐) ๐ ฬ (๐) Step 1: Calculate (๐) , ๐ EKM(๐ (๐) , ฮบ(. , . , . ), ๐ , ๐๐ ) Step 2: Estimate the transformation matrix ๐ช (๐) ๐ช (๐)
(
๐ ๐
ฬ (๐) )T ๐ ฬ (๐) ) 1 (๐ ฬ (๐) )T ๐ (๐) + (๐
Step3: Calculate new data representation ๐ (๐+1) โ ๐๐ (๐ (๐) (๐ช (๐) )T ) ฬ (๐) and ๐ช (๐) Return ๐ (๐+1) , (๐) , ๐ 3.2 Proposed ML-EKM-ELM ML-EKM-ELM follows the two separate learning procedures of ML-KELM [11]. In the stage of unsupervised representation learning, each ๐ช (๐) and ๐ (๐) (for i-th EKM-AE) is obtained using (24) and (25), respectively. In the stage of supervised ฬ ๐๐๐๐๐ โ ๐
๐ร๐ feature classification, the final EKM ๐ with respect to ๐ ๐๐๐๐๐ is conveyed as input for training:
ACCEPTED MANUSCRIPT ฬ ๐๐๐๐๐ ๐ท = ๐
(26)
The output weight ๐ท in (26) is solved by ๐ท=(
๐ ๐๐๐๐๐
ฬ ๐๐๐๐๐ )T ๐ฎ ฬ ๐๐๐๐๐ ) 1 (๐ฎ ฬ ๐๐๐๐๐ )T + (๐ฎ
(27)
For the test stage, the data representation ๐(๐+1) of test data ๐ (1) is obtained by multiplying the i-th transformation matrix ๐ช (๐) (28) ๐(๐+1) = ๐๐ (๐(๐) (๐ช (๐) )T )
CR IP T
The final data representation ๐๐๐๐๐๐ = [๐ณ1 ๐๐๐๐๐ , โฆ , ๐ณ๐ ๐๐๐๐๐ ]T is obtained and used to ฬ ๐ โ R๐ร๐ calculate the approximate test kernel matrix ๐ (29) ฬ ๐) , (๐ ฮบ(๐ณ ๐๐๐๐๐ , ๐ฏ ๐๐๐๐๐ , ๐๐๐๐๐ ) where ๐ฏ ๐๐๐๐๐ is the j-th selected landmark point from ๐ ๐๐๐๐๐ in the training phase. ๐๐๐๐๐
๐ท
AN US
Finally, the model output is given by ฬ๐ ฬ=๐ ๐
(30)
The procedure of ML-EKM-ELM is detailed in Algorithm 3.
AC
CE
PT
ED
M
Algorithm 3: Proposed ML-EKM-ELM Training Stage: Input: Input matrix ๐ (1) , Output matrix , regularization ๐ , kernel function ฮบ(. , . , . )๏ผ kernel parameter ๐ , the number of layer s, activation function ๐๐ and landmark set size ๐๐ Output: landmark points ๐๐๐๐๐๐ (selected from ๐ ๐๐๐๐๐ ), ๐ โ 1 transformation matrix ๐ช (๐) , output weight ๐ท and final mapping matrix ๐๐๐๐๐ Step 1: for i=1: ๐ โ 1 do Calculate ๐ (๐+1) , ๐ช (๐) EKM โ AE(๐ (๐) , ๐ , ฮบ(. , . , . ), ๐ , ๐๐ , ๐๐ ) Step 2: i ๐ ฬ (๐) Step 3: Calculate (๐) , ๐ EKM(๐ (๐) , ฮบ(. , . , . ), ๐ , ๐๐ ) (๐) ฬ ๐๐๐๐๐ ฬ (๐) Step 4: ๐ ๐๐๐๐๐ ๐ (๐) , ๐๐๐๐๐ ,๐ ๐ Step 5: Calculate the output weight ๐ท
(
๐ ๐๐๐๐๐
ฬ ๐๐๐๐๐ )T ๐ ฬ ๐๐๐๐๐ ) (๐ ฬ ๐๐๐๐๐ )T + (๐
Return ๐ช (๐) (๐ = 1,2, โฆ , ๐ โ 1), ๐ท, ๐๐๐๐๐ and ๐๐๐๐๐๐ Prediction Stage Input: test data ๐(1) , landmark points ๐๐๐๐๐๐ , mapping matrix transformation matrix ๐ช (๐) (๐ = 1,2, โฆ ๐ โ 1), and output weight ๐ท ฬ Output: ๐ Step 1: for i=1: ๐ โ 1 do Calculate ๐(๐+1) = ๐๐ (๐(๐) (๐ช (๐) )T ) Step 2: i ๐ ; ๐๐๐๐๐๐ ๐(๐)
๐๐๐๐๐
,
ACCEPTED MANUSCRIPT ฬ ๐ โ R๐ร๐ using ๐๐๐๐๐๐ and ๐๐๐๐๐๐ Step 3: Calculate ๐ ฬ ๐) (๐
,
ฮบ(๐ณ
๐๐๐๐๐
ฬ๐ ฬ=๐ Step 4: Calculate the networkโs output ๐ ฬ Return ๐
, ๐ฏ ๐๐๐๐๐ , ๐๐๐๐๐
๐)
๐ท
ED
M
AN US
CR IP T
3.3 Memory and Computational complexity In this section, we take an example of three hidden layers to analyze the memory and time complexity of ML-EKM-ELM. A๏ผMemory Complexity Compared to ML-KELM that needs O(๐2 ) memory to store ๐(๐) and ๐ (๐) for the i-th layer during the training stage, ML-EKM-ELM takes O(๐๐๐ ) to store the ฬ (๐) and ๐ (๐) for the i-th layer which is only linear in the data size ๐. Hence, matrix ๐ the memory requirement is significantly reduced from quadratic to linear. For execution, with ๐ features and c classes in the training data and m test samples, ML-KELM needs to store i) O(๐๐) memory for test kernel matrix ๐๐ ; ii) O(๐๐) and O(๐2 ) memory for transformation matrix ๐ช (1) and ๐ช (2) respectively; iii) O(๐๐) memory for ๐ท, resulting in a total of O(๐2 + ๐(๐ + ๐ + ๐)) memory. For ฬ ๐ ; ii) ML-EKM-ELM, we need to store i) O(๐๐3 ) memory for test kernel matrix ๐ O(๐๐1 ) and O(๐1 ๐2 ) memeory for transformation matrix ๐ช (1) and ๐ช (2) ; iii) O(๐3 2 ) memory for ๐๐๐๐๐ ; iv) O(๐3 ๐) memory for ๐ท; resulting in a total of ๐(๐๐3 + ๐๐1 + ๐1 ๐2 + ๐3 2 + ๐3 ๐) memory. Assume ๐1 = ๐2 = ๐3 = ๐๐, where ๐ โช 1 . Then the memory complexity of ML-EKM-ELM is equal to O(๐๐๐ + ๐๐๐ + 2๐2 ๐2 + ๐๐๐). Since ๐, ๐ and ๐ << ๐ in large-scale application, by keeping only the largest complexity values, the complexities of ML-KELM and ML-EKM-ELM become O(๐2 ) and O(๐2 ๐2 ) , respectively. For special case, when linear piecewise activation ๐๐ was applied to all transformation matrix ๐ช (๐) , only one single
PT
transformation matrix ๐ช๐ข๐๐๐๐๐๐
(i. e. , ๐ช๐ข๐๐๐๐๐๐ = ๐ช (2) โ ๐ช (1) ) is necessary for
CE
execution (detailed in Section 2.3). In this case, the complexities of ML-KELM and ML-EKM-ELM are O(๐) and O(๐๐) respectively.
AC
B๏ผComputational complexity For training stage, the computational complexities of ML-KELM and ML-EKM-ELM are respectively shown in Table 1 and Table 2. In total, ML-KELM takes O(7๐3 + 3๐2 ๐ + ๐2 ๐) while ML-KEM-ELM takes O ((3๐1 2 + 3๐2 2 + 3๐3 2 + 3๐1 ๐2 +๐2 ๐3 + 3๐๐1 + ๐3 c)๐ + 3(๐1 3 + ๐2 3 + ๐3 3 )) .
For
test
stage,
the
time
complexities of ML-KELM and ML-EKM-ELM are shown in Table 3 and Table 4 respectively. In total, the time complexity of the ML-KELM is O(๐(๐๐ + 2๐2 + ๐๐)) and the time complexity of the ML-KEM-ELM is O(๐(๐๐1 + ๐1 ๐2 + ๐2 ๐3 + ๐3 2 + ๐3 ๐)).
ACCEPTED MANUSCRIPT Assume ๐1 = ๐1 = ๐3 = ๐๐, where ๐ โช 1 , and keeping only the largest complexity values, and then, for training stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(๐3 ) and O(๐2 ๐3 ) , respectively. Similarly, for test stage, the computational complexity of ML-KELM and ML-EKM-ELM are O(๐2 ) and O(๐2 ๐2 ) , respectively. Table 1 The computational complexity of ML -KELM in the training stage 1-st layer
2-nd layer
2
O(๐ ๐) O(๐3 ) O(๐2 ๐) ---
O(๐3 ) O(๐3 ) -O(๐3 )
O(๐ O(๐3 ) O(๐3 ) O(๐2 ๐) --
O(๐2๐)
AN US
Construct ๐ Invert ๐(๐) Calculate ๐ช (๐) Calculate ๐ (๐) Calculate ๐ท
3-rd layer
3)
CR IP T
(๐)
Table 2 The computational complexity of ML-EKM-ELM in the training stage 2-nd layer
3-rd layer
O(๐๐1 ๐)
O(๐๐1 ๐2 )
๐(๐๐2 ๐3 )
O(๐1 )
O(๐2 )
O(๐3 3 )
O(๐2 3 ) O(๐๐2 2 ) O(๐๐2 2 ) O(๐2 3 ) O(๐๐2 2 ) + O(๐๐2 ๐1 ) O(๐๐๐1 )
O(๐3 3 ) O(๐๐3 2 ) O(๐๐3 2 ) O(๐3 3 ) -O(๐๐1 ๐2 )
3
O(๐1 3 ) O(๐๐1 2 ) O(๐๐1 2 ) O(๐1 3 ) O(๐๐1 2 ) + O(๐๐1 ๐) ---
3
--
PT
Calculate ๐ช (๐) Calculate ๐ (๐) Calculate ๐ท
1-st layer
M
Construct ๐๐๐ and ๐๐ ๐ Decompose ๐๐ ๐ (๐) Calculate (๐) ฬ (๐) Calculate ๐ ฬ (๐) )T ๐ ฬ (๐) Calculate (๐ ฬ (๐) )T ๐ ฬ (๐) Inverse (๐
(๐)
ED
(๐)
O(๐๐3 2 ) + O(๐๐3 c)
CE
Table 3 The computational complexity of ML -KELM in the test stage
(๐)
AC
Calculate ๐ Calculate ๐๐ ฬ Calculate ๐
1-st layer
2-nd layer
----
O(๐๐๐) ---
Table 4 The computational complexity of ML-EKM-ELM in the test stage 1-st layer
Calculate Calculate Calculate
๐ (๐) ฬ๐ ๐ ฬ ๐
3-rd layer O(๐๐2 ) O(๐๐2 ) O(๐๐๐)
----
4 Experiments and results
2-nd layer O(๐๐๐1 ) ---
3-rd layer O(๐๐1 ๐2 ) O(๐๐2 ๐3 )
O(๐๐3 2 ) + O(๐๐3 ๐)
ACCEPTED MANUSCRIPT
CR IP T
Extensive experiments were conducted to evaluate the accuracy, time complexity and memory requirements of the proposed ML-EKM-ELM, which are further compared with ML-KELM and other relevant state-of-the-art methods. 4.1 Comparison Between ML-EKM-ELM and ML-KELM ML-EKM-ELM and ML-KELM are evaluated over 12 publicly available benchmark datasets from UCI Machine Learning repository [20] and http://openml.org [21]. All data sets are described in Table 5, nine of which are originally used in [11] and the results in [11] have shown that ML-KELM outperforms ML-ELM and H-ELM on average by 2.3% and 4.1% respectively. Table 5 Properties of benchmark data sets
1473 1484 2600 3279 5000 5974 7000 7400 9298 10299 14395 19282
1178 1187 2080 2623 4000 4779 5600 5920 7438 8239 11516 15426
295 297 520 656 1000 1195 1400 1480 1860 2060 2879 3856
AN US
CMC Yeast Madelon Isolet Waveform HAR1 Gisette TwoNorm USPS HAR2 Sylva Adult
Numbers of training Numbers of test samples samples
M
Instances
Features
Classes
9 8 500 617 21 561 5000 20 254 561 212 122
3 10 2 26 2 6 2 2 10 6 2 2
ED
Dataset
PT
A. Experiments Setup The experiments were carried out in Matlab 2015a under MacOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24 GB RAM. Fivefold cross validation is used for all experiments for a fair comparison. Following the practice in [11], RBF
CE
kernel ฮบ(๐) (๐ฑ , ๐ฑ ,
๐ ) = exp(โ
โฅ๐ฑ๐ ๐ฑ๐ โฅ22 ) 2๐ 2
was adopted for all experiments, activation
AC
function ๐๐ was chosen as linear piecewise activation, and the number of hidden layers is 3. ML-KELM has an automatically determined structure of ๐ โ ๐ โ ๐ โ ๐ โ ๐. The kernel parameter ๐๐ 2 (i=1 to 3) in each layer is respectively set as (๐)
๐ฑ
โ๐ฑ
(๐)
๐ฝ ๐2
โ๐,
=1
โฅ
โฅ22 , where ๐ฝ๐ โ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter
in each layer is set as 225 because the experiments indicated that a sufficiently large ๐ can result in satisfactory performance on most datasets. Therefore, ML-KELM results in a total of 125 combinations of parameters for (๐1 , ๐2 , ๐3 , 225 , 225 , 225 ). The network structure of ML-EKM-ELM is ๐ โ ๐1 โ ๐2 โ ๐3 โ ๐. For the sake of ๐
ACCEPTED MANUSCRIPT simplicity, we simply set ๐1 = ๐2 = ๐3 = ๐๐, but not to find out the optimal network structure of ML-EKM-ELM because our aim is just to show the effectiveness of ML-EKM-ELM over relevant state-of-the-art methods. Hence, the network structure of ML-EKM-ELM becomes ๐ โ ๐๐ โ ๐๐ โ ๐๐ โ ๐. The kernel parameter ๐๐ (i=1 to 3) in each layer is respectively set as
๐ฝ ๐โ๐
โ๐=1 โ๐ =1 โฅ ๐ฑ
(๐)
โ ๐ฏ (๐) โฅ22 , where
CR IP T
๐ฝ๐ โ {10 2 , 10 1 , 100 , 101 , 102 }. The regularization parameter ๐ in each layer is simply set as 225 . Similarly, a total of 125 combinations of (๐1 , ๐2 , ๐3 , 225 , 225 , 225 ) are resulted, which is consistent with ML-KELM. The parameter ๐ is only selected from {0.01, 0.03, 0.05, 0.07, 0.1, 0.2, 0.3} because our later experiments show that a large ๐ (> 0.2) cannot result in significant increase of test accuracy but leads to very heavy computation burden.
AC
CE
PT
ED
M
AN US
B. Evaluation and Performance Analysis The training time and the test accuracy of ML-EKM-ELM both depend on the selected value of p. The training time and test accuracy is plotted as a function of p in Figure 4 for Isolet, USPS, HAR2 and Sylva, which are selected from Table 5.
Fig. 4 Training time and test accuracy with different p (ML-EKM-ELM)
ACCEPTED MANUSCRIPT
M
AN US
CR IP T
Fig.4 shows several properties of the proposed work: i) The test accuracy is improved as p grows (i.e., adopting more landmark points). The reason is that ML-EKM-ELM can preserve more discriminant information when more training samples are selected as landmark points to construct the EKM. ii) The value p to achieve satisfactory accuracy depends on the application dataset. For some dataset (e.g., Syala), a small p is sufficient to achieve satisfactory accuracy, while on some difficult tasks (e.g., Isolet), lager p is needed for a good accuracy. iii) The test accuracy grows slowly when the value of p reaches a certain threshold, which exposes that selecting a certain number of training samples as landmark points is sufficient to preserve most discriminant information in EKM. iv) After a certain threshold, increasing of p can only slightly improve the accuracy, but incurs a sharp increase of training time. From the experiments, ML-EKM-ELM with ๐ = 0.1 can already achieve satisfactory accuracy in most adopted datasets, which shows that selecting a small part of training samples as landmark points should be sufficient to construct EKM for learning a robust classifier. In practice, the selection of p is a tradeoff between accuracy and the limited resources (i.e., time and memory requirements). As a rule of thumb, we can first set ๐ = 0.1 to test the model performance of ML-EKM-ELM. If a device (e.g., mobile device) cannot provide enough resources to store and run such model, we can try a smaller ๐ (e.g., ๐ = 0.05, 0.01). If the accuracy does not satisfy the demand of the application, we can set a larger one (e.g., ๐ = 0.2, 0.3). The time complexities and memory requirements of ML-EKM-ELM and ML-KELM are listed in Table 6 and Table 7 for comparison. For ML-EKM-ELM, the selected value of p is with relative error of less than 1% in terms of accuracy |TA1 TA2| TA1
ED
compared to ML-KELM (i.e.,
ร 100%
1%, where TA1 and TA2 are the
PT
corresponding test accuracies of ML-KELM and ML-EKM-ELM). In Table 6, ML-EKM-ELM achieves a substantial reduction of training time to get the comparable accuracy of ML-KELM. For adult dataset, our proposed ML-EKM-ELM runs 279 times faster than ML-KELM only with a little loss of 0.11%. The reason is
CE
ฬ (๐) and invert ๐ ฬ (๐) T ๐ ฬ (๐) , that ML-EKM-ELM only takes O(๐๐๐ 2 ) time to form ๐
AC
while ML-KELM requires O(๐3 ) time to form and invert ๐(๐) . In addition, the test ฬ ๐ , ๐ช (๐) and ๐ are with much smaller time is also substantially reduced because ๐ size in the execution stage. For adult dataset, our proposed ML-EKM-ELM runs 315 times faster than ML-KELM. In Table 7, ML-EKM-ELM achieves a substantial reduction of training memory as ฬ (๐) while well because ML-EKM-ELM requires a storage of ๐ ร ๐๐ matrix ๐ ML-KELM requires a storage of ๐ ร ๐ kernel matrix ๐(๐) . For adult dataset, the training memory requirement of ML-EKM-ELM is 34.9 MB while ML-KELM takes 3348 MB. In addition, the test memory requirement of ML-EKM-ELM is also ฬ ๐ , ๐ช (๐) and substantially reduced because only some much smaller matrices (such as ๐ ๐ ) are necessary in the execution stage. For adult dataset, the test memory requirement of ML-EKM-ELM is 4.69 MB while ML-KELM takes up to 432 MB. In
ACCEPTED MANUSCRIPT
both analyses, a conclusion can be drawn that ML-EKM-ELM is of much lower computational complexity and memory requirements than ML-KELM. Table 6 The time complexities of ML-EKM-ELM and ML-KELM Test Accuracy (%) Data Set
Training time (in seconds)
Test time (in seconds)
p TA1
TA2
Tr1
Tr2
Tr1/ Tr2
Te1
Te2
Te1/Te2
0.05
55.48 2.5
55.29 2.5
0.261
0.014
18.6
0.040
0.0016
25.0
Yeast
0.03
59.96 2.
59.39 2.5
0.273
0.010
27.3
0.038
0.0013
29.2
Madelon
0.2
79.13 1.
79.46 2.1
1.213
0.372
3.3
0.162
0.0227
7.1
Isolet
0.2
94.76 0.
94.00 0.
2.056
0.648
3.2
0.239
0.0347
6.9
Waveform
0.01
86.84 1.0
86.72 1.1
5.113
0.025
204.5
0.554
0.0025
221.6
HAR1
0.1
98.75 0.2
97.94 0.3
8.527
0.924
9.2
0.803
0.0564
14.2
Gisette
0.05
97.89 0.5
97.47 0.5
266.6
19.87
13.4
23.77
0.8512
27.9
TwoNorm
0.01
97.80 0.
97.82 0.3
13.44
0.050
268.8
1.212
0.0043
281.9
2.312
0.1928
12.0
3.057
0.1444
21.2
CR IP T
CMC
0.1
98.37 0.3
97.53 0.
26.32
2.400
11.0
0.07
99.11 0.2
98.23 0.3
35.51
1.881
18.9
Sylva
0.01
99.40 0.1
98.47 0.2
83.69
0.354
236.4
6.844
0.0258
265.3
Adult
0.01
84.77 0.5
84.66 0.
212.9
0.763
279.0
15.50
0.0492
315.0
AN US
USPS HAR2
*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; Tr1: the training time of ML-KELM; Tr2: the training time of ML-EKM-ELM; Te1: the test time of ML-KELM; Te2: the test time of ML-EKM-ELM
Test Accuracy (%) Data Set
p TA2
ED
TA1
M
Table 7 The memory requirements of ML-EKM-ELM and ML-KELM Training memory (in MB)
Test memory (in MB)
TrM1
TrM2
TrM1/ TrM2
TeM1
TeM2
TeM1/TeM2
0.05
55.48 2.5
55.29 2.5
19.4
1.03
18.8
2.52
0.18
14.0
Yeast
0.03
59.96 2.
59.39 2.5
20.2
0.62
32.6
2.67
0.10
26.7
Madelon
0.2
79.13 1.
79.46 2.1
67.7
16.6
4.1
15.2
5.59
2.7
Isolet
0.2
Waveform
0.01
HAR1
0.1
PT
CMC
111
Gisette
0.05
97.89 0.5
97.47 0.5
650
33.8
19.2
259
14.1
18.4
TwoNorm
0.01
97.80 0.
97.82 0.3
503
5.16
97.5
63.8
0.69
92.5
USPS
0.1
98.37 0.3
97.53 0.
807
84.6
9.5
113
19.4
5.8
HAR2
0.07
99.11 0.2
98.23 0.3
998
71.2
14.0
154
15.7
9.8
Sylva
0.01
99.40 0.1
98.47 0.2
1922
19.7
97.6
256
2.72
94.1
Adult
0.01
84.77 0.5
84.66 0.
3348
34.9
95.9
432
4.69
92.1
94.00 0.
26.0
4.3
24.7
8.94
2.8
86.84 1.0
86.72 1.1
229
2.38
96.2
29.4
0.33
89.1
98.75 0.2
97.94 0.3
340
36.3
9.4
59.5
9.31
6.4
AC
CE
94.76 0.
*TA1: the test accuracy of ML-KELM; TA2: the test accuracy of ML-EKM-ELM; TrM1: the training memory of ML-KELM; TrM2: the training memory of ML-EKM-ELM; TeM1: the test memory of ML-KELM; TeM2: the test memory of ML-EKM-ELM
C.
Evaluation on the Effect of Multiple Hidden Layers To further illustrate the effect of multiple hidden layers of ML-EKM-ELM, a highly nonlinear data set Madelon is employed, which has 2600 samples with 500 features. Madelon was originally proposed in the NIPS 2003 feature selection
ACCEPTED MANUSCRIPT
AN US
CR IP T
challenge[22]. Fivefold cross-validation is used in the following experiment, and the input attributes are normalized into the range [-1,1]. In highly nonlinear application, ML-EKM-ELM may require more hidden layers to obtain satisfactory performance. In this experiment, the number of hidden layers is set from 2 to 5. The test accuracy of ML-EKM-ELM under different hidden layers is shown in Table 8, and ML-KELM is also included to make a comparison. From Table 8, it is evident that: i) Under 2 layers, the accuracy of ML-KELM outperforms ML-EKM-ELM. However, the accuracy of ML-EKM-ELM is improved as p grows up. The reason is that shallow model needs to map the highly nonlinear input data to a very high dimensional space which is linearly separable. Therefore, it usually requires large number of hidden neurons to obtain better accuracy for highly nonlinear problem. ii) When p is large enough (p 0.05 for madelon), ELM-EKM-ELM with sufficient number of the hidden layers can perform as close as ML-KELM in terms of accuracy. iii) When the number of hidden layers reaches a certain threshold (4 hidden layers on madelon), the accuracy keeps almost constant or even starts to decline, which reveals that kernel-based model may not need a very deep structure to achieve satisfactory performance due to its strong approximation ability for nonlinear problem. Otherwise, overfitting may even occur. Table 8 Evaluation on the effect of multiple hidden layers over Madelon 2 layers 72.38 2.1
ML-EKM-ELM(p=0.2)
68.77 2.2
ML-EKM-ELM(p=0.1) ML-EKM-ELM(p=0.05)
5 layers
79.73
.๐
78.55 1.7
79.46 2.1
79.92 ๐. ๐
79.42 2.1
67.86 1.
75.43 2.1
80.51 ๐.
80.47 1.8
65.90 2.
74.68 1.
79.56 1.
79.62 2.4
64.11 2.0
66.06 3.0
67.35 2.
67.37 ๐. ๐
ED
ML-EKM-ELM(p=0.01)
4 layers
79.13 1.
M
ML-KELM
3 layers
PT
The best result of each row is shown in bold.
AC
CE
4.2 Comparison with State-of-the-Art methods on NORB dataset In this experiment, a more complicated data set called NYC Object Recognition Benchmark (NORB) [23] was evaluated to further confirm the effectiveness of the proposed ML-EKM-ELM. NORB contains 48,600 images of different 3D toy objects of five distinct categories: animals, humans, airplanes, trucks, and cars, as shown in Fig. 5. Each image is captured from different viewpoints and under various lighting conditions. The training set contains 24,300 stereo training images of size 2 ร 32 ร 32 (i.e., 2048 dimensions), and another set of 24,300 images for test. We adopted the pre-fixed train/test data and the preprocessing method used in [12] to produce the comparable performance of our proposed work and the state-of-the-art relevant algorithms.
CR IP T
ACCEPTED MANUSCRIPT
(a)
(b)
Fig.5 a Examples for training figures. b Examples for testing figures [23]
AC
CE
PT
ED
M
AN US
Existing mainstream methods are compared to verify the effectiveness of the proposed ML-EKM-ELM, including BP-based algorithms (Stacked Auto Encoders (SAE) [2], Stacked Denoising Autoencoder (SDA) [24], Deep Belief Networks (DBN) [25], Deep Boltzmann Machines (DBM) [26], and multilayer perceptron (MLP) [27]), and ELM-based training algorithms (ML-ELM[10] and H-ELM[12]). For BP-based algorithms (SAE, SDA, BDN, DBM, and MLP), the initial learning rate is set as 0.1 and the decay rate is set as 0.95 for each learning epoch. For ELM-based training algorithms (ML-ELM and H-ELM with 3 layers), the regularization parameter ๐ of ML-ELM in each layer is respectively set as 10 1 , 103 and 108 ,while the regularization parameter of H-ELM is respectively set as 1,1 and 230 . More detailed parameters setting could be checked in [12]. Due to the huge computational cost, the experimental results on SAE, SDA, BDN, DBM, MLP, and ML-ELM are directly cited from [12]. Only the network structure of H-ELM with 20 โ 3000 โ 3000 โ 15000 โ 5 is tested in our experiment because [12] has demonstrated that such network structure can obtain optimal performance in NORB dataset. ML-KELM has an automatically determined structure of ๐ โ ๐ โ ๐ โ ๐ โ ๐, and for the sake of simplicity, ML-EKM-ELM was tested with network structure of ๐ โ ๐๐ โ ๐๐ โ ๐๐ โ ๐. Experiment in section 4.1 showed that adopting 10% of training samples may be sufficient to construct EKM for learning a robust ML-EKM-ELM, therefore ๐ = 0.1 was adopted in our experiments and ๐ = 0.01, 0.05 are included as well for comparison. The kernel parameter ๐๐ (i=1 to 3) in each layer is respectively set as
๐ฝ ๐โ๐
โ๐=1 โ๐ =1 โฅ ๐ฑ
(๐)
โ ๐ฏ (๐) โฅ22 , where ๐ฝ๐ โ
{10 2 , 10 1 , 100 , 101 , 102 }, the regularization parameter ๐ in the first two layer is set as 225 , and in the final layer is selected from {25 , 210 , 215 , 220 , 225 }. A total of 625 combinations of (๐1 , ๐2 , ๐3 , 225 , 225 , 3 ) are tested in our experiments. All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported. The average training time for each model with fixed parameters is shown in Table 9. For ML-KELM, practically, it
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
may take several weeks to thoroughly try out the numerous combinations of parameters for the best model. In addition to substantial training time, the test time of ML-KELM is 427.48s and the test memory requirement is 4,577 MB, which restricts the application of ML-KELM on many devices with only limited resources such as mobile phones. The proposed ML-EKM-ELM can overcome the time and memory limitation of ML-KELM while comparable accuracy is maintained. Under the comparable accuracy of H-ELM (with structure of 20 โ 3000 โ 3000 โ 15000 โ 5), the training time and test time of ML-EKM-ELM with ๐ = 0.05(the corresponding structure is 20 โ 1215 โ 1215 โ 1215 โ 5) , are surprisingly fast, i.e., 17.46s and 4.96s respectively, while the training memory and test training memory are only 453MB and 250MB respectively. The reason is that ML-EKM-ELM with ๐ = 0.05 employs a much smaller EKM, which is easy for storage and computation. ML-EKM-ELM with ๐ = 0.1 (the corresponding structure is 20 โ 2 30 โ 2 30 โ 2 30 โ 5) performs better than ML-ELM and H-ELM in terms of accuracy, up to 4% and 2% with a smaller variance, respectively. ML-ELM and H-ELM are suboptimal because the input weights ๐(๐) and the bias ๐ (๐) in every layer are randomly generated. In some cases, poorly generated ๐(๐) and ๐ (๐) deteriorate the generalization of ML-ELM and H-ELM. Although large number of hidden nodes can improve the stability (i.e., lower standard deviation (STD)) of the accuracy in H-ELM, the STD of the accuracy in H-ELM with 15000 hidden nodes is still 0.40% as shown in Table 9. Instead, ML-EKM-ELM uses training samples as landmark points (in practice, adopting a small portion of training samples is sufficient) to generate features, which contains more discriminant information than the randomly generated features [28] and thereby its accuracy is higher and more stable. Although only 2,430 features ( ๐ = 0.1 means to select 2,430 training samples as landmark points from 24,300) are used in ML-EKM-ELM, the STD of ML-EKM-ELM drops to 0.24% compared to 0.40% in H-ELM with 15,000 features. Note that the accuracy of ML-EKM-ELM can be further improved and can become more stable as p grows. Moreover, ML-ELM and H-ELM need to exhaustively try out the values of ๐ฟ๐ from several hundreds to tens of thousands because their accuracies are very sensitive to ๐ฟ๐ . In contrast, ML-EKM-ELM only need a few trials of ๐ in our experiment. Therefore, the exhaustive and tedious tuning over ๐ฟ๐ is eliminated in ML-EKM-ELM. Table 9 Comparison on NORB dataset Accuracy (%)
Training
Test time(s)
time(s)
Training
Test
Memory(MB)
Memory(MB)
SAEa [2]
86.28
60504.34
N/A
N/A
N/A
a
SDA [24]
87.62
65747.69
N/A
N/A
N/A
a
DBN [25]
88.47
87280.42
N/A
N/A
N/A
a
89.65
182183.53
N/A
N/A
N/A
84.20
34005.47
N/A
N/A
N/A
DBM [26] a
MLP-BP [27]
ACCEPTED MANUSCRIPT
a
HELM [12]
88.91
775.29
N/A
N/A
N/A
91.28
432.19
N/A
N/A
N/A
155.19
26.48
3604
3082
3.07
1.35
91
46
17.46
4.96
453
250
60.33
11.62
932
545
965.66
427.48
8785
4577
Maximum: 91.56
HELMb [12]
Mean: 90.47 0. 0
ML-EKM-ELMb
Maximum: 87.98
(p=0.01)
Mean:86.51 0.7
ML-EKM-ELMb
Maximum: 92.18
(p=0.05)
Mean: 91.50 0.30
ML-EKM-ELMb
Maximum:93.17
(p=0.1)
Mean: 92.38 0.2
ML-KELMb
93.52*
CR IP T
ML-ELMa [10]
a. Results from[12] ; machine configuration: Laptop, Intel-i7 2.4G CPU, 16G DDR3 RAM, Windows 7, MATLAB R2013b
b. Results from our computer; machine configuration: macOS Sierra 10.12 with an Intel Core i5 of 3.4 GHz and 24
AN US
GB RAM, MATLAB R2015a
N/A. Results from[12] and [12] didnโt report the results of test time, training memory and test memory. * The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (๐(๐) and ๐ (๐) ) and does not need to randomly select the training sample as landmark points.
AC
CE
PT
ED
M
4.3 Comparison with State-of-the-Art methods on 20Newsgroups dataset To test the performance of ML-EKM-ELM on text categorization, 20Newsgroups2 was used in our experiment. which contains 18846 documents with 26214 distinct words. This dataset has 20 categories, each with around 1000 documents. In our experiment, 11314 documents (60%) is adopted for training data and 7532 documents (40%) for test data. Due to the huge dimensionality of 20Newsgroups (up to 26214 features), large number of hidden nodes is need for autoencoder to keep enough discriminant information. Therefore, for ML-ELM and H-ELM, the numbers of hidden nodes ๐ฟ๐ (๐ = 1 to 3) are, respectively, set as 500 ร ๐ {๐ = 1,2,3, โฆ ,10}. For the proposed ML-EKM-ELM, we simply set ๐1 = ๐2 = ๐3 = ๐๐ , and ๐ = {0.1,0.2,0.3 }. For ML-ELM,ML-KELM and ML-EKM-ELM, the regularization parameter ๐ in each layer is simply set as 225 , while the regularization parameter of H-ELM is respectively set as 1,1 and 230 as recommended by [12].All the experiments are repeated 100 times under the best parameters and the average accuracy and maximum accuracy are reported in Table 10. Similar to the NORB dataset, ML-EKM-ELM (p=0.2) takes fewer hidden nodes (2262 vs. 3500) in each layer to obtain the comparable accuracy of ML-ELM and H-ELM because ML-EKM-ELM contains more discriminant information in the hidden layer than the randomly generated hidden layer in ML-ELM and H-ELM. Therefore, ML-EKM-ELM outperforms ML-EKM and H-ELM both in computation time and memory requirement under the similar accuracy. Furthermore, ML-EKM-ELM (p=0.2) significantly outperforms ML-ELM, 2
20Newsgroups is download at http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html
ACCEPTED MANUSCRIPT
CR IP T
ML-KELM, and H-ELM in terms of training time (40.24 < 186.69 < 196.01 < 316.95). The reason is that dealing with high dimensional features (up to 26,214 input features and thousands of hidden nodes in each layer for 20Newsgroups) poses a big challenge for ML-ELM, ML-KELM and H-ELM. In details, ML-ELM needs to ensure orthogonality of the large randomly generated matrix (i.e., input weights and biases) in each layer for better generalization ability. ML-KELM needs to calculate the entire kernel matrix ๐(๐) in each layer from the big input matrix ๐ (๐) . H-ELM needs to apply fast iterative shrinkage-thresholding algorithm (FISTA) to the big input matrices ๐ (๐) and ๐ (๐) for sparse autoenocoder. In contrast, ML-EKM-ELM can avoid these above-mentioned operations, maintaining a very attractive training time. Table 10 Comparison on 20Newsgroups Accuracy (%)
ML-ELM
26214-3500-3500-3500-20
HELM
26214-3500-3500-4000-20 26214-1131-1131-1131-20
26214-2262-2262-2262-20
(p=0.2)** ML-EKM-ELM
Maximum:85.52
Mean:85.04 0.18 26214-3394-3394-3394-20
(p=0.3)**
Maximum:85.95
Mean:85.48 0.17 26214-11314-11314-11314-20
Test time(s)
186.69
5.08
86.47*
Training
Test
Memory(MB)
Memory(MB)
1413
6.55
1454
1065
10.14
1.74
402
290
40.24
4.31
844
617
104.10
7.16
1320
982
196.01
30.27
3946
2734
* The standard deviation (STD) of ML-KELM is not reported due to ML-KELM does not have any random input parameters (๐(๐) and ๐ (๐) ) and does not need to randomly select the training sample as landmark points.
** ML-EKM-ELM with ๐ = 0.1 results 1131 (i. e. , ๐๐ = 0.1 1131 โ 1131) hidden nodes in every layer๏ผ
PT
ML-EKM-ELM with ๐ = 0.2 results 22 2 (i. e. , ๐๐ = 0.2 1131 โ 22 2) hidden nodes in every layer๏ผ
CE
ML-EKM-ELM with ๐ = 0.3 results 33
(i. e. , ๐๐ = 0.3 1131 โ 33
) hidden nodes in every layer.
5 Conclusion An efficient kernel-based deep learning algorithm called ML-EKM-ELM is proposed, which replaces the randomly generated hidden layer in ML-ELM into an approximate empirical kernel map (EKM) with ๐๐ dimensions (where ๐๐ = ๐๐ , and ๐ โ {0.01, 0.05, 0.1,0.2} typically, n is the data size). By this way, ML-EKM-ELM has resolved the three practical issues of ML-ELM and H-ELM. 1) ML-EKM-ELM is based on kernel learning so that no random projection is necessary. As a result, an optimal and stable performance can be achieved under a fixed set of parameters. In the experiments, ML-EKM-ELM with p=0.1 is respectively up to 2% and 4% better than H-ELM and ML-ELM in terms of accuracy over NORB. 2) ML-EKM-ELM does not need to exhaustively tune the parameters ๐ฟ๐ for all layers as in ML-ELM and H-ELM. Only a few trials of ๐๐ is necessary in
AC
1024
316.95
ED
ML-KELM
Maximum: 84.05
Mean:83.47 0.25
(p=0.1)** ML-EKM-ELM
Maximum: 85.48
Mean: 84.93 0.20
M
ML-EKM-ELM
Maximum: 85.38 Mean: 84.85 0.1
Training time(s)
AN US
Network Structure
ACCEPTED MANUSCRIPT
ED
M
AN US
CR IP T
ML-EKM-ELM. For simplicity, ๐๐ can be equally set in every layer (i.e., ๐1 = ๐2 = ๐3 = โฏ), while maintaining a satisfactory performance. 3) Both computation time and memory requirements may be significantly reduced in ML-EKM-ELM. For NORB, under the comparable accuracy of H-ELM, ML-EKM-ELM with p=0.05 can be respectively up to 9 times and 5 times faster than H-ELM for training and testing, while the training memory storage and test memory storage can be reduced up to 1/8 and 1/12 respectively. Furthermore, ML-EKM-ELM can overcome the memory storage and computation issues for ML-KELM, producing a much smaller hidden layer for fast training and low memory storage. For NORB, ML-EKM-ELM with p=0.1 can be respectively up to 16 times and 37 times faster than ML-KELM for training and testing with a little loss of accuracy of 0.35%, while the memory storage can be reduced up to 1/9. To summarize, we empirically show that hidden layers in multilayer neural network can be encoded in form of EKM. In the future, more advanced low-rank approximation method such as Clustered Nystrรถm method [18] and ensemble Nystrรถm method [29] may be tested for more effective EKM and to further improve the performance of the proposed ML-EKM-ELM. At last, ML-EKM-ELM has the following limitation: when ๐ > 0.3 is chosen, ML-EKM-ELM may not outperform ML-KELM in terms of training time for some applications. The reason is that ML-EKM-ELM needs to apply a standard SVD (singular value decomposition) step on the submatrix of kernel matrix in training stage, whose size is proportional to the value of ๐. When ๐ > 0.3, the SVD step will dominate the computations and become computationally prohibitive. Therefore, in the future, we need to find an approximate and fast SVD method [30] to overcome this problem.
CE
Reference
codes:
PT
Acknowledgments The work is supported by University of Macau under project MYRG2018-00138-FST, and MYRG2016-00134-FST.
AC
[1] Y. Bengio, Learning deep architectures for AI, Foundations and trends in Machine Learning, 2 (2009) 1-127. [2] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, science, 313 (2006) 504-507. [3] J. Cao, W. Wang, J. Wang, R. Wang, Excavation equipment recognition based on novel acoustic statistical features, IEEE transactions on cybernetics, 47 (2017) 4392-4404. [4] J. Cao, K. Zhang, M. Luo, C. Yin, X. Lai, Extreme learning machine and adaptive sparse representation for image classification, Neural networks, 81 (2016) 91-102. [5] L. Xi, B. Chen, H. Zhao, J. Qin, J. Cao, Maximum Correntropy Kalman Filter With State Constraints, IEEE Access, 5 (2017) 25846-25853
[6] W. Wang, Y. Huang, Y. Wang, L. Wang, Generalized autoencoder: A neural network framework for dimensionality reduction, in: IEEE Conference on Computer Vision and Pattern Recognition
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Workshops,2014, pp. 490-497. [7] L. Van Der Maaten, E. Postma, J. Van den Herik, Dimensionality reduction: a comparative, Journal of Machine Learning Research, 10 (2009) 66-71. [8] J. Deng, Z. Zhang, E. Marchi, B. Schuller, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in: Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction,2013, pp. 511-516. [9] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of ICML Workshop on Unsupervised and Transfer Learning,2012, pp. 37-49. [10] L.L.C. Kasun, H. Zhou, G.B. Huang, C.M. Vong, Representational learning with ELMs for big data, IEEE Intelligent Systems,28 (6) (2013), 31-34. [11] C.M. Wong, C.M. Vong, P.K. Wong, J. Cao, Kernel-based multilayer extreme learning machines for representation learning, IEEE transactions on neural networks and learning systems,29(3),(2018),757-762 [12] J. Tang, C. Deng, G.B. Huang, Extreme learning machine for multilayer perceptron, IEEE transactions on neural networks and learning systems, 27 (2016) 809-821. [13] G.B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42 (2012) 513-529. [14] K. Zhang, L. Lan, Z. Wang, F. Moerchen, Scaling up kernel svm on limited resources: A low-rank linearization approach, in: JMLRโProceedings Track, vol. 22, 2012, pp. 1425โ1434 [15] A. Golts, M. Elad, Linearized kernel dictionary learning, IEEE Journal of Selected Topics in Signal Processing, 10 (2016) 726-739. [16] F. Pourkamali-Anaraki, S. Becker, A randomized approach to efficient kernel clustering, in: IEEE Global Conference on Signal and Information Processing, 2016, pp. 207-211. [17] A. Gittens, M.W. Mahoney, Revisiting the Nystrรถm method for improved large-scale machine learning, Journal of Machine Learning Research, 28 (2013) 567-575. [18] F. Pourkamali-Anaraki, S. Becker, Randomized Clustered Nystrom for Large-Scale Kernel Machines, arXiv preprint arXiv:1612.06470, (2016). [19] B. Scholkopf, S. Mika, C.J. Burges, P. Knirsch, K.-R. Muller, G. Ratsch, A.J. Smola, Input space versus feature space in kernel-based methods, IEEE transactions on neural networks, 10 (1999) 1000-1017. [20] M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine,CA (2013). http://archive.ics.uci.edu/ml. [21] J. Vanschoren, J.N. Van Rijn, B. Bischl, L. Torgo, OpenML: networked science in machine learning, ACM SIGKDD Explorations Newsletter, 15 (2014) 49-60. [22] I. Guyon, S. Gunn, A. Ben-Hur, G. Dror, Result analysis of the NIPS 2003 feature selection challenge, Advances in neural information processing systems,(2005), pp. 545-552. [23] Y. LeCun, F.J. Huang, L. Bottou, Learning methods for generic object recognition with invariance to pose and lighting, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (2004), vol.2, pp. 97-104. [24] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096-1103. [25] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
computation, 18 (2006) 1527-1554. [26] R. Salakhutdinov, G. Hinton, Deep boltzmann machines, in : Proceedings of the Artificial Intelligence and Statistics,2009, pp. 448-455. [27] C.M. Bishop, Pattern recognition and machine learning,springer, New York (2006). [28] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, Z.H. Zhou, Nystrรถm method vs random fourier features: A theoretical and empirical comparison, Advances in neural information processing systems,2012, pp. 476-484. [29] S. Kumar, M. Mohri, A. Talwalkar, Ensemble nystrom method, Advances in Neural Information Processing Systems, 2009, pp. 1060-1068. [30] M. Li, J.T. Kwok, B. Lu, Making large-scale Nystrรถm approximation possible, in proceedings of the International Conference on Machine Learning (ICML),2010,pp.631-638
ACCEPTED MANUSCRIPT
CR IP T
Chi-Man VONG received the M.S. and Ph.D. degrees in software engineering from the University of Macau, Macau, in 2000 and 2005, respectively. He is currently an Associate Professor with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau. His research interests include machine learning methods and intelligent systems.
M
AN US
Chuangquan Chen received the B.S. and M.S. degrees in mathematics from South China Agriculture University, Guangzhou, China, in 2010 and 2013, respectively. He is currently pursuing the Ph.D. degree in Computer Science, University of Macau, Macau, China. His current research interests include machine learning and data mining.
AC
CE
PT
ED
Pak-Kin Wong received the Ph.D. degree in Mechanical Engineering from The Hong Kong Polytechnic University, Hong Kong, in 1997. He is currently a Professor in the Department of Electromechanical Engineering and Associate Dean (Academic Affairs), Faculty of Science and Technology, University of Macau. His research interests include automotive engineering, fluid transmission and control, artificial intelligence, mechanical vibration, and manufacturing technology for biomedical applications. He has published over 200 scientific papers in refereed journals, book chapters, and conference proceedings.