Heterogeneous blocked CPU-GPU accelerate scheme for large scale extreme learning machine

Heterogeneous blocked CPU-GPU accelerate scheme for large scale extreme learning machine

Accepted Manuscript Heterogeneous Blocked CPU-GPU Accelerate Scheme for Large Scale Extreme Learning Machine Shijie Li, Xin Niu, Yong Dou, Qi Lv, Yue...

3MB Sizes 0 Downloads 76 Views

Accepted Manuscript

Heterogeneous Blocked CPU-GPU Accelerate Scheme for Large Scale Extreme Learning Machine Shijie Li, Xin Niu, Yong Dou, Qi Lv, Yueqing Wang PII: DOI: Reference:

S0925-2312(17)30195-9 10.1016/j.neucom.2016.05.112 NEUCOM 17994

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

17 September 2015 6 May 2016 13 May 2016

Please cite this article as: Shijie Li, Xin Niu, Yong Dou, Qi Lv, Yueqing Wang, Heterogeneous Blocked CPU-GPU Accelerate Scheme for Large Scale Extreme Learning Machine, Neurocomputing (2017), doi: 10.1016/j.neucom.2016.05.112

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Heterogeneous Blocked CPU-GPU Accelerate Scheme for Large Scale Extreme Learning Machine Shijie Li∗ , Xin Niu, Yong Dou, Qi Lv, and Yueqing Wang

CR IP T

Science and Technology on Parallel and Distributed Processing Laboratory, School of Computer, National University of Defense Technology, China [email protected] {xinniu,yongdou,lvqi,yueqingwang}@nudt.edu.cn

1

PT

ED

M

AN US

Abstract Extreme learning machine (ELM) has been intensively studied during the last decade due to its high efficiency, effectiveness and easy to implement. Recently, a variant of ELM named local receptive fields based ELM (ELM-LRF) has been proposed to reduce the global connections and introduce local receptive fields to the input layer. However, an ELM-LRF model with large number of hidden neurons spend plenty of time on solving large scale Moore-Penrose Matrix Inversion (MPMI) problem which has heavy computational cost and needs much more runtime memory. Moreover, this procedure can not be directly accelerated by GPU platforms due to the limited memory of GPU devices. In this paper, we propose three efficient approaches to perform ELM-LRF on GPU platform. First we propose a novel blocked LU decomposition algorithm, which overcomes the limitation of global memory size so that any size of ELM-LRF models can be trained. Furthermore, an efficient blocked Cholesky decomposition algorithm is presented to accelerate blocked LU decomposition algorithm according to matrix characteristics in the ELMLRF model. Finally we present a heterogeneous blocked CPU-GPU parallel algorithm to fully exploit resources on a GPU node such as to accelerate blocked Cholesky decomposition algorithm furthermore in the ELM-LRF model. Keywords : ELM-LRF, GPU, Blocked CPU-GPU accelerate algorithm.

Introduction

AC

CE

Machine learning and big data analysis have drawn great attentions from researchers in multi-disciplinary fields in recent years [1,2,3,4,5]. Most traditional training methods designed for neural networks such as back-propagation (BP) algorithm [18] involve numerous gradient descent searching steps and suffer from troubles including slow convergence rate, local minima, intensive human intervention, etc. Extreme learning machine (ELM) [12] aims to overcome these drawbacks and limitations faced by conventional learning theories and techniques. Recently, a variant named local receptive fields based extreme learning machine (ELM-LRF) was proposed by Huang et al. [13]. According to ELM theories, there exists close relationship between local receptive fields and random

ACCEPTED MANUSCRIPT

2

Shijie Li et al.

AC

CE

PT

ED

M

AN US

CR IP T

hidden neurons. The local connections between input and hidden nodes may allow the network to consider local structures and further introduce translational invariance into the network. Input weights are randomly generated and then orthogonalized in order to encourage a more complete set of features. The output weights are calculated analytically in ELM-LRF, providing a simple deterministic solution. Though achieving great success on both theoretical and practical aspects, ELM can not efficiently handle large-scale learning tasks on normal CPU platform due to the limitation of memory and the intensive computational cost of the multiplication and inverse of large matrices, although some parallel algorithm has been presented [26]. To solve this problem, GPGPU is introduced to help accelerating the ELM training [14] [15]. However, in these works the number of hidden neurons is rather small (with upper bound from 950 to 2000), which is entirely insufficient for complex tasks just like ImageNet classification [9]. And in the latest work [23], the author implemented a python toolbox for big data ELM training. However the author focus on optimizing big data’s I/O operation to make the whole training process faster rather than optimize the calculation process. And in the paper the author didn’t discuss how to handle big ELM training model with large scale hidden neuron numbers. The toolbox could only train ELM with big data, not big model. The reason why the hidden neurons number can not be bigger on GPU lies in the limitation in global memory size. When training the ELM-LRF model or other ELM models with GPU, the number of neurons in hidden layer must be limited under the size of global memory. Consequently, small-scale ELM model can only handle some naive image recognition challenges such as MNIST. If tasks become complex, it doesn’t work well. According to the ELM theory, the more hidden neurons, the better representation ability ELM will obtain. So training ELM which has large scale number of hidden neurons on GPU is a meaningful work to archive better performance in more complex challenges. In this paper, we focus on ELM-LRF, and try to address the challenge that traditional ELM-LRF can not sovle large scale MPMI problem on a GPU device firstly by presenting a novel blocked LU decomposition algorithm. Secondly, we accelerate the MPMI solving by replacing LU decomposition with the more efficient blocked Cholesky decomposition, which makes the ELM-LRF training process faster. Furthermore, we present a heterogeneous blocked CPU-GPU accelerate algorithm to use full of resources on a multi-GPU node to accelerate Cholesky decomposition algorithm in the ELM-LRF model. Experimental results show that the performance of heterogeneous strategy is even 5%-10% higher than the second scheme.

2

Related Work

This section gives a brief review of extreme learning machine (ELM) and local receptive fields based extreme learning machine (ELM-LRF). ELM is well-known and widely used for its high efficiency and superior accuracy [16]. ELM-LRF is

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

3

a new biological inspired ELM framework, which is implemented by introducing the local receptive field (LRF) concept in neuroscience. 2.1

Extreme Learning Machine

CR IP T

The ELM algorithm is proposed by Huang et al. in [16]. ELM is one kind of single-layer feedforward neural networks (SLFN), where the weights are randomly initialized. The main concepts of ELM as presented in [16] are reviewed as belows. As shown in Fig. 1, the input layer is first transformed into hidden layer through ELM feature mapping. Then the output is generated by ELM learning, which could be used in classification, regression, clustering, etc.

f(x)

β1 1

m

βi

βl

...

i

...

1

...

d

L

x (input)

ED

M

hi(x)

...

AN US

1

Figure 1. Schematic diagram of ELM

CE

PT

The output function of ELM for generalized SLFNs is : f (x) =

L X

βi hi (x) = βh(x)

(1)

i=1

AC

where β = [β1 , · · · βL ]T denotes the vector of the output weights between L nodes of the hidden layer and m nodes of the output layer. The output vector of the hidden layer is h(x) = [h1 (x), · · · , hL (x)]. Different activation functions could used in ELM. In particular, in real applications hi (x) can be hi (x) = G(ai , bi , x), ai ∈ Rd , bi ∈ R

(2)

where G(a, b, x) is a nonlinear piecewise continuous function (sigmoid function, Fourier function and so on) and ai , bi are the i -th hidden node parameters.

ACCEPTED MANUSCRIPT

4

Shijie Li et al.

In the perspective of learning, ELM theories emphasize that the hidden neurons need not be adjusted, which is unlike other traditional learning algorithms [18]. And ELM solutions aim to simultaneously reach the smallest training error and the smallest norm of output weights, i.e., σ

σ

argmin kβkp 1 + C kHβ − Tkq 2

(3)

β

CR IP T

where σ1 > 0, σ2 > 0, p, q = 0, 1, 2, 3, 4, · · · , +∞, and C is the parameter controlling the trade-off between these two terms. Given a set of training samples(xi , ti ), i = 1, · · · N , where x is training data, and t is corresponding label, H can be calculated as a nonlinear transformed randomized hidden layer output matrix:     h1 (x1 ) · · · hL (x1 ) h(x1 )     .. .. .. (4) H =  ...  =   . . . h1 (xN ) · · · hL (xN )

h(xN )



AN US

and T is the training data target matrix:

  tT1 t11    T =  ...  =  ... tN 1 tTN

 · · · t1m .. ..  . .  · · · tN m

(5)

M

Calculating the output weights β can use numerous efficient methods including but not limited to orthogonal projection methods, iterative methods, and singular value decomposition (SVD). A popular and efficient closed-form solution for ELM with σ1 = σ2 = p = q = 2 is: −1

HT ( CI + HHT ) T, ifN ≤ L −1 ( CI + HT H) HT T, ifN > L

ED

β=

(

(6)

PT

where HT ( CI + HHT )−1 is a special kind of the Moore-Penrose generalized inverse of matrix H [19]. Overall, the learning procedure of ELM is summarized in algorithm 1.

AC

CE

Algorithm 1: The leaning procedure of ELM

1

2 3

Input: a training set (xi , yi ) ∈ Rd × R, an activation function f and the number of hidden nodes N . Output: weights matrix β Randomly assign input weights wi , i ∈ [[1, N ]] ; Calculate the hidden layer output matrix H; Calculate output weights matrix β;

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

2.2

5

Local Receptive Fields Based Extreme Learning Machine

Local Receptive Field

Random input weight aK

ED Feature map 1

PT

Input layer

Pooling map K

Full connection m

i

Pooling map 1

...

M

Feature map K

...

Random input weight a1

AN US

CR IP T

ELM theories have proved that hidden layer nodes can be generated randomly according to any probability distribution. However, these works focus only on random weights, while ignoring the attribute of random connections. For natural images and languages, the strong local correlations may make the full connections less appropriate. To overcome the problem, Huang et al. [13] propose to locally connect hidden nodes with the input ones. According to ELM theories the connections are randomly sampled based on certain probability distributions, which are denser around some input nodes while sparser farther away. Huang et al. use the simple step probability function as the sampling distribution and the square root pooling structure to form the combinatorial node. The receptive field of each hidden node will be composed of input nodes within a predetermined distance to the center. Furthermore, simply sharing the input weights to different hidden nodes directly leads to the convolution operation and can be easily implemented. In this way, a specific case for the general ELM-LRF is shown as in Fig. 2.

Output Weight β

1

Pooling Size Hidden layer

Combinatorial layer

Output layer

CE

Figure 2. The implementation network of ELM-LRF with K maps

AC

In Fig. 2, the hidden layer is comprised of random convolutional nodes. The input weights of the same feature maps are shared while distinct from other different maps. The input weight to the k -th feature map is ak ∈ Rr×r . The convolutional node (i, j) in the k -th feature map ci,j,k is calculated as: ci,j,k (x) =

r r P P

m=1 n=1

xi+m−1,j+n−1 · am,n,k

i, j = 1, · · · , (d − r + 1)

(7)

ACCEPTED MANUSCRIPT

6

Shijie Li et al.

where d is input size, r is receptive field size, and the size of the feature map would be (d − r + 1) × (d − r + 1). Square root pooling structure is used here, for one input sample x, the value of combinatorial node is obtained by calculating: s p+e q+e P P hp,q,k = c2i,j,k (8) i=(p−e) j=(q−e) p, q = 1, · · · , (d − r + 1)

2)if N > K · (d − r + 1)2 β=(

3

I + HHT )−1 T C

AN US

β = HT (

CR IP T

where e is pooling size, ci,j,k is the convolutional node calculated above. If (i, j) is out of bound, ci,j,k = 0. Simply concatenating the values of all combinatorial nodes into a row vector and putting the rows of N input samples together, the combinatorial layer matrix 2 H ∈ RN ×K·(d−r+1) can be obtained: 1)if N ≤ K · (d − r + 1)2

I + HT H)−1 HT T C

(9)

(10)

Scheme For Large Scale ELM

3.1

PT

ED

M

In this section, we present a novel blocked GPU-based LU decomposition strategy, we can train an arbitrarily large ELM-LRF model with the acceleration of GPUs using the algorithm. After addressing the limitation of GPU global memory size, we go to a step further to present an efficient blocked Cholesky decomposition algorithm to accelerate blocked LU decomposition algorithm according to the matrix feature in the ELM-LRF model. At last, in order to gain a higher performance and use full of resources in a GPU node we present a heterogeneous blocked CPU-GPU accelerate algorithm. Blocked LU Decomposition MPMI Algorithm

AC

CE

In the ELM-LRF model, the main computing parts are convolution operation and Moore-Penrose matrix Inversion (MPMI). In this paper, we don’t focus on convolution operation, since there are lots of popular GPU-based convolutional toolbox, such as Caffe, cuDNN[20] and MatConvNet[21]. Convolution operations for images can be performed in small batches, so no matter how big the dataset is, GPUs could work well. However, on the other hand, when the number of hidden neurons increases and the scale of HT H is beyond the maximum size of global memory on a GPU device, the train process can not be accelerated by GPUs. To solve the Moore-Penrose Matrix Inversion problem, i.e., ( CI + HT H)−1 HT , calculating the inversion of HT H is necessary.

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

7

U12

U14

U1Mb

A22

ED

L21

U13

...

M

AN US

CR IP T

The blocked matrix multiplication algorithm is well-known, but herein a large scale blocked inversion algorithm is needed for ELM-LRF. In related work [15], the author calculate the inversion of HT H and then do a matrix multiplication, but it costs a lot of time to inverse a big matrix, and this procedure can’t be paralleled easily. In fact, there is no need to perform the inversion operation. The expression of β can be treated as solving a general linear system AX = B, where A is ( CI + HT H), B is HT T, and X is β. Here we assume N > K · (d − r + 1)2 in general case, the number of the classes in the dataset is p , and the total number of hidden neurons is M = K · (d − r + 1)2 . Although the algorithm complexity of them is the same , a general linear system solving process can be paralleled more easily. Here we focus on LU decomposition and back substitution in big matrix. Assuming the M × M big matrix A can be divided into (M/M b) × (M/M b) smaller sub-matrixes with the size of M b × M b, we must make sure that the size of each block should not exceed the global memory size of GPU device. As depicted in Fig. 3, A1,1 is firstly decomposed into L1,1 × U1,1 , then L1,1 and U1,1 are used to calculate L2,1 , L3,1 , ..., LM b,1 and U1,2 , U1,3 , ..., U1,M b , which can be done in parallel with GPUs. Afterwards, the remaining matrixes are updated, i.e., Am,n = Am,n − Lm,1 × U1,n , where Am,n represents the sub-matrix with the elements whose row number is m and column number is n. This process can be done in parallel similarly. We denote the updated remaining matrixes as A0 , then A0 performs the same operations as A until the remaining matrix is small enough to calculate on GPU. The whole process is shown in Algorithm 2.

L31

CE AC

A’

...

PT

L41

LMb1

Figure 3. Blocked LU decomposition MPMI algorithm

After decomposing the big matrix A into blocked L and U , we want to get the weights matrix β, which is the X in formula AX = B. To achieve this aim, doing back substitution twice is necessary. The procedure is described as

ACCEPTED MANUSCRIPT

8

Shijie Li et al.

Algorithm 2: Blocked LU decomposition MPMI algorithm

2 3 4 5 6 7 8

update A0 : for each m, n ∈ {i, i + 1, . . . , M b} parallel do update the (m, n)th A block by Am,n = Am,n − Lm,i × Ui,n .

CR IP T

1

Input: a M × M matrix A and the block number M b along x and y dimension. Output: L and U for each i ∈ {1, 2, . . . , M b} do calculate the LU decomposition for Ai,i , get Li,i and Ui,i ; for each j ∈ {i, i + 1, . . . , M b} parallel do calculate Lj,i by Aj,i /Ui,i ; calculate Ui,j by Ai,j /Li,i ;

Blocked Cholesky Decomposition Algorithm

CE

3.2

PT

ED

M

AN US

follows. First, A is decomposed through a lower-upper triangular decomposition, i.e., A = L × U . Bringing it into AX = B, we will get L × U × X = B. Here we denote Y = U × X, then Y can be easily obtained by solving L × Y = B since L is a lower triangular matrix. After working out Y , X is easy to get similarly in formula Y = U × X, since U is upper triangular matrix. Here we should notice that L, U and B are all divided into small blocks. In the first back substitution, Y1 is calculated firstly by B1 /L1,1 , then the remaining elements in B boxes can be updated in parallel by Bi − Li,1 × Y1 . After the first update, Y2 can be solved just like Y1 by B2 /L2,2 . The whole routine goes on until all blocked Y s been worked out. However, in the second back substitution, the direction of calculation is contrary to that in first back substitution. To be specific, we should calculate XM b first by YM b /UM b,M b instead of X1 , then update the rest of blocked Y s just like first back substitution until all Xs have been worked out. The whole process can be described as in Fig. 4. The pseudocode of the twice blocked back substitution algorithm is shown in Algorithm 3. Here we assume that the column block number of B is 1. If the number is greater than 1, the algorithm can be performed column by column. Since the data between columns are independent, they can be calculated in parallel. When we get X, which represents β, the training process is finished, and we can use it to do tasks such as recognition, classification and so on.

AC

In Section 3.1 we have addressed limitation of GPU global memory size, as a result, we can train an arbitrarily large ELM-LRF model with the acceleration of GPU if the main memory is big enough. Since HT H is a symmetric positive semi-definite matrix, Cholesky decomposition couldn’t be performed directly. If we have checked that HT H is a definite one, the decomposition of A can be done in more efficient way. Otherwise, we still use LU decomposition algorithm instead. In LU algorithm, the L and U matrix should be both calculated. However, according to matrix theory, any symmetric positive definite matrix can

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

B2

L21

B3

L31

U

X1

Y1

1,1

...

U

U

U

1,Mb-2

1,Mb-1

1,Mb

L22 Y2

L33

Y3

...

L11

XMb-2

...

...

CR IP T

B1

U

U

Mb-2Mb-2

Mb-2,Mb

XMb-1

LMb2

LMb3

U

U

Mb-1Mb-1

Mb-1,Mb

LMbMb YMb

XMb

U

Mb,Mb

AN US

LMb1

...

BMb

M

Figure 4. The process of twice blocked back substitution algorithm

Algorithm 3: Blocked LU decomposition MPMI algorithm

4 5

synchronize all processes;

CE

6

ED

3

Input: a blocked lower triangular matrix L, blocked upper triangular matrix U , blocked matrix B, row and column block number M b for L and U , the same M b for row block number of B Output: X for each i ∈ {1, 2, . . . , M b} do calculate Yi by Yi = Bi /Li,i ; update B: for each j ∈ {i, i + 1, . . . , M b} parallel do update the j -th B block by Bj = Bj − Lj,i × Yi ;

PT

1 2

7 8

9

AC

10 11

9

for each i ∈ {M b, M b − 1, . . . , 2, 1} do calculate Xi by Xi = Yi /Ui,i ; update Y : for each j ∈ {M b, M b − 1, . . . , i + 1, i} parallel do update the j -th Y block by Yj = Yj − Uj,i × Xi ; synchronize all processes;

ACCEPTED MANUSCRIPT

10

Shijie Li et al.

be decomposed into a lower triangular matrix L and its transposed matrix LT , which is called Cholesky decomposition. The Moore-Penrose Matrix Inversion in ELM is sometimes just this case. For Cholesky decomposition, only L matrix need to be calculated and stored, thus it can run in a shorter time and need just half memory space compared to LU decomposition.

A11 L21

A22

LMb2

LMbMb

PT

LMb1

A’

...

ED

...

M

L31

AN US

CR IP T

Assuming the M × M big matrix A can be divided into (M/M b) × (M/M b) smaller sub-matrixes with the size of M b × M b, we must make sure that the size of each block should not exceed the global memory size of GPU device. As depicted in Fig. 5, A1,1 is firstly decomposed into L1,1 ×LT1,1 , then L1,1 is used to calculate L2,1 , L3,1 , ..., LM b,1 , and this process can be done in parallel with GPUs. Afterwards, the remaining matrixes are updated, i.e., Am,n = Am,n −Lm,1 ×LTn,1 , where Am,n represents the sub-matrix with the elements whose row number is m and column number is n. This process can be done in parallel similarly. We denote the updated remain matrixes as A0 , then A0 performs the same operations as A until the remaining matrix is small enough to calculate on GPU.

CE

Figure 5. The process of blocked Cholesky decomposition algorithm

AC

The pseudocode of the blocked Cholesky decomposition algorithm is shown in Algorithm 4. It is clear that Cholesky decomposition doesn’t calculate U and only the lower A0 triangular matrix needs to be updated. Thus, Cholesky decomposition runs much faster than LU decomposition with less calculation workload. The twice back substitution goes the same with LU method described in previous section, and we only need replace U blocks with LT blocks.

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

11

Algorithm 4: Blocked Cholesky decomposition MPMI algorithm

2 3 4 5 6 7

Heterogeneous Blocked CPU-GPU Accelerate Algorithm

AN US

3.3

update A0 : for each m, n ∈ {i, i + 1, . . . , M b} parallel do update the (m, n)th A block by Am,n = Am,n − Lm,i × LTn,i ;

CR IP T

1

Input: a checked symmetric positive definite M × M matrix A and the block number M b along x and y dimension. Output: L and LT for each i ∈ {1, 2, . . . , M b} do calculate the Cholesky decomposition for Ai,i , get Li,i and LTi,i ; for each j ∈ {i, i + 1, . . . , M b} parallel do calculate Lj,i by Aj,i /LTi,i ;

With the aforementioned two schemes, we can train a big ELM-LRF model with multi-GPUs successfully. However, in a GPU node, there are CPU cores besides GPU devices. So here we present a novel heterogeneous blocked CPUGPU accelerate algorithm for ELM-LRF to further improve performance and make full use of the resources in a single GPU node.

ED

M

As the big matrix has been divided into small blocks, each block is a computing unit, and the allocation strategy for CPUs and GPUs should be designed among all these blocks. Before allocating workload, GPUs’ and CPUs’ performance should be measured first to provide a basis for the allocation strategy in a single GPU node.

AC

CE

PT

After calculating Li,i , other elements in the same column can be worked out in parallel, but the next column’s calculation needs the results in previous column, so after finishing a column, a synchronous operation is needed. For this reason, the allocation strategy should work in a single column, that is, the blocks in the same column can be calculated by both GPU and CPU. Due to the “cask effect”, the time to finish the calculation of a column depends on the slowest process. Thus, making GPUs and CPUs finish their workload as simultaneously as possible would reduce the whole time to the minimum. The heterogeneous CPU-GPU allocation strategy is depicted as in Fig. 6. We assume there are N blocks to be calculated in a certain column. The first block should be calculated first, so we allocate first block to a GPU device (in general case a GPU device runs faster than multi-cores CPU in a GPU node) to make sure this “bottleneck” can be finished as quickly as possible. Then the remaining (N − 1) blocks are assigned to either CPU or GPU. We try different ratios of blocks executed on CPU/GPU and choose the most time-saving scheme as the final allocation strategy.

ACCEPTED MANUSCRIPT

12

Shijie Li et al.

GPU0

GPU0

GPU0

GPU1

GPU1

GPU0

GPU0

GPU0

GPU1

GPU0

GPU1

GPU1

GPU0

GPU1

GPU0

CPU

GPU0

CPU

GPU0

GPU1

CR IP T

Matrix A

GPU0

Figure 6. The heterogeneous CPU-GPU allocation strategy

Experiments and Results

AN US

4

4.1

M

This section is intended to show the experimental results obtained by the ELMLRF classifier in GPU and heterogeneous CPU-GPU architecture with blocked LU decomposition algorithm, modified blocked Cholesky decomposition algorithm and heterogeneous CPU-GPU algorithm. First, the experimental environment will be described. Then, the classification accuracies will be compared between CPU version and GPU version to test the algorithms’ validity. At last, we compare the speedup of different algorithms. Datasets and Experimental Environment

AC

CE

PT

ED

The proposed algorithms were evaluated on a platform equipped with Intel E52650 at 2.0GHz, and 512 GB of RAM. The CPU code was compiled using mpich 3.2 with OpenBLAS 0.2.14 in Linux RedHat 6.5 and run under MATLAB 2014a with 16 cores and 32 threads. Regarding the GPU implementation, the GPU code was compiled using nvcc version 7.0 with MAGMA 1.6.1 [22] and run on four NVIDIA Tesla K20c with 13SMXs and 2496 CUDA cores each. The datasets in experiments include: 1)MNIST, a database of handwritten digits. MNIST has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. 2)NORB, a 3D object database containing images of 50 toys. It has 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. 3) COIL-100, a database of color images of 100 objects. The objects were placed on a motorized turn table against a black background. The turn table was rotated through 360 degrees to vary object pose with respect to axed color camera. 4)Caltech-256, a challenging set of 256 object categories containing a total of 30607 images. 5)CIFAR-10 dataset. It consists of 60000 32x32 colour images in 10 classes,

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

13

Table 1. Specifications of different datasets

Name

Attributes class Training data Test data Kernel size Pooling size 784

10

50000

10000

4

4

NORB

1024

5

40000

10000

6

4

CIFAR-10

1024

10

50000

10000

6

4

SVHN

1024

10

50000

10000

6

4

COIL-100

2500

9

60000

10000

11

4

Caltech-256

2500

256

50000

10000

11

4

CR IP T

MNIST

4.2

M

AN US

with 6000 images per class. There are 50000 training images and 10000 test images. 6)SVHN, a real world image dataset [24] for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It contains 73257 digits for training and 26032 digits for testing in 10 classes. Because the size of images is different in these datasets, in order to compare performance, we reshape the original image size to make sure that the attributes of hidden layer nodes are as same as possible. The details of different datasets are shown in Table 1. To verify the effect of different hidden neurons for ELM-LRF, we vary the feature maps number from 100 to 600 with step 100 for MNIST, NORB, CIFAR10, SVHN and from 50 to 250 with step 50 for COIL-100 and Caltech-256. Classification Accuracies

AC

CE

PT

ED

In this paper, we focus on making GPU algorithms outperform the CPU version with faster speed and same results, other than adjusting parameters to obtain higher recognition accuracy. Thus, in the following experiments, we do not try different configurations of parameters and fix the tradeoff coefficient C to be 0.01. The relationship between block number and hidden neurons number is shown in Table 2. When the hidden neurons matrix is too large, it costs more main memory, and we need to divide it into more blocks. Nevertheless, the number is not unlimited. We should notice that though more memory chips can be added in a single computing node, the processing power of GPU is limited. Considering the actual limitation of main memory size and running time, we control our hidden neurons size less than 30,000*30,000 elements in this paper. Accuracies of the CPU version are shown in Table 3 while the accuracies of the blocked LU decomposition based GPU version are shown in Table 4. As shown in these two tables, the test accuracies of GPU version are on par with that of CPU version. And this experiment also shows that the classification accuracies increase as the feature maps number grows. So it is meaningful to address the challenge of large scale ELM-LRF training with GPU.

ACCEPTED MANUSCRIPT

14

Shijie Li et al. Table 2. Relationship between block number and hidden neurons number

MNIST,CIFAR-10 #Hidden Neurons 3600 7200 10800 14400 18000 21600 NORB,SVHN

4×4 6×6

9×9

9×9

9×9

9×9

#Hidden Neurons 5000 10000 15000 20000 25000

Caltech-256

#Blocks

-

5×5 10×10 10×10 10×10 10×10

-

CR IP T

COIL-100

#Blocks

Table 3. Accuracies (%) of the CPU version for different datasets

#Feature Maps 50

100

150

200

250

300

400

500

600

-

98.38

-

98.52

-

98.67 98.68 98.69 98.65

NORB

-

87.39

-

88.96

-

88.90 91.14 89.28 90.47

CIFAR-10

-

54.17

-

56.36

-

57.19 58.65 59.63 59.64

SVHN

-

86.58

AN US

MNIST

-

87.89

-

87.99 88.59 88.73 88.99

COIL-100

98.33 98.54 99.10 99.34 99.76

-

-

-

-

Caltech-256

68.16 84.71 87.15 87.41 87.37

-

-

-

-

Accelerate Performance of Different Blocked GPU Algorithms

PT

4.3

ED

M

We also evaluated the percent of the time of Moore-Penrose matrix inversion operation in the total training time on the CPU platform, and the results are shown in Fig. 7. From this figure we can see the MPMI takes greater share of the whole training time as the hidden neurons number increases and in a CIFAR10 ELM training experiment with hidden neurons number 25000, the LU decomposition time takes 30% of total run time, which indicates that accelerating the MPMI procedure is necessary and efficient for large scale ELM-LRF training.

AC

CE

First we compare our GPU accelerated ELM-LRF with state-of-the-art deep learning algorithms on NORB object recognition dataset. We can not reappear all the experiments, so some results come from [13]. In our experiment, we use the simplest blocked LU decomposition algorithm with the parameters: size of receptive field {6x6}; number of feature maps {60}; pooling size {4x4}; value of C {0.01}. We keep other experimental parameters the same with that described in [13]. The training time is shown in Table 5. Some details should be explained here: in this experiment, DBN’s training time is used as the standard to calculate the speedup times. And its architecture is: 2048(input)-500-2000-500-5(output) and 500 epochs since the convergence of training error is slow when dealing with NORB dataset. TILED CNN used the

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

15

Table 4. Accuracies (%) of the GPU version for different datasets

#Feature Maps 50

100

150

200

250

300

400

500

600

-

98.35

-

98.49

-

98.68 98.67 98.68 98.70

NORB

-

87.37

-

88.94

-

88.95 90.32 90.87 90.54

CIFAR-10

-

54.23

-

56.48

-

57.14 58.32 59.77 59.59

SVHN

-

86.43

-

87.85

-

87.91 88.64 88.79 89.02

COIL-100

98.32 98.51 99.15 99.37 99.75

-

Caltech-256

68.43 84.62 87.29 87.37 87.43

-

CR IP T

MNIST

-

-

-

-

-

-

Table 5. Training Time Comparison

AN US

ALGORITHMS TRAINING TIME (s) SPEEDUP TIMES Ours ELM-LRF TILED CNN CNN

798.70

394.16

217.47

15104.55

5.67

53378.16

1.61

85717.14

1

M

DBN

107.32

AC

CE

PT

ED

same architecture as ELM-LRF and set the maximum number of iterations in the pretraining to 20. CNN’s architecture is provided in [25] and its epoch number is 50. In the Moore-Penrose matrix inversion of ELM-LRF, a large scale matrix multiplication is performed to obtain HT H. Then a linear system (HT H)X = B needs to be solved by LU decomposition, Cholesky decomposition or other algorithms. We first evaluated the performance of blocked GPU based LU decomposition algorithms. The speedup of proposed blocked GPU based LU algorithm compared to the CPU version of MPMI run time is reported in Table 6. As shown in the table, our proposed blocked GPU based LU decomposition algorithm is much more efficient. In the MNIST experiment, when the hidden neurons number rises to 21600, the running time of MPMI is 463.17s for CPU version and 61.29s for GPU version, i.e., the speedup achieves 7.55x. It can also be observed that the speedup increases as the hidden neurons number rises. Then we discuss the relationship between performance and partition strategy. We made four experiments with a single GPU, two GPUs, three GPUs and four GPUs. In each experiment, we measured the calculation time of LU decomposition with 11 different number of hidden neurons (3600-25000) and 7 kinds of

ACCEPTED MANUSCRIPT

16

Shijie Li et al.

45 40

30 25 20 15

MNIST NORB CIFAR-10 SVHN COIL-100 Caltech-256

10 5

5000

10000 15000 Hidden Neurons Number

20000

AN US

0

CR IP T

MPMI Time Ratio (%)

35

25000

Figure 7. The percent of MPMI time in total training time with different hidden neurons number

AC

CE

PT

ED

M

partition strategy (5x5,10x10,15x15,20x20,25x25,30x30,40x40). The results are provided in Fig. 8(a)(b)(c)(d). In the results of Fig. 8, we used the calculation time of 5x5 blocks as a standard(which is value 1). The performance ratio comes from the standard time divide the calculation time of different partition strategy. From all subfigures in Fig. 8, we could find that the performance decreased with the block number increasing. Because more blocks bring more GPU kernel launching time, more synchronization time and more data translation time. Each subfigure also shown us that the performance loss is less in larger neuron number case than that in small neuron number. For example, in Fig. 8(d) the performance of 21600 neurons decreased much slower than that of 3600 neurons. When the blocks number come to 40x40, the performance of 3600 neurons is nearly to 0.2, however the performance of 21600 neurons is still above 0.4. The reason is more GPUs decrease the whole epoch times which make the average synchronization time and data translation time shorter. At the same time the GPU kernel launching time is divided by GPU number. Further more, we evaluated the performance and the GPU number. In this experiment we first fixed four partition strategies: 6x6, 15x15, 25x25 and 40x40. Then we tested the LU decomposition performance of different neurons number(from 3600 to 25000) with different GPU number (from 1 to 4). The results are illustrated in Fig. 9(a)(b)(c)(d). Several conclusions can be made from Fig. 9. First, the speedup increased with the GPU number grown up. But it didn’t increase linearly because of the

ACCEPTED MANUSCRIPT

1

1

0.8

Performance Ratio(%)

0.7 0.6 0.5 0.4

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

0.9 0.8

Performance Ratio(%)

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

0.9

0.7 0.6 0.5 0.4

0.3

0.2

0.1

10x10

15x15

20x20 25x25 Block Number with 1 GPU

30x30

(a) 1

15x15

20x20 25x25 Block Number with 2 GPU

M

0.7 0.6 0.5 0.4

40x40

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

0.9

0.8

0.7

0.6

0.5

0.2

10x10

ED

0.4

0.3

15x15

20x20 25x25 Block Number with 3 GPU

PT

(c)

30x30

0.3

40x40

0.2 5x5

10x10

15x15

20x20 25x25 Block Number with 4 GPU

(d)

CE

Figure 8. Partition Strategy and Performance

AC

30x30

1

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

0.8

Performance Ratio(%)

10x10

(b)

0.9

0.1 5x5

0.1 5x5

40x40

Performance Ratio(%)

0 5x5

AN US

0.3

0.2

17

CR IP T

Heterogeneous CPU-GPU Accelerate Scheme for ELM

30x30

40x40

ACCEPTED MANUSCRIPT

Shijie Li et al.

3.5

Speedup

2.5

3

2.5 Speedup

3

3.5 3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

2

1.5

1.5

1 1

2

3 GPU Number (6x6 blocks)

3

1.5

1 1

ED

2

2

3

GPU Number (25x25 blocks)

PT

(c)

4

2.5

3

4

Speedup

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

2

1.5

4

1 1

2 GPU Number (40x40 blocks)

(d)

CE

Figure 9. GPU Number and Performance

AC

3

3.5

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

M

Speedup

2

(b)

3.5

2.5

1 1

4

GPU Number (15x15 blocks)

(a)

3

AN US

2

3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000

CR IP T

18

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

19

Table 6. The speedup of proposed blocked LU algorithm vs. CPU version

#Hidden Neurons 3600 5000 7200 10000 10800 14400 15000 18000 20000 21600 25000 4.42

-

5.52

-

6.2

6.61

-

7.22

-

7.55

-

NORB

3.37

-

4.75

-

4.44

5.61

-

6.53

-

6.86

-

CIFAR-10

3.32

-

5.08

-

4.56

5.76

-

6.06

-

6.73

-

SVHN

3.34

-

4.87

-

4.52

5.72

-

COIL-100

-

4.61

-

5.03

-

-

Caltech-256

-

4.38

-

3.98

-

-

CR IP T

MNIST

6.33

-

6.81

-

5.21

-

5.46

-

5.98

4.97

-

5.7

-

6.04

AC

CE

PT

ED

M

AN US

overhead of partition and synchronization. Second, more blocks brought more speedup. Since more blocks provided more chances to distribute blocks to each GPU equally. As a result, the synchronous time decreased. Then we discussed the relationship between the calculation performance and hidden neurons number. We used the experiments results of Fig. 9, and draw a set of plots in Fig.10. As shown in Fig. 10, the speedup nearly keep the same with a certain number of GPUs in different hidden neurons number. It means that no matter how big the ELM’s training model is, our algorithms will have a good scalability. Next we evaluated the performance of blocked Cholesky decomposition algorithm and heterogeneous blocked CPU-GPU algorithm. In a complete MoorePenrose matrix inversion process, matrix multiplication is same for the three aforementioned blocked GPU algorithms, so we pay more attention to decomposition process. With the parameters in Table 2, we obtain feature maps with the size of 6×6 for MNIST and NORB, and 10×10 for COIL-100 and Caltech256. In general, there are only two groups of feature maps, one with the size of 6×6 varying the feature maps number from 100 to 600, the other with the size of 10×10 varying the feature maps number from 50 to 250. Therefore, here we only compare the performance of the three blocked GPU algorithms on two datasets, i.e., MNIST and Caltech-256. In this experiment we use two GPUs and all the CPUs to evaluate the accelerate performance. The task allocation results are shown in Fig. 11, (a) for two GPUs and (b) for four GPUs. The accelerate performances are illustrated in Fig. 12. From Fig.11(a), we could find that the actual task amount for CPU(white blocks) is very small. CPUs will have chance to do tasks only when there are lots of tasks in every column. When the amount of tasks in each column decreases, CPUs have no chance to get a task due to the poor performance compared with a single GPU(a single GPU gains more than 2x speedup than CPUs). For example, we analysis the third column: GPU 0 calculate the black block firstly, then GPU 0 and GPU 1 calculate the left 4 blocks in two groups. We assume

ACCEPTED MANUSCRIPT

Shijie Li et al.

4

4

3.5

3.5

3

3

Speedup

2.5

2

2.5

2

1.5

1.5

1 GPU 2 GPUs 3 GPUs 4 GPUs

1 0

0.5

1 1.5 Hidden Neurons Number (6x6 blocks)

2

CR IP T

Speedup

20

1 GPU 2 GPUs 3 GPUs 4 GPUs

1 0

2.5

0.5

4

x 10

1 1.5 Hidden Neurons Number (15x15 blocks)

4

(b)

4

4

3.5

3.5

3

3

Speedup

2.5

2

2.5

2

1.5

1.5

1 GPU 2 GPUs 3 GPUs 4 GPUs

1 0

0.5

1

1.5

2

Hidden Neurons Number (25x25 blocks)

2.5

1 0

0.5

4

x 10

1 GPU 2 GPUs 3 GPUs 4 GPUs

1 1.5 Hidden Neurons Number (40x40 blocks)

2

4

(d)

x

ED

Figure 10. Hidden Neurons Number and Performance

GPU_0

x

x

x

x

x

x

GPU_0

x

x

x

x

x

x

x

GPU_1 GPU_1

x

x

x

x

x

GPU_2 GPU_2 GPU_2

x

x

x

x

x

GPU_3 GPU_3 GPU_3 GPU_3

x

x

x

x

GPU_0 GPU_0 GPU_0 GPU_0 GPU_0

x

x

x

GPU_1 GPU_1 GPU_1 GPU_1 GPU_1 GPU_1

x

x

GPU_2 GPU_2 GPU_2 GPU_2 GPU_2 GPU_2 GPU_2

x

GPU_3 GPU_3 GPU_3 GPU_3 GPU_3 GPU_3 GPU_3 GPU_3

PT

GPU_1 GPU_1

x

GPU_0 GPU_0 GPU_0

x

GPU_1 GPU_1 GPU_1 GPU_1

GPU_0 GPU_0 GPU_0 GPU_0 GPU_0

CE

GPU_1 GPU_1 GPU_1 GPU_1 GPU_1 GPU_1

GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0

CPU

CPU

GPU_1 GPU_1 GPU_1 GPU_1 GPU_1 GPU_1 GPU_1

CPU

GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0

a

x

2.5 x 10

M

(c)

AC

2.5

x 10

AN US

Speedup

(a)

2

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0 GPU_0

b

Figure 11. Task Allocation Results among two GPUs and CPUs

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

21

Time(s)

15

10

MNIST_LU MNIST_CHOLESKY MNIST_CPU-GPU

5

3600

7200

40

Time(s)

30

18000

21600

Caltech256_LU Caltech256_CHOLESKY Caltech256_CPU-GPU

20 10

5000

10000 15000 20000 Hidden Neurons Number

25000

M

0

10800 14400 Hidden Neurons Number

AN US

0

CR IP T

the time for one GPU to finish a block is t, then the time for that with CPUs is about 2t. If the last two blocks allocated to GPUs, the total time will be t + t + t = 3t, otherwise the total time will be t + t + 2t = 4t. In this situation,the task allocation program would judge that GPUs could finish these tasks faster than CPUs do. For (b), since there are four GPUs, their performances are far more beyond the CPUs’. In each column, four GPUs could finish the job faster rather than to work with CPUs. There is no task for CPUs to do.

ED

Figure 12. Performance of three blocked GPU algorithms in MNIST and Caltech-256

CE

PT

Several conclusions can be made from the Fig. 6. First, the blocked Cholesky decomposition algorithm saves nearly half of time in decomposition process compared with blocked LU decomposition algorithm. Second, the performance of blocked heterogeneous CPU-GPU algorithm performs slightly better than blocked Cholesky decomposition algorithm by 5%-10%. It could be predicted that if there are more powerful CPU or the matrix scale is bigger or the blocks number becomes larger, the blocked heterogeneous CPU-GPU algorithm will get more benefits to accelerate the MPMI solving in ELM-LRF.

AC

5

Conclusion

In this paper, three blocked GPU algorithms, i.e., blocked LU decomposition algorithm, blocked Cholesky decomposition algorithm and blocked heterogeneous CPU-GPU algorithm, have been proposed. Our work can be concluded from the following aspects. First, these three algorithms first address the challenge

ACCEPTED MANUSCRIPT

22

Shijie Li et al.

CR IP T

that traditional ELM-LRF can not solve the large scale Moore-Penrose Matrix Inversion (MPMI) problem limited by global memory size on a GPU device. Second, an efficient blocked Cholesky decomposition algorithm is presented to accelerate MPMI according to the matrix feature(when the H’H matrix is definite) in the ELM-LRF model. Experiments results indicate the blocked Cholesky decomposition algorithm achieves about 2x speedup compared with blocked LU decomposition algorithm. Third, a heterogeneous blocked CPU-GPU accelerate algorithm is presented to make full use of resources on a GPU node to accelerate MPMI. Experimental results show that the performance of this approach is 5%-10% higher than blocked Cholesky decomposition algorithm. We believe that the algorithms proposed in this paper fit not only for ELMLRF, but also for other variants of ELMs. Our proposed algorithms can also be easily transplanted to a distributed system by dividing large matrix into blocks on distributed nodes, which could be our promising future work.

AN US

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grants 61125201, U1435219, 61402507.

References

AC

CE

PT

ED

M

1. Chen C L P, Zhang C Y.: Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences 275, 314-347 (2014) 2. Z.-H. Zhou, N. Chawla, Y. Jin, G. Williams.: Big data opportunities and challenges: Discussions from data analytics perspectives. IEEE Computational Intelligence Magazine 9, 62-74 (2014) 3. Y. Zhai, Y.-S. Ong, I. Tsang.: The emerging big dimensionality. IEEE Computational Intelligence Magazine 9, 14-26 (2014) 4. I. Arnaldo, K. Veeramachaneni, A. Song, U. O’ Reilly.: Bring your own learner: A cloud-based, data-parallel commons for machine learning. IEEE Computational Intelligence Magazine 10, 20-32 (2015) 5. L. L. C. Kasun, H. Zhou, G.-B. Huang, C. M. Vong.: Representational learning with extreme learning machine for big data. IEEE Intelligent Systems 28, 31-34 (2013) 6. NVidia CUDA Zone, http://www.nvidia.com/object/cuda_home.html 7. B. Catanzaro, N. Sundaram, K. Keutzer.: Fast support vector machine training and classification on graphics processors. In: Proceedings of the 25th International Conference on Machine Learning (ICML 2008), pp. 104-111. ACM, Finland (2008) 8. Jia Y, Shelhamer E, Donahue J, et al.: Caffe: Convolutional architecture for fast feature. In: Proceedings of the ACM International Conference on Multimedia, pp.675678. ACM (2014) 9. Deng J, Dong W, Socher R, et al.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition (CVPR 2009), pp.248-255. IEEE Conference (2009) 10. Griffin G, Holub A, Perona P.: Caltech-256 Object Category Dataset. California Institute of Technology (2007)

ACCEPTED MANUSCRIPT

Heterogeneous CPU-GPU Accelerate Scheme for ELM

23

11. D. E. Rumelhart, G. E. Hinton, and R. J. Williams.: Learning representations by back-propagation errors. Nature 323, 533-536 (1986) 12. G.-B. Huang, Q.-Y. Zhu, C.-K. Siew.: Extreme learning machine: A new learning scheme of feedforward neural networks. In: Proceedings of International Joint Conference on Neural Networks, pp.985-990 (2004) 13. Huang G B, Bai Z, Kasun L L C, et al.: Local receptive fields based extreme learning machine. Computational Intelligence Magazine 10, 18-29 (2015)

CR IP T

14. van Heeswijk M, Miche Y, Oja E, et al.: GPU-accelerated and parallelized ELM ensembles for large-scale regression. Neurocomputing 74, 2430-2437 (2011)

15. Lopez-Fandino J, Quesada-Barriuso P, Heras D B, et al.: Efficient ELM-Based Techniques for the Classification of Hyperspectral Remote Sensing Images on Commodity GPUs. IEEE Journal of Selected Topics in Applied Earth Observations & Remote Sensing (2015) 16. Huang G B, Zhu Q Y, Siew C K.: Extreme learning machine: theory and applications. Neurocomputing 70, 489-501 (2006)

AN US

17. Lawrence S, Giles C L, Tsoi A C, et al: Face recognition: A convolutional neuralnetwork approach. Neural Networks 8, 98-113 (1997) 18. Rumelhart D E, Hinton G E, Williams R J.: Learning representations by backpropagating errors. Nature 323, 533-536 (1986) 19. Courrieu P.: Fast Computation of Moore-Penrose Inverse Matrices. Neural Information Processing - Letters and Reviews 8, 25 (2008)

M

20. Chetlur S, Woolley C, Vandermersch P, et al.: cuDNN: Efficient Primitives for Deep Learning. Arxiv:1410.0759 (2014). 21. Vedaldi A, Lenc K. MatConvNet - Convolutional Neural Networks for MATLAB. Arxiv:1412.4564 (2014)

ED

22. S. Tomov, R. Nath, P. Du, and J. Dongarra, MAGMA Users Guide. ICL, UTK, 2011 [Online]. Available: http://icl.cs.utk.edu/magma/

PT

23. A.Akusok,K.Bjork,Y.Miche, and A.Lendasse.: High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications. IEEE Open Access, vol.3 (2015)

CE

24. Netzer Y, Wang T, Coates A, et al.: Reading digits in natural images with unsupervised feature learning. NIPS workshop on deep learning and unsupervised feature learning, vol.2 (2011)

AC

25. Y.LeCun, F.J.Huang, and L.Bottou.: Learning methods for generic object recognition with invariance to pose and lighting. Proceedings of International Conference on Computer Vision and Pattern Recognition, vol.2, 97-104 (2004) 26. Wang Y, Dou Y, Liu X, et al.: Pr-elm: parallel regularized extreme learning machine based on cluster. Neurocomputing 173, 1073-1081 (2016)

ACCEPTED MANUSCRIPT

24

Shijie Li et al.

Biography

AN US

CR IP T

Shijie Li was born in 1989. He received his B.S. degree in Computer Science and Technology in Tsinghua University, in 2012, and received his M.S. degree in Computer Science and Technology at National University of Defense Technology in 2014, and now he is a Ph.D. candidate at National University of Defense Technology. His research interests include high performance computer architecture, parallel computing, and machine learning.

CE

PT

ED

M

Niu Xin, born in 1983. He received his B.S. degree in computer science and technology from the National University of Defense Technology, China in 2006, and received his Ph.D. degree in geodesy and geoinformatics from Royal Institute of Technology (KTH), Sweden in 2012. He has been a Research Assistant with the PDL Laboratory, National University of Defense Technology, China since 2013. His research interests include machine learning, pattern recognition and remote sensing image processing.

AC

Yong DOU was born in 1966, professor, Ph. D. supervisor, senior membership of China Computer Federation (E200009248). He received his BS, MS, and PhD degrees in Computer Science and Technology at National University of Defense Technology in 1995. His research interests include high performance computer architecture, high performance embedded microprocessor, reconfigurable computing, and bioinformatics. He is a member of the IEEE and the ACM.

ACCEPTED MANUSCRIPT

25

CR IP T

Heterogeneous CPU-GPU Accelerate Scheme for ELM

PT

ED

M

AN US

Qi Lv was born in 1987. He received his B.S. degree in Computer Science and Technology in Tsinghua University, in 2009, and received his M.S. degree in Computer Science and Technology at National University of Defense Technology in 2011, and now he is a Ph.D. candidate at National University of Defense Technology. His research interests include remote sensing and machine learning.

AC

CE

Yueqing Wang was born in 1988. He received his B.S. degree in Computer Science and Technology in Tsinghua University, in 2010, and received his M.S. degree in Computer Science and Technology at National University of Defense Technology in 2012, and now he is a Ph.D. candidate at National University of Defense Technology. His research interests include high performance computer architecture, parallel computing, and machine learning.