Towards adaptive learning with improved convergence of deep belief networks on graphics processing units

Towards adaptive learning with improved convergence of deep belief networks on graphics processing units

Pattern Recognition 47 (2014) 114–127 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr To...

4MB Sizes 0 Downloads 34 Views

Pattern Recognition 47 (2014) 114–127

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Towards adaptive learning with improved convergence of deep belief networks on graphics processing units Noel Lopes a,b,n, Bernardete Ribeiro a,c a

CISUC – Center for Informatics and Systems of University of Coimbra, Portugal UDI/IPG – Research Unit, Polytechnic of Guarda, Portugal c Department of Informatics Engineering, University of Coimbra, Portugal b

art ic l e i nf o

a b s t r a c t

Available online 4 July 2013

In this paper we focus on two complementary approaches to significantly decrease pre-training time of a deep belief network (DBN). First, we propose an adaptive step size technique to enhance the convergence of the contrastive divergence (CD) algorithm, thereby reducing the number of epochs to train the restricted Boltzmann machine (RBM) that supports the DBN infrastructure. Second, we present a highly scalable graphics processing unit (GPU) parallel implementation of the CD-k algorithm, which boosts notably the training speed. Additionally, extensive experiments are conducted on the MNIST and the HHreco databases. The results suggest that the maximum useful depth of a DBN is related to the number and quality of the training samples. Moreover, it was found that the lower-level layer plays a fundamental role for building successful DBN models. Furthermore, the results contradict the preconceived idea that all the layers should be pre-trained. Finally, it is shown that by incorporating multiple back-propagation (MBP) layers, the DBNs generalization capability is remarkably improved. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Deep learning Deep belief networks Restricted Boltzmann machines Contrastive divergence Adaptive step size GPU computing

1. Introduction Recent advances in deep learning methods have led to a widespread enthusiasm among pattern recognition and machine learning (ML) researchers [1,2]. Inspired by the depth structure of the brain, deep learning architectures encompass the promise of revolutionizing and widening the range of tasks performed by computers [1]. In recent months deep learning applications have been growing both in number and accuracy [1]. Moreover, just a few months ago, a team of graduate students of Geoffrey E. Hinton won the top prize in a contest aiming at finding molecules that might lead to new drugs. This was a particularly impressive achievement because never before a deep learning architecture based-system had won a similar competition and the software was designed with no prior knowledge on how the molecules bind to their targets, using only a relatively small dataset [1]. Deep models reflect many levels of composition of non-linear operations in their outputs [2–4]. The idea is to have feature detector units at each layer (level) that gradually extract more sophisticated and invariant features from the original raw input signals. Lower layers aim at extracting simple features that are then clamped into higher layers, which in turn detect more complex features [5]. In contrast, shallow models (e.g. two-layers neural network (NNs), support vector machine (SVMs)) present n Corresponding author at: UDI/IPG – Research Unit, Polytechnic of Guarda, Portugal. Tel.: +351 271222690; fax: +351 271220100. E-mail addresses: [email protected] (N. Lopes), [email protected] (B. Ribeiro).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.06.029

very few layers that map the original input features into a problem-specific feature space [2,6]. Deep architectures can be exponentially more efficient than shallow ones [7]. The latter may require a huge number of elements to represent highly varying functions [2–4]. On the other hand deep architectures can represent these functions efficiently, in particular when their Kolmogorov complexity is small [2]. Since each element of the architecture is learned using examples, the number of computational elements one can afford is limited by the number of training samples [4]. Thus, the depth of architecture can be very important from the point of view of statistical efficiency. Hence, using shallow architectures may result in poor generalization models [4]. As a result, deep models tend to outperform shallow models such as SVMs [2]. Moreover, theoretical results suggest that deep architectures are fundamental to learn the kind of complex functions that can represent high-level abstractions (e.g. vision, language) [4], characterized by many factors of variation that interact in non-linear ways, making the learning process difficult [2]. However, the challenge of training deep NNs remained elusive for a long time [4], until the development of DBNs [8] which were successfully applied to several domains including classification, regression, dimensionality reduction, object segmentation, information retrieval, language processing, robotics, speech, audio, and collaborative filtering [2–4,9,6] thus demonstrating its ability to often outperform state of the art algorithms in these areas [4]. Nevertheless, training a DBN is a computationally expensive task that involves training independently several RBMs and

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

requires a considerable amount of time and effort [10,11]. Moreover, the proper choice of the learning parameters is a fundamental aspect that affects considerably the networks convergence [10]. Recently, there has been a renewed interest in accelerating the training of NNs [12]. In particular, concerning the RBMs, several approaches relying on customized hardware (FieldProgrammable Gate Array (FPGAs)) [13,12] and GPU [14,15] have been proposed. In our view, the GPU represents the most compelling option, since dedicated hardware fails to meet the expectations, as it is typically expensive, unreliable, poorly documented, with reduced flexibility, and obsolete within a few years [16]. Additionally, the FPGA implementations cannot be shared and validated by other researchers who probably do not have access to the hardware. GPUs on the other hand are widely available and relatively inexpensive [16–18]. In this paper we present two complementary approaches to speedup the training of RBMs and DBNs. First, an adaptive step size technique that solves the difficulty of choosing an adequate learning rate and momentum terms, while enhancing the training convergence, is presented. Second, we rely on a multi-core GPU parallel implementation of the CD algorithm to speedup the training process. The resulting implementation is unique in that it incorporates the proposed adaptive step size technique. Moreover, unlike other implementations, we have made our code open-source so that others can readily use and improve it. Finally, we use the resulting tool to analyze the effects of varying the number of layers and neurons of a DBN. This paper is structured as follows. Section 2 details both the DBN and RBM networks. Section 3 presents the proposed adaptive step size technique. Section 4 describes the GPU parallel implementation. Section 5 asserts the validity of both approaches on speeding up the training process and analyzes the effects of varying the number of layers and neurons in a DBN. Finally Section 6 draws the conclusions and points out future work.

A DBN is composed of several RBM layers. Each RBM receives the inputs of the previous layer and feeds the RBM in the next layer. Hence, training a DBN consists of independently training each one of the RBMs, starting by the lower-level RBM and progressively moving up in the hierarchy. 2.1. Restricted Boltzmann machine An RBM is an energy-based generative model that consists of a layer of I binary visible units (observed variables), v∈f0; 1gI , and a layer of J binary hidden units (explanatory factors), h∈f0; 1gJ , with bidirectional weighted connections [19], as depicted in Fig. 1. RBMs follow the encoder–decoder paradigm [20] where both the encoded representation and the (decoded) reconstruction are stochastic by nature. The encoder–decoder architecture is useful because: (i) after training, the feature vector can be computed in a very fast way and (ii) by reconstructing the input we can assess how well the model was able to capture the relevant information from the data [20]. Given an observed state, the energy of the joint configuration of the visible and hidden units ðv; hÞ is given by (1) ⊤

I

J

i¼1

j¼1

J

I

Eðv; hÞ ¼ cv⊤ bh v⊤ Wh ¼  ∑ ci vi  ∑ bj hj  ∑ ∑ W ji vi hj ;

Fig. 2. Reconstruction of the MNIST digits made by a newly initialized restricted Boltzmann machine (RBM) ðp^ i is the proportion of vectors in which the pixel i is on).

weights are initialized with small random values (e.g. between  0.01 and 0.01) [19]. The hidden bias, bj , can be initialized with a large negative value (e.g.  4) in order to encourage sparsity and the visible units bias, ci , to logðp^ i =ð1p^ i ÞÞ, where p^ i is the proportion of training vectors in which vi ¼ 1 [19]. Fig. 2 shows the advantages of initializing ci in this manner. The RBM assigns a probability for each configuration ðv; hÞ, using (2) eEðv;hÞ ; Z

ð2Þ

where Z is a normalization constant called partition function by analogy with physical systems, given by the sum of all energy configurations [4,19,21] Z ¼ ∑ eEðv;hÞ :

ð3Þ

v;h

Since there are no the same layer, given all the hidden units probability of h given

connections between any two units within a particular random input configuration, v, are independent of each other and the v becomes

pðhjvÞ ¼ ∏pðhj ¼ 1jvÞ;

ð4Þ

j

where I

ð5Þ

i¼1

and sðxÞ is the sigmoid function 1=ð1 þ ex Þ. For implementation purposes, hj is set to 1 when pðhj ¼ 1jvÞ is greater than a given random number (uniformly distributed between 0 and 1) and 0 otherwise. Similarly given a specific hidden state, h, the probability of v given h is given by (6) pðvjhÞ ¼ ∏pðvi ¼ 1jhÞ;

ð6Þ

i

where

J

where c∈R represents the bias of the visible units, b∈R the bias of the hidden units and W∈RJI a matrix containing the RBM connection weights. In order to break symmetry, typically the

!

pðhj ¼ 1jvÞ ¼ s bj þ ∑ vi W ji ;

j¼1i¼1

ð1Þ I

Fig. 1. Schematic representation of a restricted Boltzmann machine (RBM).

pðv; hÞ ¼

2. Deep belief network

115

J

!

pðvi ¼ 1jhÞ ¼ s ci þ ∑ hj W ji : j¼1

ð7Þ

116

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

When using (7) in order to reconstruct the input vector, it is vital to force the hidden states to be binary. Using the actual probabilities would seriously violate the information bottleneck, which acts as a strong regularizer [19]. The marginal probability assigned to a visible vector, v, is given by (8) pðvÞ ¼ ∑pðv; hÞ ¼ h

1 ∑eEðv;hÞ : Z h

ð8Þ

Hence, given a specific training vector v its probability can be raised by adjusting (optimizing) the weights and the biases of the network in order to lower the energy of that particular vector while raising the energy of all the others. To this end, we can perform a stochastic gradient ascent on the log-likelihood manifold obtained from the training data vectors, by computing the derivative of the log probability with respect to the network parameters θ∈fbj ; ci ; W ji g (see Appendix A), which is given by (9)  ∂ log pðvÞ Eðv; hÞ ∂Eðv; hÞ  ¼ ∑pðhvÞ∂ þ ∑ pðv; hÞ ∂θ ∂θ ∂θ h v;h |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} positive phase

ð9Þ

negative phase

As in the maximum likelihood learning procedure, we aim at finding the set of network parameters for which the probability of the (observed) training dataset is maximized. Computing ∂ðEðv; hÞ=∂θÞ is straightforward. Thus, in order to obtain an unbiased stochastic estimator of the log-likelihood gradient, we need a procedure to sample from pðhjvÞ and another to sample from pðv; hÞ [4]. In the so-called positive phase, v is clamped to the observed input vector, x, and h is sampled from v, while in the negative phase, both v and h are sampled ideally from the model [4]. Sampling can be done by setting up a Markov chain Monte Carlo (MCMC) using alternating Gibbs sampling (AGS) [19,4]. Each iteration of the AGS consists of updating all of the hidden units using (5) followed by updating all of the visible units using (7) [19].

This process is represented in Fig. 3. Using this procedure we can rewrite (9) and (10)     ∂ log pðvÞ Eðv; hÞ Eðv; hÞ ¼ ∂ þ ∂ ð10Þ ∂θ ∂θ ∂θ 0 1 |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} positive phase

negative phase

where 〈  〉0 denotes the expectations for the data distribution ðp0 ¼ pðhjvÞ ¼ pðhjxÞÞ and 〈  〉1 denotes the expectations under the model distribution ðp1 ðv; hÞ ¼ pðv; hÞÞ [3,21]. Unfortunately, computing 〈vi hj 〉1 is intractable as it requires performing AGS for a very long time [19,4] in order to draw unbiased samples from the model distribution [9]. To solve this problem, Hinton proposed a much faster learning procedure: the contrastive divergence (CD-k) algorithm [22,19], whereby 〈  〉1 is replaced by 〈  〉k for small values of k [3]. Changing (10) accordingly, we obtain the following update rules: ΔW ji ¼ γð〈vi hj 〉0 〈vi hj 〉k Þ

ð11Þ

Δbj ¼ γð〈hj 〉0 〈hj 〉k Þ

ð12Þ

Δci ¼ γð〈vi 〉0 〈vi 〉k Þ

ð13Þ

where γ represents the learning rate. Algorithm 1 describes the main steps of the CD-k algorithm. Algorithm 1. CD–k algorithm. 1: vð0Þ ←x ▹ x is an input vector of the training dataset. 2: Compute the binary states of the hidden units, hð0Þ , using vð0Þ and Eq. (5) 3: for n←1 to k do 4: Compute the “reconstruction” states for the visible units, ðn1Þ

vðnÞ , using h and Eq. (7) 5: Compute the binary features (states) for the hidden units, ðnÞ

h , using vðnÞ and Eq. (5) 6: end for 7: Update the weights and biases, using Eqs. (11)–(13)

Fig. 3. Markov chain Monte Carlo using alternating Gibbs sampling in a restricted Boltzmann machine (RBM). The chain is initialized with the data input vector, x.

Fig. 4. DBN training with one input layer, x, and three hidden layers h1 , h2 , h3 . From left to right, purple color represents layers already trained, while cyan the RBM being trained. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

117

Fig. 5. NVIDIA (GPU) device architecture.

2.2. Deep belief network architecture and training DBNs were recently proposed by Hinton et al., along with an unsupervised greedy learning algorithm for constructing the network one layer at a time [8]. As described earlier, the subjacent idea (see Fig. 4) consists of using an RBM for each layer, which is trained independently to encode the statistical dependencies of the units within the previous layer [5]. The training process, also called pre-training [2], is unsupervised by nature, allowing to learn non-linear complex functions by mapping the input to the output directly from data [4]. However, the output of the top layer can easily be fed to a conventional supervised classifier [19,20]. Alternatively, it is also possible to create a classification model, by adding an additional layer to the unsupervised pre-trained DBN upon which the resulting network is fine-tuned using the backpropagation (BP) algorithm. Moreover, the BP algorithm will barely

change the weights and therefore most of the performance derives from the unsupervised pre-training phase [9].

3. Adaptive step size technique The BP is one of the simplest and most widely used algorithms for training NNs [23]. Nevertheless, this algorithm is usually associated with long training times, especially for challenging problems involving large datasets [24]. Hence, numerous techniques have been proposed for accelerating its convergence [25]. Among these, the adaptive step size technique, proposed by Silva and Almeida [26,27], consists of using an individual learning rate (step size) parameter, γ ji , for each weight connection, W ji , instead of a global learning rate. Adapting this idea for the case of an RBM, at each CD-k iteration, the step sizes are

118

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

Fig. 6. Sequence of GPU kernel calls per epoch that implement the CD-k algorithm.

adjusted according to the sign changes

γ ji ¼

8 < uγ old ji

old if ð〈vi hj 〉0 〈vi hj 〉k Þð〈vi hj 〉old 0 〈vi hj 〉k Þ 4 0

: dγ old ji

old if ð〈vi hj 〉0 〈vi hj 〉k Þð〈vi hj 〉old 0 〈vi hj 〉k Þ o 0

ð14Þ

where u 41 (up) represents the increment factor for the step size and d o 1 (down) the decrement factor. When two consecutive updates have the same direction the step size of that particular weight is increased. For updates with opposite directions the step size is decreased, thus avoiding oscillations in the learning process due to excessive learning rates [10]. The underlying idea of this procedure consists of finding near-optimal step sizes that would allow bypassing ravines on the error surface. This technique is especially effective for ravines that are parallel (or almost parallel) to some axis [27]. In addition, it makes sense to use a different momentum term for each connection, αji ¼ γ ji α, proportional to a global momentum configuration, α, and to the step sizes in order to decrease further the oscillations in the training process. According to our tests, it is advantageous to clamp αji , such that 0:1≤ αji ≤ 0:9. 4. GPU parallel implementation Our GPU implementation of the RBMs and DBNs is built on top of the Compute Unified Device Architecture (CUDA) platform, as part of the open source GPU Machine Learning Library (GPUMLib), whose source code and documentation is available at http:// gpumlib.sourceforge.net/ [28,29]. 4.1. Compute Unified Device Architecture (CUDA) The GPUs in today's mainstream computing systems, are powerful, highly parallel and programmable devices that can be used for general-purpose computing applications, potentially delivering enormous performance gains for computationally intensive applications [30]. In particular, GPUs allow to manipulate and explore massive datasets [31] enabling greatly the speedups that are essential in pattern recognition applications. To this end, one important component is the programming model. In this context, CUDA represented a major step toward the simplification

of the GPU computing model by providing support for accessible interfaces and industry-standard languages, such as C and C++. Moreover, this technology is widely adopted as compared to alternatives (e.g. Open OpenCL, AMD Stream). In CUDA, computations are denoted using the so-called kernel functions, which are executed in parallel by different threads. Threads must be organized into blocks, which in turn form a grid. Both blocks and grids may have up to three dimensions. Figs. 8 and 11 in Section 4.2 show examples of a kernel grid and/or a thread block. Blocks are required to execute independently: it must be possible to execute them in any arbitrary order, either in parallel or in series. This requirement allows to schedule the set of thread blocks (grid) in any order across any number of cores, and enables to write code that scales with the number of cores present on the device. The CUDA programming model is supported by an architecture built around a scalable array of multi-threaded streaming multiprocessor (SMs), as depicted in Fig. 5. Each SM contains several scalar processor (SP) cores (also referred to as CUDA cores). While a typical x86 processor has only a few cores, each usually running two threads, a CUDA GPU is able to run thousands of threads, fastswitching among them at every clock cycle [32]. When a program on the host invokes a kernel grid, its blocks are enumerated and distributed to the SMs with available execution capacity. 4.2. RBMs and DBNs CUDA parallel implementation The RBM weights are not updated after each sample is presented, but rather in a batch or mini-batch process. Hence, given a training dataset containing N samples, we shall assume that the visible units vectors, v, will form a matrix V∈RNI , in which  each row  corresponds to a visible units vector, v, i.e. V⊤ ¼ v⊤1 ; v⊤2 ; …; v⊤N . Similarly, we shall assume that the hidden units vectors, h, will form a matrix H∈RNJ , in which  ⊤ ⊤ each ⊤row  corresponds to a hidden units vector, h, i.e. H⊤ ¼ h1 ; h2 ; …; hN . In order to implement Algorithm 1 we devised three CUDA kernels: a kernel to compute the binary states of the hidden units, named ComputeStatusHiddenUnits, which is used to implement steps 2 and 5; a kernel to compute the “reconstruction” states for the visible units, named ComputeStatusVisibleUnits, which is used to implement step 4; and finally a kernel to update the weights and biases, named CorrectWeights, which is used to implement

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

119

Fig. 7. Implications of storing the connection weights using row-major order.

step 7. The latter also adjusts the step sizes of each connection. Fig. 6 shows the sequence of kernel calls (per epoch) needed to implement the CD-k algorithm. CUDA requires a much larger number of threads (typically millions) than those needed by traditional multi-core systems in order to hide the global memory latency. Thus, one needs to define threads at a much finer granularity to take full advantage of the GPU high number of cores [33]. Hence, instead of considering the neuron as the smallest unit of computation (thread) for both the ComputeStatusHiddenUnits and ComputeStatusVisibleUnits kernels, we have considered alternatively a connection between two neurons. This decision took advantage of our previous work regarding the computation of the hidden units states in a BP layer [24], whose process is similar. Although, conceptually the decision may seem weird the rationale behind it, is to think of a connection as performing a simple function that multiplies the clamped input by its weight. In such case, each block represents a neuron and we can take advantage of the fast shared memory to sum up the values computed by each thread, using a reduction process and then computing the output of the neuron for the active sample. The order in which the weights of matrix W are stored in the memory (row-major or column-major) affects both the ComputeStatusHiddenUnits and ComputeStatusVisibleUnits kernels. Essentially, one of the kernels will be able to access the weights in a coalesced manner, thus speeding up its execution, while the other will not. Since the kernel ComputeStatusHiddenUnits needs to be called more times (see Fig. 6), we decided to store W in a rowmajor order, thus improving its performance in detriment of the ComputeStatusVisibleUnits kernel. Fig. 7 shows the effects of this decision. The bulk work to be carried out by the CorrectWeights kernel consists of aggregating the values for ΔW ji , Δbj and Δci (see (11)–(13)), needed to update the weights. Our first approach to implement this kernel consisted of creating a block for each connection, in which each thread will gather and sum the values of one or more samples, depending on the actual number of samples (N). Then a reduction process takes place in order to calculate the deltas upon which the weights and bias are updated. Fig. 8 illustrates the resulting grid and block structure. In order to evaluate this first GPU implementation, we conducted preliminary tests using the MNIST dataset (described later in Section 5.1). The left column of Fig. 9 shows the proportion of time spent in each kernel, for I ¼ 784, J ¼ 400 and N ¼ 1000 (these values correspond to the worst GPU speedup, according to the test results presented in the next section). Note that, despite ComputeStatusHiddenUnits being called twice (see Fig. 6) the overall time used by this kernel is still inferior to the time consumed by ComputeStatusVisibleUnits.

Fig. 8. Grid and block structure used by the first approach of the kernel CorrectWeights.

Fig. 9. Proportion of time spent, per epoch, in each kernel (in a GTX 280 device).

This is due to the advantage of accessing the memory in a coalesced manner. Moreover, in this approach, the CorrectWeights kernel consumes almost 3/4 of the total training time. Nevertheless, the preliminary tests show that overall, the GPU implementation presented speedups of one order of magnitude relatively to the Central Processing Unit (CPU) version. We identify two main problems in the first approach of the kernel CorrectWeights, both related to memory accesses to the Vð0Þ , Hð0Þ , VðnÞ and HðnÞ matrices: first the accesses were not being done in a coalesced manner and secondly many blocks were trying to access the same memory addresses, which could potentially lead to memory conflicts. Fig. 10 illustrates the latter problem: for any given hidden unit, hj , there are I þ 1 connections, which need to access the hj value in order to update their weights. Hence, they all need to access the same elements of Hð0Þ and HðnÞ . Similarly, for any given visible unit, vi , there are J þ 1 connections that need to access the vi value in order to update their weights. Thus, they all need to access the same elements of Vð0Þ and VðnÞ . To avoid these problems we decided to use a different approach and rewrite the referred kernel from scratch. The rationale consists of avoiding memory

120

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

Fig. 10. Connections to the hidden unit j.

Fig. 11. Block structure of the improved approach of the kernel CorrectWeights.

conflicts and uncoalesced accesses, while taking advantage of the shared memory to reduce global memory accesses. To this end, in our new and improved approach, each block processes several adjacent connections that require, to some degree, accessing the same elements of Vð0Þ , Hð0Þ , VðnÞ and HðnÞ . Fig. 11 shows the new block structure of the kernel CorrectWeights. The number of threads per block was defined to be 16  16 ¼256, since it consistently yielded the best results among several configurations tested in the MNIST dataset, using different values of J and N. Each thread within a block must now process all the samples. For each sample, the block starts by copying the portions of Vð0Þ , Hð0Þ , VðnÞ and HðnÞ , required by all the threads within the block, to the shared memory which is much faster than the global memory and can be used simultaneously by several threads within the block. Note that, for threads with the same index i there will be 16 threads (each with a different j) that use the same values of Vð0Þ and VðnÞ . Similarly, for threads with the same index of j there will be 16 threads that use the same values of Hð0Þ and HðnÞ . Moreover, since the required portions of the matrices Vð0Þ , Hð0Þ , VðnÞ and HðnÞ are gathered for the same sample, the global memory accesses are now coalesced. Although, the new approach has a much smaller number of blocks and threads, it is over 18 times faster than the original one (see Fig. 9) and this discrepancy is even bigger for greater values of N and J. In terms of computation accuracy, although there are differences between the GPU and the CPU, these are irrelevant due to the stochastic nature of the CD-k algorithm.

5. Results and discussion 5.1. Experimental setup In our testbed experiments we have used two datasets: the MNIST database of hand-written digits and the HHreco multistroke symbol database. The MNIST database is available at http:// yann.lecun.com/exdb/mnist/ and contains a total of 70,000 samples (60,000 train samples and 10,000 test samples). Each sample consists of a 28  28 ¼784 pixels image of a hand-written digit.

Fig. 12. Examples of the MNIST hand-written digits. Each column contains a different digit, starting with 0 in the left-most column and ending in 9 in the right-most column.

Fig. 12 presents examples of the MNIST images. Note that, all the images were binarized. The HHreco database, available at http:// embedded.eecs.berkeley.edu/research/hhreco/, contains a total of 7791 samples generated by 19 different persons. Overall, the database contains a total of 13 different symbol classes. Each user created at least 30 multi-stroke images per class, which means that for each symbol there are at least 19  30 ¼570 samples. Fig. 13 presents examples of the HHreco images. We converted the original HHreco vector strokes into a 28  28¼784 raster pixel image, maintaining the aspect ratio of the images. Moreover, as in the MNIST dataset, we have binarized the resulting images. Note that no further pre-processing was done. Since we are interested in evaluating the capacity of the DBNs for extracting information from the original (images) raw data, we discard both the number of strokes and time span information. Since the resulting datasets have an equal number of inputs, the tests for evaluating the multi-core GPU parallel implementation were carried out exclusively for the MNIST dataset. Moreover, since DBNs are composed by stacked RBMs, which are individually trained, we concentrate our efforts on testing the algorithms' performance for training RBMs. To this end, we have trained several RBMs, varying the number of training samples and the number of hidden neurons, J. In order to evaluate the convergence performance of the adaptive step size method, we have compared the proposed method with several typical fixed learning rate and momentum settings. In this case, the study was also confined to the MNIST dataset. Finally, in order to analyze the effects of varying the number of layers and neurons of a DBN in terms of classification performance, we have trained hundreds of networks on both datasets. The final training step of the DBNs was made using a GPU implementation of the BP and MBP algorithms, described in Lopes and Ribeiro [24]. Furthermore, the macroaverage F-measure metric was used to compare the classification performance of the resulting models.

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

121

Table 1 Main characteristics of the NVIDIA GeForce devices used in this study.

Compute capability Number of SMs Number of cores Peak performance (GFLOPS) Device memory (GB) Shared memory per block (KB) Maximum threads per block Memory bandwidth (GB/s) Shading clock speed (GHz)

Fig. 13. Examples of the HHreco multi-stroke images. Each column contains a symbol while each row contains the images draw by one of the users.

The performance of the CUDA parallel implementation was benchmarked against the counterpart CPU version, using an Intel Dual-Core i5-2410 M (2.7 GHz) CPU with 8 GB of memory and an NVIDIA GeForce GTX 460 device. Moreover, the remainder tests were executed on an Intel Core 2 Quad Q 9300 (2.5 GHz) CPU with 4 GB of memory and a GeForce GTX 280. Table 1 presents the main characteristics of the aforementioned GPUs. 5.2. Benchmarks results and discussion For comparing the RBM GPU and CPU implementations, we have varied the number of samples, N, and hidden units, J. For each configuration 30 tests were performed. Fig. 14 presents the average time required to train an RBM for one epoch, as well as the GPU speedups, which range from approximately 22 to 46 times. It is noteworthy to said that for N ¼60,000 and J ¼800 the CPU version takes over 40 min to train a single epoch, while the GPU version takes approximately 53 s [11]. Moreover, there seems to be a direct correlation between the speedup and the number of samples. This was anticipated since the GPU scales better than the CPU when facing large-volumes of data that can be processed in parallel, due to its high-number of cores. Although not so pronounced, we can observe a similar trend correlating the speedup and the number of hidden units. In order to evaluate the impact of the adaptive step size technique, we have compared it with three different fixed learning rate settings (γ ¼ 0:1, γ ¼ 0:4 and γ ¼ 0:7), while using three distinct momentum terms (α ¼ 0:1, α ¼ 0:4 and α ¼ 0:7). For the adaptive step size technique we have set the initial step sizes to

GTX 280

GTX 460

1.3 30 240 933.12 1 16 512 141.7 1.3

2.1 7 336 940.8 1 48 1024 112.5 1.4

0.1, the increment, u to 1.2, and the decrement, d to 0.8. Altogether, 12 configuration settings (three adaptive step sizes and nine fixed learning rates) were used. For statistical significance, we conducted 30 tests per configuration, using an RBM with 784 inputs and 100 outputs. Each test starts with a different set of weights, but for fairness all the configurations use the same weight settings. Due to the high number of tests, we decided to limit the size of the training dataset to 1000 samples. Fig. 15 shows the evolution of the Root Mean Square Error (RMSE) of the reconstruction, according to the learning parameters settings. As expected the adaptive step size technique excels all the fixed learning rate configurations. The discrepancy is quite significant (2.51%, 9.39% and 14.20% relatively to the best fixed learning rate solution, respectively for α ¼ 0:1, α ¼ 0:4 and α ¼ 0:7) and demonstrates the usefulness of the proposed technique and its robustness to an inadequate choice of the momentum [10]. Moreover, in order to achieve better results than those obtained after 1000 epochs, using a fixed learning rate, we would only require 207, 68 and 48 epochs, respectively for α ¼ 0:1, α ¼ 0:4 and α ¼ 0:7 [10]. Fig. 16 shows the quality of the reconstruction of the original images in the database, for both the best network trained with a fixed learning rate (γ ¼ 0:4, α ¼ 0:1) and the best network trained with the step size technique. Furthermore, Fig. 17 shows the receptive fields of the aforementioned networks and Fig. 18 their excitatory and inhibitory response zones. Training a network for 1000 epochs using the adaptive step size takes on average 76.63 70.09 s on an NVIDIA GTX 280, while training the same network with a fixed learning rate takes 76.12 70.05 s. Thus, the overhead of this method is not significant, while the convergence of the network is considerably enhanced. Additionally, by using the adaptive step size technique, we are no longer required to search for a suitable γ parameter. Moreover, the step size method can easily recover from a bad choice of the initial learning rate [27] and the parameters u and d are easily tuned [10]. In order to analyze the effects of varying the number of layers and neurons, we have pre-trained several three-layer DBNs using combinations of 100, 500 and 1000 neurons in each layer. Hence, for each dataset a total of 33 ¼27 DBNs were trained. Since the DBNs have a modular architecture, we have also included the networks obtained by considering only the first and the first two hidden layers. Thus, we end up with a total of 27  3 ¼81 networks per dataset. Furthermore, we have decided to test not only the “traditional” approach of adding an additional layer to the unsupervised pre-trained DBN, but also to test the effects of adding two-layers (one hidden layer with 30 neurons and one output layer). Moreover, we have also test the effects of adding MBP layers, instead of the standard BP ones. The MBP algorithm is a generalization of the well-known BP algorithm, which can be used for training networks with selective actuation neurons, whose contribution to the network output depends on the actual space localization of the samples [34]. The MBP algorithm can be used for training networks where standard neurons coexist with

122

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

Fig. 14. MNIST average training time per epoch (GPU speedups are indicated).

Fig. 15. Average reconstruction RMSE according to the learning parameters.

selective actuation neurons and typically exhibits better generalization characteristics than the BP algorithm. Overall, for each one of the original pre-trained DBNs four different classifier models were constructed. Thus, a total of 81  4 ¼324 networks were trained for each dataset. Given the large number of networks to be trained and since that, as we said

before, the BP algorithm hardly changes the weights learned in the greedy stage, we have decided to freeze the weights of the pretrained networks, changing only the weights of the appended classification layers. Additionally, we have also decided to use a small number of training samples for each dataset. Hence, in the case of the MNIST database, we have used 1000 samples (100 of

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

123

Fig. 16. Impact of the step size technique on the convergence of an RBM ðα ¼ 0:1Þ.

each digit) for the training dataset and the remainder 69,000 samples for the test dataset. Similarly, for the HHreco database, we have used 650 samples (50 of each symbol) for the training dataset and the remainder 7141 for the test dataset. During the pre-training phase, the RBMs encompassing the DBNs were trained for a maximum of 1000 epochs. Moreover, in the discriminative phase the resulting networks were trained for a maximum of 100,000 epochs. Fig. 19(a) shows the classification performance, according to the number of layers of the pre-trained DBNs. Moreover, Tables 2 and 3 present the top 10 best networks achieved respectively for the MNIST and HHreco datasets. Surprisingly, the average F-measure is inversely proportional to the number of layers (see Fig. 19(a)). Nevertheless, in the case of the MNIST dataset, most of the best DBNs contain four layers, not including the input layer (see Table 2). Although, these results could probably be improved by fine-tuning all the weights of the network, we believe that the reduced number of samples that were used, prevents the higher-order layers of the DBNs from extracting useful features providing real discriminative gains, even though they may present reduced error rates. Intuitively, for these layers to be able to capture the underlying regularities of the data, the universe of training samples needs to contain evidences of such regularities. Naturally, the more samples we have the more likely (probable) is for the training data to exhibit evidences of more and more complex regularities. Hence, in order to create models that can actually extract complex and useful features from the raw data, the depth of the network must have into consideration not only the number of training samples but also their diversity. In practice, however, since a DBN is a modular system, it is possible to add new layers, increasing the network depth and test whether the new features improve the overall system. To corroborate this idea we have performed some preliminary tests, using all the 60,000 training samples of the MNIST database. The amount of time required for training a DBN model using such volume of samples is substantially large, involving several hours of training for both the pre-training and the training phases, thus making it difficult to carry out more exhaustive tests. Nevertheless, we were able to achieve far better results than the ones presented in Table 2 for all of the DBN models constructed. The best DBN, which presented an F-measure of 95.01%, is a four layer network (784-600-400-20-10). The pre-training of the original 784-600-400 DBN (each RBM was trained for 300 epochs) took approximately 3:34 h. Then two MBP layers were added to the network and the resulting network was trained during 10,000 epochs for approximately 3:41 h. Note that the pre-trained weights were frozen. Table 4 presents the confusion matrix of this network. It is important to point out that the

classification performance of the networks presented in Table 2 (measured over the 69,000 samples in the test dataset) is actually better than the corresponding performance measured over the standard test dataset (with 10,000 samples), making the results obtained with the full 60,000 training samples even better. Nevertheless we are confident that these can be improved, namely through the fine-tuning of all the network weights and through the execution of additional experiments. Fig. 19(b) shows the classification performance, according to the number of neurons in the first layer of the pre-trained DBNs. Note that in this case, the average classification performance of the networks improves as the number of units in the hidden layer grows. Moreover, all of the best networks, presented in Tables 2 and 3, have at least 500 neurons in the first hidden layer. Note also that there is an expressive discrepancy, in both datasets, between the best network containing 100 neurons in the first hidden layer (which presents an f-measure of 81.01% and 76.74% respectively for the MNIST and HHreco datasets) and the remainder networks presented in Tables 2 and 3. Overall, these results indicate that it is fundamental to extract a significant number of characteristics from the original data right away in the lower-level layer, because these are the key for the next layers to extract additional refined features. The results obtained suggest that the more features (neurons) the first hidden layer comprises the better, although we would need additional tests with more hidden units to confirm this trend. Fig. 19(c) presents the DBNs classification performance depending on the topology (BP or MBP) of the additional layers. On average, in the case of the MNIST dataset both topologies perform similarly, with slightly advantage to the BP topology. However, it is important to point out that all of the top 10 best networks, with no exception, have the MBP topology. In the case of the HHreco dataset, on average the MBP networks perform much better than the BP ones and most of the networks presented in Table 3 have the MBP topology. Overall, these results confirm that it is possible to enhance the performance of DBNs by including MBP layers in their architecture. Fig. 19(d) exhibits DBN classification performance, depending on whether an additional hidden layer with 30 neurons was added to the pre-trained networks. On average, in the HHreco dataset, having such an additional layer with randomly initialized weights turns out to be beneficial, since the classification performance is greatly enhanced. However in the case of the MNIST dataset, the networks with the additional layer yielded slightly worse results. Nevertheless, as weird it may seem, all of the top 10 best networks in the MNIST dataset have this additional layer and none of the HHreco has it. Altogether, these results show that the DBNs performance can be improved by adding an additional hidden layer with randomly initialized weights.

124

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

Fig. 17. Receptive fields of the best networks trained either with the adaptive step size or with a fixed learning rate.

Fig. 18. Receptive fields excitatory (red) and inhibitory (blue) response zones for the best networks trained either with the adaptive step size or with a fixed learning rate. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

6. Conclusions and future work The methodology to design deep belief models is wellunderstood although highly time consuming. In particular, the pre-training phase consists of training several modular RBMs, which are progressively stacked on top of each other. To ease this process by significantly reducing the training time of each RBMs and therefore the overall training time of the DBNs, we have presented two complementary approaches: an adaptive step size technique that improves the RBMs convergence and a multi-core GPU parallel implementation of the CD-k algorithm, which drastically reduces the pre-training time. The experiments performed

using the MNIST dataset, demonstrate the efficiency of both approaches as shown in Figs. 14–16. The careful design of the CUDA kernels supporting the GPU parallel implementation was vital to obtain speedups of up to 46  . In addition the proposed adaptive step size technique further reduces the training time by decreasing the number of epochs needed for the networks to converge. The resulting tool was used to analyze the effects of varying the number of layers and neurons as well as the effects of adding new layers with randomly initialized weights to the pre-trained networks. The influence of MBP layers with selective actuation

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

125

Fig. 19. DBN classification performance depending on several factors. (a) Pre-training layers, (b) First hidden-layer neurons, (c) Topology of the additional classification layers and (d) Networks with an additional random hidden layer.

neurons was also studied. To this end, hundreds of DBNs were trained in both the MNIST and the HHreco datasets. One of the findings of this study is that the number and diversity of training samples are highly correlated to the quality of the DBN models. Nevertheless it is possible to build quality

models even with few training samples (see Tables 2 and 3). By increasing the number of training samples we can build better models that are able to capture the underlying regularities of the data, thereby improving the overall system discriminative capacity. Moreover, based on the results, we believe that there is a

126

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

Table 2 Top 10 DBNs with the best classification performance for the MNIST dataset. The topology column refers to the topology of the added classification layers. Topology

Pre-trained DBN layers

DBN layers

Fmeasure

MBP MBP MBP MBP MBP MBP MBP MBP MBP MBP

784-1000-1000 784-500 784-500-1000 784-500 784-500-1000 784-500 784-1000-1000 784-1000-500 784-500-100 784-500-1000

784-1000-1000-30-10 784-500-30-10 784-500-1000-30-10 784-500-30-10 784-500-1000-30-10 784-500-30-10 784-1000-1000-30-10 784-1000-500-30-10 784-500-100-30-10 784-500-1000-30-10

82.92 82.77 82.72 82.49 82.38 82.37 82.36 82.19 82.11 82.04

Table 3 Top 10 DBNs with the best classification performance for the HHreco dataset. The topology column refers to the topology of the added classification layers. Topology

Pre-trained DBN layers

DBN layers

Fmeasure

BP MBP MBP MBP MBP MBP MBP MBP BP BP

784-1000 784-1000-500 784-500-500 784-1000-500 784-1000-500 784-1000 784-1000 784-1000 784-500 784-500

784-1000-13 784-1000-500-13 784-500-500-13 784-1000-500-13 784-1000-500-13 784-1000-13 784-1000-13 784-1000-13 784-500-13 784-500-13

80.37 80.25 80.13 80.04 79.95 79.79 79.78 79.63 79.61 79.44

hidden layer with randomly initialized weights to the top hidden layer of a DBN can actually improve its classification performance by allowing the resulting network to further refine its discriminative capacity (see Fig. 19(d)). Finally, we have shown that by adding MBP layers with selective actuation neurons to a DBN we could also improve its classification performance, since these neurons provide the means for better generalization by seamless partition the feature input space (see Fig. 19(c)). Future work will cover the design and execution of additional experiments in larger datasets. Additionally, more theoretical work is needed to study the relation between generalization capabilities and the depth of DBNs. Conflict of interest None declared. Acknowledgments We would like to express our gratitude to the anonymous reviewers for their comments and suggestions. FCT (Fundação para a Ciência e Tecnologia) is acknowledged for funding project PEstOE/EGE/UI4056/2011. Appendix A. Derivative of the log probability with respect to the network parameters ðθÞ  1 ∂ log ∑h eEðv;hÞ ∂ log pðvÞ Z ¼ ∂θ ∂θ

Table 4 Confusion matrix of the best MNIST DBN (trained with 60,000 samples). Actual

0 1 2 3 4 5 6 7 8 9

positive phase

negative phase zfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflffl{

zfflfflfflfflfflffl}|fflfflfflfflfflffl{ ∂ log ∑h eEðv;hÞ ∂ log Z ¼  ∂θ

∂θ

∂ log ∑h eEðv;hÞ ∂ log ∑v;h eEðv;hÞ  ¼ ∂θ ∂θ ∑h eEðv;hÞ ∂Eðv;hÞ ∑v;h eEðv;hÞ ∂ Eðv;hÞ ∂θ ∂θ ¼ þ ∑h eEðv;hÞ ∑v;h eEðv;hÞ

Predicted class 0

1

2

3

4

5

6

7

8

9

959 0 6 1 2 4 8 0 5 6

0 1120 2 1 4 2 4 5 3 4

1 5 979 10 6 0 3 18 7 0

2 3 9 947 0 23 0 10 12 7

1 0 7 2 927 6 7 4 10 20

3 2 1 18 1 830 5 0 7 6

6 1 8 2 8 10 926 1 3 1

2 2 10 9 8 1 1 964 4 13

3 2 9 12 6 13 3 3 918 16

3 0 1 8 20 3 1 23 5 936

relation between the number and diversity of training samples and the maximum useful (in the sense of improving the classification performance) depth of a DBN. The rationale is that a DBN can only find the regularities that are actually present on the data and increasing the depth of a DBN serves no purpose when the data itself does not exhibit the type of complex regularities that would require additional layers. Therefore increasing the number of samples increases the probability of the data samples to present the same regularities of its true distribution. Another finding is that the lower-level layer plays a fundamental role within a DBN structure. It is vital to extract a significant number of characteristics from the original data right away in this layer. Failure to do so may compromise the ability of the network to extract more complex and useful features in the next layers. In fact, the results obtained (see Fig. 19(b)) suggest that the more features (neurons) the first hidden layer comprises the better, although additional experiments are required to confirm this trend. We also find that unlike the pre-conceived idea that all the layers within the DBN should be pre-trained, adding an additional

ðA:1Þ

Using (2) and taking into consideration (A.2)  pðh; vÞ pðh; vÞ 1 eEðv;hÞ eEðv;hÞ  ¼ ¼ ¼ pðhvÞ ¼ Eðv;hÞ pðvÞ ∑h pðh; vÞ Z ∑h eEðv;hÞ e ∑h Z ðA:2Þ we can rewrite (A.1) as (A.3)  ∂ log pðvÞ Eðv; hÞ ∂Eðv; hÞ  ¼ ∑pðhvÞ∂ þ ∑ pðv; hÞ ∂θ ∂θ ∂θ h v;h |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflffl} positive phase

Appendix B. Notation b c d I J v h N ⊤ u

bias of the hidden units bias of the visible units step size decrement factor number of visible units number of hidden units visible units hidden units number of samples transpose step size increment factor

negative phase

ðA:3Þ

N. Lopes, B. Ribeiro / Pattern Recognition 47 (2014) 114–127

W x Z α γ sð  Þ Δ θ R

weights matrix input vector energy partition function momentum term learning rate sigmoid function change (variation) of a given variable. For example ΔW ji represents the weight change network parameter set of real numbers

References [1] J. Markoff, Giant steps in teaching computers to think like us: ‘neural nets’ mimic the ways human minds listen, see and execute, International Herald Tribune 24–25 (November) (2012) 1–8. [2] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 473–480. [3] N.L. Roux, Y. Bengio, Representational power of restricted Boltzmann machines and deep belief networks, Neural Computation 20 (6) (2008) 1631–1649. [4] Y. Bengio, Learning deep architectures for AI, Foundations and Trends in Machine Learning 2 (1) (2009) 1–127. [5] H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in: Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 609–616. [6] D. Yu, L. Deng, Deep learning and its applications to signal and information processing, IEEE Signal Processing Magazine 28 (1) (2011) 145–154. [7] N.L. Roux, Y. Bengio, Deep belief networks are compact universal approximators, Neural Computation 22 (8) (2010) 2192–2207. [8] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (7) (2006) 1527–1554. [9] K. Swersky, B. Chen, B. Marlin, N. de Freitas, A tutorial on stochastic approximation algorithms for training restricted Boltzmann machines and deep belief nets, in: Information Theory and Applications Workshop, 2010, pp. 1–10. [10] N. Lopes, B. Ribeiro, Improving convergence of restricted Boltzmann machines via a learning adaptive step size, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Lecture Notes in Computer Science, vol. 7441, Springer, Berlin/Heidelberg, 2012, pp. 511–518. [11] N. Lopes, B. Ribeiro, J. Gonçalves, Restricted Boltzmann machines and deep belief networks on multi-core processors, in: The 2012 International Joint Conference on Neural Networks (IJCNN), 2012. [12] S.K. Kim, P.L. McMahon, K. Olukotun, A large-scale architecture for restricted Boltzmann machines, in: 18th IEEE Annual International Symposium on FieldProgrammable Custom Computing Machines, 2010. [13] D. Ly, P. Chow, A high-performance FPGA architecture for restricted Boltzmann machines, IEEE Transactions on Neural Networks 21 (11) (2010) 1780–1792.

127

[14] R. Raina, A. Madhavan, A.Y. Ng, Large-scale deep unsupervised learning using graphics processors, in: Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 873–880. [15] D.L. Ly, V. Paprotski, D. Yen, Neural Networks on GPUs: Restricted Boltzmann Machines, Technical Report, University of Toronto, 2009. [16] D. Steinkraus, I. Buck, P.Y. Simard, Using GPUs for machine learning algorithms, in: Proceedings of the 8th International Conference on Document Analysis and Recognition, vol. 2, 2005, pp. 1115–1120. [17] M. Garland, D.B. Kirk, Understanding throughput-oriented architectures, Communications of the ACM 53 (11) (2010) 58–66. [18] B. Catanzaro, N. Sundaram, K. Keutzer, Fast support vector machine training and classification on graphics processors, in: Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 104–111. [19] G.E. Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Technical Report, Department of Computer Science, University of Toronto, 2010. [20] M. Ranzato, Y. Boureau, Y. LeCun, Sparse feature learning for deep belief networks, in: Advances in Neural Information Processing Systems (NIPS 2007), vol. 20, 2007, pp. 1185–1192. [21] M.A. Carreira-Perpiñán, G.E. Hinton, On contrastive divergence learning, in: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (AISTATS 2005), 2005, pp. 33–40. [22] G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Computation 14 (8) (2002) 1771–1800. [23] J. Srinivas, G.S. Prafulla, P. Premchand, Speaker independent vowel recognition using backpropagation neural network on master–slave architecture, International Journal of Computer Applications 48 (3) (2012) 45–49. [24] N. Lopes, B. Ribeiro, An evaluation of multiple feed-forward networks on GPUs, International Journal of Neural Systems 21 (1) (2011) 31–47. [25] Z. Zainuddin, N. Mahat, Y.A. Hassan, Improving the convergence of the backpropagation algorithm using local adaptive techniques, International Journal of Computational Intelligence (2005) 172–175. [26] F.M. Silva, L.B. Almeida, Acceleration techniques for the backpropagation algorithm, in: Proceedings of the EURASIP Workshop on Neural Networks, Lecture Notes in Computer Science, vol. 412, Springer Verlag, 1990. [27] L.B. Almeida, Handbook of Neural Computation, Oxford University Press, 1997. (Chapter C1.2 Multilayer perceptrons, pp. C1.2:1–C1.2:30). [28] N. Lopes, B. Ribeiro, GPUMLib: an efficient open-source GPU machine learning library, International Journal of Computer Information Systems and Industrial Management Applications 3 (2011) 355–362. [29] N. Lopes, B. Ribeiro, R. Quintas, GPUMLib: a new library to combine machine learning algorithms with graphics processing units, in: Proceedings of the 10th International Conference on Hybrid Intelligent Systems, 2010, pp. 229–232. [30] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips, GPU computing, Proceedings of the IEEE 96 (5) (2008) 879–899. [31] T. Hey, S. Tansley, K. Tolle, The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Research, 2009. [32] T.R. Halfhill, Looking Beyond Graphics, Technical Report, In-Stat, 2009. [33] S. Ryoo, C.I. Rodrigues, S.S. Baghsorkhi, S.S. Stone, D.B. Kirk, W.W. Hwu, Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, in: Proceedings of the 13th ACM Symposium on Principles and Practice of Parallel Programming, 2008, pp. 73–82. [34] N. Lopes, B. Ribeiro, An efficient gradient-based learning algorithm applied to neural networks with selective actuation neurons, Neural, Parallel and Scientific Computations 11 (2003) 253–272.

Noel Lopes is an Assistant Professor, at the Polytechnic of Guarda in Portugal. He received MSc in Computer Science from the University of Coimbra, Portugal. He is currently doing his PhD at the University of Coimbra. His main areas of interest are machine learning algorithms and graphics processing unit (GPU) computing.

Bernardete Ribeiro is a Professor at the Informatics Engineering Department, Faculty of Science and Technology, University of Coimbra in Portugal. She received MSc degree in Computer Science and PhD in Electrical Engineering speciality of Informatics both from the University of Coimbra. Her main publications are in the areas of neural networks and their applications to engineering systems, pattern recognition and support vector machines. She is a member of ACM and IEEE.