Selected approaches to supervised learning

Selected approaches to supervised learning

4 Selected approaches to supervised learning 4.1 Backpropagation and related approaches 4.1.1 Backpropagation This chapter begins with backpropagati...

1MB Sizes 0 Downloads 65 Views

4 Selected approaches to supervised learning 4.1 Backpropagation and related approaches 4.1.1

Backpropagation

This chapter begins with backpropagation (BP) and backpropagation through time (BPTT). The former technique is the foundation for most (but not all) nonlinear training methods. At the heart of backpropagation is a more careful definition of the chain rule from calculus. This redefinition is called the chain rule for ordered derivatives, which makes it possible to account for many layers of a neural network. This technique was developed by Werbos in his 1974 Ph.D. dissertation (Werbos, 1974). This and some related publications are republished in (Werbos, 1994). The identical technique increased in popularity in the mid-1980’s, and a particularly popular formulation was published in (Hinton, McClelland, & Rumelhart, 1986; Rumelhart & McClelland, 1986). To summarize, consider a simple multilayer perceptron with one or more hidden layers as in Fig. 4.1. This can be described as: pnþ1 ðiÞ ¼

m X

w nþ1 ði; jÞxn ðiÞ þ bnþ1 ðiÞ

(4.1)

j¼1

Where: p is the net input to the neuron, m is the number of inputs to the neuron, n is the current time step, w is the weight of the corresponding input, b is the neuron bias, i is the index of the neuron and j is the index of the input. The output of neuron i will be:   y nþ1 ðiÞ ¼ f nþ1 pnþ1 ðiÞ

(4.2)

Where: y is the neuron output, f is the activation function of the ith neuron. The total error d is calculated by comparing the output of the perceptron yt with the desired output dðtÞ, as in (4.3): 1 d ¼ ðyðtÞ  dðtÞÞ2 : 2

(4.3)

Error minimization using gradient descent is achieved by calculating the partial derivatives of d with respect to each weight in the network. The partial derivatives for the error are calculated in two stages: forward, as described in (4.1) and (4.2), and backward, Computational Learning Approaches to Data Analytics in Biomedical Applications. https://doi.org/10.1016/B978-0-12-814482-4.00004-8 Copyright © 2020 Elsevier Inc. All rights reserved.

101

102

Computational Learning Approaches to Data Analytics in Biomedical Applications

FIG. 4.1 Backpropagation in a multilayer perceptron with two hidden layers, showing the synaptic connections between neurons in the different layers.

in which the derivatives are backpropagated from output layers back toward input layers. The backward stage starts by computing vd=vy for each of the output units. Differentiating (4.3) for a specific input pattern n gives (4.4) (Rumelhart & McClelland, 1986): vd ¼ yn  dn ; vyn

(4.4)

Applying the chain rule to compute the error derivatives with respect to the inputs vd= vxn : vd vd dyn ¼  ; vxn vyn dxn

The value of

dy dx

(4.5)

is obtained from differentiating (4.2). Substituting in (4.5) gives: vE vE ¼  f 0 ðpi Þ: vxi vyi

(4.6)

This shows how the change in input x will affect the error. Since the total input is a linear function of the weights on the connections, it is easy to calculate how the error will be affected by the change in the input states and weights. The partial derivative of the error with respect to the weight can be defined by: vE vE vxi vE ¼  ¼  yj ; vwij vxi vwij vxi

(4.7)

vE vxi vE  ¼  wij ; vxi vyj vxi

(4.8)

 The output of each neuron contributes to vE vyj resulting from neuron i on j is:

From (4.8) a general formula for all connections to unit j can be generated: vE X vE ¼  wij ; vyj vxi i

(4.9)

Chapter 4  Selected approaches to supervised learning

103

The partial derivatives of the error with respect to the weights is used to change the weights after every input-output pattern, which does not require a dedicated memory for the derivatives themselves. This is called online training (Haykin, 2018). An alternative approach is accumulating the error vE=vw over all the input-output pairs in the training set before updating the weights accordingly. This approach is also known as offline (or batch) training. The simplest version of gradient descent is to change each weight by an amount proportional to the accumulated error as described in (4.10): Dw ¼  εvE=vw:

(4.10)

As an improvement to this method (Rumelhart & McClelland, 1986) used an acceleration method in which the current gradient is used to modify the velocity of the point in w space instead of its position: DwðnÞ ¼ 

εvE þ aDwðn  1Þ; vw

(4.11)

where n is the epoch’s integer index, a is a learning rate or decay factor that defines the ration of contribution of the training history and current gradient to the weight change, a ˛ð0; 1. For details about the weight initialization techniques and the challenges that may face BP training, readers are advised to see (Goodfellow, 2015; Haykin, 2018).

4.1.2

Backpropagation through time

When dealing with time series data, a few approaches dominate. The classic approach, backpropagation through time (Werbos, 1990) is discussed in this section. Section 4.2 will discuss another popular approach, the training of recurrent neural networks, and Section 4.3 will discuss Long Short-Term Memory (LSTM). In BPTT, time (i.e. memory) is important, since the learning process or classification becomes more accurate if it considers what the network has seen previously. This is analogous to experience in biological intelligence where memory is important for the creatures. Generally, BPTT works by unrolling the recurrent neural network RNN in time by creating copies of the network for each time step. Then, errors are calculated and accumulated for each time step. The network is rolled back up and the weights are updated. Each time step of the unrolled recurrent neural network is considered an additional layer, given the order dependence of the problem, and the internal state from the previous time step is taken as an input for the subsequent time step. Fig. 4.2 shows a generalized form of BPTT on a recurrent neural network. The BPTT algorithm can be summarized in the following steps: 1 2 3 4

Present input-output pairs at specific time steps. Unfold the network and calculate the total network errors at each time step. Fold the network and update the weights. Repeat.

Computational Learning Approaches to Data Analytics in Biomedical Applications

( )

( )

( )

Output layer

Hidden layers

……

104

( − 1)

……… ( − 1)

( − 1)

Output layer

Hidden layers

……

( − )

( − )

( − )

FIG. 4.2 General diagram for BPTT network.

The strength of BP is that it is applicable to any system, even those that depend on past calculations within the network itself. In a traditional NN parameters are not shared across layers, so nothing needs to be summed. The key difference between BPTT and the standard BP, described in the previous section, is that the gradients for W are summed at each time step. The BPTT algorithm can be computationally expensive if it goes back too many steps in time. This will exponentially increase the number of steps required for a single weight update process. In Fig. 4.2 every neuron can input values from any other neuron at previous time steps (Werbos, 1974, 1990). However, assuming n ¼ 2 in Fig. 4.2, (4.1) can simply be replaced with: pi ðtÞ ¼

i1 X

wij xi ðtÞ þ

j¼1

N þn X

wij0 xi ðt  1Þ þ

j¼1

N þn X

wij00 xi ðt  2Þ:

(4.12)

j¼1

The network described by (4.12) can be significantly simplified by fixing some weights to zero, particularly, the w00 . It can be simplified even further if all w0 are set to zero as well, except those for wii . Werbos provided two reasons for this: “parsimony” and historical reasons (Werbos, 1990). In Werbos’ paper (Werbos, 1990), all neurons, other than input layer neurons, can input the output of any other neurons with time lag in the connection. The w 0 and w 00 are the weights on the time-lagged connections between neurons. Code for BPTT in Python and Matlab can be found in the code repository (Al-jabery, 2019). Also, readers who wish to code this algorithm themselves will find (Brownlee, 2017; Werbos, 1990) helpful. To calculate the derivatives, backward time calculations are required. In forward calculations, an exact result requires the calculation of an entire Jacobian matrix, which is computationally expensive and sometimes infeasible in large networks. The derivatives calculation or network adaptation for a network that goes back two steps in time is described by Vxi ðtÞ ¼ VYiN ðtÞ þ

Nþn X j¼iþ1

Wji  pi ðtÞ þ

N þn X i¼mþ1

Wji0  pi ðt þ 1Þ þ

Nþn X

W 00 ji  pi ðt þ 2Þ

(4.13)

j¼mþ1

Where V stands for the gradient. Setting W 00 to zero will eliminate the last term in (4.13).

Chapter 4  Selected approaches to supervised learning

105

In order to adapt this network, VWij0 and VWij00 should be calculated: VWij0 ¼

T X

pi ðt þ 1Þ  xj ðtÞ

(4.14)

pi ðt þ 2Þ  xj ðtÞ

(4.15)

t¼1

VWij00 ¼

T X t¼1

The learning rate in BPTT is usually much smaller than that of basic BP. Alternately, the network weights could be initialized to zeros, ones or any random values. There are also some systematic approaches for initializing weights (Haykin, 2018; Werbos, 1990).

4.2 Recurrent neural networks Recurrent Neural Networks (RNN) are important when analyzing time series data because they are designed to adaptively look backward over varying lengths of time. A simple RNN is shown in Fig. 4.3. The design is similar to the neural networks described in Section 4.1 with the key difference of feedback connections. Although engineers use the term “feedback”, in neuroscience this is known as recurrence, so the field of neural networks has long adopted the latter term. Recurrent connections can be inputs from a node to itself or an input from a higher-level node back to a lower one; either of these creates a feedback loop. Such systems have many challenges. Analyzing the behavior of systems with feedback is more complex, so stability theorems have been developed, particularly in the case of real-time applications. An equally prominent issue is the increased demands on training such systems.

Output

Hidden layer of fully recurrent nodes

Z–1

Z–1

Z–1 Input

FIG. 4.3 In this figure, recurrent connections are enabled from higher to lower levels. A node sending its output back to its own input is a special case. The Z1 notation, adopted from electrical engineering, indicates a singlestep time delay. This architecture can learn the number of steps back to use for a time series analysis. Taken from Hu, X., Prokhorov, D. V., & Wunsch, D. C. (2007). Time series prediction with a weighted bidirectional multi-stream extended Kalman filter. Neurocomputing. https://doi.org/10.1016/j.neucom.2005.12.135.

106

Computational Learning Approaches to Data Analytics in Biomedical Applications

The training methods of Section 4.1 can be applied with longer training times. Another successful approach is known as Extended Kalman Filter (EKF). This technique treats the optimization of neural network weights as a control problem and has been particularly useful in RNN and related models. The method can be computationally complex, but heuristics such as the node-decoupled EKF can reduce this considerably. Advances in computing power have also significantly reduced barriers to training RNN. The rest of this section is a slightly modified and condensed version of the explanation in (Hu et al., 2007). In addition to this and the papers and books cited below, see (Haykin, 2001) for a thorough discussion. Multi-Stream EKF consists of the following: (1) gradient calculation by backpropagation through time (BPTT) (Werbos, 1990); (2) weight updates based on the extended Kalman filter; and (3) data presentation using multi-stream mechanics (Feldkamp, Prokhorov, Eagen, & Yuan, 2011). See also (Anderson & Moore, 1979; Haykin, 1991; Singhal & Wu, 2003) Weights are interpreted as states of a dynamic system (Anderson & Moore, 1979), which allows for efficient Kalman training. Given a network with M weights and NL output nodes, the weights update for a training instance at the time step n of the extended Kalman filter is given by: AðnÞ ¼ ½RðnÞ þ H 0 ðnÞPðnÞHðnÞ

1

(4.16)

K ðnÞ ¼ PðnÞHðnÞAðnÞ

(4.17)

W ðn þ 1Þ ¼ W ðnÞ þ K ðnÞxðnÞ

(4.18)

Pðn þ 1Þ ¼ PðnÞ  K ðnÞH 0 ðnÞPðnÞ þ QðnÞ

(4.19)

Pð0Þ ¼ I=hp ; Rð0Þ ¼ hr I; Qð0Þ ¼ hq I

(4.20)

where R(n) is a diagonal NL-by-NL matrix, whose diagonal components are equal to or slightly less than 1; H(n) is an M-by-NL matrix containing the partial derivatives of the output node signals with respect to the weights; P(n) is an M-by-M approximate conditional error covariance matrix; AðnÞ is a NL  by  NL global scaling matrix; K ðnÞ is an M-by-NL Kalman gain matrix; W ðnÞ is a vector of length M of weights; xðnÞis the error vector of the output layer. The use of artificial process noise in Eq. (4.16) avoids numerical difficulties and significantly enhances performance. Decoupled EKF (DEKF) (Puskorius & Feldkamp, 1991, 1994) was implemented in (Hu, Vian, Choi, Carlson, & Wunsch, 2002) as a natural simplification of EKF by ignoring the interdependence of mutually exclusive groups of weights. The advantage of EKF over backpropagation is that EKF often requires significantly fewer presentations of training data and less overall training epochs (Puskorius & Feldkamp, 1991). Fig. 4.4 gives the flowchart. The multi-stream procedure (Feldkamp & Puskorius, 1994) was devised to cope with the conflicting requirements of training (Kolen & Kremer, 2001). It mitigates the risk that currently presented training data could be learned at the expense of performance on previous data. This is called the recency effect (Puskorius & Feldkamp, 1997). Multi-stream training is based on the principle that each weight update should attempt

Chapter 4  Selected approaches to supervised learning

107

Start

Initialize the network weights W, P, Q, R

Is it the end of the learning iterations?

Yes

No Feed forward all the data through the network

Compute H by BPTT

GEKF: Compute A, K, Update network weights W and P

End FIG. 4.4 Training a neural network using EKF.

to simultaneously satisfy the demands from multiple input-output pairs. In each cycle of training, a specified number NS of starting points are randomly selected in a chosen set of files. Each starting point is the beginning of a stream. The multi-stream procedure consists of progressing in sequence through each stream, carrying out weight updates according to current points. A consistent EKF update routine was devised for the multi-stream procedure. The training problem is treated as a single shared-weight network, in which the number of original outputs is multiplied by the number of streams. In multi-stream training, the number of columns in HðnÞ is correspondingly increased to NS  NL . Multi-streaming is useful whether using EKF training or some other approach.

4.3 Long short-term memory Long Short-Term Memory (LSTM) is a type of recurrent neural network with a strong ability to learn and predict sequential data. The research shows that RNN is limited in

108

Computational Learning Approaches to Data Analytics in Biomedical Applications

maintaining long-term memory. Therefore, the LSTM was invented to overcome this limitation by adding memory structure, which can maintain its state over time, with gates to decide what to remember, what to forget and what to output. The LSTM shows effective results in many applications that are inherently sequential such as speech recognition, speech synthesis, language modeling and translation and handwriting recognition. Several LSTM architectures offer major and minor changes to the standard one. The vanilla LSTM, described in (Greff, Srivastava, Koutnik, Steunebrink, & Schmidhuber, 2017) is the most commonly used LSTM variant in literature and is considered a reference for comparisons. Fig. 4.5 shows a schematic of the vanilla LSTM block which includes three gates (input, forget and output), an input block, an output activation function, a peephole connection and a memory block called cell. (A) Input Gate: The input gate receives the new information and the prior predictions as inputs and provides a vector of information that represents the possibilities. The input information is regulated using an activation function; the logistic sigmoid function is commonly used here. The values in the vector will be between 0 and 1. The highest number is more likely to be predicted next. The generated vector is then added with the viable possibilities, previously stored in the cell, to produce a collection of possibilities; their values may range from less than -1 to more than 1. The information is held by the cell and manipulated by the gates. There are three gates; each has its own neural network and is trained to do its designated task. These gates are: it1 ¼ Wi xt þ Ri yt1 þ pi 1ct1 þ bi it ¼ sðit1 Þ

(4.21)

Where Wi is the input weights, Ri is the recurrent weight, pi is the peephole weights, bi is the bias weight and s is the logistic sigmoid.

FIG. 4.5 Schematic diagram of the LSTM (Greff et al., 2017).

Chapter 4  Selected approaches to supervised learning

sðxÞ ¼

1 : 1 þ e x

109

(4.22)

The symbol 1 denotes the two-vector multiplication. (B) Forget Gate: This gate removes the information that is no longer useful in the cell state. The two inputs, which is the new information at a particular time and the previous prediction, are fed to the gate and multiplied by weight matrices. Then, the result is passed through an activation function that gives 0 when the information should be forgotten or 1 when the information must be retained for future use. The result will be multiplied with the possibilities that are collected from the input gate and the cell. The useful possibilities will be stored in the cell. ft1 ¼ Wf xt þ Rf Yt1 þ pf 1ct1 þ bf ft ¼ sðft1 Þ

(4.23)

The cell formula is represented as follows: ct ¼ zt 1 it þ ct1 1 ft

(C) Output Gate (selection gate): The output gate makes a selection based on the new information and the previous predictions. The final prediction will be a result of multiplying the result of the output gate and the normalized possibilities that are provided by the cell and the input gate. Since the collected possibilities have values that may range from more than 1 to less than -1, the tanh activation function is used for normalization. ot1 ¼ Wo xt þ Ro Yt1 þ po 1ct þ bo ot ¼ sðot1 Þ

(4.24)

The following formula represents the input block: zt1 ¼ Wz xt þ Rz yt1 þ bz zt ¼ gðzt1 Þ

(4.25)

Where g is the hyperbolic tangent activation function. g(x) ¼ tanh(x). The output block can be represented by the following formula: yt ¼ hðct Þ1ot

(4.26)

In some LSTM architectures, the peephole connections were omitted as in the simplified variant called Gated Recurrent Unit (GRU) (Cho et al., 2014); however (Gers & Schmidhuber, 2000), argued that adding them to the architecture can make precise timing easier to learn. (Greff et al., 2017) concluded that the forget gate and the output activation function can significantly impact the LSTM performance. Removing any one of them affects the performance negatively. They claim that the necessity of the output activation function is to prevent the unrestrained cell state from propagating through the network and

110

Computational Learning Approaches to Data Analytics in Biomedical Applications

Memory Predictors Selection

OUTPUT GATE Collected possibilities Forgetting Memory

FORGET GATE possibilities

INPUT GATE New information

FIG. 4.6 A typical LSTM cell structure showing its main three components.

upsetting the stability of learning. As shown in Fig. 4.6, there are two directions for the arrows: forward and back.

4.4 Convolutional neural networks and deep learning Convolutional neural networks or CNNs (LeCun & Bengio, 1998) are multi-layer perceptron designed specifically for pattern recognition in single dimension (e.g. time series) or two-dimensional data matrices (e.g. images) with a high degree of invariance to translation, scaling, skewing or any type of distortion. The idea behind this branch of neural network is motivated from biology and goes back to (Hubel & Wiesel, 1962). Some give credit to Fukoshima, who developed a convolutional network in 1980 and called it Neocognitron (Fukushima, 1980). Richard Bellman, the founder of dynamic programming, stated that high dimensionality in data is essential for many applications. The main difficulty that arises, particularly in the context of pattern classification applications, is that the learning complexity grows exponentially with a linear increase in the dimensionality of the data (Bellman, 1954). Regardless of these historical arguments, the name “convolutional neural network” indicates that the network employs a mathematical operation called convolution, described in the following section. This chapter provides an overview of two of the most important types of deep learning models: convolutional neural networks and deep belief networks (DBNs) (and their respective variations). These approaches have different strengths and weaknesses based on the type of applications. The following section begins by showing the structure of a CNN, specifically its training algorithms.

Chapter 4  Selected approaches to supervised learning

4.4.1

111

Structure of convolutional neural network

These neural networks use convolution in place of general matrix multiplication in at least one of their layers. They are very successful in many applications (Arbib et al., 2015; Haykin, 2018; LeCun & Bengio, 1998). Research into convolutional network architectures is growing so rapidly that a new architecture is announced every month, if not every week. However, CNNs’ structures consist of four patches of layers or phases according to the form of constraints that govern their structures: 1. Feature extraction. The mainstream approach of overcoming “the curse” has been to pre-process the data in a manner that would reduce its dimensionality to that which can be effectively processed, for example by a classification engine. This dimensionality reduction scheme is often referred to as feature extraction. In a CNN, each neuron receives its synaptic input from a local receptive field in the previous layer which leads to extracting local features. The exact position of the extracted features loses its importance after this process as long as its relative position to the other features is preserved. This process is implemented in the input layer, where the input is usually a multidimensional array of data such as image pixels, images transformations, patterns, time series, or video signals (Ferreira & Giraldi, 2017). Some researchers used Gabor filters as an initial pre-processing step to mimic the retinal visual response to visual excitation (Tivive & Bouzerdoum, 2003). Others have applied CNNs to various machine learning problems including face detection (Tivive & Bouzerdoum, 2004), document analysis (Simard, Steinkraus, & Platt, 2003) and speech detection (Sukittanon, Surendran, Platt, Burges, & Look, 2004). 2. Feature mapping. This is when each computational layer of the network consists of multiple feature maps, with each map composed of a plane of individual neurons where they share the same set of synaptic weights. This form of structure constraint has two advantages over the other forms: shift invariance and fewer free parameters (Haykin, 2018). The convolution process is performed in this part of the network. It is the main building block of CNN. These layers are comprised of a series of filters or kernels with nonlinear functions which extract local features from the input, and each kernel is used to calculate a feature map or kernel map. The first convolutional layer extracts low-level meaningful features such as edges, corners, textures and lines (Krig, 2014). Each layer in this phase extracts features with a higher level than the ones extracted from the previous layers; then the highest level features are extracted in the last layer (Chen, Han, Wang, Jeng, & Fan, 2006). Kernel size refers to the size of the filter, which convolves around the feature map, while the amount by which the filter slides (sliding process) is the step size. It controls how the filter convolves around the feature map. Therefore, the filter convolves around the different layers of the input feature map by sliding one unit each time (Arbib et al., 2015). Another essential feature of CNNs is padding which allows the input data to

112

Computational Learning Approaches to Data Analytics in Biomedical Applications

expand. For example, if there is a need to control the size of the output and the kernel width W independently, then zero padding is used for the input. The following equations describe how the convolution process is performed. In one dimension, the convolution between two functions is defined as follows: (4.27)

where f(x) and g(x) are two functions; is the convolution symbol; s is the variable of integration. In two dimensions, the convolution between two functions is defined as follows: ZZ

gðx; yÞ ¼ f ðx; yÞ  hðx; yÞ ¼

N

N

f ðs; tÞhðx  s; y  tÞdsdt:

(4.28)

3. Detection or non-linearity. Assuming robust deep learning is achieved, a hierarchical network could be trained on a large set of observations, and signals from this network could be extracted later to a relatively simple classification engine for robust pattern recognition. Robustness here refers to the ability to exhibit classification invariance to a diverse range of transformations and distortions, including noise, scale, rotation, various lighting conditions, displacement, etc. The prime purpose of convolution is to extract distinct features from the input. In this phase, the network learns complex models by detecting linear activation through nonlinear activation functions (Zheng, Liu, Chen, Ge, & Zhao, 2014). Examples of these activation functions are tanhðxÞ; sigmoidðxÞ and the rectified linear unit (ReLU) functions (Albelwi & Mahmood, 2017). The last function increases the nonlinearity without affecting the receptive field of the convolutional layer: ReLU ¼ maxðx; 0Þ. This function accelerates the learning process of the CNNs by reducing the gradient oscillation at all layers. In this stage, each layer consists of a generic multilayer network. 4. Subsampling (features pooling). Capturing spatiotemporal dependencies, based on regularities in the observations, is viewed as a fundamental goal for deep learning systems. In this phase, the resolution and network computational complexity are reduced from the previous stages by exclusively choosing features that are robust to noise and distortion. The output of this phase is a filtered subset of the features that have the most important or core information from the input data (Bengio, 2009; Ferreira & Giraldi, 2017). The pooling or subsampling is “tuned” by parameters in the learning process, but the basic mechanism is set by the network designer. The convolution process in this phase can be described by: 0 Xjc

¼f@

X

1 Xjc1 ð  Þkijc

þ bcj A;

(4.29)

i˛Mj

Where c is the convolution layer; X is the input feature; k is the kernel map; b is the bias; Mj is the subset selected from the features; i and j are inputs and outputs respectively.

Chapter 4  Selected approaches to supervised learning

113

5. Fully connected layers. The last part of a CNN topology consists of single dimensional layers that are fully connected to all activations in the previous layers (Bengio, 2009). These layers are usually used to train another classifier, which is typically a feed forward neural network. Training is performed by using cost functions such as: softmax, sigmoid cross-entropy or Euclidean loss in order to penalize the network vs, 2018; Schmidhuber, when it deviates from the true labels (i.e., targets) (Namate 2015; Simon Haykin, 2018). CNNs have recently (Robinson & Fallside, 1991) been trained with a temporal coherence objective to leverage the frame-to-frame coherence found in videos, though this objective need not be specific to CNNs.

4.4.2

Deep belief networks

Deep Belief Networks (DBNs) were invented as a solution for the problems encountered when using traditional neural networks training in deep layered networks, such as slow learning, becoming stuck in local minima due to poor parameter selection, and requiring a lot of training datasets. DBNs were initially introduced in (Larochelle, Erhan, Courville, Bergstra, & Bengio, 2007) as probabilistic generative models to provide an alternative to the discriminative nature of traditional neural nets. Generative models provide a joint probability distribution over input data and labels, facilitating the estimation of both PðxjyÞ and PðyjxÞ, while discriminative models only use the last model PðyjxÞ: As illustrated in Fig. 4.7, DBNs consist of several layers of neural networks, also known as “Boltzmann Machines”. Each of them is restricted to a single visible layer and a hidden layer. Associative memory

Label units

Top level units

Hidden units Note that each line is a twoway line;

………..

means generaƟve weights means detecƟve weights

………..

Hidden units

Hidden units

………..

Restricted Boltzmann Machines visible units

Input layer

………..

……….. FIG. 4.7 A deep belief network.

114

Computational Learning Approaches to Data Analytics in Biomedical Applications

Restricted Boltzmann Machines (RBMs) can be considered as a binary version of factor analysis. So instead of having many factors, a binary variable will determine the network output. The widespread RBNs allow for more efficient training of the generative weights of its hidden units. These hidden units are trained to capture higher-order data correlations that are observed in the visible units. The generative weights are obtained using an unsupervised greedy layer-by-layer method, enabled by contrastive divergence (Hinton, 2002). The RBN training process, known as Gibbs sampling, starts by presenting a vector, v; to the visible units that forward values to the hidden units. In the reverse direction, the visible unit inputs are stochastically found to reconstruct the original input. Finally, these new visible neuron activations are forwarded so single-step reconstruction hidden unit activations, h, can be attained.

4.4.3

Variational autoencoders

One of the applications of deep learning is using a multilayer neural network to convert high-dimensional data to low-dimensional. This neural network structure is called an Autoencoder. This entire Section 4.4.3 is a paraphrased excerpt from the blog (Wu, 2019), used with permission. It, in turn, is a condensed explanation of the original contribution in (Kingma & Welling, 2013), and also benefitted from the synopsis in (Doersch, 2016). A Variational Autoencoder (VAE) can be defined as a stochastic version of a conventional autoencoder which imposes some constraints on the distribution of latent variables. The upper portion of Fig. 4.8 shows the basic concept of an autoencoder. An input is mapped to itself through several layers of a neural network, where the middle layer will have a chokepoint of fewer nodes, forcing a compressed representation of the

FIG. 4.8 Variational Autoencoder Diagram. The target output is the same as the input. Due to the compressed hidden layer, this is necessarily an estimate rather than an exact mapping. The chokepoint created by the smaller hidden layer in the center creates a set of latent mappings. Figure adapted from (Kan, 2018; Tschannen, Bachem, & Lucic, 2018).

Chapter 4  Selected approaches to supervised learning

115

data. Certain constraints on the transfer functions of the hidden units (discussed below) constitute the “variational” adjective. This section will discuss the derivation of an Autoencoder and how to implement it in Tensorflow (TensorFlow, 2015). VAE aims to learn the underlying distribution of a dataset which is unknown and usually complicated. To set this up, we need the common measure (not symmetrical, thus strictly not a distance metric) of two distributions pðxÞ and qðxÞ. This is the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951), which is defined by:    pðxÞ : DKL ½pðxÞjjqðxÞ ¼ EpðxÞ log qðxÞ

(4.30)

To model the true data distribution, the KL divergence between the true data distribution qðxÞ and the model distribution pq ðxÞ should be minimized, where q is the optimization parameter of the model, as described by (4.31): DKL ½qðxÞjjpq ðxÞ ¼ EqðxÞ½log qðxÞ  log pq ðxÞ ¼ H½qðxÞ  EqðxÞ½log pq ðxÞ;

(4.31)

where qðxÞ is the underlying and unchanging distribution of the dataset, and the entropy of qðxÞ is a constant, therefore minDKL ½qðxÞjjpq ðxÞ ¼ maxq EqðxÞ½log pq ðxÞ; q

(4.32)

From (4.32), minimizing the KL divergence of the data distribution and model distribution is equivalent to the maximum likelihood method. VAE is a latent variable generative model which learns the distribution of data space x˛X from a latent space z˛Z. We can define a prior of the latent space pðzÞ, which is usually a standard normal distribution, then we can model the data distribution with a complex conditional distribution pq ðxjzÞ, so the model data likelihood can be computed as described by (4.33): Z

pq ðxÞ ¼

pqðxjzÞpðzÞdz;

(4.33)

z

However, direct maximization of the likelihood is intractable because of the integration. VAE instead optimizes a lower bound of pq ðxÞ, which can be derived using Eq. (4.34) (Jensen, 1906): If f is a convex function and X is a random variable, then Ef ðX Þ  f ðEX Þ;

(4.34)

and the equality holds only when X¼EX. In our case, (4.34) and (4.33) can be combined in (4.35): Z log pq ðxÞ ¼ log pq ðx; zÞdz;

(4.35)

z

Z

¼ log qf ðzjxÞ pq ðx; zÞqf ðzjxÞ dz;

(4.36)

z





¼ log Eqf ðzjxÞ pq ðx; zÞqf ðzjxÞ  Eqf ðzjxÞlog pq ðx; zÞqf ðzjxÞ :

(4.37)

116

Computational Learning Approaches to Data Analytics in Biomedical Applications

The last line of the derivation is called the Evidence Lower Bound (ELBO), which is used frequently in Variational Inference. The term qf ðzjxÞ is an approximate distribution of the true posterior pq ðzjxÞ of the latent variable z given datapoint x. qf ðzjxÞ; which is an instance of the Variational Inference family, it is used to perform inference of the data in the first place. For example, given a raw datapoint x, specify how to learn its representations z like shape, size, or category. The posterior of latent variables pq ðzjxÞ ¼ pq ðxjzÞpðzÞ=pq ðxÞ is also intractable because pq ðxÞ is intractable. VAE introduces a recognition model qf ðzjxÞ to approximate the true posterior pq ðzjxÞ. Similarly, to minimize the KL divergence between them as described by (4.38):



DKL qf ðzjxÞ pq ðzjxÞ ¼ Eqf ðzjxÞ log qf ðzjxÞ  log pq ðzjxÞ

¼ Eqf ðzjxÞ logqf ðzjxÞ  log pq ðxjzÞ  log pðzÞ þ log pq ðxÞ ¼ ELBO þ log pq ðxÞ;

(4.38)

Taking log pq ðxÞ out of the expectation because it does not depend on z, and rearranging (4.38) leads to (4.39):

ELBO ¼ log pq ðxÞ  DKL qf ðzjxÞjjpq ðzjxÞ :

(4.39)

This is the same objective function that is used to minimize the KL divergence between qf ðzjxÞ and pq ðzjxÞ and at the same time maximize log pq ðxÞ. Now, we need to maximize the ELBO as in (4.40-4.41):

ELBO ¼ Eqf ðzjxÞ log pq ðxjzÞ þ log pðzÞ  log qf ðzjxÞ ;

(4.40)



¼ Eqf ðzjxÞ½log pq ðxjzÞ  DKL qf ðzjxÞ jpðzÞ ;

(4.41)

The first term on Right Hand Side is the reconstruction error, which is the meansquare error of the real value data or the cross-entropy for binary value data. The second term is the KL divergence of approximate posterior and prior of the latent variable z, which can be computed analytically. There are two results can be concluded from (4.41): 1 The distribution of z can be computed given x using qf ðzjxÞ, and the distribution of x, can be computed given z, using pq ðxjzÞ. If both are implemented using neural networks, then they are the encoder and decoder of an Autoencoder, respectively. 2 VAE can generate new data while conventional Autoencoders fail. This advantage because the deterministic implementation of the first term in (4.41) is the same as that of a conventional Autoencoder. In the second term, VAE forces the mapping from data to latent variables to be close to the prior, so any time we sample a latent variable from the prior, the decoder knows what to generate, while a conventional Autoencoder distributes the latent variables randomly, with many gaps between them, which may result in samples from the gaps that are not intended to be samples by the encoder.

Chapter 4  Selected approaches to supervised learning

117

The implementation of VAE requires the following steps:

1. Compute the DKL q f ðzjxÞ pðzÞ term. We assume the prior of z is standard Gaussian, pðzÞ ¼ N ð0; IÞ, this is suitable when implementing VAE using neural networks, because of the ability of the decoder network to transform the standard Gaussian distribution to it at some layer, regardless of the true prior. Therefore, the approximate   posterior qf ðzjxÞ will also take a Guassian distribution form N z; m; s2 , and the

parameters m and s are computed using the encoder. Then DKL qf ðzjxÞ pðzÞ is computed using simple calculus:



DKL qf ðzjxÞ pðzÞ ¼ Eqf ðzjxÞ log qf ðzjxÞ  log pðzÞ Z    

¼ N z; m; s2 log N z; m; s2  log N ðz; 0; IÞ dz: ¼

1 2

J X

 2  logðsj Þ2 þ mj þ ðsj Þ2  1

(4.42)

j¼1

where j is the dimension index of vectors z, mj and sj denote the jth element of the mean and variance of z, respectively. 2. The gradient of ELBO with respect to q can be easily computed since the ELBO formula contains encoder parameters f and decoder parameters q, as described by (4.43):

Vq ELBO ¼ Vq Eqf ðzjxÞ log pq ðxjzÞ ¼ Eqf ðzjxÞ½Vq log pq ðxjzÞ L 1X ½Vq log pq ðxjzl Þ: x L l¼1

(4.43)

The last line comes from Monte Carlo estimation, since. zl wqf ðzjxÞ However, because a common gradient estimator like the score function estimator is impractical due to high variance, the ELBO gradient with respect to f needs special handling:

Vf Eqf ðzÞ½f ðzÞ ¼ Eqf ðzÞ f ðzÞVf log qf ðzÞ

L

1X x f ðzÞVf log qf ðzl Þ : L l¼1

(4.44)

VAE uses a ‘reparameterization trick’ to derive an unbiased gradient estimator. Instead of sampling zwqf ðzjxÞ directly, it reparameterize the random variable zf wqf ðzjxÞ using a differentiable transformation gf ðε; xÞ with an auxiliary noise variable ε, as described in (4.45). ze¼ gf ðε; xÞwith εwpðεÞ:

(4.45)

118

Computational Learning Approaches to Data Analytics in Biomedical Applications

  In the univariate Gaussian case, zwN m; s2 , we can sample εwN ð0; 1Þ and then use the transformation z ¼ m þ sε . In this way, we can compute the gradient with respect to f

Vf ELBO ¼ Vf Eqf ðzjxÞ½log pq ðxjzÞ  DKL qf ðzjxÞ pðzÞ ¼

L      1X log pq xjzðlÞ  DKL qf ðzjxÞjjpðzÞ ; L l¼1

(4.46)

where. zl wgf ðx; εl Þ ¼ m þ s1εl where εl wN ð0; IÞ The implementation code of the described VAE above is listed in (Wu, 2019).

4.5 Random forest, classification and Regression Tree, and related approaches The techniques in this section build on old methods (Breiman, Friedman, Stone, & Olshen, 1984). Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest (Duda, Hart, & Stork, 2000). This algorithm uses a combination of multiple decision trees to provide more accurate and stable prediction results. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably (Schapire, 2013) to Adaboost (Freund & Schapire, 1997) but are more robust with respect to noise. Internal estimates monitor error, strength and correlation, and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are applicable to regression as well (Breiman, Friedman, Olshen, & Stone, 2017), but they continue to be popular due to their theoretical foundations, simplicity and performance. As discussed in (Breiman et al., 1984) and elsewhere, these techniques are consistent with methods from Bayesian inference. While finding an optimal decision tree is NP-hard, an initial attempt at one can often be found very quickly. Classification and Regression Trees are an important set of techniques for analyzing large datasets. They tend to be fast and easily understood. However, they can morph into excessively complex trees, and there can be issues when introducing new data or significant differences in results when making small changes in data. Regardless, they have been, and remain, an important class of approaches. See (Loh, 2014) for a survey article. A shorter but still useful survey, with some comments about software tools, by the same author is (Loh, 2014).

Chapter 4  Selected approaches to supervised learning

119

For an introduction to several related approaches, see (Duda et al., 2000), particularly the reviews of Bayesian and Maximum-Likelihood methods and nonparametric techniques. The foundation of Bayesian inference is built on Bayes’ Theorem: Given events A and B, and the probability P(B) s 0, then the conditional probability P (AjB) is given by: PðAjBÞ ¼ PðB j AÞPðAÞ=PðBÞ

(4.47)

Usually one is looking at many A’s and B’s using this method. This can be very useful for automated inference methods, but it is necessary to have a priori estimates of PðAÞ, which are called Bayesian priors. One usually also makes assumptions about the underlying distributions. Optimality theorems exist if certain assumptions are satisfied (Brownlee, 2016; Donges, 2018; Loh, 2014). The Random forest algorithm can be summarized by the following pseudo code: 1 Define the problem type Pt . 2 Select an initial subset of features. 2.1. Create a decision tree based on the selected subset of features. 2.2. Update the decision tree. 2.3. Increment trees counter: n ¼ n þ 1 3 Repeat (2) for all trees in the random forest (i.e. n times). 4 If Pt ¼¼ }Prediction} then (4) elseif Pt ¼¼ }Classification} then (5) Pt ¼ “Prediction”

4.1. For i ¼ 1 to n: Calculate tree prediction (yi ). P 4.2. Calculate the total output of the random forest Yt ¼ yi =n. 4.3. Return Yt Pt ¼ “Classification”

5 For all input patterns: 5.1. Read input pattern: For j ¼ 1: No. of input patterns (Read Xj ). 5.2. For i ¼ 1 to n (For each tree) 5.2.1. Assign label Xj to Ci using tree i 5.3. Find the Cj , which is the most repetitive Ci . (Majority vote) 5.4. Assign Xj to Class Cj . 5.5. Repeat 6 Return C 0 ¼ ½Cj  (Label vector).

120

Computational Learning Approaches to Data Analytics in Biomedical Applications

For random forest code in Python see (Al-jabery, 2019). As with any algorithm, there are advantages and disadvantages to using it. Therefore, the strengths and weaknesses for this algorithm are specified below. Strengths: 1. Unbiased algorithm: It has the power of mathematical democracy, since it relies on the decision made by multiple random decision trees and uses majority voting to specify the final output. 2. Multi-purpose algorithm: The random forest algorithm can be used for prediction and classification, as discussed previously (Donges, 2018). 3. Stability: The appearance of a new data point in the dataset is unlikely to affect the entire algorithm. Instead, it affects only one tree. 4. Overfitting immunity: If there are enough trees in the forest, the classifier will not overfit the model. 5. Robustness: The algorithm works well with categorical and numerical datasets. It also tolerates and handles missing values in datasets efficiently (Malik, 2018). Weaknesses: 1. Complexity: Random forest algorithms require many more computational resources because of the large number of decision trees joined together. However, that amount is still less than the computational power required for a unified decision tree that compensates for all of them. Slow learning: Like many other algorithms, random forest suffers from the lengthy training time that is required for the algorithm to adapt and learn the given patterns. However, this only occurs when using a large number of decision trees. In general, these algorithms train quickly but create predictions slowly. In most real-world applications the random forest algorithm is fast enough, but there can certainly be situations where run-time performance is important and other approaches would be preferred (Brownlee, 2016; Donges, 2018; Loh, 2014).

4.6 Summary This chapter reviews some of the most popular algorithms in supervised learning. It discusses recurrent neural networks and their training algorithms such as backpropagation and backpropagation through time, as well as Long Short-Term memory cells showing their architecture, concepts and applications. Some representative concepts, types and applications of deep learning algorithms were explained, and two of its most popular network structures were discussed: convolutional neural networks and deep belief networks. The also provides an overview of a random forest algorithm, detailing how it works, showing pseudo code and listing the strengths and weaknesses of this class of algorithms.

Chapter 4  Selected approaches to supervised learning

121

References Al-jabery, K. (2019). ACIL group/Computational_Learning_Approaches_to_Data_Analytics_in_Biomedical_ Applications GitLab. Albelwi, S., & Mahmood, A. (2017). A framework for designing the architectures of deep Convolutional Neural Networks. Entropy, 19(6), 242. https://doi.org/10.3390/e19060242. Anderson, B. D. O., & Moore, J. B. (1979). Rcommended 2) optimal filtering. Dover Publications. https:// doi.org/10.1109/TSMC.1982.4308806. Arbib, M. A., Stephen, G., Hertz, J., Jeannerod, M., Jenkins, B. K., Kawato, M., et al. (2015). Hierarchical recurrent neural encoder for video representation with application to captioning. Compute, 0(1), 1029e1038. abs/1503 https://doi.org/10.1016/j.ins.2016.01.039. Bellman, R. (1954). The theory of dynamic programming. Bulletin of the American Mathematical Society, 60(6), 503e516. https://doi.org/10.1090/S0002-9904-1954-09848-8. Bengio, Y. (2009). Learning deep architectures for AI. In Foundations and Trends in machine learning (Vol. 2). https://doi.org/10.1561/2200000006. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. In Classification and regression trees. https://doi.org/10.1201/9781315139470. Breiman, L., Friedman, J., Stone, C., & Olshen, R. (1984). Classification and regression trees (wadsworth statistics/probability). New York: CRC Press. Brownlee, J. (2016). Master machine learning algorithms discover how they work and implement them from scratch. Machine Learning Mastery With Python. Brownlee, J. (2017). Machine learning mastery. Book. Chen, Y. N., Han, C. C., Wang, C. T., Jeng, B. S., & Fan, K. C. (2006). The application of a convolution neural network on face and license plate detection. In Proceedings - international conference on pattern recognition. https://doi.org/10.1109/ICPR.2006.1115. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Doersch, C. (2016). Tutorial on variational autoencoders. Donges, N. (2018). The random forest algorithm e towards data science. Retrieved May 29, 2019, from https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification (2nd ed.). Wiley. Feldkamp, L. A., Prokhorov, D. V., Eagen, C. F., & Yuan, F. (2011). Enhanced multi-stream kalman filter training for recurrent networks. In Nonlinear modeling. https://doi.org/10.1007/978-1-4615-5703-6_ 2. Feldkamp, L. A., & Puskorius, G. V. (1994). Training controllers for robustness: Multi-stream DEKF. Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), 4, 2377e2382. https://doi.org/10.1109/ICNN.1994.374591. Ferreira, A., & Giraldi, G. (2017). Convolutional Neural Network approaches to granite tiles classification. Expert Systems with Applications, 84, 1e11. https://doi.org/10.1016/j.eswa.2017.04.053. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119e139. https://doi.org/10. 1006/jcss.1997.1504. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193e202. https://doi.org/10. 1007/BF00344251.

122

Computational Learning Approaches to Data Analytics in Biomedical Applications

Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. In Proceedings of the IEEEINNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: New challenges and perspectives for the new millennium (pp. 189e194). https://doi.org/10.1109/ IJCNN.2000.861302. Goodfellow, I. (2015). Deep learning. In Nature methods (Vol. 13). https://doi.org/10.1038/nmeth.3707. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2017). Lstm: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222e2232. https:// doi.org/10.1109/TNNLS.2016.2582924. Haykin, S. (1991). Adaptive filter theory. Englewood Clilfs, NJ: Prentice Hall. Haykin, Simon (2001). In Simon Haykin (Ed.), Kalman filtering and neural networks (first ed.) https:// doi.org/10.1002/0471221546. Haykin, S. (2018). Neural networks and learning machines (3rd ed.). Pearson India. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771e1800. https://doi.org/10.1162/089976602760128018. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1)Foundations. https:// doi.org/10.1146/annurev-psych-120710-100344. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1), 106e154. https://doi.org/10.1113/jphysiol. 1962.sp006837. Hu, X., Prokhorov, D. V., & Wunsch, D. C. (2007). Time series prediction with a weighted bidirectional multi-stream extended Kalman filter. Neurocomputing, 70(13e15), 2392e2399. https://doi.org/10. 1016/j.neucom.2005.12.135. Hu, X., Vian, J., Choi, J., Carlson, D., & Wunsch, D. C. (2002). Propulsion vibration analysis using neural network inverse modeling. In Proceedings of the 2002 international joint conference on neural networks. IJCNN’02 (cat. No.02CH37290) (pp. 2866e2871). https://doi.org/10.1109/IJCNN.2002.1007603. Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les ine´galite´s entre les valeurs moyennes. Acta Mathematica, 30, 175e193. https://doi.org/10.1007/BF02418571. Kan, C. E. (2018). What the heck are VAE-GANs?. Retrieved June 1, 2019, from Towards Data Science website: https://towardsdatascience.com/what-the-heck-are-vae-gans-17b86023588a. Kingma, D. P., & Welling, M. (2013). Auto-encoding variational Bayes. Kolen, J. F., & Kremer, S. C. (2001). A field guide to dynamical recurrent networks (1st ed.). Wiley-IEEE Press. Krig, S. (2014). Computer vision metrics: Survey, taxonomy, and analysis. In Computer vision metrics: Survey, taxonomy, and analysis. https://doi.org/10.1007/978-1-4302-5930-5. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79e86. https://doi.org/10.1214/aoms/1177729694. Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on machine learning - ICML ’07. https://doi.org/10.1145/1273496.1273556. LeCun, Y., & Bengio, Y. (1998). Convolution networks for images, speech, and time-series. Igarss 2014. https://doi.org/10.1007/s13398-014-0173-7.2. Loh, W. Y. (2014). Fifty years of classification and regression trees. International Statistical Review. https:// doi.org/10.1111/insr.12016. Malik, U. (2018). Random forest algorithm with Python and scikit-learn. Retrieved May 29, 2019, from https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/.

Chapter 4  Selected approaches to supervised learning

123

vs, I. (2018). Deep convolutional neural networks: Structure, feature extraction and training. Namate Information Technology and Management Science, 20(1), 40e47. https://doi.org/10.1515/itms-2017-0007. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman filter training of feedforward layered networks. IJCNN-91-Seattle International Joint Conference on Neural Networks, i, 771e777. https://doi.org/10.1109/IJCNN.1991.155276. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279e297. https://doi. org/10.1109/72.279191. Puskorius, G. V., & Feldkamp, L. A. (1997). Multi-stream extended Kalman filter training for static and dynamic neural networks. In 1997 IEEE international conference on systems, man, and cybernetics. Computational cybernetics and simulation, 3, 2006e2011. https://doi.org/10.1109/ICSMC.1997.635150. Robinson, T., & Fallside, F. (1991). A recurrent error propagation network speech recognition system. Computer Speech & Language, 5(3), 259e274. https://doi.org/10.1016/0885-2308(91)90010-N. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing. Cambridge, Mass: MIT Press. https://doi.org/10.1037//0021-9010.76.4.578. Schapire, R. E. (2013). The boosting approach to machine learning: An overview. https://doi.org/10.1007/ 978-0-387-21579-2_9. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85e117. https://doi.org/10.1016/j.neunet.2014.09.003. Simard, P. Y., Steinkraus, D., & Platt, J. C. (2003). Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the international conference on document analysis and recognition. ICDAR. https://doi.org/10.1109/ICDAR.2003.1227801. Singhal, S., & Wu, L. (2003). Training feed-forward networks with the extended Kalman algorithm. https://doi.org/10.1109/icassp.1989.266646. Sukittanon, S., Surendran, A. C., Platt, J. C., Burges, C. J. C., & Look, B. (2004). Convolutional networks for speech detection. International Speech Communication Association (Interspeech). TensorFlow. (2015). TensorBoard: Visualizing learning. Retrieved June 1, 2019, from TensorFlow website: https://www.tensorflow.org/guide/summaries_and_tensorboard. Tivive, F. H. C., & Bouzerdoum, A. (2003). A new class of convolutional neural networks (SICoNNets) and their application of face detection. In Proceedings of the International Joint Conference on Neural Networks, 3, 2157e2162, IEEE. https://doi.org/10.1007/978-0-387-21579-2_9 Tivive, F. H. C., & Bouzerdoum, A. (2004). A new class of convolutional neural networks (SICoNNets) and their application of face detection. https://doi.org/10.1109/ijcnn.2003.1223742. Tschannen, M., Bachem, O., & Lucic, M. (2018). Recent advances in autoencoder-based representation learning. Retrieved from http://arxiv.org/abs/1812.05069. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences (Harvard). Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550e1560. https://doi.org/10.1109/5.58337. Werbos, P. J. (1994). The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting. Wiley-Interscience. Wu, T. (2019). Variational autoencoder. Retrieved June 19, 2019, from GitHub website: https:// hustwutao.github.io/2019/06/19/variational-autoencoder/. Zheng, Y., Liu, Q., Chen, E., Ge, Y., & Zhao, J. L. (2014). Time series classification using multi-channels deep convolutional neural networks. In Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). https://doi.org/10.1007/978-3-319-08010-9_33.