The Self-Organizing Restricted Boltzmann Machine for Deep Representation with the Application on Classification Problems

The Self-Organizing Restricted Boltzmann Machine for Deep Representation with the Application on Classification Problems

Expert Systems With Applications 149 (2020) 113286 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 0 Downloads 56 Views

Expert Systems With Applications 149 (2020) 113286

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

The Self-Organizing Restricted Boltzmann Machine for Deep Representation with the Application on Classification Problems Saeed Pirmoradi a, Mohammad Teshnehlab b,∗, Nosratollah Zarghami c, Arash Sharifi a a

Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran Department of Control Engineering, K.N. Toosi University of Technology, Tehran, Iran c Department of Biotechnology, Tabriz University of Medical Sciences, Tabriz, Iran b

a r t i c l e

i n f o

Article history: Received 15 September 2018 Revised 26 November 2019 Accepted 5 February 2020 Available online 13 February 2020 Keywords: Deep learning Self-organizing restricted Boltzmann machines Separability-correlation measure MNIST Moore-set Wisconsin breast cancer dataset

a b s t r a c t Recently, deep learning is proliferating in the field of representation learning. A deep belief network (DBN) consists of a deep network architecture that can generate multiple features of input patterns, using restricted Boltzmann machines (RBMs) as a building block of DBN. A deep learning model can achieve extremely high accuracy in many applications that depend on the model structure. However, specifying various parameters of deep network architecture like the number of hidden layers and neurons is a difficult task even for expert designers. Besides, the number of hidden layers and neurons is typically set manually, while this method is costly in terms of time and computational cost, especially in big data. In this paper, we introduce an approach to determine the number of hidden layers and neurons of the deep network automatically during the learning process. To this end, the input vector is transformed from the feature space with a low dimension into the new feature space with a high dimension in a hidden layer of RBM. In the following, new features are ranked according to their discrimination power between classes in the new space, using the Separability-correlation measure for feature importance ranking algorithm. The algorithm uses the mean of weights as a threshold, so the neurons whose weights exceed the threshold are retained, and the others are removed in the hidden layer. The number of retained neurons is presented as a reasonable number of neurons. The number of layers is also determined in the deep model, using the validation data. The proposed approach acts as a regularization method since the neurons whose weights are lower than the threshold are removed; thus, RBM learns to copy input merely approximate. It also prevents over-fitting with a suitable number of hidden layers and neurons. Eventually, DBN can determine its structure according to the input data and is the self-organizing model. The experimental results on benchmark datasets confirm the proposed method. © 2020 Elsevier Ltd. All rights reserved.

1. Introduction Deep learning has a significant effect on our lives since it is playing a remarkable role in many applications, such as cancer diagnosis, self-driving cars, and speech recognition. The deep learning research was started by Geoff Hinton’s group in 2006 (G. E. Hinton & Salakhutdinov, 2006). They initially focused on stacking unsupervised representation learning algorithm to create deeper representation. Deep network architectures utilize deep learning to represent multiple features of input patterns (Yoshua Bengio, 2009). Deep learning first constructs pre-trained hierarchical blocks and then adds a fine-tuning block to achieve high clas∗ Corresponding author at: K.N. Toosi University of Technology, Shariaty St., Tehran, Iran. E-mail addresses: [email protected] (S. Pirmoradi), [email protected] (M. Teshnehlab).

https://doi.org/10.1016/j.eswa.2020.113286 0957-4174/© 2020 Elsevier Ltd. All rights reserved.

sification capability. In the deep network architecture, the lower layers generate abstract features, while the higher layers generate concrete features. Deep Belief Network (DBN) is a well-known method in deep learning. DBN contains two or more restricted Boltzmann machines (RBMs); in other words, a DBN is a simple hierarchical network of RBMs that can achieve high classification capability by extracting features. The RBM consists of a visible layer and a hidden layer of neurons. The hidden neurons represent a feature vector for input patterns. The classification accuracy of deep learning models, especially in DBN, depends on the network structure of the deep model. The number of hidden neurons in RBM and the number of hidden layers in DBN plays a significant role in classification accuracy. Thus, characterize the deep model structure including the number of layers and neurons is very important; however, It is challenging to determine which structure, how many layers and neurons in each layer are suitable for a specific task

2

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

(Guo et al., 2016); since trial and error for discovering the reasonable (or optimal) network structure for given input data will be costly from terms of time and computation. Moreover, the deep model needs an extensive network structure to get high classification capability, but an extensive network may cause over-fitting, then the designer requires to balance between network structure and classification accuracy. Some methods are proposed to determine the deep network architecture in RBM and DBN. The adaptive structure learning method can be used since it can find a suitable architecture during the training. However, such an adaptive structure learning method is not utilized in RBM. The adaptive structure learning method consists of three types: ‘pruning neurons’ (Islam et al., 2009; Zeng & Yeung, 2006), ‘architecture determination by other clustering methods’ (Bruzzone, Prieto, & sensing, 1999) and ‘neuron generation by monitoring network structure’ (Ichimura, Tazaki, & Yoshida, 1995). The first and second methods are used in many applications for adaptive structure learning. Recently, the iRBM (Kristiansen & Gonzalvo, 2017) was introduced using a random variable; However, it can not be in the adaptive structure learning method, because it selects one structure from many structures that include many RBMs with a different number of neurons while these methods are costly in terms of time and computational cost, especially in big data. In the third type, the authors introduced a useful algorithm that can add new neurons to the initially small hidden layer by monitoring the training error between output and target in the training process (Fahlman & Lebiere, 1990). The neuron generation-annihilation algorithm (Ichimura, Oeda, Suka, Hara, et al., 2005; Ichimura, Oeda, Suka, Yoshida, & Applications, 2005; Ichimura & Yoshida, 2004) can also add a new neuron based on the variance of the weight vector and error. The mentioned algorithm can remove some redundant neurons after the neuron generation. All methods based on error cannot be used for RBM because RBM is an unsupervised deep learning algorithm. In other studies, Hinton has offered a practical guide to training RBM and tries to determine parameters such as the learning rate, regularization coefficient, number of hidden units, and has also discussed initial values of the weights and biases or different types of units (G. Hinton, 2010). He utilizes the estimation of bits that would take to characterize each data-vector and the number of training cases to determine the number of parameters in deep model. In another study (G. E. Hinton & Salakhutdinov, 2012), authors have proposed a better way to pre-train deep Boltzmann machines then they have experimented the proposed algorithm on benchmark datasets but the structure of deep model is determined by using random search so that they don’t have any idea about number of hidden units and hidden layers as many other studies (Abdel-Zaher & Eldeib, 2016; G. E. Hinton, 2007; G. E. Hinton & Salakhutdinov, 2012; Keyvanrad & Homayounpour, 2014,2015; Salakhutdinov & Hinton, 2009; Tang, Salakhutdinov, & Hinton, 2012). In (Goodfellow, Bengio, Courville, & Bengio, 2016), authors have also recommended a practical design process. They have proposed two basic approaches to selecting the hyper-parameters (including the number of layers and neurons): choosing them manually and choosing them automatically. Manually method needs understanding what the hyperparameters carry out at the model exactly. Automatic selection (including grid search and random search) removes the need to understand this idea, but they increase much more time spent and computational cost. In the recent study (Côté & Larochelle, 2016), the authors introduced Ordered RBM (oRBM), this type of RBM is extended to be sensitive to the ordering of its hidden units. In oRBM, a penalty term as a form of regularization is added to energy function since it forces the RBM to avoid utilizing more hidden units than needed. But the principal question remains: how many layers (hidden layers) and how

many neurons (hidden layer units) are appropriate for a specific task? In this research, we develop a new type of RBM with the selforganizing structure, which can use to construct DBN. The selforganizing RBM can build its structure according to the input data automatically. In brief, the input vector is transformed from the feature space with a low dimension into the new feature space with a high dimension in the hidden layer of RBM. This transformation offers the best approach to solve the classification problem. In other words, it can change a complex, nonlinear classification problem to a noncomplex and linear one. Increasing the dimension helps the classifier makes the best decision on the classes in the new feature space. In the following, each new feature is ranked according to its discrimination power between classes. Finally, a threshold is determined, and each new feature, the rank of which is less than the threshold, is removed from the new features and the others are retained in the hidden layer. The number of retained neurons is presented as a reasonable number of neurons. The number of layers is also determined in the deep model, using the validation data. The proposed algorithm can determine the number of neurons in RBM and the number of layers in DBN, which decreases the time spent, the computational burden, and removes the need to try various structures with a different number of neurons and layers in the learning procedure. It also prevents over-fitting with a reasonable number of neurons and layers since we do not use the high capacity of the deep model to solve the classification problem and assign an appropriate size of the deep model according to the complexity of the classification; Also, the proposed approach acts as a regularization method since the neurons whose ranks are lower than the threshold are removed. To verify the proposed method, we apply it to classification problems (the known datasets), and the result of the experiments shows that the new approach has succeeded in classification problems with high accuracy. This paper is structured as follows. Section 2 provides the background theories, including the linear algebra, the Restricted Boltzmann Machine, and the Separability-Correlation Measure for Feature Importance Ranking algorithm. Section 3 describes the proposed method. Section 4 introduces some of the well-known datasets. In section 5, we present our experimental results where we apply the proposed method to classification problems in different domains and compare it with the previous works. 2. Background theory 2.1. Linear Transformation The linear transformation is one of the most critical subjects in linear algebra. It is defined by matrix multiplication to a vector (Eq. 1). When a matrix Ad × d multiplies to a vector Xd × 1 , it transforms X to another vector Y (Strang, 2016).

Yd ×1 = Ad ×d Xd×1

(1) Rd ,



X and Y are the vectors in the vector spaces V = and W = Rd , V and W are called domain and codomain, respectively (Fig. 1a). Also, T(X) = Y formula shows the linear transformation and follows the same idea as a function. It maps X from a vector space with ddimension into the new vector space with d -dimension. Transformation can involve three categories based on the relation between d and d . When d = d , rotation, shrinkage, or expansion happens for vector X in the new vector space. When d > d , vector X is transformed from a large vector space into a small one. In this category, V contains the null space, according to Eq. 2. The null space is the subspace of the V so that some vectors that exist in the null space (or Kernel) are mapped to zero-point in the new vector space (Fig. 1b) (Larson, 2012).

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

3

Fig. 1. Linear Transformation: (a) Domain and Codomain (b) Kennel (null space) and Range (Column space) (Larson, 2012).

dimension_o f (V ) = dimension_o f (Nul l space(A )) + dimension_o f (Columnspace(A ) )

(2)

In Eq. 2, dimension_o f (V ) and dimension_o f (Columnspace(A ) ) are equal to d and rank of the matrix A (Rank(A) ≤ d ) respectively. As a result, dimension_o f (Nul l space(A )) will be d − d . When d < d , vector X is transformed from a small vector space into a large one. If matrix A is full rank (Rank(A) = d), then there is not any null space in V; we do not lose any information in this transformation. When the vector is transformed from small space into a large one, its dimensional nature does not change. It is transformed from the d-dimension space into the same dimension subspace in the new space with d -dimension.

Here C is the number of classes, nc is the number of patterns in the c − th class. X¯ c j is the normalized data vector. Xcj (i) for the i − th attribute is normalized as Eq. 6.

Rk = β S¯k + (1 − β ) C¯k

(3)

Eq. 3 is constructed from two parts. Part one: S¯k is called class separability measure, and part two: C¯k is called attribute-class correlation measure. β (0 ≤ β ≤ 1) is the user-defined parameter to weight the two items for the final measure. In the following, these sections are discussed. 2.2.1. Part 1: class-separability measure Class Separability measure is calculated according to two parameters, Sw and Sb . Sw is the intraclass distance (the distance of patterns within the class), and Sb is the interclass distance (the distance between patterns of different classes). Calculation of Sw and Sb is shown in Eqs. 4 and 5.

Sw =

C 

Pc

nc  

c=1

Sb =

C  c=1

¯c X¯ c j − m



¯c X¯ c j − m

T  12

(4)

j=1



¯ ) (m ¯ )T ¯c−m ¯c−m Pc (m

 12

(5)

(6)

Where max(Xi ) and min(Xi ) are the maximum and the minimum of the i − th attribute in the dataset. i=1,2,…,n, n is the ¯ c and m ¯ are the mean of the c-th class number of attributes. m and the mean of all patterns in data set, respectively (as shown in Eqs. 7 and 8).

n c j=1

¯c = m

2.2. Separability-correlation measure for feature importance ranking Feature selection plays a critical role in classification problems. Feature selection algorithms can identify significant features so that the distances between classes are enormous, according to them. Finally, these algorithms identify a subset of features that can maximize the Separability between classes. The probability of correct classification is significant when the right feature selection is carried out. We utilize the following Separability-Correlation Measure (SCM) (Fu & Wang, 2003; Wang & Fu, 2006) to evaluate the importance of features. In this method, we assign the rank to each feature that is calculated in Eq. 3.

Xc j (i ) max (Xi ) − min(Xi )

X¯ c j (i ) =

X¯ c j

(7)

nc C c=1

¯ = m

n c j=1

X¯ c j

N

(8)

In Eq. 8 N is the total number of patterns in the data set. The better Separability of the data set can happen when the Sb is greater, and the Sw is smaller. Therefore, the ratio of Sw and Sb can be used to measure the discrimination of the classes. Each attribute is ranked by calculating the intraclass to interclass distance ratio when that attribute is omitted in turn (as shown in Eq. 9).

Sk =

Swk Sbk

(9)

k=1, 2,…, n. Swk and Sbk are intraclass and interclass distances when the k-th attribute is omitted from each pattern. Sk is normalized by Eq. 10.

Sk =

Sk − min(Sk ) max (Sk ) − min(Sk )

(10)

2.2.2. Part 2: attribute-class correlation measure Section two discusses the correlation between the changes in attributes and their corresponding changes in class labels. Assume two different patterns, if variations of some attributes in the two patterns cause to change in their class labels, so those attributes can be important factors for the variation of class labels and should be weighted positively; if class labels are the same, the variations of some attributes can’t be important and are irrelevant in deciding the class labels, so should be weighted negatively. The correlation measure can be useful by combining with class Separability measure (part 1). The mentioned measure is calculated by Eq. 11.

Ck =



 

Xki − Xk j magn yi − y j i= j

(11)

4

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

Fig. 2. Left: Boltzmann machine. Right: Restricted Boltzmann machine.

Where Xki and Xkj are the k-th attributes of the i-th and the j-th patterns, respectively. For any y, magn(y)=1 if |y|>0 and magn(y)= - 0.05 if |y|=0. Ck is normalized by Eq. 12.

the model parameters θ = {U, V, W, b, c} respectively. Zθ is calculated in Eq. 16.

Ck − min(Ck ) Ck = max (Ck ) − min(Ck )

Zθ =

(12)

A greater Ck is the closer correlation between class labels and the k-th attribute, which indicates the great importance of attribute k in classification problems.

2.3.1. RBMs Introduction Probabilistic graphical models (PGMs) are one of the most important models in machine learning. PGMs contains two subtypes, namely directed and undirected graphical model. The joint probability p(x, h) is determined through a product of unnormalized non-negative maximal clique potentials in undirected graphical models, also called Markov random fields (MRFs) (Y. Bengio, Courville, & Vincent, 2013).

1 ψi (x ) η j (h ) υk (x, h ) Zθ i

j

(13)

k

In Eq. 13 ψi (x ), η j (h ) and υk (x, h ) are the clique potentials. These terms describe the interactions between the visible elements, between the hidden variables, and between the visiblehidden variables, respectively. The distribution is normalized by the partition function Zθ . The Boltzmann distribution is a particular form of MRFs with clique potentials constrained to be positive.

1 p(x, h ) = exp (−εθ (x, h ) ) Zθ

1 2

1 2

...

x1 =0

 

(15)

In Eq. 15, the visible-to-visible interactions, the hidden-tohidden interactions, the visible-to-hidden interactions, the visible self-connections, and the hidden self-connections are defined by

h 

...

xdx =0 h1 =0



exp

hd =0

 εθBM (x, h )

(16)

h

This joint probability distribution gives rise to the two conditional probability distributions as shown in Eqs. 17 and 18.















P x j |h, x\ j = sigmoid

W ji x j +



Vii hi + ci

j

i =i





W ji hi +

(17)

U j j x j + b j

(18)

j  = j

i

Inference in the BMs is intractable since computing the conditional probability is very difficult; for example, P(hi |x, h\i ) needs marginalizing over the rest of the hidden. The Restricted Boltzmann Machine (RBM) is introduced to address this problem. The RBM is the subclass of BM and is defined by restricting the interactions in the Boltzmann energy function, as shown in Eq. 19.

εθRBM = −xT W h − bT x − cT h

(19)

In Eq. 19, U, and V set to zero since there is not any interaction between visible units and between hidden variables in RBMs, as shown in Fig. 2. With this restriction, the RBM processes the conditional probability distribution over the hidden variables given the visible units and over the visible units given the hidden variables in Eqs. 20–23, hence inference in the RBMs can be tractable.

P ( h|x ) =



P ( hi |x )

(20)

i

(14)

In Eq. 14 ε θ (x,h) is the energy function. The interactions are defined by energy function and are characterized by the model parameters θ . The Boltzmann machine (BM) is a network of binary random variables or units that are connected symmetrically, as shown in Fig. 2. These stochastic units can be included in two groups: first, the visible units x ∈ {0, 1}dx that represent the data, and the second, hidden units h ∈ {0, 1}dh that mediate dependencies between visible units through their mutual interactions. The probability distribution over (x, h) is approximated via the Boltzmann distribution by Boltzmann machine energy function.

εθBM = − xT Ux − hT V h − xT W h − bT x − cT h

hd =1

xdx =1 h1 =1

P hi |x, h\i = sigmoid

2.3. Restricted Boltzmann Machines (RBMs)

p(x, h ) =

x 1 =1



P (hi = 1|x) = sigmoid



W ji x j + ci

(21)

j

P ( x|h ) =



P ( x j |h )

j



(22)





P x j = 1|h = sigmoid



W ji hi + b j

(23)

i

2.3.2. RBMs parameters estimation In the training procedure, parameters of probabilistic models are typically adjusted to maximize the likelihood or the loglikelihood of the training data. With T training examples, the loglikelihood is calculated in Eq. 24. T  t=1

 

log P xt ; θ



=

T  t=1

log

 h



P xt , h; θ



(24)

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

Gradient-based optimization algorithm needs its gradient; thus it is calculated according to Eq. 25 for RBMs.

  T T     ∂  ∂ RBM  t  log P xt ; θ = − EP (h|xt ) εθ x , h ∂θi t=1 ∂θi t=1   T  ∂ RBM + EP (x,h ) ε (x, h ) ∂θi θ t=1

(25)

In Eq. 25, there are two types of expectations: first, expectation with respect to P(h|xt ) that called the positive phase or expectation with respect to the completed data distribution and second, expectation with respect to P(x, h) that called the negative phase or expectation with respect to the distribution defined by the model. We will sometimes abbreviate EP (h|xt ) [ .] as the datadependent expectation and EP (x,h) [ .] as the model’s expectation (Salakhutdinov & Hinton, 2009). The data-dependent expectation is tractable, while the model’s expectation is more problematic since its computation over the joint probability distribution is not tractable. The model’s expectation is approximated by using the Monte Carlo approximation in Eq. 26.



EP (x,h)

   L 1  ∂ RBM ∂ RBM  l l  εθ (x, h ) ≈ EP (x,h) εθ x , h ∂ θi L ∂ θi l=1

(26)

With the samples drawn by Gibbs sampling (or Markov Chain Monte Carlo (MCMC)) as shown in Eq. 27.







xl ∼ P x|hl−1 , hl ∼ P h|xl



a probability Pi = P(xi = 1|h). However, it is common to utilize the probability instead of sampling a binary value. In the standard RBMs, the visible units are using stochastic binary values, whereas data is real-valued in many practical applications. In the updating steps, the visible units are using realvalued probabilities instead of stochastic binary values, as discussed above. There are two ways to collect the statistics for the connection between visible and hidden units.



Pi h j



data



or Pi Pj



data

(30)

In Eq. 30, Pj = P(hj = 1|x) is a probability, and hj is a binary state that takes value 1 with probability Pj . 3. The proposed method for designing of reasonable structure In each hidden layer of the RBM, the input is mapped to the new space. This mapping, in other words ‘representation’ in each layer, can be successful in case, it causes better discrimination in the new space. The principal fundamentals of the proposed algorithm are based on the detection of the best discrimination between the classes in the new spaces. The proposed method has two sections for designing the reasonable structure of the deep representation. In section one, the proposed algorithm tries to estimate the appropriate number of hidden neurons in the hidden layer, and section two, it tries to estimate the suitable number of hidden layers in deep representation.

(27)

The representation can be learned better by RBMs if more steps (infinite steps) of alternating Gibbs sampling are utilized before collecting the statistics for the second term in the learning rule, also called the model’s expectation. 2.3.3. Contrastive divergence Contrastive Divergence (CD) estimation estimates the model’s expectation. The CD algorithm initialized Gibbs chain at the training data used in the data-dependent expectation and carry out this process with a very short Gibbs chain (often just one step). Learning using n full steps of alternating Gibbs sampling is illustrated by CDn . The other algorithms are extended from CD algorithm such as Persistent Contrastive Divergence (PCD), Persistent Contrastive Divergence with Partial Smoothing (PCDPS), Fast Persistent Contrastive Divergence and Free Energy in Persistent Contrastive Divergence (FEPCD) which all of them try to approximate Gibbs sampling with high accuracy (Keyvanrad & Homayounpour, 2014, 2015). The derivative of the log probability of a training vector with respect to weight and learning rule for weight updating is shown in Eqs. 28 and 29, respectively.

  ∂ log P (x )   = xi h j − xi h j data model ∂ Wi j      Wi j = η xi h j data − xi h j model

5

(28) (29)

In the CD algorithm, we carry out two actions: first, updating the hidden states and second, updating the visible states, and assume all units are binary. In the first step, Pj = P(hj = 1|x) is the probability of turning on the j-th hidden unit that it is computed by using the sigmoid function. In the following, a random number uniformly distributed between 0 and 1 is selected, and the hidden unit turns on if this probability is greater than a random number. Only the final update of the hidden units should use the probability instead of stochastic binary states since nothing depends on which state is chosen and to avoid unnecessary sampling noise (G. Hinton, 2010). In the second step, the visible units, when generating a reconstruction, is updated by stochastically pick a 1 or 0 with

3.1. Estimation of the reasonable number of neurons in the hidden layer 3.1.1. Theoretical perspective The RBM is one of the most popular models in deep learning algorithms and can be considered from another perspective: The RBM is the particular form of Auto-Encoder (AE), which is constructed from encoder and decoder layer. In the AE, the deterministic mapping occurs whereas, in the RBM, the stochastic mapping happens. Internally, the AE and RBM have a hidden layer h, also called encoder layer, that describes a code used to represent the input, which is illustrated h = f(x) (deterministic function) and Pencoder (h|x) (stochastic mapping) respectively. Both of them have a decoder layer that produces a reconstruction r = g(h) (deterministic function) and Pdecoder (x|h) (stochastic mapping) respectively (Goodfellow et al., 2016). This procedure is shown in Fig. 3 for both of them. First, we assume that the encoder and decoder layers are linear; thus, the linear transformation happens in both layers Fig. 4). In the auto-encoder (specifically in RBM), T1 and T2 transform the input vector to the representation vector and representation vector to the reconstruction vector, respectively (Eqs. 31 and (32). The dimensions of input (X), representation (Y), and reconstruction (Z) vector spaces are d, n, and d respectively.

Wn×d . Xd×1 = Yn×1

(31)

W  d×n . Yn×1 = Zd×1

(32)

In the proposed method, we assume n > d; thus each point from X vector space (or the feature space) with a low dimension is mapped into Y vector space (or the new feature space) with a high dimension. W and W are full rank matrixes since they are determined by the backpropagation algorithm and one to one transformation in the auto-encoder. As a result, there is not any null space in X vector space, so all of the vectors in X are transformed into the column space of the new vector space Y. Columns of W are the basis for the new vector space; thus all of the vectors in Y are defined by the combination of these columns and the input vector

6

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

Fig. 3. Left: AE with deterministic mapping. Right: RBM with stochastic mapping.

Fig. 4. T1 and T2 linear transformation in the auto-encoder.

elements Eqs. 33 and (34).

Wn×d . Xd×1 = Yn×1



w11 ⎣ .. . wn1



··· .. . ···

(33)

⎡ ⎤

w 1d x1 .. ⎦ ⎣ × ... ⎦ = x1 . wnd n×d xd d×1





w11 ⎣ .. ⎦ + x2 . wn1







w12 ⎣ .. ⎦ . wn2



w 1d .. ⎦ ⎣ + . . . + xd . wnd

(34)

The new vector (or the new feature) Y is calculated by Eq. 35. Each element of Y is a linear combination of X elements. These combinations are unique since the linear transformation is one to one.

⎡ ⎤





y1 w11 x1 + w12 x2 + . . . + w1d xd .. ⎣ .. ⎦ = ⎣ ⎦ . . yn n×1 wn1 x1 + wn2 x2 + . . . + wnd xd n×1

(35)

Now, we detect which element of the new vector (or new feature) Y or linear combinations (in Eq. 35) is significant; in other words, which of them give more information about the discrimination between classes. Important new features are selected, and other unimportant features are removed since they do not help make the best decision on classification. The number of remaining features determines the number of dimensions in the vector space or the number of nodes in the encoder layer. We can apply the proposed method to the nonlinear mapping. In nonlinear transformation, when the vector is transformed from a small space into a large one, its dimensional nature does not change but is transformed from the d-dimension space into the same dimension subspace of n-dimension new space. This subspace is the nonlinear hyper-plane in the nonlinear transformation

(Stamkopoulos, Diamantaras, Maglaveras, & Strintzis, 1998), but it is difficult to determine what happens in the high dimension subspace exactly. 3.1.2. Description of the algorithm In the first phase, the proposed algorithm determines the number of neurons in the hidden layer. To achieve this goal, the proposed algorithm contains two steps. In the first step, the primary number of nodes in the hidden layer is determined according to the number of inputs (Eq. 36).

n primary = α × dinput

(36)

In Eq. 26, the primary number of neurons (nprimary ) is proportional to the input dimension; thus it is equal to the product of the input dimension (dinput ) and a user-defined coefficient (1<α ) so that nprimary > dinput . RBM, with a primary number of neurons, is trained using the training data. In the second step, the new features, which are obtained from the nonlinear mapping in the hidden layer, are evaluated. This process is carried out using the SCM algorithm, which estimates the number of neurons. This estimation is carried out by calculation of weights for each of the variables and finally, the weights illustrate which of the neurons (or new features) are valuable. The algorithm uses the mean of weights as a threshold, so the neurons whose weights exceed the threshold are retained, and the others are removed in the hidden layer (as shown in Fig. 5). The number of retained nodes is presented as a reasonable number of neurons. Besides, this approach acts as a regularization method, because the nodes whose weights are lower than the threshold are removed; thus, RBM learns to copy input merely approximate. The RBM with a reasonable structure is ready to use. Finally, this algorithm is applied to all of the RBMs that is used for designing a DBN. Table 1 shows the whole process in this phase. This method can be considered from another perspective. Let us assume an RBM that succeeds to learn this mapping

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

7

Table 2 Pseudo-code of estimation of the reasonable number of hidden layers. Algorithm: estimation of the reasonable number of hidden layers Input: for each training and validation instance a vector of attributes and class value Output: DBN with i number of hidden layers 1. Set m:= maximum number of RBM; 2. For i:=1 to m do begin 3. Add i − th RBM; 4. Estimation of the reasonable number of neurons; 5. Add fine-tuning layer; 6. Calculate MSE of the model using validation data; 7. Remove fine-tuning layer; 8. if i > 1 and MSE of i − th RBM > (i − 1) − th RBM 9. Remove i − th RBM; 10. Break; 11. Endif; 12. End;

The algorithm removes the fine-tuning layer, and the second RBM is added, then the above process is repeated for the second RBM. In the second step, the algorithm compares the MSE of two RBMs. In this part, there are two situations:

Fig. 5. The self-organizing RBM. Table 1 Pseudo-code of estimation of the reasonable number of neurons in the hidden layer. Algorithm: estimation of the reasonable number of neurons Input: for each training instance a vector of attributes and class value Output: RBM with n number of neurons 1. Set n:= primary number of neurons; 2. Train RBM with n neurons using training data; 3. SCM algorithm 4. Set Threshold:= mean of weights; 5. Set n:= number of neurons that their weights are more than the threshold; 6. Keep neurons whose weights are more than the threshold and remove other neurons in the hidden layer; 7. RBM with n neurons is ready to use;

stochastic mapping

1- If MSE of the first RBM is more than MSE of the second RBM, the presence of the second RBM is useful, and the separation between classes with two layers is better than the one layer. In this situation, the algorithm continues to add RBM. It then repeats this process until the above condition is not admitted, or the number of RBMs is equal to the maximum number of the RBM (the user-defined parameter). 2- If MSE of the first RBM is less than MSE of the second RBM, then the existence of the second RBM is not useful to solve the classification problem, and the algorithm removes the second RBM. In this situation, the algorithm stops the designing process, and the DBN with a reasonable structure (with a reasonable number of nodes and layers) is ready to use. The proposed algorithm estimates the reasonable number of layers generating the best representation of the input data leading to the best separation between classes. Table 2 shows the whole process in this phase.

stochastic mapping

→ → T :x h x everywhere is not suitable for machine learning tasks. Instead, the RBMs are designed to be incapable of learning to copy perfectly. Usually, they are restricted in ways that allow them to copy only approximately (Goodfellow et al., 2016). These properties are carried out by using the SCM algorithm, since the proposed algorithm removes other representation features with ranks less than the threshold, thus carrying out this mapping approximately. As a result, this novel method predicts the reasonable number of hidden neurons that generate the best representation of the input data. 3.2. Estimation of the reasonable number of hidden layers In the second phase, the proposed algorithm tries to estimate the number of layers. In the proposed algorithm, the number of RBMs is increased in the model according to the classification complexity, and this process will continue until achieving the best performance. This phase consists of two steps. In the first step, the algorithm adds the first RBM to the model, and then the training process is applied according to the first phase. Then, a fine-tuning layer is added, which has been trained using the training data, including the input vector and the class label. Afterward, the mean squared error (MSE) is calculated using the validation data, including the input vector and the class label.

4. Benchmark task We have tested our novel approach on a benchmark of the image classification problem. This benchmark dataset is MNIST, the well-known digit classification problem. We utilized this classification problem from the same benchmark as Vincent (Vincent, Larochelle, Bengio, & Manzagol, 2008). The MNIST dataset contains 28 × 28 (784) pixel handwritten digit images. It consists of 10 categories. We also considered 10,0 0 0 examples for training, 20 0 0 for the validation, and 50,0 0 0 for the test (Larochelle, Erhan, Courville, Bergstra, & Bengio, 2007; Rifai, Vincent, Muller, Glorot, & Bengio, 2011; Vincent et al., 2008). We have also tested our approach on the network traffic dataset. Network traffic classification has significant efficacy in computer network management and is a suffering problem in this area. In this paper, we use the Moore network traffic dataset (A. Moore, Zuev, & Crogan, 2013; A. W. Moore & Zuev, 2005), which is from the University of Cambridge and is called Moore-set. There are 130,623 samples and 248 features in each sample. Some features are presented in Table 3. It consists of 12 categories. Statistic information of Moore-set is shown in Table 4. The number of different classes is not averaging which decreases classification accuracy, and only the classes with a suitable number of samples (such as WWW and P2P, etc.) are classified

8

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286 Table 3 Part of features (A. Moore et al., 2013). Features Flow duration TCP Port Packet inter-arrival time Payload size Effective Bandwidth based upon entropy Fourier Transform of the packet inter-arrival time

Table 4 Statistic of information on Moore-set. Network Flow Class

Number of Samples

WWW P2P Database Bulk Mail Chat Interactive Multimedia Services Unknown Viop Attack

109,130 10,871 5387 3160 1479 185 173 73 51 51 34 29

Fig. 7. Confusion matrix of the Moore-set (test).

5. Experimental result

Fig. 6. Confusion matrix of the MNIST dataset (test).

precisely. To avoid this problem, we extract part of the dataset to build the appropriate dataset. Thirty-nine features consist of missing value, thus are removed from the dataset. Statistic of information of this extracted dataset is shown in Table 5. We have also utilized the Wisconsin Breast cancer diagnostic dataset that consists of 569 samples. The samples are classified into two classes, namely, malignant and benign. The samples comprise 30 real-valued features. No missing values. The samples include 357 benign, 212 malignant. This dataset is from the UC Irvine Machine Learning Repository and is called WDBC1 .

1

Wisconsin Diagnostic Breast Cancer.

We applied the proposed method to the known datasets to compare the performance of the self-organizing deep model (the self-organizing structure) with other deep models (the userdefined structure). The softmax function is utilized in the finetuning layer. The learning rate η is set to ηmax , which is chosen according to the Lyapunov stability theorem (Alibakhshi, Teshnehlab, Alibakhshi, & Mansouri, 2015; Shoorehdeli, Teshnehlab, & Sedigh, 2008). In the proposed algorithm, we can control the depth and width of the self-organizing RBMs on deep representation using the suitable values for α (primary number of neurons), β (in the SCM algorithm), and threshold (feature selection procedure in the hidden layer). The self-organizing deep model is designed according to the proposed algorithm for the classification of MNIST2 , Moore network traffic3 , and WDBC4 Datasets. Some of the essential parameters of the proposed algorithm for these experiments are shown in Table 6. In all experiments, we do not perform a grid search or random search to determine the model structure, since the self-organizing deep model can determine its structure according to the input data automatically. The architecture of the self-organizing deep model, which is designed automatically according to the input data, is represented in Table 7 for all datasets. The performance for training, validating, and test datasets are illustrated in Table 8. The confusion matrix of test datasets is also shown in Figs. 6–8. We believe that the results can be even better if these experiments (according to the proposed algorithm) is carried out using another loss function with different penalty terms, different sampling methods (Keyvanrad & Homayounpour, 2014, 2015), learning algorithms (Ioffe & Szegedy, 2015), different types of units, or different models in the fine-tuning layer.

2 See http://www.iro.umontreal.ca/∼lisa/twiki/bin/view.cgi/Public/DeepVsShallow ComparisonICML2007#Downloadable_datasets. 3 See http://www.cl.cam.ac.uk/research/srg/netos/projects/brasil/data/index.html. 4 See https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+ (Diagnostic).

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

9

Table 5 Statistic information of extracted dataset from Moore-set. Num.

Network Flow Class

Number of Samples

Training Data

Validation Data

Test Data

1 2 3 4 5 6 7 8 9 10 11 12

WWW P2P Data base Bulk Mail Chat Interactive Multimedia Services Unknown Viop Attack

109,130 10,871 5387 3160 1479 185 173 73 51 51 34 29

1300 1300 1300 1300 1000 100 100 0 0 0 0 0

200 200 200 200 200 20 20 0 0 0 0 0

1500 1500 1500 1660 279 65 53 0 0 0 0 0

Table 6 User-defined parameters of the proposed algorithm. Parameters

α

Maximum number of RBM

Threshold

β

MNIST Moore-set WDBC

1.3 1.3 1.1

3 3 3

Mean of weights Mean of weights Mean of weights

0.5 0.5 0.5

Table 7 the architecture of the self-organized deep model. Dataset

Number of hidden layers

Number of hidden neurons

MNIST Moore WDBC

1 1 1

523 96 16

Table 8 Accuracy and error rate of the MNIST, Moore-set, and WDBC dataset. Dataset

Dataset category

Accuracy rate (%)

Error rate (%)

MNIST

Training Validation Test Training Validation Test Training Validation Test

97.6 94.1 93.6 95.5 94.6 96.2 97.5 100 99.1

2.4 5.9 6.4 4.5 5.4 3.8 2.5 0 0.9

Mooreset WDBC

Fig. 8. Confusion matrix of the WDBC dataset (test).

The learning process of the deep model by testing a various number of layers and nodes has very time spending and has a high computational cost; therefore the architecture of user-defined RBMs is determined according to the number of layers and neurons, which are used in other published works. The structure of the user-defined model and its performance for test data, in some published works, are shown in Table 9. In the user-defined deep model, we should test many different structures, including multiple numbers of hidden layers and various numbers of neurons in each hidden layer, then should select the model with the best performance. This method has very time spending and has many computational costs, especially in big data. So we have to utilize an intelligent algorithm to deter-

mine the structure of the deep model. The self-organizing deep model can determine its structure according to the input data automatically. It does not need to do a random search or grid search, so it has very low time spending and has low computational cost than traditional methods in the learning and operation phases. Also, the self-organizing deep model has obtained better accuracy with the least number of nodes and layers (Table 7 and 9). Additionally, the proposed algorithm has not used any regularization method, since it acts as a regulator method by removing the neurons which their weights are less than the threshold.

Table 9 The user-defined architecture of RBMs and their accuracy. Dataset

Reference

Number of hidden layers

Number of hidden neurons

Test accuracy (%)

MNIST

(Keyvanrad & Homayounpour, 2015) (Chen, Qu, & Zhao, 2017) (Chen et al., 2017) (Abdel-Zaher & Eldeib, 2016)

3 3 5 2

[500 500 2000] [200 200 5000] [90 90 90 90 90] [4 2]

93.6 95.6 93.8 99.7

Moore-set WDBC

10

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286

6. Conclusion

Credit authorship contribution statement

In summary, we discussed some of the principal obstacles that are challenging in the field of deep learning; specifically, the question of ‘what makes one representation better than the other, and ‘how many layers or how many neurons in each layer of the deep structure are appropriate’. To address these obstacles, we proposed a new algorithm to design RBM with a reasonable structure. First, the SCM algorithm is utilized to estimate the number of neurons in each hidden layer; then, validation data is used to determine the number of hidden layers in DBN. Finally, we examined the proposed approach to classification problems. The results displayed that our approach has succeeded in this area with high accuracy. In all experiments, the self-organizing deep model is designed automatically according to the input data with the least number of layers and neurons, In contrast, other published works are utilized a lot of layers and neurons. We assigned an appropriate capacity of the deep model according to the complexity of the problem, and we did not use any regularization method. The proposed algorithm removes the need to try various structures with various numbers of neurons and layers in the learning procedure so that we do not any random search or grid search. These properties have decreased the time spent and the computational burden in the learning and operation phases. Future work could include the other applications of the proposed algorithm. We will employ it to design the self-organizing deep model for health informatics application and mainly focus on the key application of deep learning in the fields of bioinformatics and medical informatics. In addition, we can extend the proposed algorithm to learn the structure of other deep models with different building blocks like auto-encoders, recurrent neural networks and etc. in future studies. The results can be even better if these experiments, according to the proposed algorithm, are carried out using another loss function with different penalty terms, different sampling methods, learning algorithms, different types of units, or different models in the fine-tuning layer.

Saeed Pirmoradi: Investigation, Methodology, Project administration, Writing - original draft. Mohammad Teshnehlab: Supervision, Validation, Writing - review & editing. Nosratollah Zarghami: Formal analysis. Arash Sharifi: Formal analysis.

Authorship confirmation As corresponding author I, Mohammad Teshnehlab, hereby confirm on behalf of all authors that: 1. This manuscript, or a large part of it, has not been published, was not, and is not being submitted to any other journal. 2. If presented at or submitted to or published at a conference(s), the conference(s) is (are) identified and substantial justification for re-publication is presented below. A copy of conference paper(s) is(are) uploaded with the manuscript. 3. If the manuscript appears as a preprint anywhere on the web, e.g. arXiv, etc., it is identified below. The preprint should include a statement that the paper is under consideration at Expert System with Application. 4. All text and graphics, except for those marked with sources, are original works of the authors, and all necessary permissions for publication were secured prior to submission of the manuscript. 5. All authors each made a significant contribution to the research reported and have read and approved the submitted manuscript.

Declaration of Competing Interest The authors have no conflicts of interest.

References Abdel-Zaher, A. M., & Eldeib, A. M. (2016). Breast cancer classification using deep belief networks. Expert Systems with Applications, 46, 139–144. doi:10.1016/j. eswa.2015.10.015. Alibakhshi, F., Teshnehlab, M., Alibakhshi, M., & Mansouri, M. (2015). Designing stable neural identifier based on Lyapunov method. Journal of AI and Data Mining, 3(2), 141–147. doi:10.5829/idosi.JAIDM.2015.03.02.03. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1), 1–127. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. doi:10.1109/TPAMI.2013.50. Bruzzone, L., Prieto, D.F., J. I. t. o. g., & sensing, r. (1999). A technique for the selection of kernel-function parameters in RBF neural networks for classification of remote-sensing images. 37 (2), 1179–1184 Chen, L., Qu, H., & Zhao, J. (2017). Generalized Correntropy based deep learning in presence of non-Gaussian noises. Neurocomputing. Côté, M.-A., & Larochelle, H. (2016). An infinite restricted Boltzmann machine. Neural Computation. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. Paper presented at the Advances in neural information processing systems. Fu, X., & Wang, L. (2003). Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(3), 399–409. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning: 1. Cambridge: MIT press. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27–48. doi:10.1016/j. neucom.2015.09.116. Hinton, G. E. (2007). Learning multiple layers of representation. Trends in Cognitive Sciences, 11(10), 428–434. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Hinton, G. E., & Salakhutdinov, R. R. (2012). A better way to pretrain deep boltzmann machines. Paper presented at the advances in neural information processing systems. Hinton, G. (2010). A practical guide to training restricted Boltzmann machines. Momentum, 9(1), 926. Ichimura, T., Oeda, S., Suka, M., Hara, A., Mackin, K. J., & Katsumi, Y. (2005). Knowledge discovery and data mining in medicine. In Advanced techniques in knowledge discovery and data mining (pp. 177–210). Springer. Ichimura, T., Oeda, S., Suka, M., Yoshida, K.J.N.C., & Applications. (2005). A learning method of immune multi-agent neural networks. 14 (2), 132–148 Ichimura, T., Tazaki, E., & Yoshida, K. J. I. j. o. b.-m. c. (1995). Extraction of fuzzy rules using neural networks with structure level adaptation—verification to the diagnosis of hepatobiliary disorders. 40 (2), 139–146 Ichimura, T., & Yoshida, K. (2004). Knowledge-based intelligent systems for healthcare. Advanced Knowledge International. Ioffe, S., & Szegedy, C. (2015). Paper presented at the Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift http://proceedings.mlr.press. Islam, M.M., Sattar, M.A., Amin, M.F., Yao, X., Murase, K. J. I. T. o. S., Man, , & Cybernetics, P.B. (2009). A new adaptive merging and growing algorithm for designing artificial neural networks. 39 (3), 705–722. Keyvanrad, M.A., & Homayounpour, M.M. (2014). A brief survey on deep belief networks and introducing a new object oriented toolbox (DeeBNet). arXiv:1408. 3264. Keyvanrad, M. A., & Homayounpour, M. M. (2015). Deep belief network training improvement using elite samples minimizing free energy. International Journal of Pattern Recognition and Artificial Intelligence, 29(05), 1551006. Kristiansen, G., & Gonzalvo, X. J. a. p. a. (2017). EnergyNet: energy-based adaptive structural learning of artificial neural network architectures Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In Paper presented at the Proceedings of the 24th international conference on machine learning. Larson, R. (2012). Elementary linear algebra. Cengage Learning. Moore, A. W., & Zuev, D. (2005). Internet traffic classification using bayesian analysis techniques. Paper presented at the ACM SIGMETRICS performance evaluation review. Moore, A., Zuev, D., & Crogan, M. (2013). Discriminators for use in flow-based classification (1470–5559). Retrieved from

S. Pirmoradi, M. Teshnehlab and N. Zarghami et al. / Expert Systems With Applications 149 (2020) 113286 Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Paper presented at the Proceedings of the 28th international conference on machine learning (ICML-11). Salakhutdinov, R., & Hinton, G. (2009). Deep boltzmann machines. Paper presented at the Artificial Intelligence and Statistics. Shoorehdeli, M. A., Teshnehlab, M., & Sedigh, A. (2008). Stable learning algorithm approaches for ANFIS as an identifier. IFAC Proceedings Volumes, 41(2), 7046–7051. Stamkopoulos, T., Diamantaras, K., Maglaveras, N., & Strintzis, M. (1998). ECG analysis using nonlinear PCA neural networks for ischemia detection. IEEE Transactions on Signal Processing, 46(11), 3058–3067.

11

Strang, G. (2016). Introduction to linear algebra. Wellesley-Cambridge Press. Tang, Y., Salakhutdinov, R., & Hinton, G. (2012, 16-21 June 2012). Robust Boltzmann machines for recognition and denoising. Paper presented at the 2012 IEEE conference on computer cision and pattern recognition. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. Paper presented at the proceedings of the 25th international conference on machine learning. Wang, L., & Fu, X. (2006). Data mining with computational intelligence. Springer Science & Business Media. Zeng, X., & Yeung, D. S. (2006). Hidden neuron pruning of multilayer perceptrons using a quantified sensitivity measure. Neurocomputing, 69(7-9), 825–837.