Dendrite morphological neurons trained by stochastic gradient descent

Dendrite morphological neurons trained by stochastic gradient descent

Accepted Manuscript Dendrite Morphological Neurons Trained by Stochastic Gradient Descent Erik Zamora, Humberto Sossa PII: DOI: Reference: S0925-231...

6MB Sizes 0 Downloads 28 Views

Accepted Manuscript

Dendrite Morphological Neurons Trained by Stochastic Gradient Descent Erik Zamora, Humberto Sossa PII: DOI: Reference:

S0925-2312(17)30795-6 10.1016/j.neucom.2017.04.044 NEUCOM 18393

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

5 July 2016 24 January 2017 16 April 2017

Please cite this article as: Erik Zamora, Humberto Sossa, Dendrite Morphological Neurons Trained by Stochastic Gradient Descent, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.04.044

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Dendrite Morphological Neurons Trained by Stochastic Gradient Descent

CR IP T

Erik Zamoraa,∗, Humberto Sossab a Instituto

Politécnico Nacional, UPIITA, Av. Instituto Politécnico Nacional 2580, Col. Barrio la Laguna Ticoman, Gustavo A. Madero, 07340 Ciudad de México, México b Instituto Politécnico Nacional, CIC, Av. Juan de Dios Batiz S/N, Col. Nueva Industrial Vallejo, Gustavo A. Madero, 07738 Ciudad de México, México

Abstract

ED

M

AN US

Dendrite morphological neurons are a type of artificial neural network that works with min and max operators instead of algebraic products. These morphological operators build hyperboxes in N -dimensional space. These hyperboxes allow the proposal of training methods based on heuristics without using an optimisation method. In literature, it has been claimed that these heuristic-based trainings have advantages: there are no convergence problems, perfect classification can always be reached and training is performed in only one epoch. In this paper, we show that these assumed advantages come with a cost: these heuristics increase classification errors in the test set because they are not optimal and learning generalisation is poor. To solve these problems, we introduce a novel method to train dendrite morphological neurons based on stochastic gradient descent for classification tasks, using these heuristics just for initialisation of learning parameters. Experiments show that we can enhance the testing error in comparison with solely heuristic-based training methods. This approach can reach competitive performance with respect to other popular machine learning algorithms.

1

2

1. Introduction

Morphological neural networks are an alternative way to model pattern classes. Classical perceptrons divide the input space into several regions, using hyperplanes as decision boundaries. In a case where more layers are presented, perceptrons divide the input space using a hypersurface. In contrast, morphological neurons divide the same space by several piecewise lines that together can create complex nonlinear decision boundaries, allowing separation of classes

AC

3

CE

PT

Keywords: Dendrite morphological neural network, Morphological perceptron, Gradient descent, Machine learning, Neural network

4 5 6 7

∗ Corresponding

author: Tel.: +52 5545887175. Email address: [email protected], [email protected] (Erik Zamora)

Preprint submitted to Neurocomputing

May 10, 2017

ACCEPTED MANUSCRIPT

2

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

41 42

43

• We extensively review different neural architectures and training methods for morphological neurons. • To the best of our knowledge, for the first time, DMN architecture is extended with a softmax layer to be trained by stochastic gradient descent (SGD) for classification tasks.

AC

44

CR IP T

13

AN US

12

M

11

ED

10

with only one neuron (Figure 1). This is not possible with a one-layer perceptron because it is a linear model. The morphological processing involves min and max operations instead of multiplications. These comparison operators generate the piecewise boundaries for classification problems and have the advantage of being implemented more easily in logic devices, such as FPGA (Field Programmable Gate Array) (Weinstein and Lee [35]), than classical perceptrons. This paper focuses on a specific type of morphological neuron which is called a Dendrite Morphological Neuron (DMN). This neuron has dendrites, and each dendrite can be seen as a hyperbox in high dimensional space (and a box in 2D). These hyperboxes have played an important role in the proposal of training methods. The common approach is to enclose patterns with hyperboxes and label each hyperbox to the right class. There are many training methods have been reported to exploit this approach. However, most of them are heuristics and are not based on the optimisation of learning parameters. Heuristics can be useful when numerical optimisation is computationally very expensive or impossible to compute. In this paper, we show this is not the case. A learning parameters optimisation can be computed in a reasonable time and can improve the classification performance of DMN over solely heuristic-based methods. The most important training method in the scope of classical perceptrons has been gradient descent (Schmidhuber [26]). This simple but powerful tool has been applied to train multilayer perceptrons, developing the backpropagation algorithm which is the result of applying the chain rule from the last layer to the first. Even though morphological networks have more than 25 years of development reported in the literature and backpropagation is the most successful training algorithm, most of the training methods for morphological neurons are based on intuition instead of on gradient descent. This is true at least for classification tasks because an approach based on gradient descent has been published in de A. Araujo [9] but only for regression tasks. The extension of this approach for classification is not trivial due to the problem of non-differentiability of morphological operations. In this paper, we deal with this problem by adding a softmax layer (Bishop [3]) at the end of a DMN, calculate the gradients and evaluate four different methods to initialise dendrite parameters. The contributions of this paper are:

PT

9

CE

8

45

46 47

48 49 50

• We propose and evaluate four different methods based on heuristics to initialise dendrite parameters before training by SGD. • We perform several experiments to show that our approach systematically obtains smaller classification errors than solely heuristic-based training methods.

ACCEPTED MANUSCRIPT

3

51

• The code for our approach is available at Zamora [36].

59

2. Basics of DMNs

54 55 56 57

60 61 62 63 64 65 66

A DMN has been shown to be useful for pattern classification (Sossa and Guevara [28, 29], Sussner and Esmi [33]). This neuron works similarly to an radial basis function networks (RBFN) but uses hyperboxes instead of using radial basis functions and it has only one layer. Each dendrite generates a hyperbox and the dendrite that is most active is the one whose case is closer to the input pattern. The output y of this neuron (see Figure 1) is a scalar given by

AN US

53

CR IP T

58

The paper is organised as follows. In section 2, we present the background needed to understand the rest of the paper. In section 3, we provide an extensive review of morphological neural networks. In section 4, we introduce our approach to training DMN based on SGD. We present the experiments and results, as well as a short discussion, in section 5 to show the effectiveness of our proposal. Finally, in section 6, we give our conclusions and directions for future research.

52

y = arg max (dn,c ) , c

68

where n is the dendrite number, c is the class number and dn,c is the scalar output of a dendrite given by

M

67

(1)

n n dn,c = min (min (x − wmin , wmax − x)) ,

(2)

84

3. Related Work

74 75 76 77 78 79 80

AC

81

PT

72 73

CE

71

ED

83

where x is the input vector, wmin and wmax are dendrite weight vectors. The min operators together check if x inside the hyperbox is limited by wmin and wmax as the extreme points, according to Figure 1. If dn,c > 0, x is inside the hyperbox; if dn,c = 0, x is somewhere in the hyperbox boundary; otherwise, it is outside. The inner min operator in (2) compares elements of the same dimension between the two vectors, generating another vector; while the outer min operator takes the minimum among all dimensions, generating a scalar. The activation function is given by the arg max operator proposed by Sussner and Esmi [33] instead of the hard limit function that is commonly used in SLMP (Ritter and Urcid [21]). This argmax operator allows the building of more complex decision boundaries than simple hyperboxes (see Figure 1). It is worth mentioning that if (1) produces several maximums, the argmax operator takes the first maximum as as index class to which the input pattern is assigned. Qualitatively speaking, this occurs when x is equidistant to more than one hyperbox.

69 70

82

85 86

In this section, we present different neural networks that use morphological operations (max, min, +) and that have been proposed in the literature over the

ACCEPTED MANUSCRIPT

𝑥2

𝑑1,𝑐 𝑥1 𝑥2

𝑑2,𝑐



argmax𝑐

𝑦



𝑥𝑁

𝑑𝑛,𝑐 (𝐱) < 0

𝑛 𝐰𝑚𝑎𝑥⁡

𝑑𝑛,𝑐 (𝐱) > 0 𝑛 𝐰𝑚𝑖𝑛⁡

𝑑𝑛,𝑐 (𝐱) = 0

AN US

𝑑𝑁𝐻,𝑐

CR IP T

4

𝑥1

(b)

(a)

ED

M

1

PT

(c)

2 3 (d)

AC

CE

Figure 1: Conventional DMN. We show: (a) the neural architecture ; (b) an example of a hyperbox in 2D generated by its dendrite weights. A dendrite output is positive when the pattern is in the green region (inside the corresponding hyperbox), it is zero when the pattern is in the red region (within the hyperbox boundary), and it is negative in the white region (outside the hyperbox); (c) an example of DMNs hyperboxes for a classification problem; (d) the decision boundaries generated by a DMN for the same classification problem in (c). Note that decision boundaries can be nonlinear and a class region can consist of disconnected regions. These properties allow a DMN to solve any classification problem separating classes by a given tolerance (Ritter and Urcid [22]).

ACCEPTED MANUSCRIPT

AN US

CR IP T

5

89 90 91 92 93 94 95 96

AC

97

last 25 years. We discuss the advantages and disadvantages in terms of the neural architectures and training methods. In Figure 2, we show a comprehensive review for the different types of decision boundaries that morphological neural networks can build in order to classify patterns. At the end of this section, a comprehensive comparison is given in Tables 1 and 2. Morphological neurons were created as a combination between image algebra and artificial neural networks. Ritter and Davidson proposed the first morphological neural network model in their seminal papers (Davidson and Ritter [7], Davidson and Sun [8], Ritter and Davidson [18], Davidson and Hummer [6]) for template learning in dilation image processing. Three different learning rules were given for this specific task. In Ritter and Sussner [20], Ritter and Sussner proposed a more general model for binary classification, called a SingleLayer Morphological Perceptron (Single-Layer MP). These authors studied the computing capabilities of Single-Layer MP, showing that: 1) Single-Layer MP classifies data using semi-infinite hyperboxes parallel to Cartesian axes as decision boundaries, 2) Single-Layer MP can solve some classification problems with nonlinear separability and 3) morphological networks can be used to form associative memories. In Sussner [30], Sussner extends the last work, proposing

PT

88

CE

87

ED

M

Figure 2: Morphological boundaries. We show the types of decision boundary for different morphological neural networks. (a) Single-Layer MP uses semi-infinite hyperboxes that are parallel to Cartesian axes. (b) Two-Layer MP separates classes by hyperplanes parallel to Cartesian axes. (c) Multilayer MRL-NNs classify data by general hyperplanes. Note that the decision boundaries for the next three networks are closed surfaces. (d) SLMP uses hyperboxes parallel to Cartesian axes. (e) OB-LNN uses rotated hyperboxes. (f) L1 , L∞ SLLP uses polytopes. Finally, (g) MP/CL and DMNN present complex nonlinear surfaces as decision boundaries due to dendrite processing and argmax operator as activation function. In general, morphological neural networks have superior classification ability to classical singlelayer perceptrons, and some morphological networks such as (c) and (g) have at least the same capability as multilayer perceptrons. Even so, morphological neurons are less popular than classical perceptrons among practitioners, engineers and researchers. We hope this work can help them to see the benefits of morphological neurons, particularly dendrite morphological neurons (g) (note that all acronyms are defined in the main text).

98 99

100 101 102 103 104

ACCEPTED MANUSCRIPT

6

111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142

AC

143

CR IP T

110

AN US

109

M

108

ED

107

a Two-Layer Morphological Perceptron (Two-Layer MP) and giving a training method. In the following years, the hidden layer was to become known as dendrites. This Two-Layer MP is more powerful because it uses parallel hyperplanes to build arbitrary decision boundaries, e.g. to solve the XOR-problem. In Ritter and Urcid [21], Ritter and Urcid proposed the first morphological neuron using dendrite processing. This neuron is also called Single-Layer Morphological Perceptron (SLMP) or Lattice Perceptron (SLLP). Futhermore, the architecture of this neuron is different from a Single-Layer MP (Ritter and Sussner [20]). SLMP separates classes using hyperboxes parallel to Cartesian axes. In Ritter and Urcid [21], the authors prove that for any compact region in classification space, we can always find an SLMP that separates this region from the rest by a given tolerance. A generalisation of this theorem for a multiclass problem is presented in Ritter and Schmalz [19], Ritter and Urcid [22] to ensure that an SLMP can always classify different compact regions. Therefore, SLMP is more powerful than the classical single-layer perceptron and presents similar abilities as the classical two-layer perceptron. The problem with this theoretical result is that the training method is key to achieving a useful approximation in practice. The use of hyperboxes as decision boundaries allows creation of training methods based on heuristics: 1) Elimination. The training (Ritter and Urcid [22]) encloses all patterns inside a hyperbox and eliminates misclassified patterns by enclosing them with several other hyperboxes. 2) Merging. The training (Ritter and Urcid [22]) encloses each pattern inside a hyperbox and then the hyperboxes are merged. Some improvements have been made to SMLP. In standard SMLP, hyperboxes are parallel to Cartesian axes. To eliminate this limitation, Barmpoutis and Ritter [2] proposed the Orthonormal Basis Lattice Neural Network (OBLNN), using a rotational matrix that multiplies inputs. The training can be performed based on Elimination or Merging, adding an optimisation procedure to determine the best orientation for each hyperbox. This last procedure can be time consuming but it can reduce the number of hyperboxes. In Ritter et al. [24], a new single-layer lattice perceptron is proposed where dendrites use L1 and L∞ in order to soften the corners of the hyperboxes. As result of this modification, this latter neural network is able to use polytopes as decision boundaries. The Elimination training method is extended for this kind of neuron. If we use the argmax operator as activation function in SMLP (instead of a hard limiter function), we can create more complex nonlinear surfaces (instead of hyperboxes) to separate classes. Sussner has proposed a Morphological Perceptron with Competitive Neurons (MP/CL) inSussner and Esmi [31, 32, 33] where competition is implemented by the argmax operator. Sussner and colleagues have developed a training algorithm based on the idea of classifying patterns using one hyperbox per class. If there is any overlapping between hyperboxes, the overlapping region is broken down into two smaller hyperboxes for each class. We call this training method Hyperbox per Class and Divide. Later, using the same neural network model but calling it Dendrite Morphological Neural Networks (DMNN), a new training method based on a Divide and Conquer strategy was proposed in Sossa and Guevara [29]. An application to recognise

PT

106

CE

105

144 145 146 147 148 149 150

ACCEPTED MANUSCRIPT

7

157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188

AC

189

CR IP T

156

AN US

155

M

154

ED

153

3D objects using Kinect is presented in Sossa and Guevara [28]. This training has been shown to outperform Elimination training. It begins enclosing all patterns in one hyperbox and then divides the hyperbox into 2N sub-hyperboxes if there is misclassification (where N is the number of dimensions in pattern space). This last procedure is called recursively for each sub-hyperbox until each pattern is perfectly classified. In another line of research, there have been two proposals to use gradient descent and backpropagation to train morphological neural networks. In Pessoa and Maragos [17], Morphological/rank/linear neural networks (MRL-NNs) combine classical perceptrons with morphological/rank neurons. The output of each layer is a convex combination of linear and rank operations applied to inputs. The training is based on gradient descent. MRL-NNs have shown that they can solve complex classification problems like recognising digits in images (NIST dataset), generating similar or better results compared to classical MLPs in shorter training times. This approach uses no hyperboxes which is a great difference with respect to our work; hyperboxes make it easy to initialise learning parameters for gradient optimisation. In de A. Araujo [9], a neural model similar to DMNN but with a linear activation function has been applied to regression problems such as forecasting stock markets. This neural architecture is called Increasing Morphological Perceptron (IMP) and is also trained by gradient descent. The drawback of these training methods based on gradient is that the number of hidden units (dendrites or rank operations) must be tuned as a hyperparameter. The advantage of creating more hyperboxes during training is lost. Our approach in this paper preserves this advantage using heuristic-based training methods as pre-training. Special mention is given to Dendritic Lattice Associative Memories (DLAM, Ritter et al. [23], Urcid and Ritter [34], Ritter and Urcid [22]). This model can create associative memories using two layers of morphological neuron using a simple training method. Associative memories are beyond the scope of this paper. In summary, morphological neurons have shown better or similar abilities compared to classical perceptrons: 1) decision boundaries are nonlinear with one layer, 2) units (dendrites/hyperboxes) are automatically generated during training in some models, 3) the number of units must always be less than or equal to the number of training patterns, 4) training based on heuristics presents no convergence problems and 5) training can be performed in one epoch. Another advantage claimed in the literature is that morphological neurons can achieve perfect classification for any training data. This is true for training methods like Elimination, Merging, Hyperbox per Class and Divide, and Divide and Conquer. Perfect classification is inconvenient in practice because it leads to overfitting in training data and increments the learned model complexity. A perfect classified model learns irrelevant noise in data instead of finding the fundamental data structure. Futhermore, a perfect model is more complex in the sense that the number of learned parameters is usually similar to the number of training patterns. We solve these problems by modifying the neural architecture, and applying the SGD to train these neurons. Our proposal is inspired by

PT

152

CE

151

190 191 192 193 194 195 196

ACCEPTED MANUSCRIPT

CR IP T

8

Table 1: We compare the different neural networks that use morphological operations in terms of: 1) the task they solve, 2) the geometrical properties of decision boundaries, 3) the output domain, 4) the number of layers, and 5) whether these neural networks use dendrites or not. Task

model MN (Davidson and

Image

Hummer [6])

dilation

Decision

Output

#

boundaries

domain

layers

-

R

1

Single-Layer MP

Binary

Semi-infinite

(Ritter and Sussner

classification

parallel

[20])

Dendrites

AN US

Neural network

No

{0, 1}

1

No

{0, 1}

2

Yes

R

>1

No

{0, 1}

1

Yes

{0, 1}

1

Yes

hyperboxes

Two-Layer MP

Binary

Parallel

(Sussner [30])

classification

hyperplanes

Multilayer MRL-NNs

Multiclass

Hyperplanes

(Pessoa and Maragos

classification

M

[17])

or [0, 1]

SLMP (Ritter and

Multiclass

Parallel

Urcid [21])

classification

hyperboxes

Multiclass

Rotated

and Ritter [2])

classification

hyperboxes

Multiclass

Polytopes

{0, 1}

1

Yes

-

R

2

Yes

N

1

Yes

L1 L∞ -SLLP (Ritter et al. [24])

ED

OB-LNN (Barmpoutis

classification Associative

[22]

memory

PT

DLAM Ritter and Urcid

Multiclass

Complex

Esmi [33])

classification

nonlinear

IMP (de A. Araujo [9])

Regression

-

R

1

Yes

DMNN (Sossa and

Multiclass

Complex

N

1

Yes

Guevara [29])

classification

nonlinear

AC

CE

MP/CL (Sussner and

surface

surface

ACCEPTED MANUSCRIPT

CR IP T

9

Table 2: We compare the different methods to train morphological neural networks in terms of: 1) the corresponding neural network model, 2) whether there is automatic dendrite creation or not, 3) whether the training is incremental or in batches and 4) the maximum number of epochs for training. All listed training methods are for supervised learning. Training method

Neural network

Automatic

Training

model

dendrite

type

Template learning for image dilation

MN

limit

AN US

creation

Epochs

No

Incremental

No limit

Incremental

Number

dendrites

(Davidson and Hummer [6]) Single-Layer MP rule

Single-Layer

(Ritter and Sussner

MP

[20])

No

dendrites

of

patterns

Two-Layer MP rule

Two-Layer MP

Yes

M

(Sussner [30]) Elimination

SLMP,

(Ritter and Urcid [22])

OB-LNN, L1

Yes

Unique

1

batch

Unique

1

batch

ED

L∞ -SLLP

Merging

SLMP

Yes

(Ritter and Urcid [22])

Hyperbox per Class and Divide

MP/CL

PT

CE

de A. Araujo [9])

Divide and Conquer (Sossa and Guevara

AC

[29])

Multilayer

1

batch Yes

Unique

1

batch

(Sussner and Esmi [33]) Gradient descent (Pessoa and Maragos [17],

Unique

No

Incremental

No limit

Yes

Unique

1

MRL-NNs and IMP DMNN

batch

ACCEPTED MANUSCRIPT

10

204

4. DMN trained by SGD

198 199 200 201 202

CR IP T

203

MP/CL and DMNN approaches, but we use a softmax operator as a soft version of argmax along with dendrite clusters for each class. Our work differs from IMP (de A. Araujo [9]) in three ways: 1) we focus on classification problems, instead of regression problems, 2) we reuse some heuristics to initialise hyperboxes before training and 3) we extend the neural architecture using a softmax layer. Further, it differs from MRL-NNs (Pessoa and Maragos [17]) in the neural architecture which does not incorporate linear layers, only morphological layers.

197

210

4.1. DMN with Softmax Layer

212 213 214 215 216 217

218

219 220 221 222

224

225 226

exp (dc (x)) , P rc (x) = PNC k=1 exp (dk (x))

(3)

y = arg max (P rc (x)) ,

(4)

The assigned class is taken based on these probabilities by c

We propose making dendrite clusters dc in order to have several hyperboxes per class, as in de A. Araujo [9], Sossa et al. [27]. In Figure 3, we show our proposed neural architecture. If a particular classification problem has Nc classes, then we would need Nc dendrite clusters, one per class. Each output of dendrite cluster takes the maximum among their dendrites h, as follows

AC

223

The principal problem in training a conventional DMN by SGD is in calculating the gradients when we have an argmax function in the output because argmax is a discrete function. This can be easily overcome by using a softmax layer instead of the argmax function. The softmax layer normalises the dendrite outputs such that these outputs are restricted between zero to one and can be interpreted as a measurement of likelihood P r so that a pattern x belongs to the class c, as follows:

M

211

ED

208

PT

207

CE

206

AN US

209

In this section, we present our proposal. We modify the neural architecture of a conventional DMN to be trained by SGD (Montavon et al. [16]). We also present the SGD training method in some detail for practitioners who are interested in using this method. The code and a demo of the training algorithm are online at Zamora [36].

205

dc (x) = max (hk,c (x)) , k

(5)

where a dendrite k of a cluster c is given by hk,c (x) = min (min (x − wk,c , wk,c + bk,c − x)) ,

(6)

This Equation (6) is very similar to Equation (2). These two equations represent the same: a dendrite in neural architecture, and graphically, a hyperbox

ACCEPTED MANUSCRIPT

4.2 Training Method

11

ℎ1,1 ℎ2,1 𝑑1

ℎ𝑘1,1

𝑥1

ℎ1,2

𝑥2

ℎ2,2



𝑑2

ℎ𝑘2,2





𝑑𝑁𝑐

𝑃𝑟𝑁𝐶

ED





M

ℎ1,𝑁𝑐 ℎ2,𝑁𝑐

𝑃𝑟2

AN US



𝑥𝑁

𝑃𝑟1

CR IP T



ℎ𝑘𝑁,𝑁𝑐

PT

Figure 3: Neural architecture for a DMN with a softmax layer. Each class corresponds to one dendrite cluster dc .

233

4.2. Training Method

228 229 230

AC

231

CE

232

in any dimensional space. Remember that hyperboxes (dendrites) are the elemental computer units in our model. Note that we change the way to codify a hyperbox mathematically. In conventional DMN as in Equation (2), a hyperbox is represented by its extreme points wmin and wmax as depicted in Figure 1-(b). Equation (6) represents a hyperbox by its lowest extreme point wk,c and its size vector bk,c which determines the hyperbox size for each hyperbox side.

227

234 235 236

A key issue of morphological neural networks is their training. We need to determine automatically the number of hyperboxes for each dendrite cluster, and the dendrite parameters wk,c and bk,c . This is a great difference with

ACCEPTED MANUSCRIPT

4.2 Training Method

238 239 240 241 242 243 244 245 246

respect to classical perceptrons where only the learning parameters are determined during training. Several training approaches for morphological neurons have been proposed, as shown in section 3. In this paper, we propose to initialise the hyperboxes based on heuristics and optimise the dendrite parameters by SGD (Montavon et al. [16]). In section 5, we show that this method is better than training based only on heuristics. We first explain the optimisation stage. The softmax layer makes it possible to estimate the probability of a pattern x that belongs to the class c, P rc (x). The cross entropy between these probabilities and the targets (ideal probabilities) can be used as objective function J , as follows J =−

250 251 252 253 254 255 256 257 258 259 260 261

where Q is the total number of training samples, q is the index for a training sample, tq ∈ N is the target class for that training sample, and 1 {·} is the indicator function so that 1 {true statement} = 1, and 1 {f alse statement} = 0. The purpose is to minimise this cost function (7) by SGD method. The softmax makes the gradients easier to compute because cross entropy and softmax functions are continuous and very well known by the machine learning community. Further, the optimization algorithm especially takes advantage from this cost function when the task is a classification problem. If we do not add a softmax layer, we will need to use norm-1 or norm-2 to calculate an error between targets and dendrite outputs. This kind of cost function works well for regression problems, but not for classification problems. Kline et al. [13] and Golik et al. [11] have shown that cross entropy allows finding a better local optimum and has practical advantages over squared-error function. In order to update the DMNs’ parameters by SGD optmisation, the gradients dJ dJ dwk,c and dbk,c must be calculated. They are given by

AN US

249

PT

QX batch  dJ 0 =− dwk,c q=1

QX batch  dJ 0 =− dbk,c q=1

CE where

AC

262

(7)

1 {tq = c} log (P rc (xq )) ,

M

248

q=1 c=1

ED

247

Q X NC X

CR IP T

237

12

fwk,c =

...

0

fwk,c

0

...

0

...

0

fbk,c

0

...

0

  

0 − (1 {tq = c} − P rc (xq ))   1 {tq = c} − P rc (xq )

fbk,c =

(

0 1 {tq = c} − P rc (xq )

T T

,

(8)

,

(9)

k 6= k∗ k = k ∗ ∧j∗ = 1 , k = k ∗ ∧j∗ = 2 k= 6 k ∗ ∨j∗ = 1 , k = k ∗ ∧j∗ = 2

(10)

(11)

ACCEPTED MANUSCRIPT

4.2 Training Method

13

6: 7: 8: end

270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287

• HpC initialisation (Sussner and Esmi [33]). Each training set for the same class is enclosed by a hyperbox (Hyperbox per Class) with a distance margin M ∈ (−0.5, ∞). This is the most simple way to initialise dendrites. The drawback is that it restricts the maximum number of dendrites to be equal to the number of classes, and sometimes the pattern distribution may require a greater number of dendrites. Therefore, we propose the following methods that generate more hyperboxes per class to capture the data structure.

AC

288

AN US

269

M

268

ED

267

where k∗ is the index of the dendrite that generates the maximum response in the dendrite cluster c, j∗ is the column in the matrix [x − wk,c , wk,c + bk,c − x] where the minimum is and the position of fwk,c and fbk,c in the vectors (8) and (9) is determined by the row number in the matrix [x − wk,c , wk,c + bk,c − x] where the minimum is. Due to the min and max operators in DMN, most gradient elements are zero. These means that only one parameter of wk,c and bk,c and one dendrite in each cluster is updated by gradient descent method when the cost is calculated by only one training sample. In practice, we prefer to apply the gradient descent in batches with Qbatch training samples taken randomly. We show the pseudo-code of training in Algorithm 1. The first line of the code requires setting the hyperparameters: the learning rate α, the number of iterations Ni and the batch size Qbatch . The second line initialises the DMN’s parameters (wk,c , bk,c ), saving them into the dendrites object variable. The batch is randomly computed in line 4, the gradients are calculated in line 5 while the model parameters are updated in lines 6 and 7. Variables (X, T) contain the training samples and the corresponding targets. We explain the different initialisation methods for dendrite parameters. Unlike classical MLP where learning parameters are initialised randomly, here DMN allows us to propose different initialisation methods because dendrite parameters have a clear geometrical interpretation: they are hyperboxes. Hence, we can reuse training methods based on heuristics to give a first approximation at the beginning of the gradient descent iterations. In this paper, we propose the following four initialisation methods, and we show an illustration of these methods in Figure 4.

PT

265 266

289

290 291 292 293 294 295

dendrites.bk,c = dendrites.bk,c − α dbdJk,c for ∀k, c

CE

263 264

dJ dendrites.wk,c = dendrites.wk,c − α dw for ∀k, c k,c

CR IP T

Algorithm 1 This is the training procedure for a DMN based on SGD. 1: initialise hyperparameters α, Ni , Qbatch 2: dendrites :=parameters_initialisation(X, T) 3: for i = 0 to Ni do 4: (Xbatch , Tbatch ) :=compute_batch(X, T) dJ , dJ ) :=computegradients(dendrites, Xbatch , Tbatch ) 5: ( dw k,c dbk,c

ACCEPTED MANUSCRIPT

14

299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319

CR IP T

298

• dHpC initialisation. This method divides each hyperbox which was previously generated by the above method. The number of hyperboxes produced by this method is equal to 2nd Nc , where nd is the number of dimensions that are divided. For example, if nd = 1, the algorithm divides the hyperboxes in half with respect to the first dimension, producing 2Nc hyperboxes. If nd = 2, it divides the hyperboxes in half with respect to the first and second dimensions, producing 4Nc hyperboxes, and so on.

• D&C initialisation (Sossa and Guevara [29]). The dendrite parameters are set equal to the dendrites produced by the Divide and Conquer training method proposed in Sossa and Guevara [29]. This algorithm encloses all patterns in one hyperbox with a distance margin of M ∈ [0, ∞) and divides this hyperbox into 2N sub-hyperboxes if there is misclassification. This last procedure is called recursively for each sub-hyperbox until each pattern is classified with a given tolerance error E0 . This method is computationally more expensive but it captures the data distribution in a better way. In this paper, we deploy an improved version of this algorithm which can avoid the curse of dimensionality and overfitting. This new algorithm will be published later (Guevara & Sossa, 2016, Personal Communication).

AN US

297

• K-means initialisation. The classical k-means algorithm is applied to obtain S clusters for each training set of the same class. The k-means clusters are transformed to hyperboxes, and we use these hyperboxes to set dendrite parameters (we do the inverse process proposed in Sossa et al. [27]). We use Gaussians as radial basis functions for clustering:

M

296

2

321

322

wk,c = µk,c − ∆k,c ,

(13)

bk,c = 2∆k,c , s − log(y0 ) , ∆k,c = βk,c

(14) (15)

where y0 ∈ (0, 1] determines the hyperbox size (note that vector operations in (15) are element-wise). This initialisation method also increases computational cost but can create more complex learning models than HpC and dHpC methods.

AC

325

CE

323

324

(12)

The cluster centre µk,c and width βk,c are used to calculate the dendrite parameters to form a hyperbox for each Gaussian cluster, as follows:

PT

320

ED

φk,c (x) = exp(−βk,c kx − µk,c k ),

326 327

328

329 330

5. Results and Discussion In this section, we compare our proposal with other training methods for DMN and popular machine learning methods based on synthetic and real datasets

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

15

AC

CE

Figure 4: Initialisation methods. We illustrate the four initialisation methods. The HpC method initialises dendrite parameters, enclosing each class with a hyperbox with a margin (even negative). The dHpC method is the same as the previous method but allows division of hyperboxes in half with respect to some dimensions, generating more hyperboxes per class. The D&C method is specifically to use D&C training to initialise dendrites. The k-means method first makes a clustering to define where to put the initial hyperboxes using the classical k-means method.

ACCEPTED MANUSCRIPT

5.1 Synthetic and Real Datasets

337 338

339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367

5.1. Synthetic and Real Datasets Our proposed method is first applied to synthetic datasets with high overlapping. We have chosen datasets where overlapping is present because overfitting is more relevant in this kind of problem. The first synthetic dataset is called "A" and is generated by two Gaussian distributions with standard deviation equal to 0.9. The first class is centred around (0,0); the second class is centred around (1,1). The second synthetic dataset is called "B" and consists of three classes which follow a Gaussian distribution with standard deviation equal to 1.0. The classes centres are: (-1,-1), (1,1) and (-1,2). We also use some real datasets to test the performance of our proposal. Most of these datasets were taken from the UCI Machine Learning Repository (A. and D.J. [1]). We also use a subset of MNIST (LeCun Y. and C. [15]) and CIFAR10 (Krizhevsky [14]) datasets for tasks with high dimensionality. We describe them briefly. The Iris dataset is a classical non-linearly separable problem with three types of iris plant and four describing features. The Liver dataset consists of detecting liver disorder (two classes) using six features that relate to blood tests and alcoholic drinking. The Glass dataset has the purpose of identifying the glass type in crime scenes using 10 properties of glass. The number of classes is six. The Page Blocks dataset needs to classify blocks of documents into five categories based on 10 block geometrical features. The Letter Recognition dataset has 26 letters from different fonts which must be identified by 16 geometrical and statistical features. The Mice Protein expression dataset classifies mice into eight classes based on features such as genotype, behaviour and treatment. In this work, we only use the expression levels of 77 proteins as features for our experiments. Finally, we apply the proposed method to two image classification tasks taking pixels as features (without intermediate feature extraction). The MNIST dataset is a classical recognition task for handwritten digit recognition (0 to 9) with an image size of 28 × 28 pixels. The CIFAR10 requires recognition of 10 animals and vehicles by means of colour images of 32 × 32 pixels. It is worth mentioning that in some of these datasets we found duplicates; we deleted them for the experiments. Table 3 summarises the dataset properties in terms of number of features N , number of classes Nc , number of training samples Qtrain and number of testing samples Qtest .

AC

368

CR IP T

336

AN US

335

M

334

ED

333

for classification. The results demonstrate that our proposal obtains systemically better accuracy than other training methods for DMNs and achieves similar performance to popular machine learning methods. Furthermore, we discuss the advantages and disadvantages of each hyperbox initialisation method for our proposal to determine which is the best. The reader can reproduce the experimental results of this paper by using the online code provided at Zamora [36]. At the end of the section, we present special experiments to show the best asset of the DMNs regarding MLPs.

PT

332

CE

331

16

369 370 371

372 373 374

5.2. Comparison with Other DMN Training Methods In this subsection, we analyse the performance of DMN trained by SGD in comparison with other training methods based just on heuristics. We also

ACCEPTED MANUSCRIPT

5.2 Comparison with Other DMN Training Methods

17

380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397

AC

398

M

379

ED

377 378

399 400 401 402 403 404 405

Nc 2 3 3 2 6 5 26 8 10 10

Qtrain 1000 1500 120 300 170 4450 15071 900 1000 1000

Qtest 200 300 30 45 44 959 3917 180 200 200

compare performance of the different ways to initialise hyperboxes that we proposed in section 4. These comparisons are made based on the datasets described above. We measure performance in terms of the training error Etrain , the testing error Etest , training time ttrain and the model complexity NH /Qtrain . This last quantity is defined as the ratio of the number of hyperboxes NH by the number of training samples Qtrain . If this ratio is equal to 1 then the neuron uses a hyperbox for each training sample. In this extreme case, the model is very complex. In contrast, if this ratio is close to zero, this means that the number of hyperboxes is much smaller than the number of training samples. Therefore, the model is simple. Ideally, we are looking for a morphological neuron with relatively few hyperboxes (NH /Qtrain ' 0), with the smallest testing error Etest and with Etrain ∼ Etest ; that is, a simple and well generalised learning model. As we see in section 3, there have been several ways to train DMN reported in the literature. We only compare our proposal with the principal algorithms: Elimination (SMLP) (Ritter and Urcid [22]), Hyperbox per Class (HpC) (a simplified version of the algorithm proposed in Sussner and Esmi [33]), and Divide and Conquer (D&C) (Sossa and Guevara [29]). Elimination training encloses all patterns in a hyperbox and eliminates misclassified patterns by enclosing them with several other hyperboxes. HpC training encloses the patterns of each class by using a hyperbox. D&C training is based on a divide and conquer strategy; this training begins enclosing all patterns in one hyperbox, then divides the hyperbox into 2N hyperboxes if there is misclassification. This last procedure is called recursively for each sub-hyperbox until patterns are classified with a given tolerance error. It is important to note that all these training methods are based on heuristics, not on optimisation methods like our proposal, so we could expect that SGD training does better. Additionally, we compare the four initialisation methods we propose in section 4 for SGD training, which are HpC, dHpC, D&C and k-means. Figure 5 summarises our results, showing that SGD training systemically obtains smaller testing errors for each dataset than the other heuristics. Fur-

PT

376

CE

375

N 2 2 4 6 10 10 16 77 784 3072

AN US

Dataset A B Iris Liver Glass Page Blocks Letter Recog. Mice Protein MNIST CIFAR10

CR IP T

Table 3: We summarise the dataset properties.

ACCEPTED MANUSCRIPT

5.3 Comparison with Other Machine Learning Methods

412

413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435

436 437 438 439 440 441

5.3. Comparison with Other Machine Learning Methods In this subsection, we compare the performance of popular machine learning methods with our proposal: DMN-SGD. The objective is to show that our proposal gets better results in low dimensional datasets and obtains competitive results in high dimensional datasets. These comparisons are made based on the datasets described above. As popular machine learning methods, we chose two-multilayer perceptrons (MLP) (Rumelhart et al. [25]), support vector machine (SVM) (Cortes and Vapnik [5]) and radial basis function network (RBFN) (Broomhead and Lowe [4]). We summarise our results in Figure 6, showing that DMN-SGD obtains the smallest testing errors for some datasets (A, B, Iris and Liver). For other datasets (Glass, Pageblocks, Letters and Mice Protein), DMN-SGD presents errors close to the best result; and for the remaining datasets (MNIST and CIFAR10), DMN-SGD obtains the worst error. We can infer that the DMN-SGD algorithm works better for low dimensional datasets (fewer features) than for high dimensional datasets. On the other hand, on average, the testing error for DMN-SGD is the second best behind the MLP model for these 10 datasets. The main drawback is that DMN-SGD needs more model parameters Npm than the other machine learning models, but DMNSGD has better training time than MLP on average. We can conclude that DMN trained by SGD can be another useful technique in the machine learning toolbox for practitioners, because it obtains competitive results with respect to popular machine learning methods for classification problems. Finally, for results in detail, the reader can see Table 6. 5.4. DMN vs MLP In the previous subsection we have seen that the DMN performance is competitive with popular machine learning techniques. In particular, MLP has marginally better classification than DMN on average. However, there are some problems where DMN far outstrips MLP. The purpose of this subsection is to show empirically the greatest benefit of the DMNs. According to our proposal, training both DMN and MLP requires two steps: 1) initialisation of learning parameters in the model and 2) optimising them. In the MLP, initialisation is a very simple step: choose the initial parameters randomly following some convenient probability distribution (Glorot and Bengio [10], He et al. [12]); rarely do these random initial values give a good decision boundary. Therefore, optimisation should search more to find the best parameters in MLP. Instead, in the DMN, initialisation of dendritic weights exploits the geometric interpretation of the weights (as hyperboxes) to facilitate

AC

442

CR IP T

411

AN US

410

M

409

ED

408

thermore, we can see that when we initialise SGD training with the dHpC method, we obtain better testing errors on average than with other initialisation methods. On average, the best initialisation method in terms of the least number of dendrites is the HpC method and in terms of least training time it is the k-means method. We can conclude that SGD presents better results than solely heuristic-based methods to train DMNs. For results in detail, see Tables 4 and 5.

PT

407

CE

406

443 444 445 446 447 448 449

18

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

19

(a) Iris

100%

100%

50%

50%

50%

0%

0% a) b)

c)

d) e)

f)

0%

g)

a) b)

Liver

c)

d) e)

f)

g)

a) b)

Glass 100%

50%

50%

50%

0% c)

d) e)

f)

a) b)

Letters

c)

d) e)

50%

50%

0% c)

d) e)

f)

g)

g)

a) b)

c)

d) e)

f)

g)

f)

g)

MINIST

100%

50%

0% a) b)

f)

Mice Proteins 100%

f)

0%

g)

100%

d) e)

AN US

100%

a) b)

c)

Pageblocks

100%

0%

CR IP T

B

A 100%

0%

g)

a) b)

c)

d) e)

f)

g)

f)

g)

a) b)

c)

d) e)

M

CIFAR10

100%

ED

50% 0%

a) b)

c)

d) e)

PT

(b)

Average Etest 60%

Average NH/Qtrain

Average ttrain

0.6

1.0E+04

0.4

1.0E+02

20%

0.2

1.0E+00

0%

0.0

CE

40%

AC

a)

b)

c)

d)

e)

f)

g)

1.0E-02 a)

b)

c)

d)

e)

f)

g)

a) b) c) d) e) f) g)

Figure 5: Comparison with other DMN training methods. Our experimental results are summarised. (a) We show the testing errors for each dataset obtained by the training methods based on heuristics in red: a) SMLP, b) D&C and c) HpC; and the SGD training is in green, executed from four different initialisation methods: d) SGD (HpC), e) SGD (dHpC), f) SGD (D&C) and g) SGD (k-means). (b) We show the average among all datasets for testing error Etest , model complexity NH /Qtrain and training time ttrain . We mark the best value with blue. Note that quantities are sometimes zero, meaning that bars disappear.

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

20

SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

B

SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

Iris

SMLP D&C HpC SGD (HpC)

SGD (D&C)

SGD (k-means) Liver

NH /Qtrain

Etrain %

Etest %

ttrain ,seg

0.225 0.013 0.002 0.002 0.002 0.013 0.002 0.192 0.002 0.002 0.002 0.002 0.002 0.002 0.100 0.225 0.025 0.025 0.025 0.225 0.250 0.277 0.290 0.007 0.007 0.107 0.290 0.083 0.035 0.153 0.035 0.035 1.129 0.153 0.035

0.00 21.40 24.90 21.20 21.20 20.90 21.60 0.00 18.80 17.27 16.07 16.07 16.53 15.53 0.00 1.67 5.00 3.33 3.33 0.00 0.83 0.00 19.00 35.67 31.00 24.33 6.33 7.33 0.00 26.47 18.82 8.24 10.00 25.29 1.18

55.00 21.50 25.00 21.50 21.50 20.00 22.00 40.67 21.00 17.67 16.33 16.33 16.67 15.33 10.00 3.30 0.00 0.00 0.00 0.00 0.00 73.33 37.78 40.00 35.56 24.44 31.11 31.11 100.00 29.55 29.55 13.64 9.09 20.45 27.27

98.4 0.2 0.0 10.6 10.6 16.5 0.0 166.1 0.1 0.0 19.4 18.4 6.6 1.3 0.0 0.1 0.0 1.8 1.8 6.7 0.0 3.8 0.2 0.0 103.0 2.1 309.5 5.4 0.0 0.1 0.0 1.5 0.3 102.4 0.0

ED

SGD (dHpC)

NH

225 13 2 2 2 13 2 288 3 3 3 3 3 3 12 27 3 3 3 27 30 83 87 2 2 32 87 25 6 26 6 6 192 26 6

SMLP

PT

D&C HpC

SGD (HpC)

SGD (dHpC)

CE

SGD (D&C)

SGD (k-means)

Glass

SMLP D&C

AC

AN US

Training

A

M

Dataset

HpC

SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

CR IP T

Table 4: We report our results in detail. For each dataset and each training method, we evaluate the number of dendrites NH , the model complexity NH /Qtrain , the training error Etrain , the testing error Etest and the training time ttrain . The best values are shown in bold. This table shows only the results for the first five datasets; see Table 5 for the rest.

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

21

SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

Letter Rec.

SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

Mice Protein

SMLP D&C HpC SGD (HpC)

SGD (D&C)

SGD (k-means) MNIST

NH /Qtrain

Etrain %

Etest %

ttrain ,seg

0.040 0.053 0.001 0.001 0.001 0.053 0.001 0.270 0.473 0.002 0.002 0.007 0.473 0.088 0.054 0.899 0.009 0.009 0.036 0.899 0.261 0.347 1.000 0.010 0.010 0.010 1.000 0.182 0.098 1.000 0.010 0.010 0.040 1.000 0.010

0.00 2.00 22.63 4.40 4.40 5.57 9.06 2.98 8.36 63.82 32.34 23.42 0.49 7.68 0.00 0.00 8.33 7.89 6.33 0.00 3.33 0.00 0.00 20.40 11.60 11.60 0.00 0.80 0.00 0.00 48.60 4.80 3.80 0.00 76.60

14.91 8.03 28.78 7.19 7.19 7.92 12.20 20.78 31.58 63.39 33.93 25.81 22.06 12.61 63.33 5.00 15.56 15.56 12.78 4.44 11.67 61.00 80.00 39.00 23.50 23.50 36.00 15.50 99.50 83.00 77.50 77.50 77.00 83.00 79.50

174.5 0.7 0.0 1.0 1.0 10.7 0.1 60426.2 60.2 0.0 19.2 1352.5 1278.2 69.6 1.7 2.3 0.0 0.5 154.1 6.5 2.7 110.7 18.3 0.0 89.0 89.0 2318.2 35.3 20.3 72.1 0.1 36.5 380.5 715.3 0.3

ED

SGD (dHpC)

NH

179 235 5 5 5 235 5 4072 7126 26 26 104 7126 1322 49 809 8 8 32 809 235 347 1000 10 10 10 1000 182 98 1000 10 10 40 1000 10

SMLP

PT

D&C HpC

SGD (HpC)

SGD (dHpC)

CE

SGD (D&C)

SGD (k-means)

CIFAR10

SMLP D&C

AC

AN US

Training

M

Dataset

Page Blocks

HpC

SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)

CR IP T

Table 5: We report our results in detail. For each dataset and each training method, we evaluate the number of dendrites NH , the model complexity NH /Qtrain , the training error Etrain , the testing error Etest and the training time ttrain . The best values are shown in bold. This table shows only the results for last five datasets; see Table 4 for the rest.

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

22

(a) B

Iris

40%

40%

40%

20%

20%

20%

0%

0%

0% DMN MLP SVM RBFN

DMN MLP SVM RBFN

Liver

DMN MLP SVM RBFN

Glass

Pageblocks

40%

40%

20%

20%

20%

0%

0%

AN US

40%

0%

DMN MLP SVM RBFN

DMN MLP SVM RBFN

Letters

DMN MLP SVM RBFN

Mice Proteins

40%

40%

20%

20%

0%

0%

CR IP T

A

MINIST

40% 20%

0%

DMN MLP SVM RBFN

DMN MLP SVM RBFN

M

DMN MLP SVM RBFN

CIFAR10

100%

ED

50% 0%

DMN MLP SVM RBFN

PT

(b)

Average Etest 40%

Average Npm

60,000

CE

20%

0%

AC

DMN MLP SVM RBFN

Average ttrain 100

40,000

10

20,000 0

1

DMN MLP SVM RBFN

DMN MLP SVM RBFN

Figure 6: Comparison with other machine learning methods. We compare our proposal DMNSGD with popular machine learning methods, summarising the experimental results. (a) We show the testing errors for each dataset obtained by the following machine learning methods: DMN, MLP, SVM and RBFN. (b) We show the average among all datasets for testing error Etest , number of learning parameters Npm and training time ttrain . We mark the best value with blue. Note that quantities are sometimes zero, whereby bars disappear.

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

23

Table 6: We reports our results in detail. For each dataset and each machine learning method, we evaluate the number of learning parameters Npm , the training error Etrain , the testing error Etest and training time ttrain . The best values are shown in bold. DMN-SGD (D&C)

MLP SVM RBFN B

DMN-SGD (k-means)

MLP SVM RBFN Iris

DMN-SGD (D&C)

MLP SVM RBFN Liver

DMN-SGD (dHpC)

MLP SVM RBFN Glass

DMN-SGD (dHpC)

MLP SVM RBFN Page Blocks

DMN-SGD (HpC)

Letter Rec.

DMN-SGD (k-means)

PT

MLP SVM RBFN

DMN-SGD (D&C)

CE

Mice Protein

AC

MNIST

CIFAR10

Etest %

ttrain ,seg

52 252 10 22 12 63 10 75 216 83 21 51 384 272 18 299 3840 1706 78 244 100 1605 65 981 42304 2176 494 84865 124586 4308 640 10414 285376 39760 7870 222610 245760 154160 30750 369970

20.90 21.20 21.10 21.90 15.53 15.80 15.80 16.40 0.00 0.83 4.17 3.33 24.33 7.33 29.33 21.00 10.00 5.29 2.94 1.76 4.40 3.64 4.02 3.80 7.68 8.63 39.36 1.84 0.00 0.00 0.11 2.78 0.80 1.20 2.70 0.90 3.80 50.70 18.40 24.20

20.00 21.50 22.00 22.00 15.33 15.67 16.33 16.33 0.00 0.00 3.33 0.00 24.44 24.44 35.56 35.56 9.09 6.82 9.09 25.00 7.19 6.47 5.84 7.61 12.61 11.26 41.03 5.28 4.44 0.00 0.56 5.56 15.50 11.00 11.50 7.00 77.00 70.50 65.00 62.00

16.5 7.6 0.5 0.1 1.3 3.5 0.5 0.1 6.7 1.2 0.5 0.1 2.1 3.4 0.4 0.1 0.3 8.7 0.6 0.1 1.0 9.2 1.1 0.6 69.6 1685.9 47.1 193.2 6.5 232.0 1.0 0.2 35.3 148.4 2.7 2.2 380.5 103.8 8.7 3.4

ED

MLP SVM RBFN

Etrain %

MLP SVM RBFN

DMN-SGD (k-means)

MLP SVM RBFN DMN-SGD (dHpC)

MLP SVM RBFN

CR IP T

A

Npm

AN US

ML model

M

Dataset

ACCEPTED MANUSCRIPT

5.4 DMN vs MLP

451 452

optimisation, solving problems that classical perceptrons cannot, or at least, at reduced work costs. This is the greatest benefit of DMNs. Here is where heuristic methods excel at smarter parameter initialisation. Dendrite Morphological Neuron Etest = 0%

200 iterations

Two-Layer Perceptrons

Etest = 0%

Etest = 42.0%

500 iterations

Etest = 0%

Etest = 5.8%

Etest = 46.6%

3000 iterations

Etest = 7.2%

Difficulty

Etest = 6.6%

0 iterations

Etest = 6.0%

After optimisation

Etest = 48.3%

Initial model

3000 iterations

Etest = 47.5%

After optimisation

M

Initial model

200 iterations

AN US

VS Etest = 9.6%

CR IP T

450

24

455 456 457 458 459 460

AC

461

To illustrate this, we propose to evaluate the performance of both models (MLP and DMN) using three synthetic classification problems: a type XOR problem, a double spiral with two laps and a double spiral with 10 laps. In particular, the spirals are difficult problems for MLPs because MLPs tend to create open decision boundaries with no laps. The experimental results are shown in Figure 7. It is clearly noticeable that in all three cases, the optimisation requires more iterations in MLPs w.r.t DMNs. We also noticed that optimisation in DMN is just a process for refining the initial model, while optimisation in MLP is more a decisive step because initial models are bad. Finally, the most important observation is that an MLP cannot solve the double spiral with 10 laps (or more than 10), even though the decision boundary follows a mathematically simple pattern. Instead, DMN initialises the decision boundary with a good approximation by a D&C strategy, allowing the optimisation step to refine the model after only a few iterations. Of course, more research is needed to see in which real problems this would happen. Here, we just pave the way towards this research.

PT

454

CE

453

ED

Figure 7: DMN vs MLP. This figure illustrates why the dendritic weights initialisation is the greatest benefit of the DMN. This initialisation: 1) makes the optimisation step easier, reducing the number of required iterations during the optimisation step; 2) solves problems which classical perceptrons cannot, or at least, it takes them much more work. These properties are due to geometric interpretation of the dendritic weights as hyperboxes that allow intialisation methods based on heuristics. This cannot happen with MLP.

462 463 464 465 466 467 468

ACCEPTED MANUSCRIPT

25

469

6. Conclusion and Future Research

492

Acknowledgements

476 477 478 479 480 481 482 483 484 485 486 487 488 489 490

493 494 495 496 497

498

The authors would like to acknowledge the support for this research work by the UPIITA and the CIC, both schools belonging to the Instituto Politécnico Nacional (IPN). This work was economically supported by SIP-IPN [grants numbers 20160945 and 20161116]; and CONACYT [grant numbers 155014 (Basic Research) and 65 (Frontiers of Science)].

References

AC

499

AN US

475

M

474

ED

473

PT

472

CE

471

CR IP T

491

In this paper, we reviewed the different neural architectures and training methods for morphological neurons. Most of these training methods are based on heuristics that exploit the geometrical properties of hyperboxes. The main contribution of this paper is that the neural architecture of a dendrite morphological neuron (DMN) was extended to be trained by stochastic gradient descent (SGD) for classification problems. We proposed and evaluated four different methods to initialise dendrite parameters at the beginning of training by SGD. Our experimental results contradict what is implicitly said in literature: “the DMN training based just on heuristics does not need optimisation”. However, the experiments show that our approach trained by SGD systematically obtains smaller classification errors than the training methods based solely on heuristics. Instead, we show that these heuristics can be useful to initialise dendrite parameters before optimising them. This is an advantage over MLP where learning parameters are only initialised randomly. In fact, we show that DMN can solve some problems that MLP cannot, or at least, it costs a lot more work. Our results show that the best performance results were obtained at low dimensionality for these DMNs trained by SGD. Moreover, the performance decreases at high dimensionality. We think more exploration must be made to improve results in terms of new neural architectures and optimisation-based training for high dimensional problems in DMNs. This remains an open issue for future investigation.

470

500 501 502

503 504 505

[1] A., A., D.J., N., 2007. UCI Machine Learning Repository. Irvine, CA: University of California, Department of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html. [2] Barmpoutis, A., Ritter, G. X., 2006. Orthonormal basis lattice neural networks. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 331–336.

ACCEPTED MANUSCRIPT

REFERENCES

512 513

514 515 516

517 518 519

520 521 522

523 524 525 526 527

528 529 530 531 532 533

534 535

[5] Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning 20 (3), 273–297. URL http://dx.doi.org/10.1023/A:1022627411411

[6] Davidson, J. L., Hummer, F., 1993. Morphology neural networks: An introduction with applications. Circuits, Systems and Signal Processing 12 (2), 177–210. [7] Davidson, J. L., Ritter, G. X., 1990. Theory of morphological neural networks. URL http://dx.doi.org/10.1117/12.18085

[8] Davidson, J. L., Sun, K., Jul. 1991. Template learning in morphological neural nets. In: Gader, P. D., Dougherty, E. R. (Eds.), Image Algebra and Morphological Image Processing II. Vol. 1568. pp. 176–187. [9] de A. Araujo, R., 2012. A morphological perceptron with gradient-based learning for brazilian stock market forecasting. Neural Networks 28, 61 – 81. URL http://www.sciencedirect.com/science/article/pii/ S0893608011003200 [10] Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y. W., Titterington, D. M. (Eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10). Vol. 9. pp. 249–256. URL http://www.jmlr.org/proceedings/papers/v9/glorot10a/ glorot10a.pdf [11] Golik, P., Doetsch, P., Ney, H., Aug. 2013. Cross-entropy vs. squared error training: a theoretical and experimental comparison. In: Interspeech. Lyon, France, pp. 1756–1760. [12] He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034.

AC

536

CR IP T

511

[4] Broomhead, D. S., Lowe, D., 1988. Multivariable Functional Interpolation and Adaptive Networks. Complex Systems 2, 321–355.

AN US

510

M

509

ED

508

[3] Bishop, C. M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

PT

507

CE

506

26

537 538 539 540

541 542 543

[13] Kline, D. M., Berardi, V. L., 2005. Revisiting squared-error and crossentropy functions for training neural network classifiers. Neural Computing & Applications 14 (4), 310–318.

ACCEPTED MANUSCRIPT

REFERENCES

550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575

[16] Montavon, G., Orr, G. B., Müller, K.-R. (Eds.), 2012. Neural Networks: Tricks of the Trade, Reloaded, 2nd Edition. Vol. 7700 of Lecture Notes in Computer Science (LNCS). Springer.

[17] Pessoa, L. F., Maragos, P., 2000. Neural networks with hybrid morphological/rank/linear nodes: a unifying framework with applications to handwritten character recognition. Pattern Recognition 33 (6), 945 – 960. URL http://www.sciencedirect.com/science/article/pii/ S0031320399001570 [18] Ritter, G. X., Davidson, J. L., 1991. Recursion and feedback in image algebra. URL http://dx.doi.org/10.1117/12.47965 [19] Ritter, G. X., Schmalz, M. S., 2006. Learning in lattice neural networks that employ dendritic computing. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 7–13.

[20] Ritter, G. X., Sussner, P., Aug 1996. An introduction to morphological neural networks. In: Pattern Recognition, 1996., Proceedings of the 13th International Conference on. Vol. 4. pp. 709–717 vol.4. [21] Ritter, G. X., Urcid, G., Mar 2003. Lattice algebra approach to singleneuron computation. IEEE Transactions on Neural Networks 14 (2), 282– 295. [22] Ritter, G. X., Urcid, G., 2007. Computational Intelligence Based on Lattice Theory. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Learning in Lattice Neural Networks that Employ Dendritic Computing, pp. 25–44. [23] Ritter, G. X., Urcid, G., Iancu, L., 2003. Reconstruction of patterns from noisy inputs using morphological associative memories. Journal of Mathematical Imaging and Vision 19 (2), 95–111. URL http://dx.doi.org/10.1023/A:1024773330134 [24] Ritter, G. X., Urcid, G., Juan-Carlos, V. N., July 2014. Two lattice metrics dendritic computing for pattern recognition. In: Fuzzy Systems (FUZZIEEE), 2014 IEEE International Conference on. pp. 45–52.

AC

576

CR IP T

549

AN US

548

[15] LeCun Y., C. C., C., B., 1998. The MNIST dataset of handwritten digits. NewYork: NYU, Google Labs and Microsoft Research. http://yann. lecun.com/exdb/mnist/.

M

547

ED

546

[14] Krizhevsky, A., 2009. Learning Multiple Layers of Features from Tiny Images. MIT and NYU. https://www.cs.toronto.edu/~kriz/cifar.html.

PT

545

CE

544

577 578 579 580 581 582

583

27

[25] Rumelhart, D. E., Hinton, G. E., Williams, R. J., 1986. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge, MA, USA, Ch. Learning Internal Representations by Error Propagation, pp. 318–362. URL http://dl.acm.org/citation.cfm?id=104279.104293

ACCEPTED MANUSCRIPT

REFERENCES

590 591 592 593

594 595 596 597

598 599 600 601

602 603 604 605

606 607 608

609 610 611 612 613 614

615

[28] Sossa, H., Guevara, E., 2013. Pattern Recognition: 5th Mexican Conference, MCPR 2013, Querétaro, Mexico, June 26-29, 2013. Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Modified Dendrite Morphological Neural Network Applied to 3D Object Recognition, pp. 314–324. [29] Sossa, H., Guevara, E., 2014. Efficient training for dendrite morphological neural networks. Neurocomputing 131, 132 – 142. URL http://www.sciencedirect.com/science/article/pii/ S0925231213010916

[30] Sussner, P., Sep 1998. Morphological perceptron learning. In: Intelligent Control (ISIC), 1998. Held jointly with IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), Intelligent Systems and Semiotics (ISAS), Proceedings. pp. 477–482. [31] Sussner, P., Esmi, E. L., 2009. Constructive Neural Networks. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Constructive Morphological Neural Networks: Some Theoretical Aspects and Experimental Results in Classification, pp. 123–144. [32] Sussner, P., Esmi, E. L., June 2009. An introduction to morphological perceptrons with competitive learning. In: Neural Networks, 2009. IJCNN 2009. International Joint Conference on. pp. 3024–3031. [33] Sussner, P., Esmi, E. L., 2011. Morphological perceptrons with competitive learning: Lattice-theoretical framework and constructive learning algorithm. Information Sciences 181 (10), 1929 – 1950, special Issue on Information Engineering Applications Based on Lattices. URL http://www.sciencedirect.com/science/article/pii/ S0020025510001283 [34] Urcid, G., Ritter, G. X., 2006. Noise masking for pattern recall using a single lattice matrix auto-associative memory. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 187–194.

AC

616

CR IP T

589

[27] Sossa, H., Cortés, G., Guevara, E., 2014. New Radial Basis Function Neural Network Architecture for Pattern Classification: First Results. Springer International Publishing, Cham, pp. 706–713.

AN US

588

M

587

ED

586

[26] Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural Networks 61, 85–117, published online 2014; based on TR arXiv:1404.7828 [cs.NE].

PT

585

CE

584

28

617

618 619

620

621 622

[35] Weinstein, R. K., Lee, R. H., 2006. Architectures for high-performance fpga implementations of neural models. Journal of Neural Engineering 3 (1), 21. URL http://stacks.iop.org/1741-2552/3/i=1/a=003 [36] Zamora, E., 2016. Dendrite Morphological Neurons Trained by Stochastic Gradient Descend - Code. https://github.com/ezamorag/DMN_by_SGD.

ACCEPTED MANUSCRIPT

REFERENCES

29

627 628 629 630 631 632 633

634 635 636 637 638 639 640

Humberto Sossa was born in Guadalajara, Jalisco, Mexico in 1956. He received a B.Sc. Degree in Electronics from the University of Guadalajara in 1981, a M.Sc. in Electrical Engineering from CINVESTAV-IPN in 1987 and a Ph.D. in Informatics from the National Polytechnic Institute of Grenoble, France in 1992. He is a full time professor at the Centre for Computing Research of the National Polytechnic Institute of Mexico. His main research interests are in Pattern Recognition, Artificial Neural Networks, Image Analysis, and Robot Control using Image Analysis.

AC

CE

PT

641

AN US

626

Erik Zamora is a full professor with the National Polytechnic Institute of Mexico. He received his Diploma in Electronics from UV (2004), his Master’s degree in Electrical Engineering from CINVESTAV (2007), and his PhD in automatic control from CINVESTAV (2015). He developed the first commercial Mexican myoelectric system to control a prosthesis in the Pro/Bionics company and a robotic navigation system for unknown environments guided by emergency signals at the University of Bristol. He has a postdoctoral position in CIC-IPN. His current interests include autonomous robots and machine learning. He has published eight technical papers in international conference proceedings and journals and has directed nine theses on these topics.

M

625

ED

624

CR IP T

623