Accepted Manuscript
Dendrite Morphological Neurons Trained by Stochastic Gradient Descent Erik Zamora, Humberto Sossa PII: DOI: Reference:
S0925-2312(17)30795-6 10.1016/j.neucom.2017.04.044 NEUCOM 18393
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
5 July 2016 24 January 2017 16 April 2017
Please cite this article as: Erik Zamora, Humberto Sossa, Dendrite Morphological Neurons Trained by Stochastic Gradient Descent, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.04.044
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Dendrite Morphological Neurons Trained by Stochastic Gradient Descent
CR IP T
Erik Zamoraa,∗, Humberto Sossab a Instituto
Politécnico Nacional, UPIITA, Av. Instituto Politécnico Nacional 2580, Col. Barrio la Laguna Ticoman, Gustavo A. Madero, 07340 Ciudad de México, México b Instituto Politécnico Nacional, CIC, Av. Juan de Dios Batiz S/N, Col. Nueva Industrial Vallejo, Gustavo A. Madero, 07738 Ciudad de México, México
Abstract
ED
M
AN US
Dendrite morphological neurons are a type of artificial neural network that works with min and max operators instead of algebraic products. These morphological operators build hyperboxes in N -dimensional space. These hyperboxes allow the proposal of training methods based on heuristics without using an optimisation method. In literature, it has been claimed that these heuristic-based trainings have advantages: there are no convergence problems, perfect classification can always be reached and training is performed in only one epoch. In this paper, we show that these assumed advantages come with a cost: these heuristics increase classification errors in the test set because they are not optimal and learning generalisation is poor. To solve these problems, we introduce a novel method to train dendrite morphological neurons based on stochastic gradient descent for classification tasks, using these heuristics just for initialisation of learning parameters. Experiments show that we can enhance the testing error in comparison with solely heuristic-based training methods. This approach can reach competitive performance with respect to other popular machine learning algorithms.
1
2
1. Introduction
Morphological neural networks are an alternative way to model pattern classes. Classical perceptrons divide the input space into several regions, using hyperplanes as decision boundaries. In a case where more layers are presented, perceptrons divide the input space using a hypersurface. In contrast, morphological neurons divide the same space by several piecewise lines that together can create complex nonlinear decision boundaries, allowing separation of classes
AC
3
CE
PT
Keywords: Dendrite morphological neural network, Morphological perceptron, Gradient descent, Machine learning, Neural network
4 5 6 7
∗ Corresponding
author: Tel.: +52 5545887175. Email address:
[email protected],
[email protected] (Erik Zamora)
Preprint submitted to Neurocomputing
May 10, 2017
ACCEPTED MANUSCRIPT
2
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
41 42
43
• We extensively review different neural architectures and training methods for morphological neurons. • To the best of our knowledge, for the first time, DMN architecture is extended with a softmax layer to be trained by stochastic gradient descent (SGD) for classification tasks.
AC
44
CR IP T
13
AN US
12
M
11
ED
10
with only one neuron (Figure 1). This is not possible with a one-layer perceptron because it is a linear model. The morphological processing involves min and max operations instead of multiplications. These comparison operators generate the piecewise boundaries for classification problems and have the advantage of being implemented more easily in logic devices, such as FPGA (Field Programmable Gate Array) (Weinstein and Lee [35]), than classical perceptrons. This paper focuses on a specific type of morphological neuron which is called a Dendrite Morphological Neuron (DMN). This neuron has dendrites, and each dendrite can be seen as a hyperbox in high dimensional space (and a box in 2D). These hyperboxes have played an important role in the proposal of training methods. The common approach is to enclose patterns with hyperboxes and label each hyperbox to the right class. There are many training methods have been reported to exploit this approach. However, most of them are heuristics and are not based on the optimisation of learning parameters. Heuristics can be useful when numerical optimisation is computationally very expensive or impossible to compute. In this paper, we show this is not the case. A learning parameters optimisation can be computed in a reasonable time and can improve the classification performance of DMN over solely heuristic-based methods. The most important training method in the scope of classical perceptrons has been gradient descent (Schmidhuber [26]). This simple but powerful tool has been applied to train multilayer perceptrons, developing the backpropagation algorithm which is the result of applying the chain rule from the last layer to the first. Even though morphological networks have more than 25 years of development reported in the literature and backpropagation is the most successful training algorithm, most of the training methods for morphological neurons are based on intuition instead of on gradient descent. This is true at least for classification tasks because an approach based on gradient descent has been published in de A. Araujo [9] but only for regression tasks. The extension of this approach for classification is not trivial due to the problem of non-differentiability of morphological operations. In this paper, we deal with this problem by adding a softmax layer (Bishop [3]) at the end of a DMN, calculate the gradients and evaluate four different methods to initialise dendrite parameters. The contributions of this paper are:
PT
9
CE
8
45
46 47
48 49 50
• We propose and evaluate four different methods based on heuristics to initialise dendrite parameters before training by SGD. • We perform several experiments to show that our approach systematically obtains smaller classification errors than solely heuristic-based training methods.
ACCEPTED MANUSCRIPT
3
51
• The code for our approach is available at Zamora [36].
59
2. Basics of DMNs
54 55 56 57
60 61 62 63 64 65 66
A DMN has been shown to be useful for pattern classification (Sossa and Guevara [28, 29], Sussner and Esmi [33]). This neuron works similarly to an radial basis function networks (RBFN) but uses hyperboxes instead of using radial basis functions and it has only one layer. Each dendrite generates a hyperbox and the dendrite that is most active is the one whose case is closer to the input pattern. The output y of this neuron (see Figure 1) is a scalar given by
AN US
53
CR IP T
58
The paper is organised as follows. In section 2, we present the background needed to understand the rest of the paper. In section 3, we provide an extensive review of morphological neural networks. In section 4, we introduce our approach to training DMN based on SGD. We present the experiments and results, as well as a short discussion, in section 5 to show the effectiveness of our proposal. Finally, in section 6, we give our conclusions and directions for future research.
52
y = arg max (dn,c ) , c
68
where n is the dendrite number, c is the class number and dn,c is the scalar output of a dendrite given by
M
67
(1)
n n dn,c = min (min (x − wmin , wmax − x)) ,
(2)
84
3. Related Work
74 75 76 77 78 79 80
AC
81
PT
72 73
CE
71
ED
83
where x is the input vector, wmin and wmax are dendrite weight vectors. The min operators together check if x inside the hyperbox is limited by wmin and wmax as the extreme points, according to Figure 1. If dn,c > 0, x is inside the hyperbox; if dn,c = 0, x is somewhere in the hyperbox boundary; otherwise, it is outside. The inner min operator in (2) compares elements of the same dimension between the two vectors, generating another vector; while the outer min operator takes the minimum among all dimensions, generating a scalar. The activation function is given by the arg max operator proposed by Sussner and Esmi [33] instead of the hard limit function that is commonly used in SLMP (Ritter and Urcid [21]). This argmax operator allows the building of more complex decision boundaries than simple hyperboxes (see Figure 1). It is worth mentioning that if (1) produces several maximums, the argmax operator takes the first maximum as as index class to which the input pattern is assigned. Qualitatively speaking, this occurs when x is equidistant to more than one hyperbox.
69 70
82
85 86
In this section, we present different neural networks that use morphological operations (max, min, +) and that have been proposed in the literature over the
ACCEPTED MANUSCRIPT
𝑥2
𝑑1,𝑐 𝑥1 𝑥2
𝑑2,𝑐
⋮
argmax𝑐
𝑦
⋮
𝑥𝑁
𝑑𝑛,𝑐 (𝐱) < 0
𝑛 𝐰𝑚𝑎𝑥
𝑑𝑛,𝑐 (𝐱) > 0 𝑛 𝐰𝑚𝑖𝑛
𝑑𝑛,𝑐 (𝐱) = 0
AN US
𝑑𝑁𝐻,𝑐
CR IP T
4
𝑥1
(b)
(a)
ED
M
1
PT
(c)
2 3 (d)
AC
CE
Figure 1: Conventional DMN. We show: (a) the neural architecture ; (b) an example of a hyperbox in 2D generated by its dendrite weights. A dendrite output is positive when the pattern is in the green region (inside the corresponding hyperbox), it is zero when the pattern is in the red region (within the hyperbox boundary), and it is negative in the white region (outside the hyperbox); (c) an example of DMNs hyperboxes for a classification problem; (d) the decision boundaries generated by a DMN for the same classification problem in (c). Note that decision boundaries can be nonlinear and a class region can consist of disconnected regions. These properties allow a DMN to solve any classification problem separating classes by a given tolerance (Ritter and Urcid [22]).
ACCEPTED MANUSCRIPT
AN US
CR IP T
5
89 90 91 92 93 94 95 96
AC
97
last 25 years. We discuss the advantages and disadvantages in terms of the neural architectures and training methods. In Figure 2, we show a comprehensive review for the different types of decision boundaries that morphological neural networks can build in order to classify patterns. At the end of this section, a comprehensive comparison is given in Tables 1 and 2. Morphological neurons were created as a combination between image algebra and artificial neural networks. Ritter and Davidson proposed the first morphological neural network model in their seminal papers (Davidson and Ritter [7], Davidson and Sun [8], Ritter and Davidson [18], Davidson and Hummer [6]) for template learning in dilation image processing. Three different learning rules were given for this specific task. In Ritter and Sussner [20], Ritter and Sussner proposed a more general model for binary classification, called a SingleLayer Morphological Perceptron (Single-Layer MP). These authors studied the computing capabilities of Single-Layer MP, showing that: 1) Single-Layer MP classifies data using semi-infinite hyperboxes parallel to Cartesian axes as decision boundaries, 2) Single-Layer MP can solve some classification problems with nonlinear separability and 3) morphological networks can be used to form associative memories. In Sussner [30], Sussner extends the last work, proposing
PT
88
CE
87
ED
M
Figure 2: Morphological boundaries. We show the types of decision boundary for different morphological neural networks. (a) Single-Layer MP uses semi-infinite hyperboxes that are parallel to Cartesian axes. (b) Two-Layer MP separates classes by hyperplanes parallel to Cartesian axes. (c) Multilayer MRL-NNs classify data by general hyperplanes. Note that the decision boundaries for the next three networks are closed surfaces. (d) SLMP uses hyperboxes parallel to Cartesian axes. (e) OB-LNN uses rotated hyperboxes. (f) L1 , L∞ SLLP uses polytopes. Finally, (g) MP/CL and DMNN present complex nonlinear surfaces as decision boundaries due to dendrite processing and argmax operator as activation function. In general, morphological neural networks have superior classification ability to classical singlelayer perceptrons, and some morphological networks such as (c) and (g) have at least the same capability as multilayer perceptrons. Even so, morphological neurons are less popular than classical perceptrons among practitioners, engineers and researchers. We hope this work can help them to see the benefits of morphological neurons, particularly dendrite morphological neurons (g) (note that all acronyms are defined in the main text).
98 99
100 101 102 103 104
ACCEPTED MANUSCRIPT
6
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
AC
143
CR IP T
110
AN US
109
M
108
ED
107
a Two-Layer Morphological Perceptron (Two-Layer MP) and giving a training method. In the following years, the hidden layer was to become known as dendrites. This Two-Layer MP is more powerful because it uses parallel hyperplanes to build arbitrary decision boundaries, e.g. to solve the XOR-problem. In Ritter and Urcid [21], Ritter and Urcid proposed the first morphological neuron using dendrite processing. This neuron is also called Single-Layer Morphological Perceptron (SLMP) or Lattice Perceptron (SLLP). Futhermore, the architecture of this neuron is different from a Single-Layer MP (Ritter and Sussner [20]). SLMP separates classes using hyperboxes parallel to Cartesian axes. In Ritter and Urcid [21], the authors prove that for any compact region in classification space, we can always find an SLMP that separates this region from the rest by a given tolerance. A generalisation of this theorem for a multiclass problem is presented in Ritter and Schmalz [19], Ritter and Urcid [22] to ensure that an SLMP can always classify different compact regions. Therefore, SLMP is more powerful than the classical single-layer perceptron and presents similar abilities as the classical two-layer perceptron. The problem with this theoretical result is that the training method is key to achieving a useful approximation in practice. The use of hyperboxes as decision boundaries allows creation of training methods based on heuristics: 1) Elimination. The training (Ritter and Urcid [22]) encloses all patterns inside a hyperbox and eliminates misclassified patterns by enclosing them with several other hyperboxes. 2) Merging. The training (Ritter and Urcid [22]) encloses each pattern inside a hyperbox and then the hyperboxes are merged. Some improvements have been made to SMLP. In standard SMLP, hyperboxes are parallel to Cartesian axes. To eliminate this limitation, Barmpoutis and Ritter [2] proposed the Orthonormal Basis Lattice Neural Network (OBLNN), using a rotational matrix that multiplies inputs. The training can be performed based on Elimination or Merging, adding an optimisation procedure to determine the best orientation for each hyperbox. This last procedure can be time consuming but it can reduce the number of hyperboxes. In Ritter et al. [24], a new single-layer lattice perceptron is proposed where dendrites use L1 and L∞ in order to soften the corners of the hyperboxes. As result of this modification, this latter neural network is able to use polytopes as decision boundaries. The Elimination training method is extended for this kind of neuron. If we use the argmax operator as activation function in SMLP (instead of a hard limiter function), we can create more complex nonlinear surfaces (instead of hyperboxes) to separate classes. Sussner has proposed a Morphological Perceptron with Competitive Neurons (MP/CL) inSussner and Esmi [31, 32, 33] where competition is implemented by the argmax operator. Sussner and colleagues have developed a training algorithm based on the idea of classifying patterns using one hyperbox per class. If there is any overlapping between hyperboxes, the overlapping region is broken down into two smaller hyperboxes for each class. We call this training method Hyperbox per Class and Divide. Later, using the same neural network model but calling it Dendrite Morphological Neural Networks (DMNN), a new training method based on a Divide and Conquer strategy was proposed in Sossa and Guevara [29]. An application to recognise
PT
106
CE
105
144 145 146 147 148 149 150
ACCEPTED MANUSCRIPT
7
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
AC
189
CR IP T
156
AN US
155
M
154
ED
153
3D objects using Kinect is presented in Sossa and Guevara [28]. This training has been shown to outperform Elimination training. It begins enclosing all patterns in one hyperbox and then divides the hyperbox into 2N sub-hyperboxes if there is misclassification (where N is the number of dimensions in pattern space). This last procedure is called recursively for each sub-hyperbox until each pattern is perfectly classified. In another line of research, there have been two proposals to use gradient descent and backpropagation to train morphological neural networks. In Pessoa and Maragos [17], Morphological/rank/linear neural networks (MRL-NNs) combine classical perceptrons with morphological/rank neurons. The output of each layer is a convex combination of linear and rank operations applied to inputs. The training is based on gradient descent. MRL-NNs have shown that they can solve complex classification problems like recognising digits in images (NIST dataset), generating similar or better results compared to classical MLPs in shorter training times. This approach uses no hyperboxes which is a great difference with respect to our work; hyperboxes make it easy to initialise learning parameters for gradient optimisation. In de A. Araujo [9], a neural model similar to DMNN but with a linear activation function has been applied to regression problems such as forecasting stock markets. This neural architecture is called Increasing Morphological Perceptron (IMP) and is also trained by gradient descent. The drawback of these training methods based on gradient is that the number of hidden units (dendrites or rank operations) must be tuned as a hyperparameter. The advantage of creating more hyperboxes during training is lost. Our approach in this paper preserves this advantage using heuristic-based training methods as pre-training. Special mention is given to Dendritic Lattice Associative Memories (DLAM, Ritter et al. [23], Urcid and Ritter [34], Ritter and Urcid [22]). This model can create associative memories using two layers of morphological neuron using a simple training method. Associative memories are beyond the scope of this paper. In summary, morphological neurons have shown better or similar abilities compared to classical perceptrons: 1) decision boundaries are nonlinear with one layer, 2) units (dendrites/hyperboxes) are automatically generated during training in some models, 3) the number of units must always be less than or equal to the number of training patterns, 4) training based on heuristics presents no convergence problems and 5) training can be performed in one epoch. Another advantage claimed in the literature is that morphological neurons can achieve perfect classification for any training data. This is true for training methods like Elimination, Merging, Hyperbox per Class and Divide, and Divide and Conquer. Perfect classification is inconvenient in practice because it leads to overfitting in training data and increments the learned model complexity. A perfect classified model learns irrelevant noise in data instead of finding the fundamental data structure. Futhermore, a perfect model is more complex in the sense that the number of learned parameters is usually similar to the number of training patterns. We solve these problems by modifying the neural architecture, and applying the SGD to train these neurons. Our proposal is inspired by
PT
152
CE
151
190 191 192 193 194 195 196
ACCEPTED MANUSCRIPT
CR IP T
8
Table 1: We compare the different neural networks that use morphological operations in terms of: 1) the task they solve, 2) the geometrical properties of decision boundaries, 3) the output domain, 4) the number of layers, and 5) whether these neural networks use dendrites or not. Task
model MN (Davidson and
Image
Hummer [6])
dilation
Decision
Output
#
boundaries
domain
layers
-
R
1
Single-Layer MP
Binary
Semi-infinite
(Ritter and Sussner
classification
parallel
[20])
Dendrites
AN US
Neural network
No
{0, 1}
1
No
{0, 1}
2
Yes
R
>1
No
{0, 1}
1
Yes
{0, 1}
1
Yes
hyperboxes
Two-Layer MP
Binary
Parallel
(Sussner [30])
classification
hyperplanes
Multilayer MRL-NNs
Multiclass
Hyperplanes
(Pessoa and Maragos
classification
M
[17])
or [0, 1]
SLMP (Ritter and
Multiclass
Parallel
Urcid [21])
classification
hyperboxes
Multiclass
Rotated
and Ritter [2])
classification
hyperboxes
Multiclass
Polytopes
{0, 1}
1
Yes
-
R
2
Yes
N
1
Yes
L1 L∞ -SLLP (Ritter et al. [24])
ED
OB-LNN (Barmpoutis
classification Associative
[22]
memory
PT
DLAM Ritter and Urcid
Multiclass
Complex
Esmi [33])
classification
nonlinear
IMP (de A. Araujo [9])
Regression
-
R
1
Yes
DMNN (Sossa and
Multiclass
Complex
N
1
Yes
Guevara [29])
classification
nonlinear
AC
CE
MP/CL (Sussner and
surface
surface
ACCEPTED MANUSCRIPT
CR IP T
9
Table 2: We compare the different methods to train morphological neural networks in terms of: 1) the corresponding neural network model, 2) whether there is automatic dendrite creation or not, 3) whether the training is incremental or in batches and 4) the maximum number of epochs for training. All listed training methods are for supervised learning. Training method
Neural network
Automatic
Training
model
dendrite
type
Template learning for image dilation
MN
limit
AN US
creation
Epochs
No
Incremental
No limit
Incremental
Number
dendrites
(Davidson and Hummer [6]) Single-Layer MP rule
Single-Layer
(Ritter and Sussner
MP
[20])
No
dendrites
of
patterns
Two-Layer MP rule
Two-Layer MP
Yes
M
(Sussner [30]) Elimination
SLMP,
(Ritter and Urcid [22])
OB-LNN, L1
Yes
Unique
1
batch
Unique
1
batch
ED
L∞ -SLLP
Merging
SLMP
Yes
(Ritter and Urcid [22])
Hyperbox per Class and Divide
MP/CL
PT
CE
de A. Araujo [9])
Divide and Conquer (Sossa and Guevara
AC
[29])
Multilayer
1
batch Yes
Unique
1
batch
(Sussner and Esmi [33]) Gradient descent (Pessoa and Maragos [17],
Unique
No
Incremental
No limit
Yes
Unique
1
MRL-NNs and IMP DMNN
batch
ACCEPTED MANUSCRIPT
10
204
4. DMN trained by SGD
198 199 200 201 202
CR IP T
203
MP/CL and DMNN approaches, but we use a softmax operator as a soft version of argmax along with dendrite clusters for each class. Our work differs from IMP (de A. Araujo [9]) in three ways: 1) we focus on classification problems, instead of regression problems, 2) we reuse some heuristics to initialise hyperboxes before training and 3) we extend the neural architecture using a softmax layer. Further, it differs from MRL-NNs (Pessoa and Maragos [17]) in the neural architecture which does not incorporate linear layers, only morphological layers.
197
210
4.1. DMN with Softmax Layer
212 213 214 215 216 217
218
219 220 221 222
224
225 226
exp (dc (x)) , P rc (x) = PNC k=1 exp (dk (x))
(3)
y = arg max (P rc (x)) ,
(4)
The assigned class is taken based on these probabilities by c
We propose making dendrite clusters dc in order to have several hyperboxes per class, as in de A. Araujo [9], Sossa et al. [27]. In Figure 3, we show our proposed neural architecture. If a particular classification problem has Nc classes, then we would need Nc dendrite clusters, one per class. Each output of dendrite cluster takes the maximum among their dendrites h, as follows
AC
223
The principal problem in training a conventional DMN by SGD is in calculating the gradients when we have an argmax function in the output because argmax is a discrete function. This can be easily overcome by using a softmax layer instead of the argmax function. The softmax layer normalises the dendrite outputs such that these outputs are restricted between zero to one and can be interpreted as a measurement of likelihood P r so that a pattern x belongs to the class c, as follows:
M
211
ED
208
PT
207
CE
206
AN US
209
In this section, we present our proposal. We modify the neural architecture of a conventional DMN to be trained by SGD (Montavon et al. [16]). We also present the SGD training method in some detail for practitioners who are interested in using this method. The code and a demo of the training algorithm are online at Zamora [36].
205
dc (x) = max (hk,c (x)) , k
(5)
where a dendrite k of a cluster c is given by hk,c (x) = min (min (x − wk,c , wk,c + bk,c − x)) ,
(6)
This Equation (6) is very similar to Equation (2). These two equations represent the same: a dendrite in neural architecture, and graphically, a hyperbox
ACCEPTED MANUSCRIPT
4.2 Training Method
11
ℎ1,1 ℎ2,1 𝑑1
ℎ𝑘1,1
𝑥1
ℎ1,2
𝑥2
ℎ2,2
⋮
𝑑2
ℎ𝑘2,2
⋮
⋮
𝑑𝑁𝑐
𝑃𝑟𝑁𝐶
ED
⋮
⋮
M
ℎ1,𝑁𝑐 ℎ2,𝑁𝑐
𝑃𝑟2
AN US
⋮
𝑥𝑁
𝑃𝑟1
CR IP T
⋮
ℎ𝑘𝑁,𝑁𝑐
PT
Figure 3: Neural architecture for a DMN with a softmax layer. Each class corresponds to one dendrite cluster dc .
233
4.2. Training Method
228 229 230
AC
231
CE
232
in any dimensional space. Remember that hyperboxes (dendrites) are the elemental computer units in our model. Note that we change the way to codify a hyperbox mathematically. In conventional DMN as in Equation (2), a hyperbox is represented by its extreme points wmin and wmax as depicted in Figure 1-(b). Equation (6) represents a hyperbox by its lowest extreme point wk,c and its size vector bk,c which determines the hyperbox size for each hyperbox side.
227
234 235 236
A key issue of morphological neural networks is their training. We need to determine automatically the number of hyperboxes for each dendrite cluster, and the dendrite parameters wk,c and bk,c . This is a great difference with
ACCEPTED MANUSCRIPT
4.2 Training Method
238 239 240 241 242 243 244 245 246
respect to classical perceptrons where only the learning parameters are determined during training. Several training approaches for morphological neurons have been proposed, as shown in section 3. In this paper, we propose to initialise the hyperboxes based on heuristics and optimise the dendrite parameters by SGD (Montavon et al. [16]). In section 5, we show that this method is better than training based only on heuristics. We first explain the optimisation stage. The softmax layer makes it possible to estimate the probability of a pattern x that belongs to the class c, P rc (x). The cross entropy between these probabilities and the targets (ideal probabilities) can be used as objective function J , as follows J =−
250 251 252 253 254 255 256 257 258 259 260 261
where Q is the total number of training samples, q is the index for a training sample, tq ∈ N is the target class for that training sample, and 1 {·} is the indicator function so that 1 {true statement} = 1, and 1 {f alse statement} = 0. The purpose is to minimise this cost function (7) by SGD method. The softmax makes the gradients easier to compute because cross entropy and softmax functions are continuous and very well known by the machine learning community. Further, the optimization algorithm especially takes advantage from this cost function when the task is a classification problem. If we do not add a softmax layer, we will need to use norm-1 or norm-2 to calculate an error between targets and dendrite outputs. This kind of cost function works well for regression problems, but not for classification problems. Kline et al. [13] and Golik et al. [11] have shown that cross entropy allows finding a better local optimum and has practical advantages over squared-error function. In order to update the DMNs’ parameters by SGD optmisation, the gradients dJ dJ dwk,c and dbk,c must be calculated. They are given by
AN US
249
PT
QX batch dJ 0 =− dwk,c q=1
QX batch dJ 0 =− dbk,c q=1
CE where
AC
262
(7)
1 {tq = c} log (P rc (xq )) ,
M
248
q=1 c=1
ED
247
Q X NC X
CR IP T
237
12
fwk,c =
...
0
fwk,c
0
...
0
...
0
fbk,c
0
...
0
0 − (1 {tq = c} − P rc (xq )) 1 {tq = c} − P rc (xq )
fbk,c =
(
0 1 {tq = c} − P rc (xq )
T T
,
(8)
,
(9)
k 6= k∗ k = k ∗ ∧j∗ = 1 , k = k ∗ ∧j∗ = 2 k= 6 k ∗ ∨j∗ = 1 , k = k ∗ ∧j∗ = 2
(10)
(11)
ACCEPTED MANUSCRIPT
4.2 Training Method
13
6: 7: 8: end
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
• HpC initialisation (Sussner and Esmi [33]). Each training set for the same class is enclosed by a hyperbox (Hyperbox per Class) with a distance margin M ∈ (−0.5, ∞). This is the most simple way to initialise dendrites. The drawback is that it restricts the maximum number of dendrites to be equal to the number of classes, and sometimes the pattern distribution may require a greater number of dendrites. Therefore, we propose the following methods that generate more hyperboxes per class to capture the data structure.
AC
288
AN US
269
M
268
ED
267
where k∗ is the index of the dendrite that generates the maximum response in the dendrite cluster c, j∗ is the column in the matrix [x − wk,c , wk,c + bk,c − x] where the minimum is and the position of fwk,c and fbk,c in the vectors (8) and (9) is determined by the row number in the matrix [x − wk,c , wk,c + bk,c − x] where the minimum is. Due to the min and max operators in DMN, most gradient elements are zero. These means that only one parameter of wk,c and bk,c and one dendrite in each cluster is updated by gradient descent method when the cost is calculated by only one training sample. In practice, we prefer to apply the gradient descent in batches with Qbatch training samples taken randomly. We show the pseudo-code of training in Algorithm 1. The first line of the code requires setting the hyperparameters: the learning rate α, the number of iterations Ni and the batch size Qbatch . The second line initialises the DMN’s parameters (wk,c , bk,c ), saving them into the dendrites object variable. The batch is randomly computed in line 4, the gradients are calculated in line 5 while the model parameters are updated in lines 6 and 7. Variables (X, T) contain the training samples and the corresponding targets. We explain the different initialisation methods for dendrite parameters. Unlike classical MLP where learning parameters are initialised randomly, here DMN allows us to propose different initialisation methods because dendrite parameters have a clear geometrical interpretation: they are hyperboxes. Hence, we can reuse training methods based on heuristics to give a first approximation at the beginning of the gradient descent iterations. In this paper, we propose the following four initialisation methods, and we show an illustration of these methods in Figure 4.
PT
265 266
289
290 291 292 293 294 295
dendrites.bk,c = dendrites.bk,c − α dbdJk,c for ∀k, c
CE
263 264
dJ dendrites.wk,c = dendrites.wk,c − α dw for ∀k, c k,c
CR IP T
Algorithm 1 This is the training procedure for a DMN based on SGD. 1: initialise hyperparameters α, Ni , Qbatch 2: dendrites :=parameters_initialisation(X, T) 3: for i = 0 to Ni do 4: (Xbatch , Tbatch ) :=compute_batch(X, T) dJ , dJ ) :=computegradients(dendrites, Xbatch , Tbatch ) 5: ( dw k,c dbk,c
ACCEPTED MANUSCRIPT
14
299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319
CR IP T
298
• dHpC initialisation. This method divides each hyperbox which was previously generated by the above method. The number of hyperboxes produced by this method is equal to 2nd Nc , where nd is the number of dimensions that are divided. For example, if nd = 1, the algorithm divides the hyperboxes in half with respect to the first dimension, producing 2Nc hyperboxes. If nd = 2, it divides the hyperboxes in half with respect to the first and second dimensions, producing 4Nc hyperboxes, and so on.
• D&C initialisation (Sossa and Guevara [29]). The dendrite parameters are set equal to the dendrites produced by the Divide and Conquer training method proposed in Sossa and Guevara [29]. This algorithm encloses all patterns in one hyperbox with a distance margin of M ∈ [0, ∞) and divides this hyperbox into 2N sub-hyperboxes if there is misclassification. This last procedure is called recursively for each sub-hyperbox until each pattern is classified with a given tolerance error E0 . This method is computationally more expensive but it captures the data distribution in a better way. In this paper, we deploy an improved version of this algorithm which can avoid the curse of dimensionality and overfitting. This new algorithm will be published later (Guevara & Sossa, 2016, Personal Communication).
AN US
297
• K-means initialisation. The classical k-means algorithm is applied to obtain S clusters for each training set of the same class. The k-means clusters are transformed to hyperboxes, and we use these hyperboxes to set dendrite parameters (we do the inverse process proposed in Sossa et al. [27]). We use Gaussians as radial basis functions for clustering:
M
296
2
321
322
wk,c = µk,c − ∆k,c ,
(13)
bk,c = 2∆k,c , s − log(y0 ) , ∆k,c = βk,c
(14) (15)
where y0 ∈ (0, 1] determines the hyperbox size (note that vector operations in (15) are element-wise). This initialisation method also increases computational cost but can create more complex learning models than HpC and dHpC methods.
AC
325
CE
323
324
(12)
The cluster centre µk,c and width βk,c are used to calculate the dendrite parameters to form a hyperbox for each Gaussian cluster, as follows:
PT
320
ED
φk,c (x) = exp(−βk,c kx − µk,c k ),
326 327
328
329 330
5. Results and Discussion In this section, we compare our proposal with other training methods for DMN and popular machine learning methods based on synthetic and real datasets
ACCEPTED MANUSCRIPT
PT
ED
M
AN US
CR IP T
15
AC
CE
Figure 4: Initialisation methods. We illustrate the four initialisation methods. The HpC method initialises dendrite parameters, enclosing each class with a hyperbox with a margin (even negative). The dHpC method is the same as the previous method but allows division of hyperboxes in half with respect to some dimensions, generating more hyperboxes per class. The D&C method is specifically to use D&C training to initialise dendrites. The k-means method first makes a clustering to define where to put the initial hyperboxes using the classical k-means method.
ACCEPTED MANUSCRIPT
5.1 Synthetic and Real Datasets
337 338
339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367
5.1. Synthetic and Real Datasets Our proposed method is first applied to synthetic datasets with high overlapping. We have chosen datasets where overlapping is present because overfitting is more relevant in this kind of problem. The first synthetic dataset is called "A" and is generated by two Gaussian distributions with standard deviation equal to 0.9. The first class is centred around (0,0); the second class is centred around (1,1). The second synthetic dataset is called "B" and consists of three classes which follow a Gaussian distribution with standard deviation equal to 1.0. The classes centres are: (-1,-1), (1,1) and (-1,2). We also use some real datasets to test the performance of our proposal. Most of these datasets were taken from the UCI Machine Learning Repository (A. and D.J. [1]). We also use a subset of MNIST (LeCun Y. and C. [15]) and CIFAR10 (Krizhevsky [14]) datasets for tasks with high dimensionality. We describe them briefly. The Iris dataset is a classical non-linearly separable problem with three types of iris plant and four describing features. The Liver dataset consists of detecting liver disorder (two classes) using six features that relate to blood tests and alcoholic drinking. The Glass dataset has the purpose of identifying the glass type in crime scenes using 10 properties of glass. The number of classes is six. The Page Blocks dataset needs to classify blocks of documents into five categories based on 10 block geometrical features. The Letter Recognition dataset has 26 letters from different fonts which must be identified by 16 geometrical and statistical features. The Mice Protein expression dataset classifies mice into eight classes based on features such as genotype, behaviour and treatment. In this work, we only use the expression levels of 77 proteins as features for our experiments. Finally, we apply the proposed method to two image classification tasks taking pixels as features (without intermediate feature extraction). The MNIST dataset is a classical recognition task for handwritten digit recognition (0 to 9) with an image size of 28 × 28 pixels. The CIFAR10 requires recognition of 10 animals and vehicles by means of colour images of 32 × 32 pixels. It is worth mentioning that in some of these datasets we found duplicates; we deleted them for the experiments. Table 3 summarises the dataset properties in terms of number of features N , number of classes Nc , number of training samples Qtrain and number of testing samples Qtest .
AC
368
CR IP T
336
AN US
335
M
334
ED
333
for classification. The results demonstrate that our proposal obtains systemically better accuracy than other training methods for DMNs and achieves similar performance to popular machine learning methods. Furthermore, we discuss the advantages and disadvantages of each hyperbox initialisation method for our proposal to determine which is the best. The reader can reproduce the experimental results of this paper by using the online code provided at Zamora [36]. At the end of the section, we present special experiments to show the best asset of the DMNs regarding MLPs.
PT
332
CE
331
16
369 370 371
372 373 374
5.2. Comparison with Other DMN Training Methods In this subsection, we analyse the performance of DMN trained by SGD in comparison with other training methods based just on heuristics. We also
ACCEPTED MANUSCRIPT
5.2 Comparison with Other DMN Training Methods
17
380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397
AC
398
M
379
ED
377 378
399 400 401 402 403 404 405
Nc 2 3 3 2 6 5 26 8 10 10
Qtrain 1000 1500 120 300 170 4450 15071 900 1000 1000
Qtest 200 300 30 45 44 959 3917 180 200 200
compare performance of the different ways to initialise hyperboxes that we proposed in section 4. These comparisons are made based on the datasets described above. We measure performance in terms of the training error Etrain , the testing error Etest , training time ttrain and the model complexity NH /Qtrain . This last quantity is defined as the ratio of the number of hyperboxes NH by the number of training samples Qtrain . If this ratio is equal to 1 then the neuron uses a hyperbox for each training sample. In this extreme case, the model is very complex. In contrast, if this ratio is close to zero, this means that the number of hyperboxes is much smaller than the number of training samples. Therefore, the model is simple. Ideally, we are looking for a morphological neuron with relatively few hyperboxes (NH /Qtrain ' 0), with the smallest testing error Etest and with Etrain ∼ Etest ; that is, a simple and well generalised learning model. As we see in section 3, there have been several ways to train DMN reported in the literature. We only compare our proposal with the principal algorithms: Elimination (SMLP) (Ritter and Urcid [22]), Hyperbox per Class (HpC) (a simplified version of the algorithm proposed in Sussner and Esmi [33]), and Divide and Conquer (D&C) (Sossa and Guevara [29]). Elimination training encloses all patterns in a hyperbox and eliminates misclassified patterns by enclosing them with several other hyperboxes. HpC training encloses the patterns of each class by using a hyperbox. D&C training is based on a divide and conquer strategy; this training begins enclosing all patterns in one hyperbox, then divides the hyperbox into 2N hyperboxes if there is misclassification. This last procedure is called recursively for each sub-hyperbox until patterns are classified with a given tolerance error. It is important to note that all these training methods are based on heuristics, not on optimisation methods like our proposal, so we could expect that SGD training does better. Additionally, we compare the four initialisation methods we propose in section 4 for SGD training, which are HpC, dHpC, D&C and k-means. Figure 5 summarises our results, showing that SGD training systemically obtains smaller testing errors for each dataset than the other heuristics. Fur-
PT
376
CE
375
N 2 2 4 6 10 10 16 77 784 3072
AN US
Dataset A B Iris Liver Glass Page Blocks Letter Recog. Mice Protein MNIST CIFAR10
CR IP T
Table 3: We summarise the dataset properties.
ACCEPTED MANUSCRIPT
5.3 Comparison with Other Machine Learning Methods
412
413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435
436 437 438 439 440 441
5.3. Comparison with Other Machine Learning Methods In this subsection, we compare the performance of popular machine learning methods with our proposal: DMN-SGD. The objective is to show that our proposal gets better results in low dimensional datasets and obtains competitive results in high dimensional datasets. These comparisons are made based on the datasets described above. As popular machine learning methods, we chose two-multilayer perceptrons (MLP) (Rumelhart et al. [25]), support vector machine (SVM) (Cortes and Vapnik [5]) and radial basis function network (RBFN) (Broomhead and Lowe [4]). We summarise our results in Figure 6, showing that DMN-SGD obtains the smallest testing errors for some datasets (A, B, Iris and Liver). For other datasets (Glass, Pageblocks, Letters and Mice Protein), DMN-SGD presents errors close to the best result; and for the remaining datasets (MNIST and CIFAR10), DMN-SGD obtains the worst error. We can infer that the DMN-SGD algorithm works better for low dimensional datasets (fewer features) than for high dimensional datasets. On the other hand, on average, the testing error for DMN-SGD is the second best behind the MLP model for these 10 datasets. The main drawback is that DMN-SGD needs more model parameters Npm than the other machine learning models, but DMNSGD has better training time than MLP on average. We can conclude that DMN trained by SGD can be another useful technique in the machine learning toolbox for practitioners, because it obtains competitive results with respect to popular machine learning methods for classification problems. Finally, for results in detail, the reader can see Table 6. 5.4. DMN vs MLP In the previous subsection we have seen that the DMN performance is competitive with popular machine learning techniques. In particular, MLP has marginally better classification than DMN on average. However, there are some problems where DMN far outstrips MLP. The purpose of this subsection is to show empirically the greatest benefit of the DMNs. According to our proposal, training both DMN and MLP requires two steps: 1) initialisation of learning parameters in the model and 2) optimising them. In the MLP, initialisation is a very simple step: choose the initial parameters randomly following some convenient probability distribution (Glorot and Bengio [10], He et al. [12]); rarely do these random initial values give a good decision boundary. Therefore, optimisation should search more to find the best parameters in MLP. Instead, in the DMN, initialisation of dendritic weights exploits the geometric interpretation of the weights (as hyperboxes) to facilitate
AC
442
CR IP T
411
AN US
410
M
409
ED
408
thermore, we can see that when we initialise SGD training with the dHpC method, we obtain better testing errors on average than with other initialisation methods. On average, the best initialisation method in terms of the least number of dendrites is the HpC method and in terms of least training time it is the k-means method. We can conclude that SGD presents better results than solely heuristic-based methods to train DMNs. For results in detail, see Tables 4 and 5.
PT
407
CE
406
443 444 445 446 447 448 449
18
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
19
(a) Iris
100%
100%
50%
50%
50%
0%
0% a) b)
c)
d) e)
f)
0%
g)
a) b)
Liver
c)
d) e)
f)
g)
a) b)
Glass 100%
50%
50%
50%
0% c)
d) e)
f)
a) b)
Letters
c)
d) e)
50%
50%
0% c)
d) e)
f)
g)
g)
a) b)
c)
d) e)
f)
g)
f)
g)
MINIST
100%
50%
0% a) b)
f)
Mice Proteins 100%
f)
0%
g)
100%
d) e)
AN US
100%
a) b)
c)
Pageblocks
100%
0%
CR IP T
B
A 100%
0%
g)
a) b)
c)
d) e)
f)
g)
f)
g)
a) b)
c)
d) e)
M
CIFAR10
100%
ED
50% 0%
a) b)
c)
d) e)
PT
(b)
Average Etest 60%
Average NH/Qtrain
Average ttrain
0.6
1.0E+04
0.4
1.0E+02
20%
0.2
1.0E+00
0%
0.0
CE
40%
AC
a)
b)
c)
d)
e)
f)
g)
1.0E-02 a)
b)
c)
d)
e)
f)
g)
a) b) c) d) e) f) g)
Figure 5: Comparison with other DMN training methods. Our experimental results are summarised. (a) We show the testing errors for each dataset obtained by the training methods based on heuristics in red: a) SMLP, b) D&C and c) HpC; and the SGD training is in green, executed from four different initialisation methods: d) SGD (HpC), e) SGD (dHpC), f) SGD (D&C) and g) SGD (k-means). (b) We show the average among all datasets for testing error Etest , model complexity NH /Qtrain and training time ttrain . We mark the best value with blue. Note that quantities are sometimes zero, meaning that bars disappear.
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
20
SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
B
SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
Iris
SMLP D&C HpC SGD (HpC)
SGD (D&C)
SGD (k-means) Liver
NH /Qtrain
Etrain %
Etest %
ttrain ,seg
0.225 0.013 0.002 0.002 0.002 0.013 0.002 0.192 0.002 0.002 0.002 0.002 0.002 0.002 0.100 0.225 0.025 0.025 0.025 0.225 0.250 0.277 0.290 0.007 0.007 0.107 0.290 0.083 0.035 0.153 0.035 0.035 1.129 0.153 0.035
0.00 21.40 24.90 21.20 21.20 20.90 21.60 0.00 18.80 17.27 16.07 16.07 16.53 15.53 0.00 1.67 5.00 3.33 3.33 0.00 0.83 0.00 19.00 35.67 31.00 24.33 6.33 7.33 0.00 26.47 18.82 8.24 10.00 25.29 1.18
55.00 21.50 25.00 21.50 21.50 20.00 22.00 40.67 21.00 17.67 16.33 16.33 16.67 15.33 10.00 3.30 0.00 0.00 0.00 0.00 0.00 73.33 37.78 40.00 35.56 24.44 31.11 31.11 100.00 29.55 29.55 13.64 9.09 20.45 27.27
98.4 0.2 0.0 10.6 10.6 16.5 0.0 166.1 0.1 0.0 19.4 18.4 6.6 1.3 0.0 0.1 0.0 1.8 1.8 6.7 0.0 3.8 0.2 0.0 103.0 2.1 309.5 5.4 0.0 0.1 0.0 1.5 0.3 102.4 0.0
ED
SGD (dHpC)
NH
225 13 2 2 2 13 2 288 3 3 3 3 3 3 12 27 3 3 3 27 30 83 87 2 2 32 87 25 6 26 6 6 192 26 6
SMLP
PT
D&C HpC
SGD (HpC)
SGD (dHpC)
CE
SGD (D&C)
SGD (k-means)
Glass
SMLP D&C
AC
AN US
Training
A
M
Dataset
HpC
SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
CR IP T
Table 4: We report our results in detail. For each dataset and each training method, we evaluate the number of dendrites NH , the model complexity NH /Qtrain , the training error Etrain , the testing error Etest and the training time ttrain . The best values are shown in bold. This table shows only the results for the first five datasets; see Table 5 for the rest.
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
21
SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
Letter Rec.
SMLP D&C HpC SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
Mice Protein
SMLP D&C HpC SGD (HpC)
SGD (D&C)
SGD (k-means) MNIST
NH /Qtrain
Etrain %
Etest %
ttrain ,seg
0.040 0.053 0.001 0.001 0.001 0.053 0.001 0.270 0.473 0.002 0.002 0.007 0.473 0.088 0.054 0.899 0.009 0.009 0.036 0.899 0.261 0.347 1.000 0.010 0.010 0.010 1.000 0.182 0.098 1.000 0.010 0.010 0.040 1.000 0.010
0.00 2.00 22.63 4.40 4.40 5.57 9.06 2.98 8.36 63.82 32.34 23.42 0.49 7.68 0.00 0.00 8.33 7.89 6.33 0.00 3.33 0.00 0.00 20.40 11.60 11.60 0.00 0.80 0.00 0.00 48.60 4.80 3.80 0.00 76.60
14.91 8.03 28.78 7.19 7.19 7.92 12.20 20.78 31.58 63.39 33.93 25.81 22.06 12.61 63.33 5.00 15.56 15.56 12.78 4.44 11.67 61.00 80.00 39.00 23.50 23.50 36.00 15.50 99.50 83.00 77.50 77.50 77.00 83.00 79.50
174.5 0.7 0.0 1.0 1.0 10.7 0.1 60426.2 60.2 0.0 19.2 1352.5 1278.2 69.6 1.7 2.3 0.0 0.5 154.1 6.5 2.7 110.7 18.3 0.0 89.0 89.0 2318.2 35.3 20.3 72.1 0.1 36.5 380.5 715.3 0.3
ED
SGD (dHpC)
NH
179 235 5 5 5 235 5 4072 7126 26 26 104 7126 1322 49 809 8 8 32 809 235 347 1000 10 10 10 1000 182 98 1000 10 10 40 1000 10
SMLP
PT
D&C HpC
SGD (HpC)
SGD (dHpC)
CE
SGD (D&C)
SGD (k-means)
CIFAR10
SMLP D&C
AC
AN US
Training
M
Dataset
Page Blocks
HpC
SGD (HpC) SGD (dHpC) SGD (D&C) SGD (k-means)
CR IP T
Table 5: We report our results in detail. For each dataset and each training method, we evaluate the number of dendrites NH , the model complexity NH /Qtrain , the training error Etrain , the testing error Etest and the training time ttrain . The best values are shown in bold. This table shows only the results for last five datasets; see Table 4 for the rest.
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
22
(a) B
Iris
40%
40%
40%
20%
20%
20%
0%
0%
0% DMN MLP SVM RBFN
DMN MLP SVM RBFN
Liver
DMN MLP SVM RBFN
Glass
Pageblocks
40%
40%
20%
20%
20%
0%
0%
AN US
40%
0%
DMN MLP SVM RBFN
DMN MLP SVM RBFN
Letters
DMN MLP SVM RBFN
Mice Proteins
40%
40%
20%
20%
0%
0%
CR IP T
A
MINIST
40% 20%
0%
DMN MLP SVM RBFN
DMN MLP SVM RBFN
M
DMN MLP SVM RBFN
CIFAR10
100%
ED
50% 0%
DMN MLP SVM RBFN
PT
(b)
Average Etest 40%
Average Npm
60,000
CE
20%
0%
AC
DMN MLP SVM RBFN
Average ttrain 100
40,000
10
20,000 0
1
DMN MLP SVM RBFN
DMN MLP SVM RBFN
Figure 6: Comparison with other machine learning methods. We compare our proposal DMNSGD with popular machine learning methods, summarising the experimental results. (a) We show the testing errors for each dataset obtained by the following machine learning methods: DMN, MLP, SVM and RBFN. (b) We show the average among all datasets for testing error Etest , number of learning parameters Npm and training time ttrain . We mark the best value with blue. Note that quantities are sometimes zero, whereby bars disappear.
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
23
Table 6: We reports our results in detail. For each dataset and each machine learning method, we evaluate the number of learning parameters Npm , the training error Etrain , the testing error Etest and training time ttrain . The best values are shown in bold. DMN-SGD (D&C)
MLP SVM RBFN B
DMN-SGD (k-means)
MLP SVM RBFN Iris
DMN-SGD (D&C)
MLP SVM RBFN Liver
DMN-SGD (dHpC)
MLP SVM RBFN Glass
DMN-SGD (dHpC)
MLP SVM RBFN Page Blocks
DMN-SGD (HpC)
Letter Rec.
DMN-SGD (k-means)
PT
MLP SVM RBFN
DMN-SGD (D&C)
CE
Mice Protein
AC
MNIST
CIFAR10
Etest %
ttrain ,seg
52 252 10 22 12 63 10 75 216 83 21 51 384 272 18 299 3840 1706 78 244 100 1605 65 981 42304 2176 494 84865 124586 4308 640 10414 285376 39760 7870 222610 245760 154160 30750 369970
20.90 21.20 21.10 21.90 15.53 15.80 15.80 16.40 0.00 0.83 4.17 3.33 24.33 7.33 29.33 21.00 10.00 5.29 2.94 1.76 4.40 3.64 4.02 3.80 7.68 8.63 39.36 1.84 0.00 0.00 0.11 2.78 0.80 1.20 2.70 0.90 3.80 50.70 18.40 24.20
20.00 21.50 22.00 22.00 15.33 15.67 16.33 16.33 0.00 0.00 3.33 0.00 24.44 24.44 35.56 35.56 9.09 6.82 9.09 25.00 7.19 6.47 5.84 7.61 12.61 11.26 41.03 5.28 4.44 0.00 0.56 5.56 15.50 11.00 11.50 7.00 77.00 70.50 65.00 62.00
16.5 7.6 0.5 0.1 1.3 3.5 0.5 0.1 6.7 1.2 0.5 0.1 2.1 3.4 0.4 0.1 0.3 8.7 0.6 0.1 1.0 9.2 1.1 0.6 69.6 1685.9 47.1 193.2 6.5 232.0 1.0 0.2 35.3 148.4 2.7 2.2 380.5 103.8 8.7 3.4
ED
MLP SVM RBFN
Etrain %
MLP SVM RBFN
DMN-SGD (k-means)
MLP SVM RBFN DMN-SGD (dHpC)
MLP SVM RBFN
CR IP T
A
Npm
AN US
ML model
M
Dataset
ACCEPTED MANUSCRIPT
5.4 DMN vs MLP
451 452
optimisation, solving problems that classical perceptrons cannot, or at least, at reduced work costs. This is the greatest benefit of DMNs. Here is where heuristic methods excel at smarter parameter initialisation. Dendrite Morphological Neuron Etest = 0%
200 iterations
Two-Layer Perceptrons
Etest = 0%
Etest = 42.0%
500 iterations
Etest = 0%
Etest = 5.8%
Etest = 46.6%
3000 iterations
Etest = 7.2%
Difficulty
Etest = 6.6%
0 iterations
Etest = 6.0%
After optimisation
Etest = 48.3%
Initial model
3000 iterations
Etest = 47.5%
After optimisation
M
Initial model
200 iterations
AN US
VS Etest = 9.6%
CR IP T
450
24
455 456 457 458 459 460
AC
461
To illustrate this, we propose to evaluate the performance of both models (MLP and DMN) using three synthetic classification problems: a type XOR problem, a double spiral with two laps and a double spiral with 10 laps. In particular, the spirals are difficult problems for MLPs because MLPs tend to create open decision boundaries with no laps. The experimental results are shown in Figure 7. It is clearly noticeable that in all three cases, the optimisation requires more iterations in MLPs w.r.t DMNs. We also noticed that optimisation in DMN is just a process for refining the initial model, while optimisation in MLP is more a decisive step because initial models are bad. Finally, the most important observation is that an MLP cannot solve the double spiral with 10 laps (or more than 10), even though the decision boundary follows a mathematically simple pattern. Instead, DMN initialises the decision boundary with a good approximation by a D&C strategy, allowing the optimisation step to refine the model after only a few iterations. Of course, more research is needed to see in which real problems this would happen. Here, we just pave the way towards this research.
PT
454
CE
453
ED
Figure 7: DMN vs MLP. This figure illustrates why the dendritic weights initialisation is the greatest benefit of the DMN. This initialisation: 1) makes the optimisation step easier, reducing the number of required iterations during the optimisation step; 2) solves problems which classical perceptrons cannot, or at least, it takes them much more work. These properties are due to geometric interpretation of the dendritic weights as hyperboxes that allow intialisation methods based on heuristics. This cannot happen with MLP.
462 463 464 465 466 467 468
ACCEPTED MANUSCRIPT
25
469
6. Conclusion and Future Research
492
Acknowledgements
476 477 478 479 480 481 482 483 484 485 486 487 488 489 490
493 494 495 496 497
498
The authors would like to acknowledge the support for this research work by the UPIITA and the CIC, both schools belonging to the Instituto Politécnico Nacional (IPN). This work was economically supported by SIP-IPN [grants numbers 20160945 and 20161116]; and CONACYT [grant numbers 155014 (Basic Research) and 65 (Frontiers of Science)].
References
AC
499
AN US
475
M
474
ED
473
PT
472
CE
471
CR IP T
491
In this paper, we reviewed the different neural architectures and training methods for morphological neurons. Most of these training methods are based on heuristics that exploit the geometrical properties of hyperboxes. The main contribution of this paper is that the neural architecture of a dendrite morphological neuron (DMN) was extended to be trained by stochastic gradient descent (SGD) for classification problems. We proposed and evaluated four different methods to initialise dendrite parameters at the beginning of training by SGD. Our experimental results contradict what is implicitly said in literature: “the DMN training based just on heuristics does not need optimisation”. However, the experiments show that our approach trained by SGD systematically obtains smaller classification errors than the training methods based solely on heuristics. Instead, we show that these heuristics can be useful to initialise dendrite parameters before optimising them. This is an advantage over MLP where learning parameters are only initialised randomly. In fact, we show that DMN can solve some problems that MLP cannot, or at least, it costs a lot more work. Our results show that the best performance results were obtained at low dimensionality for these DMNs trained by SGD. Moreover, the performance decreases at high dimensionality. We think more exploration must be made to improve results in terms of new neural architectures and optimisation-based training for high dimensional problems in DMNs. This remains an open issue for future investigation.
470
500 501 502
503 504 505
[1] A., A., D.J., N., 2007. UCI Machine Learning Repository. Irvine, CA: University of California, Department of Information and Computer Science. http://www.ics.uci.edu/~mlearn/MLRepository.html. [2] Barmpoutis, A., Ritter, G. X., 2006. Orthonormal basis lattice neural networks. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 331–336.
ACCEPTED MANUSCRIPT
REFERENCES
512 513
514 515 516
517 518 519
520 521 522
523 524 525 526 527
528 529 530 531 532 533
534 535
[5] Cortes, C., Vapnik, V., 1995. Support-vector networks. Machine Learning 20 (3), 273–297. URL http://dx.doi.org/10.1023/A:1022627411411
[6] Davidson, J. L., Hummer, F., 1993. Morphology neural networks: An introduction with applications. Circuits, Systems and Signal Processing 12 (2), 177–210. [7] Davidson, J. L., Ritter, G. X., 1990. Theory of morphological neural networks. URL http://dx.doi.org/10.1117/12.18085
[8] Davidson, J. L., Sun, K., Jul. 1991. Template learning in morphological neural nets. In: Gader, P. D., Dougherty, E. R. (Eds.), Image Algebra and Morphological Image Processing II. Vol. 1568. pp. 176–187. [9] de A. Araujo, R., 2012. A morphological perceptron with gradient-based learning for brazilian stock market forecasting. Neural Networks 28, 61 – 81. URL http://www.sciencedirect.com/science/article/pii/ S0893608011003200 [10] Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks. In: Teh, Y. W., Titterington, D. M. (Eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10). Vol. 9. pp. 249–256. URL http://www.jmlr.org/proceedings/papers/v9/glorot10a/ glorot10a.pdf [11] Golik, P., Doetsch, P., Ney, H., Aug. 2013. Cross-entropy vs. squared error training: a theoretical and experimental comparison. In: Interspeech. Lyon, France, pp. 1756–1760. [12] He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034.
AC
536
CR IP T
511
[4] Broomhead, D. S., Lowe, D., 1988. Multivariable Functional Interpolation and Adaptive Networks. Complex Systems 2, 321–355.
AN US
510
M
509
ED
508
[3] Bishop, C. M., 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.
PT
507
CE
506
26
537 538 539 540
541 542 543
[13] Kline, D. M., Berardi, V. L., 2005. Revisiting squared-error and crossentropy functions for training neural network classifiers. Neural Computing & Applications 14 (4), 310–318.
ACCEPTED MANUSCRIPT
REFERENCES
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575
[16] Montavon, G., Orr, G. B., Müller, K.-R. (Eds.), 2012. Neural Networks: Tricks of the Trade, Reloaded, 2nd Edition. Vol. 7700 of Lecture Notes in Computer Science (LNCS). Springer.
[17] Pessoa, L. F., Maragos, P., 2000. Neural networks with hybrid morphological/rank/linear nodes: a unifying framework with applications to handwritten character recognition. Pattern Recognition 33 (6), 945 – 960. URL http://www.sciencedirect.com/science/article/pii/ S0031320399001570 [18] Ritter, G. X., Davidson, J. L., 1991. Recursion and feedback in image algebra. URL http://dx.doi.org/10.1117/12.47965 [19] Ritter, G. X., Schmalz, M. S., 2006. Learning in lattice neural networks that employ dendritic computing. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 7–13.
[20] Ritter, G. X., Sussner, P., Aug 1996. An introduction to morphological neural networks. In: Pattern Recognition, 1996., Proceedings of the 13th International Conference on. Vol. 4. pp. 709–717 vol.4. [21] Ritter, G. X., Urcid, G., Mar 2003. Lattice algebra approach to singleneuron computation. IEEE Transactions on Neural Networks 14 (2), 282– 295. [22] Ritter, G. X., Urcid, G., 2007. Computational Intelligence Based on Lattice Theory. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Learning in Lattice Neural Networks that Employ Dendritic Computing, pp. 25–44. [23] Ritter, G. X., Urcid, G., Iancu, L., 2003. Reconstruction of patterns from noisy inputs using morphological associative memories. Journal of Mathematical Imaging and Vision 19 (2), 95–111. URL http://dx.doi.org/10.1023/A:1024773330134 [24] Ritter, G. X., Urcid, G., Juan-Carlos, V. N., July 2014. Two lattice metrics dendritic computing for pattern recognition. In: Fuzzy Systems (FUZZIEEE), 2014 IEEE International Conference on. pp. 45–52.
AC
576
CR IP T
549
AN US
548
[15] LeCun Y., C. C., C., B., 1998. The MNIST dataset of handwritten digits. NewYork: NYU, Google Labs and Microsoft Research. http://yann. lecun.com/exdb/mnist/.
M
547
ED
546
[14] Krizhevsky, A., 2009. Learning Multiple Layers of Features from Tiny Images. MIT and NYU. https://www.cs.toronto.edu/~kriz/cifar.html.
PT
545
CE
544
577 578 579 580 581 582
583
27
[25] Rumelhart, D. E., Hinton, G. E., Williams, R. J., 1986. Parallel distributed processing: Explorations in the microstructure of cognition, vol. 1. MIT Press, Cambridge, MA, USA, Ch. Learning Internal Representations by Error Propagation, pp. 318–362. URL http://dl.acm.org/citation.cfm?id=104279.104293
ACCEPTED MANUSCRIPT
REFERENCES
590 591 592 593
594 595 596 597
598 599 600 601
602 603 604 605
606 607 608
609 610 611 612 613 614
615
[28] Sossa, H., Guevara, E., 2013. Pattern Recognition: 5th Mexican Conference, MCPR 2013, Querétaro, Mexico, June 26-29, 2013. Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Modified Dendrite Morphological Neural Network Applied to 3D Object Recognition, pp. 314–324. [29] Sossa, H., Guevara, E., 2014. Efficient training for dendrite morphological neural networks. Neurocomputing 131, 132 – 142. URL http://www.sciencedirect.com/science/article/pii/ S0925231213010916
[30] Sussner, P., Sep 1998. Morphological perceptron learning. In: Intelligent Control (ISIC), 1998. Held jointly with IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), Intelligent Systems and Semiotics (ISAS), Proceedings. pp. 477–482. [31] Sussner, P., Esmi, E. L., 2009. Constructive Neural Networks. Springer Berlin Heidelberg, Berlin, Heidelberg, Ch. Constructive Morphological Neural Networks: Some Theoretical Aspects and Experimental Results in Classification, pp. 123–144. [32] Sussner, P., Esmi, E. L., June 2009. An introduction to morphological perceptrons with competitive learning. In: Neural Networks, 2009. IJCNN 2009. International Joint Conference on. pp. 3024–3031. [33] Sussner, P., Esmi, E. L., 2011. Morphological perceptrons with competitive learning: Lattice-theoretical framework and constructive learning algorithm. Information Sciences 181 (10), 1929 – 1950, special Issue on Information Engineering Applications Based on Lattices. URL http://www.sciencedirect.com/science/article/pii/ S0020025510001283 [34] Urcid, G., Ritter, G. X., 2006. Noise masking for pattern recall using a single lattice matrix auto-associative memory. In: Fuzzy Systems, 2006 IEEE International Conference on. pp. 187–194.
AC
616
CR IP T
589
[27] Sossa, H., Cortés, G., Guevara, E., 2014. New Radial Basis Function Neural Network Architecture for Pattern Classification: First Results. Springer International Publishing, Cham, pp. 706–713.
AN US
588
M
587
ED
586
[26] Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural Networks 61, 85–117, published online 2014; based on TR arXiv:1404.7828 [cs.NE].
PT
585
CE
584
28
617
618 619
620
621 622
[35] Weinstein, R. K., Lee, R. H., 2006. Architectures for high-performance fpga implementations of neural models. Journal of Neural Engineering 3 (1), 21. URL http://stacks.iop.org/1741-2552/3/i=1/a=003 [36] Zamora, E., 2016. Dendrite Morphological Neurons Trained by Stochastic Gradient Descend - Code. https://github.com/ezamorag/DMN_by_SGD.
ACCEPTED MANUSCRIPT
REFERENCES
29
627 628 629 630 631 632 633
634 635 636 637 638 639 640
Humberto Sossa was born in Guadalajara, Jalisco, Mexico in 1956. He received a B.Sc. Degree in Electronics from the University of Guadalajara in 1981, a M.Sc. in Electrical Engineering from CINVESTAV-IPN in 1987 and a Ph.D. in Informatics from the National Polytechnic Institute of Grenoble, France in 1992. He is a full time professor at the Centre for Computing Research of the National Polytechnic Institute of Mexico. His main research interests are in Pattern Recognition, Artificial Neural Networks, Image Analysis, and Robot Control using Image Analysis.
AC
CE
PT
641
AN US
626
Erik Zamora is a full professor with the National Polytechnic Institute of Mexico. He received his Diploma in Electronics from UV (2004), his Master’s degree in Electrical Engineering from CINVESTAV (2007), and his PhD in automatic control from CINVESTAV (2015). He developed the first commercial Mexican myoelectric system to control a prosthesis in the Pro/Bionics company and a robotic navigation system for unknown environments guided by emergency signals at the University of Bristol. He has a postdoctoral position in CIC-IPN. His current interests include autonomous robots and machine learning. He has published eight technical papers in international conference proceedings and journals and has directed nine theses on these topics.
M
625
ED
624
CR IP T
623