Learning parsimonious dendritic classifiers

Learning parsimonious dendritic classifiers

Neurocomputing 109 (2013) 3–8 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Le...

394KB Sizes 0 Downloads 52 Views

Neurocomputing 109 (2013) 3–8

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Learning parsimonious dendritic classifiers ˜ a n, Ana Isabel Gonzalez-Acun ˜a Manuel Gran Computational Intelligence Group, Dept. CCIA; Universidad del Pais Vasco (UPV/EHU), Spain

a r t i c l e i n f o

abstract

Available online 8 October 2012

From a practical industrial point of view parsimonious classifiers based on dendritic computing (DC) have two advantages: First they are implemented using only additive and min/max operators. They can be implemented in simple processors and be extremely fast providing classification responses. Second, parsimonious models improve generalization. In this paper we develop a formulation of dendritic classifiers based on lattice kernels and we train them using a direct Monte Carlo approach and a Sparse Bayesian Learning. We compare the results of both kinds of training with the relevance vector machines (RVM) on a collection of benchmark datasets. & 2012 Elsevier B.V. All rights reserved.

Keywords: Dendritic computing Lattice computing Sparse Bayesian Learning

1. Introduction Lattice computing [7] encompasses a wide class of algorithms characterized by either using the lattice operators inf and sup as the computational building blocks or using lattice theory to produce generalizations or fusions of previous approaches [10–14]. Dendritic computing (DC) [2,17,19–22] is a biologically inspired lattice computing algorithm to build up classifiers for binary classification problems. Specifically, in [18,21] an analytical proof is given for the single neuron lattice model with dendrite computation (SNLDC) providing a perfect approximation to any data distribution. However it suffers from over-fitting problems. A recent work [4] performed over a specific database [6,23–25] showed that SNLDC has high sensitivity (ability to identify positively the members of the target class) but very low specificity (ability to identify discard the nonmembers of the target class) producing low accuracy on average in a 10-fold cross-validation experiment, proposing a kernel transformation [28] followed by a lattice independent component analysis (LICA) [8] to improve SNLDC classifiers generalization. However, there is no work done aiming to obtain parsimonious classifiers in the DC framework. This paper aims to obtain parsimonious classifiers with good generalization capabilities. Sparse Bayesian Learning (SBL) [29–31] is a general Bayesian framework for obtaining sparse solutions to regression and classification tasks. Sparse models have many parameters set to the null value, zero in the conventional ring of the real numbers with the conventional addition and product. This approach obtains dramatically simpler models than other approaches. A popular instance of this approach is the relevance vector machine (RVM), which trains a linear prediction model that is functionally identical to the one used

n

Corresponding author. Tel.: þ34 94 301 8000; fax: þ 34 94 301 5590. ˜ a). E-mail address: [email protected] (M. Gran

0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.03.025

by the support vector machines (SVM) obtaining much parsimonious representations, i.e. using much fewer relevant vectors than the equivalent performance SVM. Industrial uses of hybrid and soft-computing [1,32], such as building inspection [26] or data traffic monitorization for security [5], will benefit from the use of the SBL approach to obtain parsimonious DC classifiers requiring only addition and max/min operations. This paper is a step forward from [9] to embed lattice computing algorithms into a Bayesian framework. It contains the definition of the single layer lattice kernel neuron (SLKN), its training by direct Monte Carlo methods, and a specific tailoring of SBL for SLKN. The SBL includes the definition of the likelihood and prior distributions, with corresponding hyperpriors to formulate learning and parameter relevance determination. Computational experiments on a suite of benchmark datasets provide results that are comparable in some cases to RVM in terms of classification accuracy. The structure of the paper is as follows. Section 2 reviews the baseline dendritic approach and introduces the SLKN model. Section 3 gives the baseline Monte Carlo method to train the SLKN. Section 4 reviews the Sparse Bayesian Learning. Section 5 provides the Sparse Bayesian Learning for the SLKN. Section 6 provides experimental results comparing RDC and RVM. Section 7 gives our conclusions and avenues for further research.

2. Dendritic computing Following the notation in [30], in supervised learning we are given a sample of input vectors fxn gN n ¼ 1 along with corresponding targets ft n gN n ¼ 1 which might be real values or class labels. The linear prediction model is given by a linear function of the form P T yðx; wÞ ¼ M i ¼ 1 wi ci ðxÞ ¼ w /ðxÞ, where the output is a weighted sum of M basis functions. Basis functions might be non-linear,

˜a, A.I. Gonzalez-Acun ˜a / Neurocomputing 109 (2013) 3–8 M. Gran

4

without affecting the intrinsic linear nature of the model. The P general linear model is specialized to: yðx; wÞ ¼ N n¼1 wn Kðx,xn Þ þw0 , where Kðx,xn Þ is a kernel function defining a basis function from each training set sample. This model is used in the relevance vector machines (RVM) and the support vector machines (SVM). To obtain the output classification prediction, a hard-limiter or Heaviside function may be applied to the linear model output: t^ ¼ f ðyðx; wÞÞ, where ( 1, x Z 0, f ðxÞ ¼ ð1Þ 0, x o0: We could also apply a logistic sigmoid function: f ðxÞ ¼ sðxÞ ¼

1 , 1 þ ex

ð2Þ

which could be interpreted as the a posteriori probability of class 1.

3.

Compute the total response of the neuron: j V tðxn Þ ¼ tk ðxn Þ; n ¼ 1, . . . ,N:

4.

If 8n; ðf ðtðxn ÞÞ ¼ t n Þ the algorithm stops here with perfect classification of the training set. Create a new dendrite j ¼ j þ 1, Ij ¼ I0 ¼ X ¼ E ¼ H ¼ |, D ¼ C1 Select xg such that t g ¼ 0 and f ðtðxg ÞÞ ¼ 1. V W m ¼ n a g f di¼ 1 9xgi xni 9 : n A Dg:

k¼1

5. 6. 7.

I0 ¼ fi : 9xgi xni 9 ¼ m,n A Dg;X ¼ fði,xni Þ : 9xgi xni 9 ¼ m,n A Dg. 9. 8ði,xni Þ A X (a) if xgi 4 xni then lij ¼ xni , Eij ¼ f1g (b) if xgi o xni then uij ¼ xni , Hij ¼ f0g 10. Ij ¼ Ij [ I0 ; Lij ¼ Eij [ Hij 11. D0 ¼ fn A D : 8i A Ij ,lij oxni o uij g. If D0 ¼ | then goto step 2, else D ¼ D0 goto step 7. 8.

2.1. The single layer morphological neuron model Dendrites are branched projections of the neurons conducting the electrochemical stimulus from one neuron to the other. Some computational models assert that the bulk of the computation in the nervous system happens in the dendrites instead of the neuron body, the soma. Moreover, this computation seems to be based on addition and maximum/minimum operators, a model that fits nicely in the lattice computing framework. A single layer morphological neuron endowed with dendrite computation based on lattice algebra was introduced in [21]. It is composed of dendrites Dj with associated inhibitory and excitatory weights ðw0ij ,w1ij Þ from the synapses coming from the i-th input variable modeled by a neuron. The response of the j-th dendrite to the input vector x is as follows: ^ ^ tj ðxÞ ¼ pj ð1Þ1l ðxi þ wlij Þ, ð3Þ i A Ij l A Lij

where Ij is the collection of input neurons affecting dendrite Dj, l A Lij D f0,1g specifies if the weight wlij is included in the mode and its inhibitory/excitatory character, Lij ¼ | means that there is no synapse from the i-th input neuron to the j-th dendrite, therefore i= 2Ij ; pj A f1,1g encodes the inhibitory/excitatory response of the whole dendrite. The complete neuron activation is computed as

tðxÞ ¼

j ^

tk ðxÞ, x ¼ 1, . . . ,m:

ð4Þ

Eq. (3) can be rewritten as follows: ^ tj ðxÞ ¼ pj ½plij þ ðxi lij Þ4puij þ ðuij xi Þ,

ð5Þ

i A Ij

where lij ¼ w1ij and uij ¼ w0ij are the lower and upper limits of the interval defined on the range of the i-th neuron by the j-th dendrite, respectively. Moreover, plij , puij A f0,1g control the inclusion of the interval limit in the model. For instance, Lij ¼ | corresponds to plij ¼ puij ¼ 1. Notice that tj ðxÞ Z0 when xi A ½lij ,uij  and pj ¼ 1, therefore contributing to assign x to class 1. Dendrites with pj ¼ 1 are contributions to class 0. 2.2. Training the SLMN A constructive algorithm to fit a SLMN to a classification dataset was provided in [21], which is specified in Algorithm 1. The algorithm starts building a hyperbox enclosing all pattern samples of class 1, that is, C 1 ¼ fn : t n ¼ 1g. Then, the dendrites are added to the structure trying to remove misclassified patterns of class 0 that fall inside this hyperbox [14,10,11]. In step6 the algorithm selects at random one such misclassified patterns, computes the minimum Chebyshev distance to a class 1 pattern and uses the patterns that are at this distance from the misclassified pattern to build a hyperbox that is removed from the C1 initial hyperbox.

k¼1

This model allows to approximate any compact data distribution in high dimension Euclidean space within specified tolerance. A constructive proof is given in [21]. This constructive proof parallels the structure of the ad hoc learning algorithm specified in Algorithm 1. The ad hoc learning algorithm achieves perfect fitting of the training data, however there is a big risk of overfitting and lack of generalization already, which has already motivated works such as [4].

2.3. Lattice kernels

Algorithm 1. Dendritic computing learning based on elimination.

where parameters plni , puni A f0,1g specify the existence of the lower/upper limit of the interval around the i-th dimension of the data sample, respectively. The size of this interval is specified by elni , euni A R þ . The response of the single layer lattice kernel neuron (SLKN) is computed as

Training set T ¼ fðxn ,t n Þ; xn A Rd ,t n A f0,1g; n ¼ 1, . . . ,Ng, 1. Initialize j ¼1, Ij ¼ f1, . . . dg, Pj ¼ f1, . . . ,mg, V W Lij ¼ f0,1g,li1 ¼ xni , ui1 ¼ xni ,8iA I tn ¼ 1

2.

tn ¼ 1

Compute response of the current dendrite Dj, with V l pj ¼ ð1Þsgnðj1Þ :tj ðxx Þ ¼ pj ½pij þðxni lij Þ4puij þ ðuij xni Þ,8n A P j ,

i A Ij

where plij ¼ 0 if 1 A Lij , puij ¼ 0 if 0 A Lij , otherwise plij ¼ 1, u ij

p ¼ 1.

One out of many configurations of the dendritic model of Eq. (5) is obtained when we define the dendrites as boxes around each data sample:

ln ðx,xn Þ ¼

d ^

½plni þðxi ðxni elni ÞÞ4puni þ ððxni þ euni Þxi Þ,

ð6Þ

i¼1

tðxÞ ¼

N _

ðpn þ ln ðx,xn Þpn Þ,

ð7Þ

n¼1

where pn A f1,1g specifies the excitatory/inhibitory role of the kernel, if pn ¼ 1 the corresponding data sample does not contribute to the computation of the SLKN.

˜a, A.I. Gonzalez-Acun ˜a / Neurocomputing 109 (2013) 3–8 M. Gran

Propositions 1 and 3 prove that the SLKN have general approximation properties. Given a labeled data sample, it is possible to build a SLKN mapping exactly each input pattern with its corresponding label. Proposition 1. A data sample fxn ,t n gN be exactly modeled by n ¼ 1 can W a single layer lattice kernel neuron tðxÞ ¼ N n ¼ 1 ðpn þ ln ðx,xn Þpn Þ with lattice kernels defined as in Eq. (6): t n ¼ t^ n ¼ f ðtðxn ÞÞ: Proof. By construction, we show that there is at least a configuration of parameters achieving such exact modeling. Assume that pn ¼ 1 if t n ¼ 0, then the activation of the neuron is

tðxÞ ¼

N _

ðpn þ ln ðx,xn Þpn Þ,

ð8Þ

n¼1 tn ¼ 1

and that pn ¼ 1, pn ¼ plni ¼ puni ¼ 0 if t n ¼ 1. Then, for all data samples xk such that t n ¼ 1, we have that V lk ðxk ,xk Þ 4 ln ðxk ,xn Þ; 8n ak. Moreover, lk ðxk ,xk Þ 4 i ðelkj 4eukj Þ Z 0. Therefore tðxk Þ Z0 and t^ k ¼ 1 ¼ t k . For data samples xk such that t n ¼ 0, we have that ln ðxk ,xn Þ o 0 if 9xki xni 9 4 elnj , eunj for all n ak. Then tðxk Þ o 0 according to Eq. (8) and t^ k ¼ 0 ¼ t k . & Corollary 2. A data sample fxn ,t n gN n ¼ 1 can be exactly modeled t n ¼ t^ n ¼ f ðtðxn ÞÞ by a single layer lattice kernel neuron tðxÞ ¼ WN l n ¼ 1 ln ðx,xn Þ with lattice kernels defined as in Eq. (6) when eni ¼ e ¼ 0. Proof. Immediate from Proposition 1.

&

From Corollary 2 we obtain the degenerate expression of the single layer lattice kernel neuron that would classify as class 0 any point different from the data samples labelled as class 1. Another variation of the lattice kernel can be formulated computing the absolute value:

ln ðx,xn Þ ¼

d ^

½pni þ 9xi xni 9,

ð9Þ

i¼1

where pni A R1 , so that ln ðx,xn Þ o 0 if for any dimension i we have pni 4 9xi xni 9. When pni ¼ 1 the corresponding dimension becomes irrelevant as previously. The activation of the SLKN with absolute values (SLKN-A) follows Eq. (7). Proposition 3. A data sample fxn ,t n gN n ¼ 1 can be exactly modeled by W a SLKN-A tðxÞ ¼ N ð p þ l ðx,x Þp n n n n Þ with lattice kernels defined n¼1 as in Eq. (9): t n ¼ t^ n ¼ f ðtðxn ÞÞ: Proof. By construction, we show that there is at least a configuration of parameters achieving such exact modeling. Assume that pn ¼ 1 if t n ¼ 0, then the activation of the neuron follows Eq. (8) with pn ¼ 1, pn ¼ 0, pni r 0 if t n ¼ 1. Then, for all data samples xk such that t n ¼ 1, we have that lk ðxk ,xk Þ 4 ln ðxk ,xn Þ; 8n ak if pni o9xi xni 9 for all n a k. MoreV over, lk ðxk ,xk Þ 4 i pki Z 0. Therefore tðxk Þ Z 0 and t^ k ¼ 1 ¼ t k . For data samples xk such that t n ¼ 0, we have that ln ðxk ,xn Þ o 0 if 9xki xni 9 4 pki for all n ak. Then tðxk Þ o0 according to Eq. (8) and t^ k ¼ 0 ¼ t k . & 3. Learning lattice kernel neurons

5

learning as the gradient descent minimization of some error or energy function defined on the model output. Constructive algorithms such as Algorithm 1 are in fact greedy algorithms that often provide overfitted solutions. This problem has also been faced by other lattice computing approaches, such as FLN and FLR [11,10,14] resorting to genetic algorithms to find the optimal parameter values. Here we propose to apply a Monte Carlo method for the search of the optimal or near-optimal parameter settings. Let us focus on the more complex model, the SLKN of Eqs. (6) and (7). Let us denote p the set of all parameters that must be tuned, including pn , plni , puni , elni , euni , and pn. Algorithm 2 shows a pseudo-code schema of the Monte Carlo method that we have applied to train the SLKN. In fact, the initial configuration has always been the null configuration, where all parameters are set to infinite or 0 depending on their range. Generating random perturbations involves selecting the parameter to be changed and generating a proposal of an alternative value in its range. The energy function which drives the search for the best parameter values is the accuracy on the training set: P EðkÞ ¼ n dðt n f ðtðxn ; pðkÞÞÞÞ, where dðxÞ is Dirac’s delta function and f ðxÞ is the Heaviside function of Eq. (1). Algorithm 2 follows the pattern of a Simulated Annealing algorithm [15]. Algorithm 2. Monte Carlo method for the training of the SLKN. Initialize randomly pð0Þ, compute Eð0Þ Set the initial temperature Tð0Þ k¼0 Repeat  generate a random candidate configuration p0 ðkÞ  compute E0 ðkÞ DE ¼ E0 ðkÞEðkÞ  compute P a ðDE,TÞ ¼ eDE=T , generate random r  Uð0,1Þ  if DE 4 0 or Pa ðDE,TÞ 4r then pðk þ1Þ ¼ p0 ðkÞ; Eðkþ 1Þ ¼ EðkÞ.  reduce T Until convergence Some graphical results of the training are shown in (Fig. 1) for two toy examples: the XOR problem and Gaussian distributions centered at the XOR points. The algorithm is able to find solutions that fit the training data, producing some inevitable overfitting. The number of parameters and the size of the search space grows combinatorially with the dimensions and the number of samples in the training data. A simplification that may allow to manage big problems is to consider a single value of the e parameters for all the kernels.

4. Sparse Bayesian Learning We consider a supervised classification problem, where fxn ,t n gN n ¼ 1 are the training input-target class pairs, t n A f0,1g. The logistic function of Eq. (2) is applied to the data model yðxn ; wÞ to obtain a prediction of the probability of the input belonging to class 1. Assuming a Bernoulli distribution for Pðt9xÞ we write the training set likelihood as Pðt9wÞ ¼

N Y

f ðyðxn ; wÞÞtn ½1f ðyðxn ; wÞÞ1tn :

ð10Þ

n¼1

A prior distribution on the weights pðw9aÞ embodies our assumptions on the weights, the most adequate for linear models is the zero-mean Gaussian prior distribution over w: pðw9aÞ ¼

N Y

N ðwi 90, a1 i Þ,

ð11Þ

i¼0

The main difficulty faced by training algorithms for lattice models is that they are not differentiable, therefore it is not feasible to define

where a is the vector of hyperparameters, each ai moderating the strength of the prior of the corresponding model parameter wi.

˜a, A.I. Gonzalez-Acun ˜a / Neurocomputing 109 (2013) 3–8 M. Gran

6

Fig. 1. Distribution of class 1 (blue dot region) obtained by training on the (a) XOR, (b) Gaussians centered at the XOR points, (c) the synthetic data used by Tipping [29]. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

In this setting, an infinite value of a hyperparameter implies that the corresponding weight is zero valued. Hyperprior parameters are, therefore, a measure of weight relevance. Sparse learning aims to obtain minimal complexity models selecting only relevant weights. The distribution of the hyperparameters pðaÞ is often [30] assumed to be a Gamma distribution: pðaÞ ¼

N Y

Gammaðai 9a,bÞ,

ð12Þ

i¼0

where setting a ¼ b ¼ 0 we obtain non-informative hyperpriors with uniform (flat) distributions. Sparse Bayesian Learning is the simultaneous estimation of the data model parameters and corresponding hyperparameters. It is achieved by the computation of the posterior distribution over all unknowns conditioned to the training data pðw, a9tÞ, which can be decomposed as follows: pðw, a9tÞ ¼ pðw9t, aÞpða9tÞ,

ð13Þ

where the posterior of the parameters on the data and the hyperpriors can be decomposed by Bayes rule into the data likelihood and the weight priors: pðw9t, aÞ ¼

pðt9wÞpðw9aÞ ppðt9wÞpðw9aÞ: pðt9aÞ

ð14Þ

pða9tÞppðt9aÞpðaÞ, with respect to a. Because the hyperprior distribution of the hyperparameters is non-informative, the distribution pðt9aÞ provides the same information as pða9tÞ. If we consider the maximum a posteriori (MAP) estimation ðw, aÞ

 p ¼ fpkni A f0,1g,k A fl,ugg are relevance indicators. Their distri

bution can be assumed as Bernoulli with parameter pðplni ¼ 1Þ. e ¼ fekni A ½0,1,k A fl,ugg define the limits of the lattice kernel boxes. They are not Gaussian distributed because they are always semipositive, therefore we can assume them to be exponentially distributed with parameter ðakni Þ1 ¼ E½ekni . If ðakni Þ1 ¼ 1 we can assume that the corresponding relevance indicator to be infinite, that is, pkni ¼ 1:

The data likelihood is computed as Pðt9p, eÞ ¼

N Y

f ðtðxn ; p, eÞÞtn ½1f ðtðxn ; p, eÞÞ1tn ,

ð16Þ

n¼1

where we apply the logistic function of Eq. (2) to the SLKN output

tðxn ; p, eÞ. The prior distribution of the limits of the lattice kernel boxes can be formulated as a spike and slab mixture prior [27] as follows: pðe9C, aÞ ¼

N Y d Y Y

½ð1CÞdððekni Þ1 Þ þ CEðekni 9ðakni Þ1 Þ,

ð17Þ

n ¼ 1 i ¼ 1 k A fl,ug

The relevance learning factor in Eq. (13) is the hyperparameter posterior distribution pða9tÞ, which can be approximated as

ðw, aÞMAP ¼ argmaxfpðw, a9tÞg,

and hyperparameters as formulated in Eq. (15). We differentiate the two kinds of SLKN parameters:

ð15Þ

the solution is often approximated by the iterative interleaved computation of the most probable values aMP ¼ argmaxa fpða9tÞg and wMP ¼ argmaxw pðw9t, aÞ, which are computed maintaining the other parameter constant following a coordinated maximization. Maximization is usually done on the logarithms of the distributions. When the distributions are not analytically tractable then Simulated Annealing or similar Monte Carlo based approaches can be applied.

5. Relevant dendritic computing We apply the Sparse Bayesian Learning framework to the SLKN, trying to achieve a MAP estimation of the SLKN parameters

where e ¼ felni , euni A ½0,1g is the vector composed of the lattice kernel box limits, dðxÞ is Dirac’s delta function, therefore dððekni Þ1 Þ becomes 1 when the parameter is infinite, i.e. irrelevant. The binary variables fpkni A f0,1g; kA fl,ugg are such that ekni 9ðpkni ¼ 1Þ  dððekni Þ1 Þ and ekni 9ðpkni ¼ 0Þ  Eðekni 9ðakni Þ1 Þ, where Eðx9lÞ denotes the exponential distribution of parameter l. This prior corresponds to pkni ¼ 1  Bernoullið1CÞ, and the number of relevant parameters follows a binomial distribution 9fpkni a 1g9  Binð2Nd,CÞ. The prior of the lattice kernel box limits conditioned on the knowledge of the relevance parameters is as follows: Y pðe9p, aÞ ¼ Eðekni 9ðakni Þ1 Þ, ð18Þ fn,i,k9pkni ¼ 0g

and the prior of the relevance indicator is as said before: Y dðpk Þ k C ni ð1CÞð1dðpni ÞÞ , pðp9CÞ ¼

ð19Þ

fn,i,kg

the joint prior can be decomposed as pðe, p9C, aÞ ¼ pðe9p, aÞpðp9CÞ. The computation of the most probable SLKN parameters ðp, eÞMP maximizing the logarithm of the posterior pðe9t, p,C, aÞ ¼ pðt9p, e,C, aÞ pðe9p, aÞpðp9CÞ: log pðe9t,C, aÞ ¼

N X n¼1

½t n log yn þð1t n Þlog½1yn 

˜a, A.I. Gonzalez-Acun ˜a / Neurocomputing 109 (2013) 3–8 M. Gran



X

½logðakni Þ þ ekni ðakni Þ1 

k ni

fn,i,k9p ¼ 0g

þ

X

½dðpkni ÞlogC þ ð1dðpkni ÞÞlogð1CÞ,

ð20Þ

fn,i,kg

where yn ¼ f ðtðxn ; p, eÞÞ can be done applying a Simulated Annealing procedure [15] similar to Algorithm 2 used to train the SLKN. From the definition of the prior on the lattice kernel limits in Eq. (18), if akni ¼ 0 the corresponding relevance indicator parameter is pkni ¼ 1. Computation of the most probable hyperparameters ða,CÞMP can then be decomposed into two steps: first P estimation of aMP , second estimation of C MP ¼ ð1=2NdÞ n,i,k dðakni Þ on the computed aMP . Therefore, we are applying a decomposition pðC, a9tÞ ¼ pða9tÞpðC9aÞpðCÞ. The estimation of the aMP under a type II maximum likelihood [31] is stated as Z aMP ¼ argmax pðt9p, e, aÞpðe9p, aÞ de: a

However, we do not have any closed form for data likelihood as a function of the hyperprior parameters. Therefore, we need to apply a Monte Carlo approach involving sampling of both the hyperparameter a and the parameter e spaces in order to maximize the integral expression. Instead, we can apply an Expectation–Maximization approach [3] computing the expectation of pðe9p, aMP Þ as the average value e kni of the lattice kernel parameters during the Monte Carlo search for eMP , therefore ak,new ¼ ðe kni Þ1 for all n,i,k. ni Algorithm 3 summarizes the sparse Bayesian parameter estimation process. We assume non-informative hyperpriors, therefore the hyperparameters akni are initialized as uniformly distributed random variables at constant value Nd. The probability parameter C of the indicator variables is set to its corresponding non-informative value 0.5. In the experimental works we assume that only class 1 parameters are relevant, in order to alleviate the computational cost of the search for eMP .

7

Table 1 Information about the experimental classification datasets: name of dataset, number of patterns in train and test partitions, input data dimension and best result reported, specified as the percentage of correct classifications 7 the standard deviation. Dataset

#train

#test

Dimension

Best accuracy result reported

Flare solar Breast Titanic Thyroid Heart Diabetes German Synth

666 200 150 140 170 468 700 250

400 77 2051 75 100 300 300 1000

9 9 3 5 13 8 20 2

67.57 75.23 77.42 95.80 84.05 76.79 76.39 –

7 7 7 7 7 7 7

1.82 4.63 1.18 2.07 3.26 1.63 2.07

Table 2 Classification accuracy results as percentage of correct classifications on the test and train data partitions for each of the considered datasets. Train data results inside brackets. Dataset

RVM

SLKN

SLKN-G

SLKN-SBL

Flare solar Breast Titanic Thyroid Heart Diabetes German Synth

65(68) 73(75) 77(80) 88(91) 80(89) 76(79) 79(76) 90(87)

56(55) 74(99) 33(29) 92(99) 72(100) 74(94) 78(100) 86(86)

55(55) 73(99) 33(29) 91(99) 67(100) 75(98) 73(100) 87(86)

62(62) 74(95) 74(70) 92(98) 77(98) 75(94) 78(98) 89(90)

Table 3 Percentage of the training sample which are relevant vectors for RVM and SLKN approaches, as a measure of the model’s sparsity. Closer to zero means more parsimonious model. Dataset

RVM linear

SLKN-G

SLKN

SLKN-SBL

Flare solar Breast Titanic Thyroid Heart Diabetes German Synth

0.05 0.02 0.05 0.02 0.03 0.01 0.01 0.004

0.22 0.17 0.19 0.17 0.2 0.18 0.21 0.11

0.26 0.15 0.16 0.17 0.19 0.15 0.18 0.14

0.16 0.12 0.11 0.10 0.14 0.12 0.14 0.10

Algorithm 3. The Sparse Bayesian Learning applied to the SLKN. 1. Initialize hyperparameters at uninformative values akni ¼ Nd, C ¼0.5 2. Search for the most probable weights ðe, pÞMP maximizing by Monte-Carlo Methods the log-posterior of Eq. (20) eMP ¼ arg max log pðe9t, p,C, aÞ ðe, pÞ

where yn ¼ f ðtðxn ; p, eÞÞ. 3. Update the hyperparameters ak,new ¼ ðek Þ1 : ni MP ni 4. Set relevant parameters: set pkni ¼ 1 if akni o E. 5. C ¼ 19fakni o Eg9=2Nd 6. Test convergence. If not converged, repeat from step 2.

6. Experimental results For comparison we use the first version of SparseBayes code provided by Tipping1 to train the RVM, because it is more powerful than later versions, though it has some numerical difficulties with large databases. We are publishing Matlab code for RDC in our research group site.2 The ‘‘synth’’ is the dataset proposed by Ripley [16] as provided in Tipping’s site. Additional experimental datasets were downloaded from the ‘‘Gunnar Raetsch’s Benchmark Datasets’’.3 Table 1 summarizes the characteristics of the datasets: number of 1

http://www.miketipping.com/index.php http://www.ehu.es/ccwintco/index.php/Relevance_Dendritic_Computing: _codes_and_examples 3 http://www.fml.tuebingen.mpg.de/Members/raetsch/benchmark/?searchterm= benchmark 2

patterns in train and test partitions, input data dimension and best result reported in the site. Average classification accuracy on the test set specified as percentage of correct classification with the standard deviation preceded by 7. We follow the same validation protocol as reported in the referred site, in order to give results that are comparable, therefore we use the same training and a test data partition provided in the site. For the same reason we do not define validation sets for model selection. Table 2 presents the classification accuracy results on the train and test partitions of the linear RVM, the SLKN and SLKN-G trained with Algorithm 2 (SLKN-G means that the interval size e is the same for all relevant lattice kernels), and SLKN trained with the Sparse Bayesian Learning (SLKN-SBL). Presenting both train and test accuracies it is possible to appreciate the generalization capabilities of the algorithms. We find that in some datasets both SLKN-G and SLKN approach the accuracy of RVM. They improve on RVM in the thyroid dataset. In most datasets, we find that SLKN and SLKN-G fit on the training set is much better than in the test set. The SLKN-SBL accuracy results are sometimes better than the results by the direct Monte Carlo approach, but still are most of the times below the results of RVM. Table 3 presents the percentage of the train dataset which is preserved as relevant parameters by each algorithm.

8

˜a, A.I. Gonzalez-Acun ˜a / Neurocomputing 109 (2013) 3–8 M. Gran

It can be appreciated that our approach does not reach yet the degree of sparsity obtained by the RVM.

7. Conclusions Parsimonious models are desirable for both speed of response and generalization capabilities for industrial applications. In this paper we introduce a single layer lattice kernel neuron (SLKN) which allows the pruning of the relevant input data features by systematic application of learning approaches aiming to obtain sparse models. We apply a direct Monte Carlo approach to their training as well as a Sparse Bayesian Learning approach. The experimental results obtained on a suite of benchmarking datasets are encouraging. The SLKN accuracy compares well with a state of the art sparse training classifier, the RVM. However, we have find difficult to obtain comparable sparsity of the SLKN. The greatest model reduction is obtained with the SBL approach, but is far from the sparsity obtained by the RVM on the same sets. From a strategic point of view, the paper is the first attempt to embed a lattice computing approach [7] into a Bayesian reasoning framework. Such disparate paradigms can be effectively combined to give efficient classification systems. Still, the intrinsic nature of lattice computing systems imposes the need for computationally intensive Monte Carlo approaches in order to perform parameter optimization. Future work will be directed to improve the tuning of the numerical approaches. We expect that fine tuning of the Monte Carlo parameters may lead to increased sparsity of the obtained models applying the SBL approach. We expect improvements of accuracy. Alternatively, we will try an incremental approach, starting from minimal models and adding lattice kernels following also a Bayesian approach. References [1] Ajith Abraham, Editorial—hybrid soft computing and applications, Int. J. Comput. Intell. Appl. 8 (1) (2009) v–vii. [2] A. Barmpoutis, G.X. Ritter, Orthonormal basis lattice neural networks, in: 2006 IEEE International Conference on Fuzzy Systems, 2006, pp 331–336. [3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. ˜ a, A. Savio, J. Maiora, Hybrid dendritic computing with [4] D. Chyzhyk, M. Gran Kernel-LICA applied to Alzheimer’s Disease detection in MRI, Neurocomputing 75 (1) (2012) 72–77. [5] Emilio Corchado, Alvaro Herrero, Neural visualization of network traffic data for intrusion detection, Appl. Soft Comput. 11 (2) (2011) 2042–2056. ˜ a, J. Villanu´a, On the use of morpho[6] M. Garcı´a-Sebastia´n, A. Savio, M. Gran metry based features for Alzheimer’s Disease detection on MRI, in: Joan Cabestany, Francisco Sandoval, Alberto Prieto, Juan M. Corchado (Eds.), Bio-Inspired Systems: Computational and Ambient Intelligence. IWANN 2009 (Part I), Lecture Notes in Computer Science, vol. 5517, 2009, pp. 957– 964. ˜ a. A brief review of lattice computing, in: Proceedings of the WCCI [7] M. Gran 2008, 2008, pp. 1777–1781. ˜ a, D. Chyzhyk, M. Garcı´a-Sebastia´n, C. Herna´ndez, Lattice indepen[8] M. Gran dent component analysis for functional magnetic resonance imaging, Inf. Sci. 181 (2011) 1910–1928. ˜ a, A. Gonzalez-Acun ˜ a, Towards relevance dendritic computing, in: [9] M. Gran NABIC 2011, IEEE, 2011, pp. 588–593. [10] Vassilis G. Kaburlasos, Ioannis N. Athanasiadis, Pericles A. Mitkas, Fuzzy lattice reasoning (flr) classifier and its application for ambient ozone estimation, Int. J. Approx. Reasoning 45 (May) (2007) 152–188. [11] V.G. Kaburlasos, L. Moussiades, A. Vakali, Fuzzy lattice reasoning (flr) neural computation for weighted graph partitioning, Neurocomputing 72 (10–12) (2009) 2121–2133. (Special Section on Lattice Computing and Natural ˜ a). Computing. Guest Editor: Manuel Gran [12] V.G. Kaburlasos, S.E. Papadakis, A granular extension of the fuzzy-artmap (fam) neural classifier based on fuzzy lattice reasoning (flr), Neurocomputing 72 (10–12) (2009) 2067–2078. [13] V.G. Kaburlasos, S.E. Papadakis, A. Amanatiadis, Binary image 2d shape learning and recognition based on lattice computing (lc) techniques, J. Math. Imaging Vision 42 (2–3) (2012) 118–133. [14] V.G. Kaburlasos, V. Petridis, Fuzzy lattice neurocomputing (fln): a novel connectionist scheme for versatile learning and decision making by clustering, Int. J. Comput. Appl. 4 (2) (1997) 31–43.

[15] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671–680. [16] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996. [17] Gerhard Ritter, Paul Gader, Fixed points of lattice transforms and lattice associative memories, Advances in Imaging and Electron Physics, vol. 144, Elsevier, 2006, pp. 165–242. [18] G.X. Ritter, L. Iancu, Single layer feedforward neural network based on lattice algebra, in: Proceedings of the International Joint Conference on Neural Networks, 2003, vol. 4, July 2003, pp. 2887–2892. [19] G.X. Ritter, L. Iancu, A morphological auto-associative memory based on dendritic computing, in: Proceedings. 2004 IEEE International Joint Conference on Neural Networks, 2004, vol. 2, July 2004, pp. 915–920. [20] G.X. Ritter, L. Iancu, G. Urcid, Morphological perceptrons with dendritic structure, in: The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ ’03, vol. 2, May 2003, pp. 1296–1301. [21] G.X. Ritter, G. Urcid, Lattice algebra approach to single-neuron computation, IEEE Trans. Neural Networks 14 (March (2)) (2003) 282–295. [22] G.X. Ritter, G. Urcid, Perfect recall from noisy input patterns with a dendritic lattice associative memory, in: Proceedings of the International Joint Conference on Neural Networks, 2011, pp. 503–510, art. no. 603326. ˜ a, J. Villanu´a, Results of an adaboost [23] A. Savio, M. Garcı´a-Sebastia´n, M. Gran approach on Alzheimer’s Disease detection on MRI. in: J. Mira, J.M. Ferra´ndez, J.R. Alvarez, F. dela Paz, F.J. Tolede (Eds.), Bioinspired Applications in Artificial and Natural Computation, Lecture Notes in Computer Science, vol. 5602, 2009, pp. 114–123. ˜ a, J. Villanu´a, Classifica[24] A. Savio, M. Garcı´a-Sebastia´n, C. Herna´ndez, M. Gran tion results of artificial neural networks for Alzheimer’s Disease detection, in: Emilio Corchado, Hujun Yin (Eds.), Intelligent Data Engineering and Automated Learning—IDEAL 2009, Lecture Notes in Computer Science, vol. 5788, 2009, pp. 641–648. ˜ a, [25] A. Savio, M.T. Garcia-Sebastian, D. Chyzhyk, C. Hernandez, M. Gran A. Sistiaga, A. Lopez de Munain, J. Villanua, Neurocognitive disorder detection based on feature vectors extracted from VBM analysis of structural MRI, Comput. Biol. Med. 41 (2011) 600–610. [26] J. Sedano, L. Curiel, E. Corchado, E. de la Cal, J.R. Villar, A soft computing based method for detecting lifetime building thermal insulation failures, Integrated Comput. Aided Eng. 17 (2) (2011) 103–115. [27] Kevin Sharp, Magnus Rattray, Dense message passing for sparse principal component analysis, J. Mach. Learn. Res. 9 (2010) 725–732. [28] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [29] M.E. Tipping, The relevance vector machine, in: S.A. Solla, T.K. Leen, K.-R. Muller (Eds.), Advances in Neural Information Processing Systems, vol. 12, MIT Press, 2000, pp. 652–658. [30] M.E. Tipping, Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn. Res. 1 (2001) 211–244. [31] M.E. Tipping, Bayesian inference: an introduction to principles and practice ¨ in machine learning, in: U. von Luxburg O. Bousquet, G. Ratsch (Eds.), Advanced Lectures on Machine Learning, Springer-Verlag New York, Inc., 2004, pp. 41–62. [32] Tomasz Wilk, Michal Wozniak, Soft computing methods applied to combination of one-class classifiers, Neurocomputing 75 (1) (2012) 185–193.

˜ a is full professor at the Computer Manuel Gran Science Department of the Universidad del Pais Vasco. His research interest include image processing, artificial neural networks architectures and applications, robotics and computer vision. He has co-edited several books, published more than 50 journal papers and more than a hundred conference papers.

˜ a is associate professor at the Ana I Gonzalez-Acun Computer Engineering Department of the Universidad del Pais Vasco. She is interested in artificial neural networks, classification and lattice computing. She has published more than 10 journal papers.