Regularized ensemble neural networks models in the Extreme Learning Machine framework

Regularized ensemble neural networks models in the Extreme Learning Machine framework

Neurocomputing 361 (2019) 196–211 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Regular...

2MB Sizes 0 Downloads 39 Views

Neurocomputing 361 (2019) 196–211

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Regularized ensemble neural networks models in the Extreme Learning Machine framework Carlos Perales-González∗, Mariano Carbonero-Ruz, David Becerra-Alonso, Javier Pérez-Rodríguez, Francisco Fernández-Navarro Universidad Loyola Andalucía, Spain

a r t i c l e

i n f o

Article history: Received 21 January 2019 Revised 9 April 2019 Accepted 1 June 2019 Available online 15 July 2019 Communicated by Dr Lendasse Amaury Keywords: Extreme Learning Machine Ensemble Hierarchy Diversity Negative Correlation

a b s t r a c t Extreme Learning Machine (ELM) has proven to be an efficient and speedy algorithm for classification. In order to generalize the results of standard ELM, several ensemble meta-algorithms have been implemented. On this manuscript, we propose a hierarchical ensemble methodology that promotes diversity among the elements of an ensemble, explicitly through the loss function in the single-hidden-layer feedforward network version of ELM. The diversity term in the loss function is justified using the concept of regularization from the Negative Correlation Learning framework. Statistical tests show that our proposal is competitive in both performance and diversity measures against bagging and boosting ensemble methodologies.

1. Introduction In the field of multi-class classification, Extreme Learning Machines (ELM) have been extensively used for many applications, from traditional databases [1] to image classification [2], time series prediction [3–6] and optimization [7,8]. It was first conceived as a single-hidden-layer feedforward network (SLFN) where connections among input nodes and nodes in the hidden layer were chosen at random [9]. Then, Huang et al. [1] applied the kernel trick, used in SVM and other algorithms based on distances [10] into ELM. This allowed the generalization of the neural ELM as a kernel version where the mapping function is unknown, and only the inner product is known. The generalization of the mapping on its hidden layer gives ELM its extensive approximation capabilities [9,11]. ELM satisfies both Ridge theory [12] and neural network generalization theory [13]. It applies the Ridge theory on a label binarizer in order to achieve classification from multiple-variable regressions. Thus, the ELM concept uses the structure of neural networks, the kernel from Support Vector Machines (SVM) and the mathematical properties of random projection methods.



Corresponding author. E-mail address: [email protected] (C. Perales-González).

https://doi.org/10.1016/j.neucom.2019.06.040 0925-2312/© 2019 Elsevier B.V. All rights reserved.

© 2019 Elsevier B.V. All rights reserved.

Over the last years, ELM has been considered alongside the best classifiers in the state of the art [14], along with SVM [15–18] and Random Forests [19–23]. There are many reasons why ELM has remained useful in the field of data classification and regression [24]. Randomness in the original ELM enables a speedy calculation of the weights in the hidden layer, thus achieving the processing of heavy datasets in reasonable computational time [25,26]. Unfortunately, the random configuration of the parameters associated to the hidden layer can lead to the hidden node outputs being nonuniformly distributed (thus giving rise to poor generalization performance). ELM practitioners usually solve this problem by including a large numbers of hidden nodes in their models. Thus, two related problems appear in ELM framework: non-uniform distribution in the hidden layer outputs and a large number of neurons in this layer. The first problem can be solved avoiding randomness in the hidden layer parameters [27–29] whereas the second can be addressed by either pruning the models after their training [30–32] or estimating the models in an incremental way [8]. It also facilitates the use of ELM as part of greater structures, such as ensembles [33–35] and deep networks [36,37]. In accordance with the deep learning framework in traditional neural networks, Deep Extreme Learning Machines have been used on problems such as brain monitoring [38] or handwriting recognition [26]. Under these hierarchical layer settings, issues with complex

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

data and noise can be tackled in order, before a final classification or regression is made [39]. Deep learning improves performance and reduces the impact of noisy data at the expense of time execution [40–43]. While deep learning methodologies improve performance through building a classifier with high complexity [44], ensemble methodologies look for performance choosing diverse base learners, relying on diversity, regularization and generalization as the key point to improve performance. Additionally, ELM practitioners can also use GPU and distributed programming to reduce the computational burden of those ensemble models where individuals are trained independently [45]. Considering the taxonomy outlined by Ye and Suganthan [46], there are three main varieties within ensembles: • Kernel diversity: Variations in kernel functions, when creating different classifiers, implies a greater diversity in the solutions offered [47]. Such kernel functions are the core of heuristics in the latest Extreme Learning Machines and Support Vector Machines [48,49]. • Data diversity: Given a large enough dataset, training the same model with different subsamples may yield different performances. This instability in the representativeness of training data can be used to our advantage when diversity is the aim. Cross validation [50], boosting [51] and bagging [52] algorithms are included in this strategy for diversity searches [53]. Boosting meta-algorithms are easily implemented in many different classifiers, just selecting and weighting the instances from the training data [54]. Bagging meta-algorithms use random subselections from the training data [55,56]. AdaBoost [57], is the best-known example from boosting methodology, selecting training subsets on an iteration that depends on the performance of the model in the previous iteration by weighting misclassified instances. The case of Adaboost Negative Correlation, for instance, uses and ambiguity term explicitly derived for classification ensembles in order to introduce diversity [58]. • Parameter diversity: There are parameters affecting the performance of the classifiers. Modifications in the loss function, such as the number of nodes in the hidden layer of neural networks, the perceptron functions or regularization hyperparameters, influence the construction of the classifier itself and its final accuracy [59]. To the best of our knowledge, ELM ensemble literature dwells on kernel and data diversity [54,55,58]. The present manuscript aims at parameters as the driving force for diversity in classification. A regularization term for the ensemble is added, looking for avoiding overfitting while improving generalization, similar to the Negative Correlation Learning framework [60–62]. The idea of regularization for a single ELM is extrapolated to an ensemble of ELMs. Outputs weights are made to be as orthogonal among them as possible, in order to promote this ensemble generalization. The paper is organized as follows: first, an explanation of Extreme Learning Machine for classification problems, following the neural version, on Section 2. Our proposal is discussed on Section 3. The concept of diversity using matrix angles is introduced with a metric in Section 3.1. The detailed description of the implementation is then explained in Section 3.5. The experimental framework is presented in Section 4, and the results and empirical comparisons are on Section 5. Lastly, conclusions are discussed on the final segment of the article. 2. Extreme Learning Machine for classification problems The parameters of the ELM models are estimated from a training set D = {(xn , yn )}N , where xn ∈ RK is the vector of attributes n=1 of the nth pattern, K is the dimension of the input space (number of attributes in the problem), yn ∈ RJ is the class label assuming

197

the “1-of-J” encoding (yn j = 1 if xn is a pattern of the jth class, yn j = 0 otherwise) and J is the number of classes. Let us denote

⎛ ⎞

y1 . ⎟ . ⎠, where Yj is the jth column of . yN the Y matrix. Under this formulation, the output function of the classifier is defined as:

⎜ Y as Y = (Y 1 , . . . , Y J ) = ⎝



f (x ) = h (x )β,

(1)

where f (x ) ∈ is the output of ELM, β = (β1 , . . . , βJ ) ∈ is the output matrix of coefficients, β j ∈ RD are the weights of the jth output node, h : RK → RD is the mapping function and D is the number of hidden nodes (the dimension of the transformed space). In a classification problem, f(x) is a vector with J elements and fj (x) will be used to denote the jth element  of that vector. Let us also denote H as H = h (x1 ), . . . , h (xN ) ∈ RN×D , the transformation of the training set from the input space to the transformed one. ELM minimizes the following optimization problem [1]: RD×J

RJ



min

β∈RD×J

2 β2 + C H β − Y  ,

(2)

where C ∈ R is a user-specified parameter, aiming to add Tikhonov regularization [63]. Although the ELM problem is typically presented in its matrix form, the final optimization problem is the sum of J separable vector problems, one for each class. In fact, if we disaggregate the objective function by columns, it can be verified that:

β2 + C H β − Y 2 =

J 

β j 2 + C H β j − Y j 2 .

(3)

j=1

The optimization objective for each class can be reformulated as:

β j 2 + C H β j − Y j 2 = βj β j + C (βj H  − Y j )(H β j − Y j ) =

βj β j + C βj H  H β j − 2CY j H β j + Y jY j . (4)

Hence, the optimization problem can be rewritten for each class as:



min

β j ∈R D

βj (I + C H  H )β j − 2CY j H β j ,

(5)

where Y j Y j is constant. The solution to that optimization problem is well known:

βj =

I

C

+ HH

−1

H Y j

(6)

The final solution is obtained after grouping the βj elements by columns:

β=

I C

+ HH

−1

H Y

(7)

This solution can be alternatively written as:

β = H

I C

+ HH

−1 Y

(8)

The two solutions are equivalent as pointed out in the literature [1]. There are two possible implementations of the ELM framework: the neural version [9] and the kernel version [1]. The main difference between these two approaches lies in the way h(x) is computed (for pattern x) and how that affects the H matrix. In the neural implementation of the framework, h(x) can be explicitly computed and it is defined as:

h ( x ) = ( φ ( x; w d , b d ) , d = 1 , . . . , D ) ,

(9)

198

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

where φ (·; wd , bd ) : RK → R is the mapping function of the dth hidden node, wd ∈ RK is the input weight vector associated to the dth hidden node and bd ∈ R is the bias of the dth hidden node. Any nonlinear piecewise continuous function can be used as a mapping function in the ELM framework [64]. In this neural version, the input weights of the hidden nodes are randomly chosen, the output weight matrix, β, is analytically determined using Eq. (7) and the mapping function chosen is typically sigmoidal, i.e.:

φ ( x; w d , b d ) =

1 1 + exp(−wd · x + bd )

(10)

In the kernel implementation of the framework, the h(x) function is an unknown feature mapping. Fortunately, there are certain functions k(xi , xj ) that compute the dot product in another space,

 k(xi , x j ) = h(xi ), h(x j ) (for all xi and xj in the input space). Finally, it is important to clarify that the predicted class label for a test pattern x is stored in a vector  y(x ) ∈ RJ in which all their values are equal to 0 except the element in position







arg max h (x )β , j j=1,...,J

that is equal to 1.

S S   f (x ) = h ( x )β ( s ) = h ( x ) β (s ) s=1

3.1. Diversity measure In the Negative Correlation Learning (NCL) framework, the error functions of the models in the ensemble include a correlation penalty term in order for each neural network to minimize its mean square error (MSE) together with the correlation of the ensemble [60–62]. According to the definition of NCL, the correlation penalty term in the error function acts as a regularization term of the ensemble [58]. As shown in [65], this correlation can be understood as a covariance among the classifiers of the ensemble. In ELM ensembles (1 ) (S ) there are computable output matrices {β , . . . , β }. Thus, the relationship between the Pearson’s correlation coefficient and the cosine will now be explicitly presented. The covariance between two random variables, U and V, measures the relationship between them, and is defined as:

(11)

where μU and μV are the averaged populations of the U and V random variables, respectively. The correlation coefficient is equal to the covariance normalized to 1:

σU ,V E[(U − μU )(V − μV )] =  . σU σV E [(U − μU )2 ]E [(V − μV )2 ]

(12)

From the data sample, Eq. (12) could also be expressed as:



ρU ,V = 

(Ul − u )(Vl − v )] ,  2 2 l (Ul − u ) l (Vl − v ) l[

ul vl 

=

2 l vl

u, v = cos (∠(u, v ) ), uv

(14)

where ul and vl are the lth components of the random variables U and V after the standardization of the sample. Because of this, fostering diversity between neural network parameters is proposed, analyzing the angle between the vectors involved. These vectors will be most different when, |∠(u, v )| = π /2, u,v and therefore u, v = 0. When most similar, uv = ±1. Taking this into account, a metric of diversity among u and v can be described as1

d ( u, v ) = 1 −

u, v2 u2 v2

(15)

There is no standard definition for the angle between two matrices of the same size. For the purpose of the present work it will be defined as the average of the angles of its homologous columns, since ELM optimization problem can be separated by columns, as seen in Eq. (3). (S )

div(β

(1 )

where J

,...,β

S 2

(S )

(13)

 2 β (jk) , β (jl ) 1−     ( k ) 2  ( l ) 2 j=1 β j  β j 

J S 1

) = S J

2

k
(16)

term is introduced to normalize the diversity between

0 y 1 after the summation. The diversity div(β bounded between 0 (β

In this scenario, the goal is to estimate S matrices associated to (1 ) (S ) the output matrices of S ELM models, {β , . . . , β }.

ρU ,V =

l

2 l ul

(1 )

The aim of this Section is to show how to recursively build an ensemble made of S instances of ELM models in which diversity is imposed explicitly in the error function of the model. We will focus on the neural version of the ELM framework and will use the same transformation of the input space for all instances in the ensemble. So h(x) is the same function for all individuals in the ensemble, and the output of the whole ensemble is

σU ,V = E[(U − μU )(V − μV )],



ρU ,V = 

Definition 1. Given matrices β , . . . , β of the same Extreme Learning Machine ensemble, their diversity is given by

3. Methodology proposed: regularized ensemble learning

s=1

where u and v are the averaged sample. Standardizing the data, Eq. (13) can be presented as:

(1 )

(S )

 . . .  β ) and 1 (β

(1 )

(1 )

,...,β

⊥ ... ⊥ β

(S )

(S )

) is

).

3.2. Error function formulation The proposed ensemble model, named Regularized Ensemble Extreme Learning Machine (RE-ELM), promotes diversity in the β parameters sequentially, disaggregating the matrices by columns. Each β(s) is calculated hierarchically, by minimizing error and maximizing div from Eq. (16) in the loss function, respect to (1 ) (s−1 ) previous β , . . . , β from the ensemble. The first component of the ensemble, β(1) , is estimated using the solution in Eq. (7) whereas β(2) is obtained from β(1) ; β(3) from β(1) and β(2) and so on, up to β(S) . Thus, β(s) , s = {2, . . . , S}, is obtained from the following optimization problem:

 min

β (s) ∈RD×J

(s ) 2

β  + C H β

(s )

2

−Y + N

s−1 k=1

αk

J  (s )

(k )

βj , uj

2

 ,

j=1

(17) (k )

where u(jk ) ∈ RD is the normalized vector associated to β j ∈ RD (the jth output vector of the kth individual in the ensemble), N is the number of instances and α k > 0 is a hyperparameter optimized for diversity, associated to the kth component of the ensemble. The N factor is added to balance diversity against the error term in the loss function, because diversity term is O(J), while error term is O(N × J). We propose that α s parameters are also the weights for base learners in the final prediction of the ensemble, thus expecting S (s) is obtained hierarchically s=1 αs = 1. In addition, since each β by adding diversity as a penalty to the loss function in Eq. (17), 1

The dot product is squared aiming to focus solely on the direction of the vector.

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

199

we consider that choosing α 1 > α 2 >  > α S is the best option in order to keep a good performance. Taking into account both conditions, our proposal for αs (s = 1, . . . , S ) is a geometric progression with r < 1, so these parameters will depend on just one hyperparameter r ∈ (0, 1). Since the geometric progression satisfies

Therefore all its eigenvalues are non negative and matrix A(js ) is invertible. 

S−1

The main drawback of the ensemble method proposed is its high computational cost, compared to other ensemble methods. To solve the complete optimization problem, it is necessary to compute (S − 1 )J inverses, one for each component of the ensemble and for each output vector. In this Section, we will describe a method inspired in the Sherman–Morrison formula [66] to reduce the number of inverses to be computed. The formula is built from an invertible square matrix (G) and two vectors (m and v) with the same rank as G. The matrix F = G + mv is invertible if 1 + v G−1 m = 0. If G + mv is invertible, then its inverse is given by:

rs =

s=1

r − rS , 0 < r < 1. 1−r

from the condition

( 1 − r )r

αs =

S

s−1

αs = 1 we have:

s=1

s = 1, . . . , S.

1 − rS

(18)

The RE-ELM optimization problem (associated to the sth component of the ensemble) can also be formulated as the sum of J separable vector problems, one for each class. Hence, we could rewrite the optimization function as: 2

2

β ( s )  + C H β ( s ) − Y  + N =

J

j=1

=

J

s−1

αk

J  (s )

k=1

β j , u(jk)

2

3.4. Calculation of the inverses via the Sherman–Morrison formula

F −1 = G−1 −

j=1

J  2 s−1 β (js) 2 + C H β (js) − Y j 2 + N αk β (js) , u(jk) k=1

 2

2

β (js)  + C H β (js) − Y j  + N

s−1

j=1



 2 αk β (js) , u(jk) . (19)

In this study, we will first rewrite the A(js ) matrix as:

A(js ) = A(js−1) + v(js−1)

A(j1) =

(s )

vj =

2

From Eq. (19), H β j − Y j  can be rewritten as:



 2 H β (js) − Y j  = β (js) H  H β (js) − 2 β (js) H Y j + Y jY j ,

(20)

being Y j Y j a constant term. The diversity term can also be rewritten as: s−1



αk β (js) , u(jk)

2

=

k=1

s−1



=

 (k )

αk β (js) u j

k=1





(s )

βj

  s−1

uj

(k )

αk u j

(k ) 



β (js)

(k ) 

uj

(21)

k=1

Thus, the optimization problem for the sth component of the ensemble and its jth output vector could be formulated as:

  

 (s ) (s ) (s ) (s )  β min A β − 2 H Y β j , j j j j (s )

(22)

β j ∈R D

where A(js ) =

I C

+ HH +

N C

s−1

k=1

αk u(jk) (u(jk) ) . The solution to that

optimization problem (for positive definite A(js ) matrices) is

 −1 β (js) = A(js) H Y j

(23)

since A(js ) is positive definite and inverse can be calculated. Lemma 1. Matrix CI + H  H + nite for s = 2, . . . , S.

N C

s−1

k=1

αk u(jk) (u(jk) ) is positive defi-

Proof. From other studies [1], it is known that matrix

I C

+ HT H

is positive definite. Thus, to verify that A(js ) is positive definite it s−1 suffices to proof that new diversity term N α u(k) (u(jk) ) is C k=1 k j positive definite: s−1  I N x A(js ) x = x x + x H  H x + αk x, u(jk) 2 > 0. C C k=1



(25)

(26)

(27)

The recurrence relation for the calculation of all the inverses involved in the diversity stage is obtained from the Sherman–Morrison formula using A(j1 ) as the initial matrix (for jth output vector). Hence:



(s ) −1

Aj



(s−1 ) −1

= Aj

 βj .

v(js−1)

Nαs (s ) uj . C

 (s )



I + HH. C



3.3. Matrix formulation (s )

(24)

1 + v G−1 m

with

k=1

j=1

G−1 mv G−1

=

I−





   −1 v(js−1) v(js−1) A(js−1)    −1 (s−1) 1 + v(js−1) A(js−1) vj   (s−1) −1  (s−1)  (s−1)

A(js−1)

−1

Aj uj uj Nαs−1    C 1 + Nαs−1 u(s−1) A(s−1) −1 u(s−1) C j j j

× A(js−1)





−1

⎞  β (js−1) β (js−1) ⎜ ⎟ = ⎝I −

  ⎠ 2 −1 (s−1 ) ( s −1 ) ( s −1 ) ( s −1 ) C β j  + β j Aj βj Nαs−1  (s−1) −1 

A(js−1)

−1

× Aj

(28)

So each j component of output matrix β(s) is obtained hierarchically only with the j components of the previous output (1 ) (s−1 ) matrices {β , . . . , β }. 3.5. Algorithmic summary of the proposed ensemble model In this Section, the algorithmic steps required to estimate the parameters of the proposed method will be briefly described. The RE-ELM algorithm starts randomly generating the hidden nodes output matrix, H (Fig. 1, steps 1–3). After that, the algorithm estimates the parameters of the first ELM model (β(1) ) using Eq. (7) (Fig. 1, step 4). The importance of each element of the ensemble is determined according to a geometric progression in which the first elements of the ensemble are more important than

200

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

Fig. 1. RE-ELM training algorithm framework.

the following (Fig. 1, step 5). The last stage of the RE-ELM algorithm consists on the estimation of the parameters associated to (1 ) (S ) the diversity part ({β , . . . , β }) which are obtained recursively from the first ELM model using the Sherman–Morrison formula (Fig. 1, steps 6–11, Eq. (28)). The elements of the ensemble are combined using the α s (s = 1, . . . , S) parameters which are obtained from a geometric progression (Fig. 1, step 5). Thus, the output of the final ensemble model is:

f (x ) =

S

αs h (x )β (s) ,

(29)

s=1

where f(x) is the numerical output of the ensemble model, fj (x) is the jth element of the vector and α s is obtained following Eq. (18). Finally, it is important to clarify that the predicted class label for a test pattern x is included in a vector  y ( x ) ∈ RJ where all values are equal to 0 except the element in position  (s ) arg max j=1,...,J ( Ss=1 αs (h (x )β )) j , that is equal to 1. 4. Experimental framework An extensive experimental study has been performed in order to validate the algorithm presented. A description of the datasets is available in Section 4.1. The measures employed to evaluate the performance are described in Section 4.2, whereas brief descriptions of the algorithms used for comparison are given in Section 4.3. Finally, statistical tests used to validate the obtained results are specified in Section 4.4. 4.1. Datasets The methodology proposed was applied to forty seven datasets extracted from the UCI repository [67]. Table 1 summarizes the properties of the selected datasets. For each dataset, the number

of patterns (Size), attributes (#Attr.), classes (#Classes) and the distribution of instances within classes (Class distribution) are shown. A detailed description of the datasets can be found at the UCI repository. They were chosen for having class and size variety, including binary and multi-class problems. Each dataset has been downloaded and processed into a common format, dropping the missing values.2 Features have been standardized and rescaled following a normal distribution N (0, 1 ). This transformation of the features is extremely important for distance-based classifiers, such as ELM or Support Vector Machine, normalizing the a priori importance among features. Labels have been binarized, following a 1-of-J encoding, as explained in Section 2. Datasets have been partitioned 10-fold, in order to generalize the results of the metrics. On each iteration, the hyperparameters are chosen with a 5-fold nested cross-validation.

4.2. Performance measures In this manuscript, two types of measures have been considered to evaluate the quality of the ensemble models implemented. The first category groups those metrics assessing the performance of the global ensemble model, i.e., the accuracy of the ensemble model in classifying the N patterns for a given dataset { y ( x1 ),  y ( x2 ), . . . ,  y(xN )}, with respect to the true labels {y1 , y2 , . . . , yN }. These measures are: • Accuracy rate (Acc): the proportion of correct predictions from all predictions made. It has been by far the most commonly used metric to assess the performance of classifiers for years

2 A Python repository has been developed by the authors with this goal in mind and uploaded to Github (https://github.com/cperales/uci- download- process).

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

201

Table 1 Characteristics of the datasets. ID

Dataset

Size

#Attr.

#Classes

Class distribution

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

poker-hand nursery mushroom pen-based-recognition-handwritten-digits spambase optical-recognition-handwritten-digits chess-king-rook-vs-king-pawn thyroid-disease-allhyper thyroid-disease-allhypo thyroid-disease-dis thyroid-disease-allbp thyroid-disease-sick car-evaluation banknote-authentication qsar-biodegradation statlog-project-german-credit connectionist-bench mammographic-mass breast-cancer-wisconsin credit-approval balance-scale breast-cancer-wisconsin-diagnostic climate-model-simulation-crashes heart-disease-hungarian cylinder-bands soybean-large thyroid-disease-new-thyroid glass-identification image-segmentation seeds breast-cancer-wisconsin-prognostic parkinsons flags wine hayes-roth monks-problems-1 heart-disease-switzerland monks-problems-3 echocardiogram zoo post-operative-patient spectf-heart hepatitis soybean-small lenses balloons-a balloons-d

25010 12960 8124 7494 4601 3823 3196 2800 2800 2800 2800 2800 1728 1372 1055 1000 990 830 699 690 625 569 540 294 277 266 215 214 210 210 198 195 194 178 132 124 123 122 105 101 90 80 80 47 24 20 16

10 26 106 16 57 64 38 24 24 24 24 24 21 4 41 59 13 5 8 10 4 30 18 4 99 35 5 9 19 7 32 22 48 13 4 6 4 6 10 16 17 44 19 35 4 4 4

10 5 2 10 2 10 2 4 4 2 3 2 4 2 2 2 11 2 2 2 3 2 2 2 2 15 3 6 7 3 2 2 8 3 3 2 5 2 3 7 4 2 2 4 3 2 2

(12493, 10599, 1206, 513, 93, 54, 36, 6, 5, 5) (4320, 4266, 2, 4044, 328) (4208, 3916) (780, 779, 780, 719, 780, 720, 720, 778, 710, 719) (2788, 1813) (376, 389, 380, 389, 387, 37, 377, 387, 38, 382) (1527, 1669) (8, 7, 62, 2723) (154, 2580, 64, 2) (45, 2755) (9, 124, 2667) (2629, 171) (384, 69, 1210, 65) (762, 610) (699, 356) (700, 300) (90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90) (427, 403) (458, 241) (307, 383) (49, 288, 288) (357, 212) (46, 494) (188, 106) (99, 178) (40, 20, 10, 10, 40, 20, 10, 10, 10, 40, 10, 10, 10, 10, 10) (150, 35, 30) (70, 76, 17, 13, 9, 29) (30, 30, 30, 30, 30, 30, 30) (70, 70, 70) (151, 47) (48, 147) (40, 60, 36, 8, 4, 27, 15, 4) (59, 71, 48) (51, 51, 30) (62, 62) (8, 48, 32, 30, 5) (62, 60) (43, 17, 45) (41, 20, 5, 13, 4, 10) (63, 1, 2, 24) (40, 40) (13, 67) (10, 10, 10, 17) (4, 5, 15) (8, 12) (9, 7)

• Diversity (Div): As it appears in Section 3.1, this is a proposed metric for classifier ensembles, optimized in RE-ELM.

[68]. The mathematical expression of Acc is:

Acc =

N 1 I ( y(xn ) = yn ), N

(30)

n=1

where I(·) is the zero-one loss function. • Root Mean Square Error (RMSE): the standard deviation of the differences between predicted values and target values. This metric is optimized in the ELM loss function, and it is defined as:

 

J N  2 1  1 f j (xn ) − yn j . RMSE = J N n=1

(31)

j=1

where fj (xn ) is the numerical output of the model for the jth class. The second category of measures is referred to the ensemble itself, and is not related with the performance on datasets. Our proposed metric, diversity, belongs to this group, based on the Negative Correlation Learning framework:

div(β

(1 )

,...,β

(S )

 2 β (jk) , β (jl ) 1−     ( k ) 2  ( l ) 2 j=1 β j  β j 

J S 1

) = S J

2

k
This metric can only be evaluated when different β (s) share the same mapping from the input space to the transformed one (same H), so it cannot be applied to bagging approaches. Finally, the time (T) required to train each model has also been evaluated, considering cross-validation training time. The total runtime is divided by the number of hyperparameters to cross-validate, making it a consistent measure.

4.3. Description of the algorithms Six other ensemble algorithms have been implemented to be compared with our proposal. Three of them are from the boosting

202

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

Fig. 2. Accuracy: RE-ELM vs NCELM.

Fig. 3. Accuracy: RE-ELM vs BSELM.

family, and two are a bagging ensemble. Standard ELM is also tested.3 RE-ELM Regularized Extreme Learning Machine, previously detailed in Section 3. ELM Extreme Learning Machine, as described in Section 2. BELM Bagging Extreme Learning Machine [55]. Different subsets are selected from original dataset in order to train a base learner. Then, all the ELMs are composed, as an ensemble where they weight the same. In our example, each subset contains 75% of the training dataset. AELM AdaBoost Extreme Learning Machine [54]. It is based on classical idea of AdaBoost, but training instances are weighted and not completely removed from one base 3 A Python library has been developed by the authors with the algorithms used for these experiments and uploaded publicly to Github (https://github.com/ cperales/pyridge).

Fig. 4. Accuracy: RE-ELM vs AELM.

Fig. 5. Accuracy: RE-ELM vs BELM.

learner to the next. From a given H, the different β are approximated, making patterns that were wrongly classified on previous runs more important than the rest. During all iterations, weights remain normalized avoiding overfitting. Thus, a balance is found that offers a more efficient pattern-based classification. BRELM Boosting Ridge Extreme Learning Machine [69]. The result of applying each classifier to the training dataset, without renormalizing using 1-of-J encoding, is added to the next classifier. Prediction of each β(s) is adjusted to  −1 (l ) Y − sl=1 Hβ . NCELM Negative Correlation Extreme Learning Machine [58]. Diversity among the outputs of the base learners is introduced explicitly through an ambiguity term. For each β(s) , training instances are both weighted, depending on previous behavior of the whole ensemble β and penalized with a hyperparameter λ.

(1 )

,...,β

(s−1 )

,

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

Fig. 6. Accuracy: RE-ELM vs BRELM. Fig. 8. RMSE: RE-ELM vs NCELM.

Fig. 7. Accuracy: RE-ELM vs ELM. Fig. 9. RMSE: RE-ELM vs BSELM.

BSELM Bagging Stepwise Extreme Learning Machine [56]. After selecting random subsets from the dataset, each base learner becomes part of the ensemble if it improves accuracy and diversity, using Q-Statistic [70] as a measure of diversity. We have considered only the single-hiddenlayer neural network version of ELM as base learner, same as all the other ensembles. Table 2 shows the grid in which hyperparameters are searched. All these models use the sigmoid activation function. BELM and BSELM come directly from bagging approaches, while AELM, BRELM and NCELM come from boosting approaches. In both cases, diversity is searched by data sampling. Our method, RE-ELM, looks for diversity by tunning the loss function, hence being categorized as parameter diversity [46].

Table 2 Hyperparameters for each model. Algorithm

Ref.

Hyperparameters

ELM BELM AELM BRELM NCELM

[1] [55] [54] [69] [58]

BSELM

[56]

S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50}, r ∈ {0.01, 0.05, 0.1, 0.3, 0.5} C ∈ {10−2 , 0.1, 1, 10, 102 }, D ∈ {10, 20, 30, 40, 50} S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50} S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50} S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50} S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50}, λ ∈ {0.25, 0.5, 1, 5, 10} S = 5, C ∈ {10−2 , . . . , 102 }, D ∈ {10, . . . , 50}

RE-ELM

203

204

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

Fig. 10. RMSE: RE-ELM vs AELM.

Fig. 12. RMSE: RE-ELM vs BRELM.

Fig. 11. RMSE: RE-ELM vs BELM. Fig. 13. RMSE: RE-ELM vs ELM.

4.4. Statistical tests In the present experimental study, testing methodologies are used to provide statistical support for the study and comparison of our proposal with state-of-the-art algorithms. Specifically, non-parametric tests have been used since the initial conditions that guarantee the reliability of the parametric tests may not be satisfied [71]. We have followed the procedure described by Demšar [71] which has been extensively implemented by other research groups in the machine learning community in recent years [72–74]. Thus, in order to determine the statistical significance of the rank differences for each method in the different datasets, we have carried out three Friedman pre-hoc tests with the ranking of Acc, RMSE and Div (in the generalization set) of the models as the test variables [75]. Accordingly, the Bonferroni–Dunn post-hoc test was the nonparametric pairwise multiple comparison procedure

implemented in this research study (as the null hyphothesis of the Friedman test was rejected in the three cases considered) [76]. 5. Results Comparative analysis both for performance and diversity are carried out following the experimental design in Section 4. The results are shown on tables, figures and statistical tests found in Sections 5.1 and 5.2. 5.1. Classification performance The most widespread procedure in the literature compares the performance of different methods by accuracy. Besides, since ELM accomplishes a RMSE optimization process, we also included this measure in the comparative analysis. Results for accuracy appear

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

205

Table 3 Accuracy generalized results. The best result for each dataset is highlighted in bold-face and the second one in italics. ID

Dataset

RE-ELM

ELM

BELM

AELM

BRELM

NCELM

BSELM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

poker-hand nursery mushroom pen-based-recognition-handwritten-digits spambase optical-recognition-handwritten-digits chess-king-rook-vs-king-pawn thyroid-disease-allhyper thyroid-disease-allhypo thyroid-disease-allbp thyroid-disease-dis thyroid-disease-sick car-evaluation banknote-authentication qsar-biodegradation statlog-project-german-credit connectionist-bench mammographic-mass breast-cancer-wisconsin credit-approval balance-scale breast-cancer-wisconsin-diagnostic climate-model-simulation-crashes heart-disease-hungarian cylinder-bands soybean-large thyroid-disease-new-thyroid glass-identification image-segmentation seeds breast-cancer-wisconsin-prognostic parkinsons flags wine hayes-roth monks-problems-1 heart-disease-switzerland monks-problems-3 echocardiogram zoo post-operative-patient spectf-heart hepatitis soybean-small lenses balloons-a balloons-d

0.5036 0.7996 0.9195 0.9304 0.8876 0.8938 0.8238 0.9725 0.9214 0.9568 0.9839 0.9389 0.7658 0.9985 0.8568 0.7420 0.4253 0.8107 0.9614 0.8496 0.9021 0.9684 0.9204 0.8123 0.6531 0.8761 0.9578 0.6511 0.8810 0.9619 0.7985 0.8223 0.4632 0.9722 0.7262 0.6887 0.3834 0.8058 0.4915 0.9722 0.6961 0.7500 0.8595 0.9800 0.8167 1.0000 0.9333

0.5017 0.7905 0.9196 0.9317 0.8774 0.8875 0.8088 0.9725 0.9214 0.9571 0.9839 0.9389 0.7346 0.9818 0.8482 0.7150 0.4222 0.8132 0.9572 0.8454 0.8959 0.9580 0.9186 0.7955 0.6282 0.8759 0.9483 0.6253 0.8762 0.9429 0.7886 0.8248 0.4266 0.9722 0.7328 0.6315 0.3917 0.7635 0.4964 0.9420 0.6753 0.7125 0.7931 0.9750 0.7833 0.9667 0.8083

0.5032 0.8040 0.7021 0.9356 0.6757 0.8789 0.6304 0.9725 0.9214 0.9571 0.9839 0.9389 0.7779 0.8666 0.6825 0.7000 0.4414 0.6542 0.8828 0.6782 0.8944 0.7959 0.9149 0.6778 0.6321 0.8497 0.9303 0.6427 0.8476 0.9524 0.7580 0.7592 0.4605 0.9889 0.6590 0.5860 0.3741 0.6776 0.4428 0.9260 0.6795 0.5625 0.8405 0.9750 0.8167 0.6833 0.6333

0.4993 0.8095 0.6792 0.9568 0.6003 0.9035 0.6088 0.9722 0.9214 0.9557 0.9839 0.9389 0.7667 0.7522 0.6626 0.7000 0.4566 0.5857 0.6517 0.5984 0.8559 0.6773 0.9149 0.6396 0.6426 0.8115 0.9392 0.5473 0.8190 0.9619 0.7059 0.7542 0.4308 0.9610 0.6272 0.5000 0.3753 0.5327 0.4818 0.9590 0.6961 0.5250 0.8405 1.0000 0.7917 0.6333 0.5667

0.5036 0.7912 0.9183 0.9295 0.8843 0.8920 0.8087 0.9725 0.9214 0.9564 0.9839 0.9389 0.7503 0.9978 0.8454 0.7320 0.4131 0.8107 0.9615 0.8438 0.8927 0.9650 0.9242 0.8128 0.6095 0.8650 0.9483 0.6330 0.8714 0.9524 0.7983 0.8034 0.4043 0.9725 0.7174 0.6696 0.3443 0.7622 0.4297 0.9615 0.6836 0.7250 0.8341 0.9750 0.6583 1.0000 0.9750

0.5008 0.7874 0.6586 0.9596 0.5805 0.8943 0.6135 0.9722 0.9211 0.9554 0.9839 0.9379 0.7772 0.7463 0.6626 0.7000 0.4434 0.5772 0.7199 0.5941 0.8800 0.6625 0.9149 0.6396 0.6388 0.8500 0.9532 0.6278 0.7952 0.9524 0.7120 0.7542 0.4442 0.9672 0.6877 0.5000 0.3890 0.5423 0.4911 0.9581 0.6961 0.5125 0.8405 0.9750 0.8667 0.6333 0.5667

0.4961 0.8016 0.9100 0.9340 0.7827 0.7121 0.8135 0.9669 0.9211 0.9582 0.9839 0.7545 0.7648 0.9738 0.4418 0.6647 0.3525 0.8059 0.3448 0.5551 0.8582 0.4474 0.9149 0.3604 0.6426 0.7489 0.9593 0.6667 0.8524 0.9238 0.2370 0.7495 0.3886 0.9889 0.6785 0.5000 0.3805 0.4923 0.5099 0.8581 0.6878 0.5000 0.8405 0.9800 0.8167 0.9889 0.7972

on Table 3, and for RMSE on Table 4. As explained in Section 4.1, performance metrics are calculated over 10 fold. Figs. 2–7 compare the performance in accuracy between our proposal and the comparative methods in pairs. The difference between the accuracy of RE-ELM and other algorithms is represented for each dataset in each figure. Positive values indicate the extent to which our proposal outperforms the other methods. Datasets are ordered on each figure depending on this outperformance in our proposal. For the sake of comparison, upper and bottom limits in all the figures are fixed to 0.4 and −0.1 respectively. We choose these unbalanced limits because in the datasets where our proposal is not competitive, differences are lower than in the cases where RE-ELM wins. As it can be seen in Fig. 3, differences in accuracy between REELM and BSELM are higher than 0.4 for the first 5 datasets (breastcancer-wisconsin, breast-cancer-wisconsin-prognostic, breast-cancerwisconsin-diagnostic, heart-disease-hungarian, qsar-biodegradation). Similar to accuracy, Figs. 8–13 compare the performance in RMSE between our proposal and the comparative methods in pairs. Since a low value of RMSE is searched, the lower the values in the figures, the more our proposal outperforms the other methods. As

with accuracy, datasets are ordered on each figure depending on this outperformance of our proposal. Upper and bottom limits are 50 and −500 respectively. The Friedman test shows statistical significance of the rank differences with α = 0.05, as the confidence interval is C0 = (0, F0.05 ) = (0, 2.1315 ) and the F-distribution statistical values are F ∗ = 13.0244 ∈ / C0 for Acc and F ∗ = 56.5186 ∈ / C0 for RMSE. Based on the rejection of the null hypothesis, the Bonferroni–Dunn post-hoc test [76] is used to compare all classifiers to our proposal both in Acc and in RMSE. This test considers that the performance of any two classifiers is significantly different if their mean ranks differ by at least the critical difference. For each dataset, R = 1 is assigned to the algorithm that achieves best performance. R = 7 is assigned to the algorithm in last place. Table 5 summarizes the output of Bonferroni–Dunn post-hoc test. Acc and RMSE are average values for accuracy and RMSE, respectively. R is the capital letter for Ranks. From the results in Table 5, it can be concluded that our method is significantly better in accuracy than the compared methods, and at least as competitive as BRELM in RMSE. From a purely descriptive point of view, RE-ELM obtains the best

206

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

Table 4 RMSE 10 fold results. The best result from each dataset is highlighted in bold-face and the second one in italics. ID

dataset

RE-ELM

ELM

BELM

AELM

BRELM

NCELM

BSELM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

poker-hand nursery mushroom pen-based-recognition-handwritten-digits spambase optical-recognition-handwritten-digits chess-king-rook-vs-king-pawn thyroid-disease-allhyper thyroid-disease-allhypo thyroid-disease-allbp thyroid-disease-dis thyroid-disease-sick car-evaluation banknote-authentication qsar-biodegradation statlog-project-german-credit connectionist-bench mammographic-mass breast-cancer-wisconsin credit-approval balance-scale breast-cancer-wisconsin-diagnostic climate-model-simulation-crashes heart-disease-hungarian cylinder-bands soybean-large thyroid-disease-new-thyroid glass-identification image-segmentation seeds breast-cancer-wisconsin-prognostic parkinsons flags wine hayes-roth monks-problems-1 heart-disease-switzerland monks-problems-3 echocardiogram zoo post-operative-patient spectf-heart hepatitis soybean-small lenses balloons-a balloons-d

0.7395 0.5282 0.2033 0.4794 0.2523 0.6109 0.3298 0.1407 0.2153 0.1171 0.0581 0.1121 0.5129 0.0814 0.2759 0.3536 0.8927 0.2664 0.1137 0.2591 0.3673 0.1540 0.1734 0.3132 0.4179 0.5672 0.2141 0.6773 0.4806 0.2908 0.3194 0.3171 0.8209 0.2998 0.6006 0.3966 0.8259 0.3268 0.7380 0.2877 0.6184 0.3602 0.2687 0.2937 0.5161 0.2060 0.2055

1.3618 16.8960 0.2116 1.5174 0.2503 21.5124 0.3270 93.1678 46.5506 1.2227 93.8938 0.1121 1.9969 0.1033 0.2759 0.3589 0.9739 0.2702 0.1025 0.2269 0.9774 0.1521 0.1754 0.2747 0.8463 1.2751 0.9588 0.9136 1.2281 0.9438 0.3236 0.2920 6.7567 2.7477 0.9937 0.4046 17.0691 0.3378 2.4084 0.8591 13.4737 0.3693 5.5989 5.2256 0.7748 0.2850 0.3144

49.9197 130.1036 2.1389 1.5187 0.3126 22.8147 2.2558 421.3358 183.6655 1.3697 473.8633 0.0864 1.6499 1.9526 0.3049 0.3464 1.3480 1.9925 1.4832 2.2265 1.1960 1.5254 409.3009 0.3114 2.6958 5.4810 0.8496 1.0615 5.2928 0.7562 0.3008 3.0709 12.8822 18.5901 0.9515 1.0908 49.5085 0.5896 10.2654 1.3318 91.6468 1.1342 79.1274 25.8382 1.1428 11.0684 0.9068

169.2525 264.3685 8.8801 2.4418 1.4627 4.1005 7.8134 293.4042 245.7647 56.4075 494.8521 1.9077 0.8254 9.7270 0.3386 0.3175 0.9268 6.4107 8.3481 7.8044 0.7743 6.9423 288.2339 0.3461 26.3398 11.8553 0.7233 1.5835 23.7272 0.8243 1.2038 136.8265 31.1166 16.9726 0.9147 34.4263 22.3523 6.1334 24.1896 4.9859 139.6952 3.3935 152.7577 132.1555 1.8111 32.4398 0.4121

0.7380 0.5406 0.2085 0.4698 0.2517 0.5997 0.3307 0.1441 0.2117 0.1175 0.0585 0.1136 0.5284 0.0793 0.2840 0.3518 0.9300 0.2664 0.1014 0.2372 0.3725 0.1510 0.1746 0.2774 0.4259 0.5860 0.2257 0.7151 0.5052 0.2787 0.3210 0.3294 0.8309 0.2981 0.5839 0.3901 0.8645 0.3494 0.8330 0.2792 0.6192 0.3830 0.2601 0.3202 0.6286 0.3178 0.1136

136.6766 354.8782 9.8346 0.8861 1.8839 3.6468 8.0756 247.5193 40.8172 58.6817 496.3852 1.7870 0.7381 9.7081 0.3357 0.3159 1.2914 6.3745 9.0064 8.1598 0.7976 8.3734 284.8894 0.3470 5.9240 0.8488 0.8679 1.4461 22.4396 1.5049 1.0955 144.3558 52.8525 76.0810 0.9109 16.6072 2.7597 5.9841 6.2664 4.6276 101.7169 4.6178 99.2401 337.9482 3.8439 31.8526 0.4101

6.1874 43.1016 0.2067 0.7614 0.3182 6.1668 0.3419 75.3756 54.8497 2.2959 94.1052 0.1130 1.4093 0.1147 0.3113 0.3302 0.9733 0.2690 30.9989 51.8301 0.9568 33.7370 81.1851 31.4170 42.2346 1.2521 0.9804 0.8888 0.6086 3.5484 18.7664 0.3138 5.0227 6.3101 0.9941 35.6359 6.6621 35.6072 1.9249 0.8041 25.6035 30.0402 50.8463 6.4620 0.8725 0.2785 0.3102

Table 5 Test Acc and RMSE averages and rankings together with the results of the non-parametric Bonferroni–Dunn test. Performance

Acc RAcc

RMSE RRMSE

RE-ELM

ELM

BELM

AELM

BRELM

NCELM

BSELM

0.8231 2.0957

0.8076 3.5851

0.7563 4.1702

0.7275 4.7660

0.8094 3.5957

0.7327 4.8511

0.7172 4.9362

RE-ELM

ELM

BELM

AELM

BRELM

NCELM

BSELM

0.3745 1.7340

7.3896 3.5319

43.3746 4.9149

57.2019 5.7234

0.3829 2.0532

55.6513 5.4681

16.8793 4.5745

The best result from each dataset is in bold face and the second one in italics. Bonferroni–Dunn test Acc Compared Methods Control Method

ELM

BELM

AELM

BRELM

NCELM

BSELM

RE-ELM

1.1756•

1.8723•

2.5106•

1.3830•

2.6383•

1.9362•

RMSE Compared Methods Control Method

ELM

BELM

AELM

BRELM

NCELM

BSELM

RE-ELM

1.7979•

3.1809•

3.9894•

0.3192

3.7341•

2.8405•

Bonferroni–Dunn Test: CD(α =0.05) = 1.1756. •: Statistically difference with α = 0.05.

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

207

Table 6 Diversity 10 fold results. The best result from each dataset is highlighted in bold-face and the second one in italics. ID

Dataset

RE-ELM

BRELM

AELM

NCELM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

poker-hand nursery mushroom pen-based-recognition-handwritten-digits spambase optical-recognition-handwritten-digits chess-king-rook-vs-king-pawn thyroid-disease-allhyper thyroid-disease-allhypo thyroid-disease-allbp thyroid-disease-dis thyroid-disease-sick car-evaluation banknote-authentication qsar-biodegradation statlog-project-german-credit connectionist-bench mammographic-mass breast-cancer-wisconsin credit-approval balance-scale breast-cancer-wisconsin-diagnostic climate-model-simulation-crashes heart-disease-hungarian cylinder-bands soybean-large thyroid-disease-new-thyroid glass-identification image-segmentation seeds breast-cancer-wisconsin-prognostic parkinsons flags wine hayes-roth monks-problems-1 heart-disease-switzerland monks-problems-3 echocardiogram zoo post-operative-patient spectf-heart hepatitis soybean-small lenses balloons-a balloons-d

0.6130 0.1472 0.4105 0.6924 0.5913 0.5116 0.6615 0.5281 0.5563 0.6706 0.6572 0.6860 0.6288 0.7517 0.6705 0.5328 0.6192 0.7427 0.6772 0.3021 0.7132 0.6018 0.7038 0.2081 0.3564 0.6573 0.7842 0.6983 0.7348 0.7416 0.5503 0.6886 0.3982 0.6321 0.4889 0.7073 0.5275 0.7304 0.4456 0.6892 0.1803 0.4275 0.3626 0.5994 0.7136 0.6922 0.7021

0.3172 0.3659 0.2857 0.6677 0.1371 0.5836 0.3284 0.3883 0.3368 0.4030 0.0571 0.2912 0.6538 0.3133 0.1222 0.1171 0.7226 0.2920 0.2077 0.2621 0.5664 0.2507 0.0285 0.2756 0.2208 0.4138 0.4546 0.6324 0.4089 0.5939 0.1079 0.0392 0.3577 0.4697 0.6382 0.0703 0.3849 0.2533 0.5255 0.3759 0.1386 0.3095 0.0087 0.1739 0.3233 0.0328 0.1704

0.3138 0.3259 0.3710 0.3493 0.3983 0.4159 0.3799 0.2708 0.2810 0.3340 0.3959 0.2785 0.2863 0.2670 0.3180 0.2789 0.2842 0.2976 0.3250 0.3429 0.2849 0.3579 0.2744 0.2525 0.2977 0.3400 0.3153 0.2501 0.3090 0.3205 0.2299 0.2361 0.2342 0.3654 0.2759 0.2545 0.2824 0.2596 0.2746 0.3322 0.2519 0.2736 0.3347 0.1942 0.2681 0.2424 0.2248

0.4002 0.3707 0.2851 0.7377 0.1674 0.6030 0.3317 0.3989 0.4904 0.3756 0.0784 0.3194 0.6530 0.3158 0.1147 0.1303 0.7077 0.2892 0.2110 0.2644 0.5871 0.2604 0.0267 0.2562 0.2834 0.4974 0.4620 0.6070 0.4317 0.5900 0.1092 0.0431 0.4140 0.4250 0.6609 0.1845 0.4948 0.2551 0.5830 0.2704 0.1512 0.2589 0.1858 0.1722 0.4195 0.0328 0.2020

ranking in both performance measures (RAccRE −E LM = 2.0957 and RRMSERE −E LM = 1.7340), followed by ELM in accuracy (RAcc = 3.5851) and by BRELM in RMSE (RRMSEBRELM = 2.0532). 5.2. Diversity Even though diversity is the measure we explicitly optimize, as seen in Section 3, we consider necessary to ascertain if there are significant differences, in order to achieve a complete comparison of our proposal. As explained in Section 4.2, the values of the diversity metric (16), β (1 ) , . . . , β (S ) from each ensemble needs to be done in the same subspace RD×J , so H in training steps should be equal for all the β (s ) , s = 1, . . . , S. Taking this into account, statistical tests for diversity compare Regularized ELM (RE-ELM) only with AdaBoost ELM (AELM), Boosting Ridge (BRELM) and AdaBoost Negative Correlation ELM (NCELM), but not with Bagging ELM (BELM) or Advanced ELM Ensemble (BSELM). The results presented on Table 6 show how RE-ELM has higher diversity when compared. The nonparametric Friedman test shows that there is a statistically significant difference with α = 0.05, as the confidence interval is C0 = (0, F0.05 ) = (0, 2.6702 ) and the

/ C0 . Bonferroni– F-distribution statistical value is F ∗ = 22.3413 ∈ Dunn post-hoc test is used again, and the output is on Table 7. Fig. 14 shows diversity comparisons against different datasets and different algorithms. Datasets are disposed counterclockwise, drawing the perimeter of a circle. As each line is identified with a method, the further it is from the center of the circle, the higher the diversity the method achieves. The fact that RE-ELM is almost in any dataset over the remaining algorithms indicates that our method outperforms in diversity. 5.3. Execution time Since the Sherman–Morrison theorem is applied, we avoid calculating several matrix inverses. Time results from our proposal are of the same order of magnitude as the other methods. To prove this assertion, Table 8 shows the execution time of every compared method. The execution time shown is the time employed by each method from the beginning of the process until it obtains the final outputs, i.e. the time needed to both train the algorithm and get the metrics, divided by the number of hyperparameters

208

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211 Table 7 Test Div averages and rankings together with the results of the non-parametric Bonferroni–Dunn test. Diversity

Div RDiv

RE-ELM

AELM

BRELM

NCELM

0.5827 1.4255

0.2990 2.9787

0.3208 2.9894

0.3427 2.6064

The best result for each dataset is in bold face and the second one in italics Bonferroni–Dunn test Div Compared Methods Control Method

AELM

BRELM

NCELM

RE-ELM

1.5532•

1.5639•

1.1809•

Bonferroni–Dunn Test: CD(α =0.05) = 0.6376. •: Statistically difference with α = 0.05. Table 8 Execution time in seconds, divided by number of hyperparameters. ID

Dataset

RE-ELM

BELM

AELM

BRELM

NCELM

BSELM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

poker-hand nursery mushroom pen-based-recognition-handwritten-digits spambase optical-recognition-handwritten-digits chess-king-rook-vs-king-pawn thyroid-disease-allhyper thyroid-disease-allhypo thyroid-disease-allbp thyroid-disease-dis thyroid-disease-sick car-evaluation banknote-authentication qsar-biodegradation statlog-project-german-credit connectionist-bench mammographic-mass breast-cancer-wisconsin credit-approval balance-scale breast-cancer-wisconsin-diagnostic climate-model-simulation-crashes heart-disease-hungarian cylinder-bands soybean-large thyroid-disease-new-thyroid glass-identification image-segmentation seeds breast-cancer-wisconsin-prognostic parkinsons flags wine hayes-roth monks-problems-1 heart-disease-switzerland monks-problems-3 echocardiogram zoo post-operative-patient spectf-heart hepatitis soybean-small lenses balloons-a balloons-d

18.2949 8.4848 9.4866 6.8588 3.9561 5.0749 2.3673 2.7352 2.9876 2.6224 2.1298 2.0969 2.2164 1.6035 1.3267 1.5541 2.6793 1.1703 1.2721 1.1132 1.3886 1.1187 1.1775 0.8477 0.9131 2.9113 1.2877 1.7078 1.9487 1.2769 0.9433 0.7866 2.1542 1.1568 1.0538 0.9273 1.3039 0.9509 1.0616 1.5656 1.2462 0.7622 0.7955 1.1586 1.0163 0.7034 0.7675

11.1708 4.9449 4.1377 3.2510 2.0288 1.9719 1.2096 1.1822 1.3036 1.2058 1.0787 1.0611 0.8675 0.7162 0.5654 0.6364 0.6868 0.4858 0.4806 0.4515 0.4803 0.4563 0.4444 0.3663 0.3662 0.5413 0.3943 0.4438 0.4335 0.4244 0.3605 0.3581 0.4425 0.3732 0.3394 0.3424 0.3710 0.3412 0.3357 0.3715 0.3786 0.3302 0.3100 0.3287 0.3155 0.2950 0.3180

338.5049 82.1875 31.3874 33.8395 12.1925 10.9823 7.0552 6.1399 6.4062 6.2289 5.5005 5.9195 2.8996 1.8912 1.2208 1.2430 1.4299 0.9441 0.8672 0.7726 0.7587 0.7037 0.7198 0.5045 0.4763 0.7070 0.5415 0.5917 0.6054 0.5486 0.5261 0.4621 0.6142 0.4741 0.4630 0.4472 0.4612 0.4562 0.4349 0.4911 0.4625 0.3859 0.3921 0.3884 0.3741 0.3505 0.3871

6.6866 3.1361 3.4059 2.2871 1.5195 1.5323 0.9117 0.9255 0.9993 0.9513 0.8357 0.8340 0.7160 0.5857 0.5182 0.5750 0.6294 0.4402 0.4709 0.4467 0.4546 0.3913 0.4392 0.3371 0.3364 0.5038 0.3841 0.3969 0.4062 0.3765 0.3615 0.3383 0.4285 0.3449 0.3319 0.3247 0.3599 0.3460 0.3261 0.3593 0.3527 0.3043 0.2741 0.2927 0.2802 0.2567 0.2802

1148.2994 294.6805 115.0922 123.0036 46.4871 44.3931 27.4674 24.6717 24.4464 24.2102 23.8768 22.9987 12.6914 8.0187 5.8065 5.5926 6.8819 4.6094 4.0014 3.9612 3.9840 3.4215 3.3574 2.2995 2.4723 3.1082 2.1799 2.3513 2.4837 2.1632 2.0590 2.2238 2.5262 2.0660 1.8826 1.7422 1.9210 1.7685 1.7432 1.9381 1.7079 1.6050 1.5932 1.6519 1.4365 1.4013 1.4553

8.3633 8.5298 9.5851 6.9556 3.9694 5.1232 2.3984 2.8063 3.0764 2.6807 2.1793 2.1398 2.2183 1.6842 1.3363 1.6232 2.7422 1.2592 1.3117 1.1731 1.4512 1.1652 1.1967 0.9463 0.9842 2.9889 1.3273 1.7759 1.9655 1.3360 1.0359 0.8140 2.2071 1.1719 1.1023 0.9360 1.3060 1.0179 1.0713 1.5963 1.2984 0.7872 0.8484 1.1756 1.0353 0.7348 0.7923

each methodology uses in the cross-validation process. Therefore, RE-ELM takes into account 3 hyperparameters; BELM, 2; AELM, 2; BRELM, 2, NCELM, 3, and BSELM, 3. Fig. 15 plots the results from Table 8. As explained in Section 4.1, ID is ordered from highest to lowest according to

the number of instances. As it shows, RE-ELM is more stable than other algorithms when size increases. Thus, Fig. 15 shows RE-ELM as fast as AELM with few instances, and RE-ELM keeps this execution time when datasets with more instances are trained. The difference is especially noticeable for problems with large datasets.

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

209

As a consequence of the hierarchical way of generating the base learners and the way in which they are combined, our proposal can also be understood as a generator of classifiers. This deserves further study in some real world problems, where sensitivity to the variables may lead us to some applied solutions to our problem, since each base learner of the ensemble could be considered as a solution by itself.

Declarations of interest None.

References

Fig. 14. Diversity value reached at each dataset.

Fig. 15. Execution time in seconds needed for each dataset.

Time grows excessively in these datasets for AELM or NCELM, but not for our method.

6. Conclusions Our ensemble proposal starts from a standard ELM, and shows a hierarchical way of generating the base learners, each one most different from previous elements of the ensemble, yet minimizing the classification error. As shown in Section 3, the classifiers are combined hierarchically by weighting progressively less in order to achieve diversity without detriment to performance. This idea improves the performance significantly compared to other ensemble methods, as seen in Section 5.1. Our proposal achieves competitive results in both accuracy and RMSE. While Boosting Ridge ELM shows good results in RMSE, it is no longer the case in terms of accuracy against our classifier, which is important in classification problems. The fact that diversity is introduced explicitly in the loss function leads us to such desirable results. As expected, RE-ELM is also a good diverse method, according to Section 5.2.

[1] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B Cybern. 42 (2) (2012) 513–529. [2] A.A. Mohammed, R. Minhas, Q.J. Wu, M.A. Sid-Ahmed, Human face recognition based on multidimensional pca and extreme learning machine, Pattern Recognit. 44 (10–11) (2011) 2588–2597. [3] X. Wang, M. Han, Online sequential extreme learning machine with kernels for nonstationary time series prediction, Neurocomputing 145 (2014) 90–97. [4] R. Singh, S. Balasundaram, Application of extreme learning machine method for time series analysis, Int. J. Intell. Technol. 2 (4) (2007) 256–262. [5] Y. Lan, Y.C. Soh, G.-B. Huang, Ensemble of online sequential extreme learning machine, Neurocomputing 72 (13–15) (2009) 3391–3395. [6] M. Van Heeswijk, Y. Miche, T. Lindh-Knuutila, P.A. Hilbers, T. Honkela, E. Oja, A. Lendasse, Adaptive ensemble models of extreme learning machines for time series prediction, in: Proceedings of the International Conference on Artificial Neural Networks, Springer, 2009, pp. 305–314. [7] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. [8] G.-B. Huang, L. Chen, Convex incremental extreme learning machine, Neurocomputing 70 (16–18) (2007) 3056–3062. [9] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1–3) (2006) 489–501. [10] B. Schlkopf, The kernel trick for distances, TR MSR 20 0 0-51, Microsoft Research, 1993. 5–3 [11] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of the IEEE International Joint Conference on Neural Networks, 2, IEEE, 2004, pp. 985–990. [12] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1) (1970) 55–67. [13] P.L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory 44 (2) (1998) 525–536. [14] M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15 (1) (2014) 3133–3181. [15] B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, ACM, 1992, pp. 144–152. [16] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [17] M.A. Hearst, S.T. Dumais, E. Osuna, J. Platt, B. Scholkopf, Support vector machines, IEEE Intell. Syst. Appl. 13 (4) (1998) 18–28. [18] J. Platt, Fast training of svms using sequential minimal optimization, MITPress: Cambridge (1999) 185–208. [19] H.T. Kam, Random decision forest, in: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, Canada, August, 1995, pp. 14–18. [20] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 832–844. [21] L. Breiman, Random forests, Machine Learn. 45 (1) (2001) 5–32. [22] L. Breiman, et al., Statistical modeling: the two cultures (with comments and a rejoinder by the author), Stat. Sci. 16 (3) (2001) 199–231. [23] M.R. Segal, Machine learning benchmarks and random forest regression, Biostatistics (2004). [24] G.-B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybernet. 2 (2) (2011) 107–122. [25] Q. He, T. Shang, F. Zhuang, Z. Shi, Parallel extreme learning machine for regression based on mapreduce, Neurocomputing 102 (2013) 52–58. [26] L. Kasun, H. Zhou, G. Huang, C. Vong, Representational learning with extreme learning machine for big data, IEEE Intell. Syst. 28 (5) (2013) 31–34. [27] J. Cao, K. Zhang, H. Yong, X. Lai, B. Chen, Z. Lin, Extreme learning machine with affine transformation inputs in an activation function, IEEE Trans. Neural Netw. Learn. Syst. 30 (7) (2018) 2093–2107. [28] C. Cervellera, D. Macci, Low-discrepancy points for deterministic assignment of hidden weights in extreme learning machines, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 891–896.

210

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211

[29] P.A. Henríquez, G.A. Ruz, Extreme learning machine with a deterministic assignment of hidden weights in two parallel layers, Neurocomputing 226 (2017) 109–116. [30] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, Op-elm: optimally pruned extreme learning machine, IEEE Trans. Neural Netw. 21 (1) (2010) 158–162. [31] L.D. Tavares, R.R. Saldanha, D.A.G. Vieira, A.C. Lisboa, A comparative study of extreme learning machine pruning based on detection of linear independence, in: Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence, 2014, pp. 63–69. [32] Y. Miche, M. Van Heeswijk, P. Bas, O. Simula, A. Lendasse, Trop-elm: a double-regularized elm using lars and tikhonov regularization, Neurocomputing 74 (16) (2011) 2413–2421. [33] N. Liu, H. Wang, Ensemble based extreme learning machine, IEEE Signal Process. Lett. 17 (8) (2010) 754–757. [34] D. Wang, M. Alhamdoosh, Evolutionary extreme learning machine ensembles with size control, Neurocomputing 102 (2013) 98–110. [35] J.-h. Zhai, H.-y. Xu, X.-z. Wang, Dynamic ensemble extreme learning machine based on sample entropy, Soft Comput. 16 (9) (2012) 1493–1502. [36] J. Tang, C. Deng, G.-B. Huang, Extreme learning machine for multilayer perceptron, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 809–821. [37] T. Wang, J. Cao, X. Lai, B. Chen, Deep weighted extreme learning machine, Cognit. Comput. 10 (6) (2018) 890–907. [38] S. Ding, N. Zhang, X. Xu, L. Guo, J. Zhang, Deep extreme learning machine and its application in EEG classification, Math. Probl. Eng. 2015 (2015) 11. [39] J. Tang, C. Deng, G.-B. Guang, Extreme learning machine for multilayer perceptron, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2015) 809–821. [40] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, R. Fergus, Regularization of neural networks using dropconnect, in: Proceedings of the International conference on machine learning, 2013, pp. 1058–1066. [41] X.-W. Chen, X. Lin, Big data deep learning: challenges and perspectives, IEEE Access 2 (2014) 514–525. [42] L. Xie, J. Wang, Z. Wei, M. Wang, Q. Tian, Disturblabel: Regularizing cnn on the loss layer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4753–4762. [43] C. Ju, A. Bibaut, M. van der Laan, The relative performance of ensemble methods with deep convolutional neural networks for image classification, J. Appl. Stat. 45 (15) (2018) 2800–2818. [44] F. Du, J. Zhang, N. Ji, G. Shi, C. Zhang, An effective hierarchical extreme learning machine based multimodal fusion framework, Neurocomputing 322 (2018) 141–150. [45] M. Van Heeswijk, Y. Miche, E. Oja, A. Lendasse, Gpu-accelerated and parallelized elm ensembles for large-scale regression, Neurocomputing 74 (16) (2011) 2430–2437. [46] R. Ye, P.N. Suganthan, Empirical comparison of bagging-based ensemble classifiers, in: Proceedings of the 15th International Conference on Information Fusion, IEEE, 2012, pp. 917–924. [47] Z. Zhou, J. Chen, Z. Zhu, Regularization incremental extreme learning machine with random reduced kernel for regression, Neurocomputing 321 (2018) 72–81. [48] M.A. Islam, D.T. Anderson, J.E. Ball, N.H. Younan, Fusion of diverse features and kernels using lp-norm based multiple kernel learning in hyperspectral image processing, in: Proceedings of the 8th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), IEEE, 2016, pp. 1–5. [49] Y. Ren, P. Suganthan, N. Srikanth, Ensemble methods for wind and solar power forecasting state-of-the-art review, Renew. Sustain. Energy Rev. 50 (2015) 82–91. [50] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the IJCAI, 14, Montreal, Canada, 1995, pp. 1137–1145. [51] Y. Freund, R.E. Schapire, A short introduction to boosting, J. Japanese Soc. Artif. Intell. 14 (5) (1999) 771–780. [52] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140. [53] A.H. Ko, R. Sabourin, L.E. de Oliveira, A. de Souza Britto Jr, The implication of data diversity for a classifier-free ensemble selection in random subspaces, in: Proceedings of the 19th International Conference on Pattern Recognition, ICPR, 2008, pp. 1–5. [54] A. Riccardi, F. Fernández-Navarro, S. Carloni, Cost-sensitive adaboost algorithm for ordinal regression based on extreme learning machine, IEEE Trans. Cybern. 44 (10) (2014) 1898–1909. [55] H. Tian, B. Meng, A new modeling method based on bagging elm for day-ahead electricity price prediction, in: Proceedings of the IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), IEEE, 2010, pp. 1076–1079. [56] A.O. Abuassba, D. Zhang, X. Luo, A. Shaheryar, H. Ali, Improving classification performance through an advanced ensemble based heterogeneous extreme learning machines, Comput. Intell. Neurosci. 2017 (2017) 1–11. [57] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [58] S. Wang, H. Chen, X. Yao, Negative correlation learning for classification ensembles, in: Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), 2010, pp. 1–8. [59] C.-H. Shih, K.-Y. Wang, Y.-H. Wang, Y.-S. Hsu, S.-J. Lin, An emerging fusion architecture for corporate operating efficiency forecasting with ensemble strategies, Int. J. Inf. Process. Manag. 6 (2) (2015) 15.

[60] Y. Liu, X. Yao, Ensemble learning via negative correlation, Neural Netw. 12 (10) (1999) 1399–1404. [61] Y. Liu, X. Yao, T. Higuchi, Evolutionary ensembles with negative correlation learning, IEEE Trans. Evol. Comput. 4 (4) (20 0 0) 380–387. [62] F. Fernandez-Navarro, P.A. Gutierrez, C. Hervas-Martinez, X. Yao, Negative correlation ensemble learning for ordinal regression, IEEE Trans. Neural Netw. Learn. Syst. 24 (11) (2013) 1836–1849. [63] A.N. Tikhonov, V.I. Arsenin, Solutions of Ill-posed Problems, 14, Vh Winston, 1977. [64] G. Huang, G.-B. Huang, S. Song, K. You, Trends in extreme learning machines: a review, Neural Netw. 61 (2015) 32–48. ˇ , Managing diversity in regression ensembles, J. [65] G. Brown, J.L. Wyatt, P. Tino Mach. Learn. Res. 6 (Sep) (2005) 1621–1650. [66] W.W. Hager, Updating the inverse of a matrix, SIAM Rev. 31 (2) (1989) 221–239. [67] D. Dheeru, G. Casey, UCI repository of machine learning databases, 2019. http: //archive.ics.uci.edu/ml. [68] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Data Management Systems, second ed., Morgan Kaufmann (Elsevier), 2005. [69] Y. Ran, X. Sun, H. Sun, L. Sun, X. Wang, Boosting ridge extreme learning machine, in: Proceedings of the 2012 IEEE Symposium on Robotics and Applications, ISRA, 2012, pp. 881–884. [70] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003) 181–207. [71] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (Jan) (2006) 1–30. [72] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm Evol. Comput. 1 (1) (2011) 3–18. [73] F. Fernández-Navarro, C. Hervás-Martínez, P. Antonio Gutiérrez, A dynamic over-sampling procedure based on sensitivity for multi-class problems, Pattern Recognit. 44 (8) (2011) 1821–1833. [74] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 743–761. [75] M. Friedman, A comparison of alternative tests of significance for the problem of m rankings, Ann. Math. Stat. 11 (1) (1940) 86–92. [76] O.J. Dunn, Multiple comparisons among means, J. Am. Stat. Assoc. 56 (293) (1961) 52–64. Carlos Perales-González received his B.Sc. degree in Physics from University of Córdoba, Spain, in 2015, and his M.Sc. degree in Mathematical Engineering from University Complutense of Madrid, Spain, in 2016. He is currently a Ph.D. student in Computer Science at University Loyola Andalucía, Sevilla, Spain. His current research interests include neural networks, supervised machine learning and ensemble classifiers.

Mariano Carbonero-Ruz received his B.Sc. degree in mathematics from University of Sevilla, Spain, in 1985, his B.Sc. degree in Economics from National Distance Education University (UNED), and his Ph.D. in Mathematics from University of Sevilla in 1995. He has been a Lecturer with ETEA from 1987 to 2013, a private Business Administration faculty affiliated to University of Córdoba, Spain. Since then he is an Associate Professor at University Loyola Andalucía. His current research interests include computational intelligence, both theoretical research as applications of methods to solve real problems in different areas of economics and education.

David Becerra-Alonso received his B.Sc. degree in Physics from University of Córdoba, Spain, in 2005, where he specialized in the simulation of physical systems, and his Ph.D. from the School of Computing at University of the West of Scotland, Paisley, U.K., in 2010, where he was involved on dynamical chaotic systems. He also has a M.Sc. Degree in Bioinformatics from University Internacional of Andalucía, Seville, Spain. He is currently an Associate Professor at University Loyola Andalucía, Spain. His current research interests include dynamical systems, emergent collective behavior, and machine learning techniques and heuristics.

C. Perales-González, M. Carbonero-Ruz and D. Becerra-Alonso et al. / Neurocomputing 361 (2019) 196–211 Javier Pérez-Rodríguez received his B.Sc. in Computer Science from University of Córdoba, Spain, in 2007, and his M.Sc. in Soft Computing and Intelligent Systems (2012) and his Ph.D. in ICT (2015) from the University of Granada. He worked as Graduate Teaching Assistant at the Department of Computing and Numerical Analysis, Univ. of Cordoba. In 2018 he joined the Department of Quantitative Methods at University Loyola Andalucía as Associate Professor. His research interests include pattern recognition, evolutionary computation and bioinformatics.

211

Francisco Fernández-Navarro received his B.Sc. degree in Computer Science from University of Córdoba, Spain, in 2008, and his M.Sc. and Ph.D. in Computer Science and Artificial Intelligence from University of Málaga, Spain, in 2009 and 2011, respectively. He was a Research Fellow in Computational Management with the European Space Agency, Noordwijk, The Netherlands. He is currently an Associate Professor at University Loyola Andalucía, Spain. His current research interests include neural networks, ordinal regression, imbalanced classification, and hybrid algorithms.