Boosting weighted ELM for imbalanced learning

Boosting weighted ELM for imbalanced learning

Neurocomputing 128 (2014) 15–21 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Boosting ...

559KB Sizes 0 Downloads 257 Views

Neurocomputing 128 (2014) 15–21

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Boosting weighted ELM for imbalanced learning Kuan Li a, Xiangfei Kong b, Zhi Lu b, Liu Wenyin c,n, Jianping Yin a a b c

School of Computer Science, National University of Defense Technology, Changsha, China City University of Hong Kong, 83 Tat Chee Ave, Kowloon, Hong Kong College of Computer Science and Technology, Shanghai University of Electric Power, Shanghai, China

art ic l e i nf o

a b s t r a c t

Article history: Received 28 August 2012 Received in revised form 14 May 2013 Accepted 18 May 2013 Available online 25 October 2013

Extreme learning machine (ELM) for single-hidden-layer feedforward neural networks (SLFN) is a powerful machine learning technique, and has been attracting attentions for its fast learning speed and good generalization performance. Recently, a weighted ELM is proposed to deal with data with imbalanced class distribution. The key essence of weighted ELM is that each training sample is assigned with an extra weight. Although some empirical weighting schemes were provided, how to determine better sample weights remains an open problem. In this paper, we proposed a Boosting weighted ELM, which embedded weighted ELM seamlessly into a modified AdaBoost framework, to solve the above problem. Intuitively, the distribution weights in AdaBoost framework, which reflect importance of training samples, are input into weighted ELM as training sample weights. Furthermore, AdaBoost is modified in two aspects to be more effective for imbalanced learning: (i) the initial distribution weights are set to be asymmetric so that AdaBoost converges at a faster speed; (ii) the distribution weights are updated separately for different classes to avoid destroying the distribution weights asymmetry. Experimental results on 16 binary datasets and 5 multiclass datasets from KEEL repository show that the proposed method could achieve more balanced results than weighted ELM. & 2013 Elsevier B.V. All rights reserved.

Keywords: Extreme learning machine Weighted extreme learning machine Imbalanced datasets AdaBoost

1. Introduction ELM approach works for generalized single-hidden layer feedforward networks (SLFNs) [1–4]. It has been proved effective and efficient in both classification and regression scenarios [3,5,6], and has drawn a significant amount of interest from various fields for recent years, including face recognition [7,8], handwritten character recognition [9], image classification [10], etc. The essence of ELM is that the hidden layer of SLFNs need not be tuned: the hidden neuron parameters are randomly assigned which may be independent of the training data, and the output weights can be decided by the Moore–Penrose generalized inverse [11] analytically. Compared with other computational intelligence techniques, ELM provides better generalization performance at a much faster learning speed and with least human intervention. Imbalanced data sets are quite common in many real applications [12]. Over the years, the classification of imbalanced data sets has attracted tremendous attention [12–15]. Technically speaking, any data set that exhibits an unequal distribution between its classes can be deemed imbalanced. However, the common understanding in the learning community is that imbalanced data set corresponds to data exhibiting significant imbalances.

n

Corresponding author. Tel.: þ 86 21 6802 9237. E-mail address: [email protected] (L. Wenyin).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.05.051

When dealing with imbalanced data sets, most standard learning algorithms tend to provide unfavorable accuracy across the classes. The majority classes are often emphasized while the minority classes neglected. The reason is that these algorithms assume relatively balanced class distributions and equal misclassification costs within imbalanced data sets [13,14]. Typically, researchers tried to address the issue of class imbalance using the following two kind of approaches: (i) data-level approaches, which try to re-balance the original data set by re-sampling techniques [16–18]. (ii) algorithmic approaches, which try to modify the classifiers to suit the imbalanced datasets [19–22]. A popular algorithmic approach is to assign a different misclassification cost (weight) for each particular training sample [20]. Due to ELM's great advantages, using it to solve the imbalanced classification problem has also drawn a lot of research interests [23,24]. Earlier, weighted regularized ELM was proposed by Toh [25] and Deng et al. [26]. But their works were not targeting the imbalanced data problem. Recently, Zong et al. [24] proposed Weighted ELM. The key essence of weighted ELM is that each training sample is assigned with an extra weight to strengthen the impact of minority class while weaken the relative impact of majority class. Extensive experimental results in their work showed that unweighted ELM tends to be affected by the imbalance datasets; weighted ELM could obtain more balanced classification results while maintaining the advantages from original ELM. However, the performance of weighted ELM is greatly

16

K. Li et al. / Neurocomputing 128 (2014) 15–21

affected by the training sample weights set by users. Although two empirical schemes, which are generated from the class information, were provided, how to set better training sample weights still remains a problem worth addressing. In this paper, we proposed a Boosting weighted ELM, trying to answer the above question in a novel way. Particularly, we embedded weighted ELM proposed by [24] seamlessly into a modified AdaBoost framework. As it is known, an important concept in AdaBoost framework is the distribution weights, which reflect importance of all training samples. Intuitively, such distribution weights could be used as training sample weights in weighted ELM. Furthermore, the original AdaBoost framework is adapted in two aspects according to the characteristics of imbalanced learning and weighted ELM: (i) the initial distribution weights are set to be asymmetric, instead of uniform, to make AdaBoost converge at a faster speed; (ii) the distribution weights are updated separately for training samples form different classes to avoid destroying the distribution weights asymmetry. Experimental results on 16 binary datasets and 5 multiclass datasets from KEEL repository show that under G-mean metric, the proposed method could achieve more balanced results than weighted ELM. The remainder of this paper is organized as follows. Section 2 outlined the related work of ELM, weighted ELM and evaluation metrics used in this paper. The proposed method is presented in Section 3. Some issues related to the proposed method are discussed in Section 4. Experimental results are provided in Section 5. Finally we concluded in Section 6.

2.1. Extreme learning machine The essence of ELM is that the hidden layer of SLFNs need not be tuned: the hidden neuron parameters are randomly assigned which may be independent of the training data, and the output weights can be decided by the Moore–Penrose generalized inverse [11] analytically. More precisely, given N training samples ðxi ; t i Þ, the mathematical model of the SLFNs is J Hβ  T J ¼ 0

LPELM ¼

Subject to :

hðxi Þβ ¼ t Ti  ξi ; T

i ¼ 1; …; N

ð3Þ

where ξi ¼ ½ξi;1 ; …; ξi;m  is the training error vector of the m output nodes corresponding to training sample xi . C here is the trade-off regularization parameter between the minimization of training errors and the maximization of the marginal distance. Same solution as (2) could be obtained based on KKT theorem [28]. Note that ELM is capable of both binary class and multiclass classification tasks. Assume that N training samples ðxi ; t i Þ belong to m classes. For each sample, t i is typically a vector of length m, where ( 1 if xi A class j t i ½j ¼ ð4Þ = class j  1 if xi 2 Given a new sample x, the output function of ELM is obtained [3] by 8  1 > > T I T > T when N o L > < hðxÞH C þ HH ð5Þ f ðxÞ ¼  1 > I > > þ HT H H T T when N Z L > hðxÞ : C where f ðxÞ ¼ ½f 1 ðxÞ; …; f m ðxÞ is the output function vector. Users may use (6) to find out the predication label of x. ð6Þ

i

This section will briefly introduce the techniques related to ELM, weighted ELM and a commonly used evaluation metric Gmean in imbalanced learning.

or

1 1 N ‖β‖2 þ C ∑ ‖ξi ‖2 2 2i¼1

Minimize :

labelðxÞ ¼ arg max f i ðxÞ; i A ½1; …; m

2. Related work

H β ¼ T;

ELM could also be explained in the optimization view. ELM tries to minimize both ‖H β  T‖2 and J β J . Therefore, solution of (1) could be obtained [3] by

ð1Þ

where β is the output weight, T is the target vector and H is the hidden layer output matrix H ¼ ½hðx1 Þ; hðx2 Þ; …; hðxN Þ. The hidden layer node function hi ðxÞ; i ¼ 1; …; L, also called feature mapping function, actually maps sample data x from the original data space to the hidden layer space to form the hidden layer output row vector hðxÞ ¼ ½h1 ðxÞ; …; hL ðxÞ with dimensionality L (L nodes). As the keynote feature of ELM, the parameters ai ; bi in feature mapping function hi ðxÞ ¼ Gðai ; bi ; xÞ are randomly generated according to any continuous probability distribution. Thus, we could analytically obtain the least square solution (the output weights β) with minimal norm [3] by 8  1 > > > H T I þ HH T T when N o L > < C ð2Þ β ¼ H† T ¼  1 > I > T T > þ H H H T when N Z L > : C A positive parameter C is used here for better generalization performance [27].

2.2. Weighted ELM Weighted ELM [24] is recently proposed to deal with data with imbalanced class distribution while maintaining the advantages from original ELM stated above. Particularly, each training sample is assigned with an extra weight. Mathematically, a N  N diagonal matrix W associated with every training sample xi is defined. Usually if xi comes from a minority class(es), the associated weight Wii is relatively larger than samples from a majority class(es). Therefore, the impact of minority class(es) was strengthened while the relative impact of majority class(es) was weakened. Considering the diagonal weight matrix W, the optimization formula of ELM could be revised [24] as 1 1 N ‖β‖2 þ CW ∑ ‖ξi ‖2 2 2i¼1

Minimize :

LPELM ¼

Subject to :

hðxi Þβ ¼ t Ti  ξi ; T

i ¼ 1; …; N

According to the KKT theorem [28], the solution to (7) is 8  1 > > T I T > þ WHH H WT when N o L > < C β ¼ H† T ¼  1 > I > > þ H T WH WH T T when N Z L > : C

ð7Þ

ð8Þ

2.3. Evaluation metrics In the case of imbalanced learning, conventional evaluation metrics, such as overall accuracy and error rate, fail to provide adequate information about the performance of the classifier [12,13]. Consider a binary classification problem where 99% of the samples belong to one class, and the rest 1% belong to the other, a dump learner can easily achieve 99% accuracy by assigning

K. Li et al. / Neurocomputing 128 (2014) 15–21

all test samples to the majority class. Although such an accuracy is impressive, the accuracy for the minority class is actually 0. Instead, an evaluation metric G-mean, which is commonly used among the imbalanced learning community, is adopted in this paper. G-mean is the geometric means of recall values of all m classes. Let Ri ¼ nii =∑m j ¼ 1 nij denote the recall value of class i, then G-mean for both binary class and multi-class classification problem is defined as !m1

To recall, m is the number of classes and #t i is the number of samples belonging to class t i . The differences between these two initial distribution weighting schemes will be discussed later. Second, in original AdaBoost framework, the distribution weights of training samples are dynamically updated without reference to the class distribution information, and the normalization procedure is operated among all training samples. Formally, the distribution weight Dt ðxi Þ; i ¼ 1; 2; …; N is updated sequentially in original AdaBoost by

m

G mean ¼

∏ Ri

Dt þ 1 ðxi Þ ¼

i¼1

17

Dt ðxi Þexpð  αt IðΩt ðxi Þ; yi ÞÞ Zt

ð10Þ

Actually, G-mean metric considers the balance/tradeoff between recall values from different classes. Take the 99–1% dump classifier above for example, the G-mean value would be as low as 0 since the recall value for minority class is 0.

 where αt ¼ 12 lnðð1  ɛ t Þ=ɛt Þ ¼ 12 ln ∑i:Ωt ðxi Þ ¼ yi Dt ðxi Þ=∑i:Ωt ðxi Þ a yi Dt ðxi ÞÞ is the weight updating parameter. IðÞ is an indicator function ( 1 if Ωt ðxi Þ ¼ yi IðΩt ðxi Þ; yi Þ ¼ ð11Þ 0 if Ωt ðxi Þ ayi

3. Boosting weighted ELM

where Ωt ðxi Þ is the prediction output of classifier Ωt on xi . Zt is a normalization factor so that ∑N i ¼ 1 Dt þ 1 ðxi Þ ¼ 1. Instead, we update the distribution weights separately for samples from different classes. See the pseudo-code for our proposed Boosting Weight ELM, as shown in Algorithm 1, for details. The differences between these two operations will also be discussed later.

It can be seen from (8) that the weight matrix W ¼ diagðW ii Þ; i ¼ 1; …; N plays an important role in weighted ELM. Actually, it determines to what degree of re-balance users are seeking for. For the sake of convenience, Ref [24] empirically proposed two weighting schemes, which are automatically generated from the class information 1 #t i 8 0:618 > > > < #t i Weighting scheme W2 : W ii ¼ 1 > > > : #t i

Algorithm 1. Boosting weighted ELM.

Weighting scheme W1 : W ii ¼

if #t i 4 AVGð#t i Þ if #t i r AVGð#t i Þ

where #t i is the number of samples belonging to class t i , AVGð#t i Þ represents the average number of samples for all classes. Although the above two weighting schemes were provided, how to determine training sample weights to achieve better classification performance is still up to users. It is still an open yet meaningful problem. We try to solve it by building a bridge between weighted ELM and the AdaBoost framework. AdaBoost (Adaptive Boosting) algorithm [29] is an effective boosting algorithm. It is called adaptive because it uses multiple classifiers (multiple iterations) to generate a single strong learner. These classifiers are trained serially. Before each iterations, the distribution weights on the training samples are adjusted according to the performance of the previous classifiers. Therefore, the distribution weights of training sample reflect their relative importance and the samples that are often misclassified will obtain larger distribution weights than the correctly classified samples. This forces the following classifier to concentrate on those samples hard to be correctly classified. This motivates us to embed weighted ELM into AdaBoost and input the distribution weights of training sample into weighted ELM as training sample weights directly. There are many variants of AdaBoost algorithm. In this study, AdaBoost.M1 [29] is utilized. Furthermore, based on analyzing the characteristics of imbalanced datasets and weighted ELM, we made the following two changes to the AdaBoost.M1 framework. First, in original AdaBoost framework, the initial distribution weight D1 ðxi Þ for each sample xi ; i ¼ 1; 2; …; N is set uniformly to 1=N. Instead, we set the initial distribution weights in an asymmetric way D1 ðxi Þ ¼

1 m#t i

ð9Þ

1: Input:Training set ℵ ¼ fðxi ; t i Þ; xi A Rn ; t i A Rm ; i ¼ 1; 2; …; Ng. The number T of weighted ELMs in the final decision rule (also the number of iterations in AdaBoost). 2: Initialization D1 ðxi Þ ¼ 1=m#t i . 3: for t¼1 to T do 4: W t ¼ diagðDt ðxi ÞÞ; i ¼ 1; …; N. 5: Train weighted ELM classifier Ωt based on ℵ and W t . 6:

Update distribution weights for each class separately: for class j ! ∑ Dt ðxi Þ 1 αjt ¼ ln xi A class j:Ωt ðxi Þ ¼ yi 2 ∑xi A class j:Ωt ðxi Þ a yi Dt ðxi Þ 8 xi A class j; Dt þ 1 ðxi Þ ¼

Dt ðxi Þexpð  αjt IðΩt ðxi Þ;jÞÞ Z jt

where Ztj is a normalization factor so that ∑xi A class j Dt þ 1 ðxi Þ ¼ 1=m. 7: if ∑i:Ωt ðxi Þ ¼ yi Dt ðxi Þ r ∑i:Ωt ðxi Þ a yi Dt ðxi Þ then set T ¼ t  1 and abort. 8: Compute the vote weight of current weighted ELM classifier

αt ¼

Ωt:



1 1  ɛt ln 2 ɛt

 ¼

∑i:Ωt ðxi Þ ¼ yi Dt ðxi Þ 1 ln 2 ∑i:Ωt ðxi Þ a yi Dt ðxi Þ

!

9: end for 10: Test: Given an unlabeled sample x, output the weight voted label: ΘðxÞ ¼ arg max∑Tt ¼ 1 αt ½Ωt ðxÞ ¼ k. k

4. Discussion The previous section provided a solution for embedding weighted ELM into a modified AdaBoost framework to automatically update the training sample weights without human intervention. We know that in AdaBoost, the distribution weights of training samples reflect their relative importance. The samples that are often misclassified will obtain larger distribution weights than the correctly classified samples. Therefore, it is quite intuitive

18

K. Li et al. / Neurocomputing 128 (2014) 15–21

balanced classification results under G-mean metrics than that of AdaBoost weighted ELM1 in the following section.

5. Performance evaluation 5.1. Data specification and parameter setting

Fig. 1. Training G-mean values of yeast1vs7 with AdaBoost iteration steps.

to use the distribution weights of training samples as the training sample weights in weighted ELM. In the remainder of this section, we will focus on explaining the differences between our modified AdaBoost framework and the original AdaBoost framework. 4.1. Initial distribution weights In original AdaBoost, the initial distribution weights for all samples are set uniformly to 1=N, where N is the number of training samples. In the context of imbalanced learning, this equals that we start the AdaBoost training with an unweighed ELM classifier. In other words, the training sample weights are updated solely based on the AdaBoost framework, while the valuable class distribution information is discarded. This will result in a much slower convergence speed in AdaBoost and need more iteration steps, which is roughly equivalent to longer training and computational time. Take a binary problem yeast1vs7 with imbalance ratio (6.72;93.28) for example. The training G-mean values with AdaBoost iteration steps when applying asymmetric/uniform initial distribution weights are depicted in Fig. 1. When applying asymmetric initial distribution weights, it takes 9 steps for the training process to converge. However, when applying uniform initial distribution weights, it takes 35 steps for the training process to converge. Both cases reach a G-mean value equals 1.0 on the training set. 4.2. Distribution weights updating The initial boosting weights are modified so as to represent the weight asymmetry. However, because boosting re-updates all weights at each iteration, it quickly destroys the initial asymmetry. Although the AdaBoost framework is trained well (G-mean value equals 1) in the training process, the predictor obtained after convergence is usually not different from that produced with uniform initial conditions. Our second natural heuristic is to modify the weight updating scheme. The basic starting point of our weight update scheme, as shown in Algorithm 1, is that each class shares a same amount (sum) of distribution weights. Therefore, in each round of AdaBoost, we could roughly imagine the quantity of minority class(es) and majority class(es) is re-balanced into the ratio of 1 : 1 : … : 1 . |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} m

Therefore, the distribution weights will be re-updated and normalized inside each class. This will prevent the weights from gathering into the majority classes, and the distribution weights asymmetry will not be destroyed during the training process. We will show that such a weight updating scheme could obtain more

The main objective of this section is twofold. First, check the classification ability of the proposed Boosting weighted ELM on imbalanced learning; second, verify the analysis stated in the previous sections. We have selected 21 datasets (16 binary-class problems and 5 multi-class problems) which have different degrees of imbalance. All datasets are downloaded with 5-folder cross validations from the KEEL dataset repository.2 The result is averaged over 10 runs. Both Boosting weighted ELM and AdaBoost weighted ELM need to train several weighted ELM classifiers. In our experiments, due to the difference between convergence speeds as stated in Section 4.1, the number of weighted ELM learners in Boosting weighted ELM and AdaBoost weighted ELM are set to 10 and 50 respectively. It roughly means that Boosting weighted ELM and AdaBoost weighted ELM take about 10 and 50 times more training/testing time than weighted ELM respectively. However, considering the learning speed of ELM is extremely fast, it is worthy to afford such cost in the context of imbalanced learning. To quantitatively measure the imbalance degree of a dataset, the imbalance ratio (IR) is defined: #majority #minority maxð#t i Þ ; Multiclass : IR ¼ minð#t i Þ

Binary : IR ¼

i ¼ 1; …; m

The attributes of the datasets are normalized into ½ 1; 1. Details of the datasets can be found in Tables 1 and 2. For each dataset, the number of samples (#Samps.), the number of attributes (#Atts.), the number of classes (#Cl.) and IR are listed. Both tables are ordered by the IR, from low to highly imbalanced datasets. In ELM theory, the feature mapping can be almost any nonlinear continuous piecewise function. In the experiments, we test the algorithms on Sigmoid additive node, which is a popular choice by researchers. There are two parameters to tune for ELMs with sigmoid additive node Gða; b; xÞ ¼ 1=ð1 þ expð  ðax þ bÞÞÞ: the trade-off constant C and the number of hidden nodes L. Similar to [24], a grid search of C f2  18 ; 2  16 ; …; 248 ; 250 g and L f10; 20; …; 990; 1000g is conducted in seek of the optimal result. 5.2. Experimental results 5.2.1. Boosting weighted ELM vs. weighted ELM From Table 3, it can be seen that, for most datasets, after embedding weighted ELM into the modified AdaBoost framework, G-mean value increases significantly. Even for some datasets where the classes are quite balanced, e.g. glass and pima, the results of Boosting weighted ELM are better than those of weighted ELM. It shows that the proposed Boosting weighted ELM is applicable to not only the imbalanced datasets, but also the well balanced datasets. However, for some datasets where the classes are highly imbalanced, e.g. abalone19, the results of Boosting weighted ELM and weighted ELM are more or less. 1 For convenience, we denote “the original AdaBoost framework þ weighted ELM” as “AdaBoost weighted ELM” 2 http://sci2s.ugr.es/keel/imbalanced.php

K. Li et al. / Neurocomputing 128 (2014) 15–21

5.2.2. Boosting weighted ELM vs. AdaBoost weighted ELM From Table 3, we observe that, for most datasets, AdaBoost weighted ELM performs worse not only than Boosting Weighted ELM, but also than weighted ELM. In all tests, AdaBoost weighted ELM performs only slightly better than unweighed ELM. The reason lies that original AdaBoost targets at higher overall accuracy, rather than higher G-mean value. It deals with all training samples equally. After several training iterations, the majority class (es) will still be over-emphasized. In other words, sum of weights for majority class(es) will be much larger than that of minority Table 1 Summary description for imbalanced binary-class datasets. Dataset

#Samps.

#Atts.

IR

glass1 pima vehicle3 haberman ecoli1 yeast3 ecoli3 glass016vs2 glass2 yeast1vs7 abalone9vs18 yeast2vs8 yeast4 yeast1289vs7 yeast6 abalone19

214 768 846 306 336 1484 336 192 214 459 731 482 1484 947 1484 4174

9 8 18 3 7 8 7 9 9 8 8 8 8 8 8 8

1.82 1.90 2.52 2.68 3.36 8.11 8.19 10.29 10.39 13.87 16.39 23.10 28.41 30.56 39.15 128.87

#Samps.

#Atts.

#Cl.

IR

hayes-roth new-thyroid glass thyroid page-blocks

132 215 214 720 548

4 5 9 21 10

3 3 6 3 5

1.70 4.84 8.43 39.19 163.24

class(es). While the proposed Boosting weighted ELM updates training sample weights for each class separately; during the iterations, all classes will always share same amount (sum) of training sample weights. Such comparisons proved that our modification to original AdaBoost framework is effective. 5.2.3. Multiclass imbalanced learning Note that the proposed Boosting weighted ELM can deal with multi-class problems without any adaption. Several multiclass datasets from KEEL repository are tested in the experiments, see Table 4. Similar observations of binary classifiers apply to multiclass classifiers. The proposed Boosting weighted ELM is verified to be efficient on the multiclass classification problems as well. 5.2.4. Performance result in terms of accuracy Performance result in terms of accuracy of all the binary problems and multiclass problems is shown in Tables 5 and 6. It is interesting to see that AdaBoost weighted ELM outperforms the others for most datasets. This is consistent with our analysis in 5.2.2. In other words, AdaBoost weighted ELM performs best in achieving higher overall accuracy, while the proposed Boosting weighted ELM tries to obtain more balanced classification results (with higher G-mean value), in the cost of relatively slight decrease in the overall accuracy.

6. Conclusion In this study, a Boosting weighted ELM is proposed trying to obtain more balanced imbalanced classification results than weighted ELM. In weighted ELM, several empirical weighting schemes were provided. However, when facing new datasets, how to determine the weights is still up to users. It is still an open problem worth solved. Inspired by the distribution weights updating mechanism of AdaBoost, we embedded weighted ELM seamlessly into a modified AdaBoost framework. Intuitively, the distribution weights in AdaBoost, which reflect importance of training samples, are input into weighted ELM as training sample weights. Furthermore, such training sample weights were dynamically updated during iterations of AdaBoost.

Table 2 Summary description for imbalanced multi-class datasets. Dataset

19

Table 3 Performance result of binary problems. Dataset (IR)

glass1(1.82) pima(1.90) vehicle3(2.52) haberman(2.68) ecoli1(3.36) yeast3(8.11) ecoli3(8.19) glass016vs2(10.29) glass2(10.39) yeast1vs7(13.87) abalone9vs18 (16.39) yeast2vs8(23.10) yeast4(28.41) yeast1289vs7 (30.56) yeast6(39.15) abalone19(128.87)

G-mean (Sigmoid node) Unweighted ELM

Weighted ELM W1

Weighted ELM W2

Boosting weighted ELM

AdaBoost weighted ELM

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(222,390) (22,92) (26,690) (214,230) (216,120) (222,230) (212,630) (216,530) (220,260) (210,900) (220,250)

73.25 70.63 77.24 47.91 86.62 79.81 76.13 65.96 69.37 64.13 72.58

(216,30) (22,970) (212,230) (2  2,690) (24,410) (24,810) (24,170) (214,360) (26,60) (28,200) (212,210)

72.32 75.34 82.14 64.88 89.78 92.29 88.34 81.43 78.03 75.95 88.78

(226,240) (20,580) (212,400) (218,70) (224,50) (210,90) (216,110) (214,450) (210,580) (20,340) (22,180)

73.42 74.78 82.49 63.04 89.44 92.32 88.43 79.86 82.95 75.81 88.07

(220,250) (22,210) (210,40) (20,810) (28,80) (26,160) (216,140) (212,180) (210,10) (212,240) (26,230)

76.81 76.06 85.71 65.13 91.55 92.92 90.45 88.09 90.31 79.25 90.64

(224,510) (210,150) (214,250) (216,110) (214,340) (28,400) (222,190) (218,30) (220,680) (210,650) (210,420)

75.72 72.60 80.71 51.79 89.64 84.47 79.38 69.02 71.46 66.19 75.22

77.73 84.64 74.74

(212,190) 79.22 (222,640) 65.28 (216,380) 63.62

(20,760) 72.83 (214,810) 63.82 (232,510) 57.94

(216,30) 76.42 (2  10,720) 83.59 (2  2,940) 72.17

(28,310) 76.65 (2  2,770) 82.88 (24,20) 68.09

(220,290) (22,120) (24,600)

(220,900) 64.28 (240,850) 47.17

(212,60) (24,260)

(28,190) (20,640)

(2  2,430) 89.29 (2  10,870) 77.06

88.25 77.99

87.75 76.09

(214,260) (220, 890)

75.50 50.11

20

K. Li et al. / Neurocomputing 128 (2014) 15–21

Table 4 Performance result of multiclass problems. Dataset(IR)

G-mean (Sigmoid node) Unweighted ELM (C,L)

hayes-roth(1.70) new-thyroid(4.84) glass(8.43) thyroid(39.19) pageblocks(163.24)

Weighted ELM W1

Testing result (%) (C,L)

16

(2 ,40) (28,210) (210,970) (226,770) (212,240)

8

77.34 93.50 55.08 46.49 42.40

(2 ,50) (212,30) (26,840) (214,450) (214,580)

Weighted ELM W2

Testing result (%) (C,L)

Testing result (%) (C,L)

10

81.12 99.16 66.15 67.57 82.25

Boosting weighted ELM

(2 ,40) (210,110) (28,50) (212,990) (216,310)

12

81.53 98.48 58.80 64.62 80.95

(2 ,60) (26,450) (20,940) (210,520) (28,840)

AdaBoost weighted ELM

Testing result (%) (C,L) 85.75 99.72 71.45 86.24 87.59

Testing result (%)

10

(2 ,50) 82.55 (22,540) 100.00 (210,620) 61.73 (210,260) 84.03 (26,80) 63.01

Table 5 Performance result of binary problems in terms of accuracy. Dataset(IR)

Accuracy (Sigmoid node)

glass1(1.82) pima(1.90) vehicle3(2.52) haberman(2.68) ecoli1(3.36) yeast3(8.11) ecoli3(8.19) glass016 vs 2(10.29) glass2(10.39) yeast1 vs 7(13.87) abalone9 vs 18(16.39) yeast2 vs 8(23.10) yeast4(28.41) yeast1289 vs 7(30.56) yeast6(39.15) abalone19(128.87)

Unweighted ELM

Weighted ELM W1

Weighted ELM W2

Boosting weighted ELM

AdaBoost weighted ELM

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(C,L)

Testing result (%)

(24,820) (20,900) (28,880) (2  2,900) (28,770) (28,700) (214,900) (26,50) (22,620) (210,960) (224,210) (2  2,990) (212,230) (20,920) (216,850) (210,890)

75.68 77.47 84.04 74.18 91.38 94.54 92.55 91.16 92.06 95.21 95.89 97.92 96.83 96.83 98.18 99.23

(212,50) (212,300) (210,420) (2  10,430) (210,410) (212,370) (222,650) (210,860) (26,560) (220,820) (224,210) (2  18,580) (220,830) (228,870) (224,680) (226,850)

73.38 75.64 82.97 73.20 89.59 92.72 88.11 84.89 85.01 83.01 93.43 97.92 89.35 85.22 92.65 89.27

(216,240) (22,430) (214,670) (226,580) (222,610) (210,710) (218,620) (28,420) (210,610) (224,420) (222,950) (20,70) (224,820) (220,630) (222,650) (222,920)

74.76 72.40 82.74 60.44 87.52 90.03 86.91 80.70 81.75 81.70 90.42 89.21 87.33 78.77 89.76 86.54

(218,260) (214,110) (212,520) (24,50) (28,40) (214,230) (218,70) (28,220) (28,120) (212,850) (218,230) (2  2,90) (218,700) (220,490) (214,420) (218,450)

77.08 76.56 84.87 73.85 90.77 92.99 90.18 88.53 88.29 88.45 94.11 97.93 92.12 89.23 94.41 95.04

(216,70) (24,240) (210,290) (20,530) (26,260) (210,350) (20,810) (2  10,910) (2  2,820) (28,270) (24,250) (26,50) (218,280) (216,210) (24,810) (2  10,840)

79.01 77.73 85.93 75.49 92.87 95.22 94.03 91.16 92.06 95.42 96.85 97.92 96.97 97.04 98.31 99.23

Table 6 Performance result of multiclass problems in terms of accuracy. Dataset(IR)

Accuracy (Sigmoid node) Unweighted ELM

hayes-roth(1.70) new-thyroid(4.84) glass(8.43) thyroid(39.19) pageblocks(163.24)

Weighted ELM W1

(C,L)

Testing result (%) (C,L)

(210,40) (28,60) (212,40) (24,90) (214,180)

78.77 98.14 66.85 92.64 94.70

(216,50) (214,190) (214,850) (222,340) (224,510)

Weighted ELM W2

Testing result (%) (C,L) 83.28 99.07 68.26 85.14 93.61

Testing result (%) (C,L)

(212,30) (210,610) (220,380) (222,650) (220,480)

Considering the characteristics of imbalanced learning, we modify original AdaBoost.M1 in two aspects. First, the initial distribution weights are set to be asymmetric to make AdaBoost converge at a faster speed. This results in a Boosting classifier with less number of weighted ELM classifiers and can save much computational time. Second, the distribution weights are updated separately for different classes to avoid destroying the distribution weights asymmetry. The proposed Boosting weighted ELM does not need users to determine the weighting items. All procedures are done automatically without human intervention. Compared with weighted ELM, the proposed Boosting weighted ELM may need several iteration steps, but usually less than 10 steps is enough. Considering the very fast learning speed of ELM, such costs are quite worthy.

Boosting weighted ELM

83.33 97.67 64.49 83.06 93.61

(214,40) (20,470) (210,430) (210,650) (216,430)

AdaBoost weighted ELM

Testing result (%) (C,L) 86.32 99.53 69.22 91.53 94.52

Testing result (%)

(210,40) 83.33 (20,660) 100.00 (216,50) 69.22 (212,120) 93.75 12 (2 ,60) 96.16

Experimental results on 16 binary datasets and 5 multiclass datasets from KEEL repository show that the proposed method could achieve more balanced results than weighted ELM. The future work includes introducing more kinds of AdaBoost variants into consideration, and applying the proposed Boosting weighted ELM to more datasets with large variety in class distribution. Acknowledgment We thank the anonymous reviewers for their constructive comments. This work was fully supported by a grant from the City University of Hong Kong (Project no. 7002774), the Natural Science Foundation of China (under Grant nos. 91024012; 60970034; 61073189; 61170287; 61232016; 61305094).

K. Li et al. / Neurocomputing 128 (2014) 15–21

References [1] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1) (2006) 489–501. [2] G.-B. Huang, D.-H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2) (2011) 107–122. [3] G.-B. Huang, H.-M. Zhou, X.-J. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. Part B Cybern. 42 (2) (2012) 513–529. [4] X.-Z. Wang, A. Chen, H. Feng, Upper integral network with extreme learning mechanism, Neurocomputing 74 (16) (2011) 2520–2525. [5] Q. Liu, Q. He, Z. Shi, Extreme support vector machine classifier, in: Advances in Knowledge Discovery and Data Mining, Springer, 2008, pp. 222–233. [6] G.-B. Huang, X.-J. Ding, H.-M. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing 74 (1–3) (2010) 155–163. [7] W. Zong, G.-B. Huang, Face recognition based on extreme learning machine, Neurocomputing 74 (16) (2011) 2541–2551. [8] A. Mohammed, R. Minhas, Q. Jonathan Wu, M. Sid-Ahmed, Human face recognition based on multidimensional PCA and extreme learning machine, Pattern Recognit. 44 (10) (2011) 2588–2597. [9] B.P. Chacko, V.V. Krishnan, G. Raju, P.B. Anto, Handwritten character recognition using wavelet energy and extreme learning machine, Int. J. Mach. Learn. Cybern. 3 (2) (2012) 149–161. [10] W. Jun, W. Shitong, F.-l. Chung, Positive and negative fuzzy rule system, extreme learning machine and image classification, Int. J. Mach. Learn. Cybern. 2 (4) (2011) 261–271. [11] D. Serre, Matrices: Theory and Applications, Springer Verlag, 2010. [12] N.V. Chawla, Data mining for imbalanced datasets: an overview, in: Data Mining and Knowledge Discovery Handbook, Springer US, 2010, pp. 875–886. [13] H.-B. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1263–1284. [14] C. KrishnaVeni, T. Rani, On the classification of imbalanced datasets, Int. J. Comput. Sci. Technol. 2 (SP1) (2011) 145–148. [15] S. Wang and X. Yao, Multiclass imbalance problems: analysis and potential solutions, IEEE Trans. Syst. Man. Cybern. Part B, Cybern. Preprint, 2012. [16] X.-Z. Wang, L.-C. Dong, J.-H. Yan, Maximum ambiguity-based sample selection in fuzzy decision tree induction, IEEE Trans. Knowl. Data Eng. 24 (8) (2012) 1491–1505. [17] S.-J. Yen, Y.-S. Lee, Cluster-based under-sampling approaches for imbalanced data distributions, Expert Syst. Appl. 36 (3) (2009) 5718–5727. [18] X.-Y. Liu, J.-X. Wu, Z.-H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man. Cybern. Part B: Cybern. 39 (2) (2009) 539–550. [19] Y. Sun, M.S. Kamel, A.K. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit. 40 (12) (2007) 3358–3378. [20] Z.-H. Zhou, X.-Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 (1) (2006) 63–77. [21] M. Gao, X. Hong, S. Chen, C.J. Harris, A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems, Neurocomputing 74 (17) (2011) 3456–3466. [22] S.-H. Oh, Error Back-propagation algorithm for classification of imbalanced data, Neurocomputing 74 (6) (2011) 1058–1061. [23] S. Suresh, R. Venkatesh Babu, H. Kim, No-reference image quality assessment using modified extreme learning machine classifier, Appl. Soft Comput. 9 (2) (2009) 541–552. [24] W. Zong, G.-B. Huang, Y. Chen, Weighted extreme learning machine for imbalance learning, Neurocomputing 101 (2013) 229–242. [25] K.-A. Toh, Deterministic neural classification, Neural Comput. 20 (6) (2008) 1565–1595. [26] W.-Y. Deng, Q.-H. Zheng, L. Chen, Regularized extreme learning machine, in: IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2009), 2009, pp. 389–395. [27] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1) (1970) 55–67. [28] R. Fletcher, Practical Methods of Optimization, Constrained Optimization, vol. 2, John Wiley & Sons, Inc, 1981. [29] R. Schapire, The boosting approach to machine learning: an overview, in: MSRI Workshop on Nonlinear Estimation and Classification, 2002, pp. 149–172.

Kuan Li received his Ph.D. degree in Computer Science from the National University of Defense Technology in 2012. He is currently an Assistant Professor in School of Computer Science, National University of Defense Technology, China. His research interests include medical image processing, pattern recognition and parallel computing.

21 Xiangfei Kong received his B.S.E degree in Computer Science from Shandong University of Science and Technology, Qingdao, Shandong, China in 2005. He is currently working toward his Ph.D degree in Department of Computer Science, City University of Hong Kong, China. His research interests include extreme machine learning, noise reduction analysis, image processing, and pattern recognition.

Zhi Lu received his M.Sc. and M.Phil. degrees in computer science from the City University of Hong Kong, Hong Kong S.A.R. He is currently a third year Ph.D. student in the Department of Computer Science at the City University of Hong Kong and visiting the Australian Centre for Visual Technologies, the University of Adelaide, Australia. His research interests include medical image processing and pattern recognition.

Liu Wenyin is a “Bright Scholar” Professor in the College of Computer Science and Technology, Shanghai University of Electronic Power. He was an assistant professor at the City University of Hong Kong from 2002 to 2012 and a full time researcher at Microsoft Research China/Asia from 1999 to 2002. His research interests include anti-phishing, question answering, graphics recognition, and performance evaluation. He has a B.E. and M.E. in computer science from Tsinghua University, Beijing and a D.Sc. from the Technion, Israel Institute of Technology, Haifa. In 2003, he was awarded the International Conference on Document Analysis and Recognition Outstanding Young Researcher Award by the International Association for Pattern Recognition (IAPR). He had been TC10 chair of IAPR for 2006–2010. He had been on the editorial boards of the International Journal of Document Analysis and Recognition (IJDAR) from 2006 to 2011 and the IET Computer Vision journal from 2011 to 2012. He is a Fellow of IAPR and a senior member of IEEE.

Jianping Yin received his M.S. degree and Ph.D. degree in Computer Science from the National University of Defense Technology, China, in 1986 and 1990, respectively. He is a full professor of computer science in the National University of Defense Technology. His research interests involve artificial intelligence, pattern recognition, algorithm design, and information security.