A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine

A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine

Neurocomputing 102 (2013) 10–22 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom ...

722KB Sizes 12 Downloads 62 Views

Neurocomputing 102 (2013) 10–22

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A novel automatic two-stage locally regularized classifier construction method using the extreme learning machine Dajun Du a,b, Kang Li b,n, George W. Irwin b, Jing Deng b a b

Shanghai Key Laboratory of Power Station Automation Technology, School of Mechatronical Engineering and Automation, Shanghai University, Shanghai 200072, China School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT9 5 AH, UK

a r t i c l e i n f o

a b s t r a c t

Available online 5 June 2012

This paper investigates the design of a linear-in-the-parameters (LITP) regression classifier for twoclass problems. Most existing algorithms generally learn a classifier (model) from the available training data based on some stopping criterions, such as the Akaike’s final prediction error (FPE). The drawback here is that the classifier obtained is then not directly obtained based on its generalization capability. The main objective of this paper is to improve the sparsity and generalization capability of a classifier, while reducing the computational expense in producing it. This is achieved by proposing an automatic two-stage locally regularized classifier construction (TSLRCC) method using the extreme learning machine (ELM). In this new algorithm, the nonlinear parameters in each term, such as the width of the Gaussian function and the power of a polynomial term, are firstly determined by the ELM. An initial classifier is then generated by the direct evaluation of these candidates models according to the leaveone-out (LOO) misclassification rate in the first stage. The significance of each selected regressor term is also checked and insignificant ones are replaced in the second stage. To reduce the computational complexity, a proper regression context is defined which allows fast implementation of the proposed method. Simulation results confirm the effectiveness of the proposed technique. & 2012 Elsevier B.V. All rights reserved.

Keywords: Classification Extreme learning machine Leave-one-out (LOO) misclassification rate Linear-in-the-parameters model Regularization Two-stage stepwise selection

1. Introduction Classification problems are widespread in for example medical imaging, disease diagnosis, and credit risk assessment [1–3]. Without loss of generality [4], this paper is restricted to the two-class problem. Classifier design plays a key role here, where the classification algorithm generally learns a suitable model (i.e., the classifier) from available training data to be then used for classifying unseen data. Considering both the complexity of the classifier and the size of the data set, various classification algorithms have been proposed for improving both accuracy and generalization capability. For example, a hybrid linear/nonlinear two-class classifier is presented in [1], which involves two stages: the input features are linearly combined to form a scalar variable in the first stage. The likelihood ratio of the scalar variable is then used as the decision variable for classification. An incremental construction method for a classifier is presented in [5], where an ensemble of classifiers is firstly constructed to choose a subset from a larger set, and an ensemble of discriminants is then produced. Two-class classification problems can also be posed in a linear-inthe-parameters (LITP) regression framework (e.g. a support vector

n

Corresponding author. E-mail addresses: [email protected], [email protected] (K. Li).

0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2011.12.052

machine (SVM) model [4] or a radial basis function (RBF) neural model [6,7]) using the known class labels as the system output for model training. For the LITP model, its performance critically depends upon the determination of the nonlinear parameters in each model term, such as the width of a Gaussian function and the power of a polynomial. A conventional strategy is to randomly select some input data points as the RBF centers, then the output weights can be estimated using a stochastic gradient approach [6]. Unfortunately, such a simple hybrid procedure may produce a network with poor performance. To tackle this, clustering techniques have been introduced [7], which determine the center locations for RBF functions using both input and output data. In contrast to such traditional computational intelligence techniques, extreme learning machines (ELMs) have been proposed in [8–11]. These apply random computational nodes in the hidden layer that do not need to be tuned. The hidden layer thus has the fixed parameters, thereby allowing the output weights to be resolved using least-squares. After the nonlinear parameters in each term have been determined, many methods for selecting the regressor terms are available. These include the popular forward orthogonal leastsquares (OLS) [12,13] and the fast forward recursive algorithm (FRA) [14,15], which select the regressor terms based on their contributions to maximizing the model error reduction ratio. Unlike OLS that uses QR decomposition on the regression matrix, the more recent FRA [14] presents a regression context based on

D. Du et al. / Neurocomputing 102 (2013) 10–22

which fast selection of the model structure and fast estimation of the model parameters are achievable from the original regression matrix. It has been shown that the FRA requires less computational effort and is also numerically more stable than the alternatives. In practice, the real data to be classified often includes a large amount of noise, which may lead to the model over-fitting the data. To solve this, a dynamic classifier using an ensemble selection strategy was proposed in [16]. Regularization [17,18] is a useful technique for enforcing model sparsity and overcoming over-fitting, but the regularization parameters have to be properly tuned to obtain satisfactory performance [18]. In Bayesian learning theory [19,20], a regularized parameter is equivalent to the ratio of the related hyperparameter to a noise parameter. The Bayesian approach provides a rigorous framework for automatic adjustment of the regularization parameters to their near-optimal values. This is achieved by marginalizing the hyperparameters when making inferences, and an validation data set is not needed. Bayesian techniques have now been integrated with OLS [21] and FRA [22]. The above forward selection algorithms only provide an efficient tool for determining the nonlinear parameters in each model term and for selecting a set of regression terms. Determining the size of a LITP model to improve the generalization performance remains a major issue. Different cost functions, often involving a trade-off between model complexity and training accuracy, lead to alternative architectures. When the training mean-square-error (MSE) or an information-based measure such as the Akaike final prediction error (FPE) [23] are used, only the stopping point of the selection procedure is affected while the regressor terms that may cause poor model performance are not penalized. To improve the model generalization, a better method would be to incorporate this feature directly in the model selection procedure. Leave-one-out (LOO) cross-validation [24] is a widely used criterion for this purpose and has been incorporated into both OLS [25] and two-stage stepwise construction [26–28]. However, for the two-class problem, LOO cross-validation can be further interpreted as LOO misclassification rate, which is then combined into forward orthogonal selection procedure (OFSþLOO) in [29]. In this paper, the extreme learning machine and the LOO misclassification rate are introduced into the, recently proposed, locally regularized forward recursive algorithm [22], leading to a novel two-stage locally regularized classifier construction (TSLRCC) method. An initial classifier is first generated by the direct evaluation of candidates models according to their LOO misclassification rate. The significance of each selected regressor term is then checked and insignificant ones are replaced. To alleviate computational complexity, a new regression context is defined, which supports a fast implementation. Therefore, unlike [29], the proposed algorithm can directly choose the regressor terms from the original regression matrix and produce an optimized compact classifier. The paper is organized as follows. Section 2 gives some preliminaries on the linear-in-the-parameters classifier and on the leave-oneout misclassification rate. Section 3 presents the main results of the paper, including the proposed automatic two-stage locally regularized classifier construction method, the regression context, and the complete algorithm procedure. The simulation results are presented in Section 4, followed by concluding remarks in Section 5.

2. Preliminaries 2.1. Linear-in-the-parameters classifier Consider the problem of separating a training data set belonging to two separate classes, n DN ¼ fXðtÞ,yðtÞgN t ¼ 1 ,XðkÞ A R ,y A f1,1g:

ð1Þ

11

Suppose that a two class classifier f ðXÞ : Rn -f1; 1g is formed using this data set, the linear-in-the-parameters classifier being expressed as 8 ^ ¼ sgnðf ðtÞÞ, yðtÞ > > < M X ð2Þ f ðtÞ ¼ wj jj ðXðtÞÞ, > > : j¼1

^ where yðtÞ is the model predicted class label for X(t), wj (j ¼ 1, . . . ,M) are the linear parameters, M is the number of regressors (kernels) and jj ðÞ denotes the classifier kernels with a known nonlinear basis function. Here, a Gaussian kernel function fðx,ci Þ ¼ expðJxci J=s2i Þ which contains two adjustable parameters, the center ci and the width si , is used. Finding suitable values of these can help to improve the classification results. The nonlinear width parameter can be estimated using various methods, such as by conjugate gradient optimization or using an exhaustive search [15]. Compared to these methods, the extreme learning machine (ELM) [9,10] is claimed to produce better generalization performance with an extremely fast learning speed. Therefore, both these parameters will be determined using an ELM as follows. 2.2. Determining the centers and widths using an ELM The ELM [9,10] was originally applied to single-hidden layer feedforward neural networks (SLFNs) and then extended to ‘‘generalized’’ SLFNs. The essence of an ELM is that: differing from commonly used learning methods, the hidden layer of SLFNs need not be tuned. The target is then simply a linear combination of the hidden nodes, and the output layer weights can be easily estimated by the least-squares. As a result, the learning speed in ELM can be much faster than traditional leaning approaches. Further, like a SLFNs, the RBF neural model can be considered as LITP model if the kernel centers and the widths are arbitrarily given. Therefore, the centers and the widths in (2) can be determined using extreme learning. In the ELM, if the activation function jðÞ is infinitely differentiable and the required number of hidden nodes M r N, the following theorems hold. Theorem 1 (Huang et al. [10]). Given any small positive value e 40 and activation function jðÞ g : R-R which is infinitely differentiable in any interval, there exists M rN such that for N arbitrary distinct sample ðxðtÞ,yðtÞÞ, where xðtÞ A Rn and yðtÞ A Rm , for any weight wi and the threshold of the ith hidden node bi randomly chosen from any intervals of Rn and R, respectively, according to any continuous probability distribution, then with probability one, the regression matrix F is invertible and JFWYJ o e. Theorem 2 (Myers [31, p. 51]). Let there exist a matrix G such that GY is a minimum norm least-squares solution of a linear system FW ¼ Y. Then it is necessary and sufficient that G ¼ F þ , the Moore– Penrose generalized inverse of matrix F. According to Theorems 1 and 2, the ELM can find the kernels and output weights. It not only produces the smallest squarederror on the training data but also obtains the smallest output weights. The main idea behind an ELM is to randomly choose the centers and the widths. Using this idea, the following two steps are involved in forming the regression matrix F in (2). First, the RBF centers ci are determined by randomly selecting MðM r NÞ samples from the training data. Secondly, the widths si are found by generating random values uniformly within a specific range ½smin , smax . After the centers and the widths have been determined, the regression model (2) can be built using the training samples.

12

D. Du et al. / Neurocomputing 102 (2013) 10–22

Next, define the modeling residual as eðtÞ ¼ yðtÞf ðtÞ, then (2) can be re-written in matrix form as

ðtÞ

Y ¼ FW þ X,

ð3Þ

where Y ¼ ½yð1Þ,yð2Þ, . . . ,yðNÞT A RN , W ¼ ½w1 ,w2 , . . . ,wM T , X ¼ ½eð1Þ, . . . ,eðNÞT A RN with the properties EðXÞ ¼ 0 and EðXXT Þ ¼ s2 IN , and F ¼ ½j1 , . . . , jM  A RNM with column vectors ji ¼ ½ji ðXð1ÞÞ, . . ., ji ðXðNÞÞT , i ¼ 1, . . . ,M. The row vectors in F are denoted as jM ðtÞ ¼ ½j1 ðXðtÞÞ, . . . , jM ðXðtÞÞ, t ¼ 1, . . . ,N. To obtain a sparse model and avoid over-fitting, a regularized cost function is introduced [17,18,30] J ¼ XT X þ

M X

sequentially splitting the data set. Thus (see Appendix)

li w2i ¼ XT X þ W T LW,

ð4Þ

i¼1

where k ¼ ½l1 , l2 , . . . , lM T is the regularization parameter vector of the same dimension as the column vectors of the regression matrix F and L ¼ diagðl1 , l2 , . . . , lM Þ is a diagonal matrix. The regularized least-squares estimate of the linear output weights to minimize (4) is given by ^ ¼ ðFT F þ LÞ1 FT Y: W

ð5Þ

Section 3 will show that such parameter regularization can be effectively integrated with our proposed two-stage stepwise construction method [26] and LOO misclassification rate, to produce an efficient automatic two-stage locally regularized classifier construction method.

eðtÞ ¼ yðtÞf j j

ðt9N1Þ ¼

yðtÞjj ðtÞðFTj Fj þ Lj Þ1 FTj Y 1jj ðtÞðFTj Fj þ Lj Þ1 jTj ðtÞ

:

ð8Þ

Multiplying both side of (8) by y(t) and noting y2 ðtÞ ¼ 1 produces ! yðtÞjj ðtÞðFTj Fj þ Lj Þ1 FTj Y ðtÞ yðtÞf j ðt9N1Þ ¼ 1yðtÞ : ð9Þ 1jj ðtÞðFTj F þ Lj Þ1 jTj ðtÞ Defining aðtÞ ¼ yðtÞjj ðtÞðFTj Fj þ Lj Þ1 FTj Y and bðtÞ ¼ 1jj ðtÞ jTj ðtÞ, and using (9), (7) can be re-written as   N 1X yðtÞaðtÞ : ð10Þ Id 1 Jj ¼ Nt¼1 bðtÞ

ðFTj Fj þ Lj Þ1

The LOO misclassification rate of a linear-in-the-parameters classifier (3) can then be explicitly computed from (10). To efficiently construct a parsimonious model with good generalization, using the LOO misclassification rate for model selection, the LOO misclassification rate for each candidate model then needs to be computed explicitly. In subset selection in the field of linear regression, forward selection algorithms that construct a growing model structure, one regressor at a time, are considered superior to both backward and exhaustive search methods in terms of computational efficiency. It is therefore natural to combine LOO misclassification rate with forward regression, as described in the next section.

2.3. Leave-one-out misclassification rate Leave-one-out (LOO) cross-validation [24] is a commonly used metric to measure model generalization. The idea here is that, for any predictor in a set of K predictors (or models), each data point in the data set DN is sequentially set aside in turn, a model is estimated using the remaining N1 data points, and the prediction error is derived for the one that was removed. Thus, for t ¼ 1, . . . ,N, the jth model is estimated by removing the tth data point from the data set and the output of the model estimated ðtÞ from the remaining ðN1Þ data samples is denoted as f j ðt9N1Þ. The corresponding predicted class label is given by ðtÞ ðtÞ y^ j ðtÞ ¼ sgnðf j ðt9N1ÞÞ:

ð6Þ

Note that the LOO mean-square error (MSE) of the model is derived by computing the average of all modeling errors. Unlike the LOO MSE, the LOO misclassification rate is defined as Jj ¼

N N 1X 1X ðtÞ I ðS ðtÞÞ ¼ I ðyðtÞf j ðt9N1ÞÞ, Nt¼1 d j Nt¼1 d

ð7Þ

where Id ðÞ denotes the misclassification indicator for a data sample, and is defined as ( 1 Sj ðtÞ r0, Id ðSj ðtÞÞ ¼ 0 Sj ðtÞ 40: Here Sj ðtÞ r 0 means the tth data sample is misclassified, such that ðtÞ the class label produced by the model f j is different from the actual class label y(t). This provides a good method for model selection by the direct evaluation of these candidates models according to classification performance. The best one with the minimum LOO misclassification rate is chosen, i.e., the nth candidate model n ¼ arg½minðJ j ,j ¼ 1, . . . ,KÞ. It is well known that introducing the LOO misclassification rate is computationally expensive. However, when a predictor is identified based on a linear-in-the-parameters model like (2), the LOO model residuals can be derived algebraically using the Sherman–Morrison–Woodbury theorem [32] rather than the above

3. Automatic two-stage locally regularized classifier construction method based on LOO misclassification rate 3.1. Stage 1—forward model selection For forward subset selection, suppose that k out of M regression vectors, denoted as p1 , . . . ,pk , have already been selected. The remaining vectors from the full regression matrix F are denoted as jk þ 1 , . . . , jM . The resulting regression matrix becomes Fk ¼ ½p1 , . . . ,pk , and the regularization parameter matrix Lk ¼ diag fl1 , . . . , lk g. The corresponding LOO misclassification rate is then expressed as ! N 1X yðtÞðyðtÞjk ðtÞðFTk Fk þ Lk Þ1 FTk YÞ Id 1 Jk ¼ Nt¼1 1jk ðtÞðFTk Fk þ Lk Þ1 jTk ðtÞ   N 1X yðtÞak ðtÞ , ð11Þ Id 1 ¼ Ni¼1 bk ðtÞ where

ak ðtÞ ¼ yðtÞjk ðtÞðFTk Fk þ Lk Þ1 FTk Y, bk ðtÞ ¼ 1jk ðtÞðFTk Fk þ Lk Þ1 jTk ðtÞ: Next, at the (k þ1)th regression step, if a new term

jj A fjk þ 1 , . . . , jM g is now chosen, the resulting regression matrix increases by one column, becoming Fk þ 1 ¼ ½Fk , jj . The corresponding lj A flk þ 1 , . . . , lM g is also picked for jj , to produce a new regularization parameter matrix Lk þ 1 ¼ diagfLk , lj g. The corresponding LOO misclassification rate is expressed as ! N yðtÞðyðtÞjk þ 1 ðtÞðFTk þ 1 Fk þ 1 þ Lk þ 1 Þ1 FTk þ 1 YÞ 1X Id 1 Jjk þ 1 ¼ Nt¼1 1jk þ 1 ðtÞðFTk þ 1 Fk þ 1 þ Lk þ 1 Þ1 jT ðtÞ kþ1

  1 yðtÞak þ 1 ðtÞ : Id 1 ¼ Ni¼1 bk þ 1 ðtÞ N X

ð12Þ

Eq. (12) expresses the LOO misclassification rate of a selected regressor term. For each of the Mk remaining candidate terms

D. Du et al. / Neurocomputing 102 (2013) 10–22

(8jj A fjk þ 1 , . . . , jM g), the LOO misclassification rate is computed as J jk þ 1 . The one with minimal LOO misclassification rate is then chosen as the (kþ1)th regressor, and the corresponding LOO misclassification rate is retained as J k þ 1 . The order of regressor vectors in the full regression matrix F changes each time a regressor term is chosen. More specifically, all the selected kþ1 regressor vectors, denoted as p1 , . . . ,pk þ 1 , will be shifted in a group to the left hand side of F. All the remaining Mk1 vectors in F are then moved to the right hand side of F and grouped together as one candidate term set. By monitoring the LOO misclassification rate over two successive steps, this forward selection process is automatically terminated when J n rJ n þ 1 ,

ð13Þ

yielding an n-unit model. A detailed explanation as to how the LOO misclassification rate can balance the training error and model complexity was provided in [12]. In summary, there exists an ‘‘optimal’’ model size n such that for k rn, Jk decreases as k increases while the condition (13) holds. This property is very useful, as it enables the model selection procedure to be automatically terminated, without the need for a special termination criterion. Unfortunately, (12) still requires extensive computational effort. To simplify the complexity, a regression context is first established which includes intermediate matrices and vectors that are generated during the model selection. By defining this regression context, it follows that the whole model construction process can be concisely formulated and easily implemented with significantly reduced computation.

k ¼ 1, . . . ,M,

Rk k þ 1 Tk þ 1 Rk T R þ lk þ 1 kþ1 k kþ1

j

j

j

j

ð14Þ

,

k ¼ 0, . . . ,M1,

ð15Þ

k ¼ 0, . . . ,M,

R1,...,p,...,q,...,k ¼ R1,...,q,...,p,...,k ,

ð16Þ p,qr k:

ð17Þ

and jðkÞ , upper triangular matrixes A and D, lower (2) Quantities pðkÞ i i triangular matrix C, and vector b: Consider a selected regressor term or a remaining 8ji A fjk þ 1 , . . . , jM g, define 8 < pðkÞ ¼ Rk pi i ¼ 1, . . . ,k, i ð18Þ : jðkÞ ¼ R j j ¼ kþ 1, . . . ,M, k j j and upper triangular matrix A A9½ai,j kM , 8 < pTi Ri1 pj ¼ ðpði1Þ ÞT pj i ai,j 9 ði1Þ T : p Ri1 jj ¼ ðp Þ T jj i i

C9½ci,j kk ,ci,j 9

ð22Þ

ai,j ¼ pTi jj 

i1 X

ðas,i as,j Þ=ðas,s þ ls Þ,

i ¼ 1, . . . ,k, j ¼ k, . . . ,M,

ð23Þ

s¼1 i1 X

ci,j ¼ aj,i =ðaj,j þ lj Þ

ðas,i cs,j Þ=ðas,s þ ls Þ,

s ¼ jþ1

i ¼ 1, . . . ,k,

bi ¼

j ¼ 1, . . . ,i1,

ð24Þ

8 i1 X > > > ðas,i bs Þ=ðas,s þ ls Þ > pTi Y > < s¼1

i ¼ 1, . . . ,k,

i1 X > > T > > j Y ðas,i bs Þ=ðas,s þ ls Þ > i :

i ¼ k þ1, . . . ,M,

ð25Þ

s¼1

j1 X

di,j ¼ bi =ðai,i þ li Þ

ðcs,i bs Þ=ðas,s þ ls Þ,

i ¼ 1, . . . ,j1,

8 a ði1Þ > > ¼ pði2Þ  i1,i pði2Þ > i < pi ai1,i1 i1 ai1,j ði2Þ ði1Þ > > ¼ jði2Þ  p > j : jj ai1,i1 i1

ð26Þ i ¼ 1, . . . ,k, ð27Þ j ¼ kþ 1, . . . ,M:

The model residuals can now be expressed in matrix form as ð28Þ

where it is found that ak þ 1 ðtÞ is the tth element of the vector ak þ 1 according to ak þ 1 ðtÞ in (12). Noting bk þ 1 ðtÞ in (12), it turns out that bk þ 1 ðtÞ is the tth diagonal element of the matrix Rk þ 1 , i.e.,

bk þ 1 ðtÞ ¼ Rk þ 1 ðt,tÞ:

ð29Þ

Further, using (15), (16) and (27), and defining bk þ 1 ¼ ½bk þ 1 ð1Þ, . . . , bk þ 1 ðNÞT , ak þ 1 ðtÞ, bk þ 1 ðtÞ and jj A fjk þ 1 , . . . , jM g can be updated recursively as follows: 8 kþ1 > bj > > > a ¼ a  jk j ¼ k þ 1, . . . ,M, k þ 1 k > < akj,jþ 1 þ lj j ð30Þ > 1 > k 2 > bk þ 1 ¼ bk  ð j Þ j ¼ k þ 1, . . . ,M, > > : akj,jþ 1 þ lj j kþ1

where akj,jþ 1 and bj are the updated values of each of the remaining candidate regressor terms at the (kþ 1)th step, i.e.,

i rj rk, k oj rM,

ð19Þ akj,jþ 1 ¼ akj,j 

and lower triangular matrix C (

ð21Þ

Using the properties of the residual matrix Rk, the matrices, vectors or quantities defined above can be computed as follows:

ak þ 1 ¼ YFk þ 1 ðFTk þ 1 Fk þ 1 þ Lk þ 1 Þ1 FTk þ 1 Y ¼ Rk þ 1 Y

and Rk ¼ RTk ,

1r i rk, ko i rM,

and upper triangular matrix D ( 0 j ri, D9½di,j kk ,di,j 9 T ðpðj1Þ Þ Y= l io j r k, i i

j ¼ 1, . . . ,k,

where R0 9I. The matrix Rk is a residue matrix with the following properties Rk þ 1 ¼ Rk 

and a vector b 8 < pi Ri1 Y ¼ ðpði1Þ ÞT Y i b9½bi M1 ,bi 9 : ji Rk Y ¼ ðjðkÞ ÞT Y i

s ¼ iþ1

(1) The residual matrix Rk: A matrix series Rk is defined Rk 9IFk ðFTk Fk þ Lk Þ1 FTk ,

13

0

i r j,

pTi Ri1 pj =lj ¼ ai,j =lj

j oi rk,

ð20Þ

a2k,j ak,k þ lk

,

kþ1

bj

k

¼ bj 

ak,j bk : ak,k þ lk

Thus, after ak þ 1 and bk þ 1 ðtÞ have been recursively updated using (30), J k þ 1 follows from (12). In this way, the LOO misclassification rates are calculated with a minimal computational effort.

14

D. Du et al. / Neurocomputing 102 (2013) 10–22

3.2. Stage 2—backward model refinement The forward stepwise stage generates an initial classifier by direct evaluation of candidate models according to a leave-oneout (LOO) misclassification rate. However, a good model can be missed. To construct an optimized compact classifier, each previously selected regressor term needs to be checked again and its

LOO misclassification rate compared with all remaining ones in the candidate pool. With the aid of the regression context defined above, the backward model refinement can now be described. 3.2.1. Regression context reconstruction After an initial classifier with n regressor terms has been produced, the selected n regressor terms are shifted in a group to

Fig. 1. The regression context is produced in the first stage.

D. Du et al. / Neurocomputing 102 (2013) 10–22

the left hand side of F, the resulting regression matrix becoming F ¼ ½p1 ,p2 , . . . ,pn , jn þ 1 , . . . , jM . The corresponding regularization parameter is expressed as l ¼ ½l1 , l2 , . . . , ln , ln þ 1 , . . . , lM . This regression context is illustrated in Fig. 1. To review each previously selected regressor term, it must be shifted to the nth position through a series of interchanges between adjacent regressor terms. Suppose that pq and pq þ 1 in the regression matrix F are interchanged, where p^ q ¼ pq þ 1 , p^ q þ 1 ¼ pq , l^ q ¼ lq þ 1 , l^ q þ 1 ¼ lq .

15

The property given in (17) means that only Rq in the residual matrix series Rk is changed at each step, becoming R^ q . Therefore, by changing p^ q and p^ q þ 1 in the regression matrix, F and R^ q in the residual matrix series Rk, the other items in the regression context must also change. Thus, the qth and ðq þ1Þth columns, having elements from row 1 to q1, are exchanged and the qth and (q þ1)th rows must be recalculated in the matrix A. Also, the qth and (qþ 1)th columns, having elements from row qþ2 to n, are

Fig. 2. Changes in the regression context if two adjacent selected regressor terms pq and pq þ 1 are interchanged.

16

D. Du et al. / Neurocomputing 102 (2013) 10–22

exchanged and the qth and (q þ1)th rows are updated in the matrix C. Only the qth and (qþ1)th elements are changed in the vector b. In the matrix D, the qth and (qþ1)th rows, having elements from columns qþ2 to n, are exchanged and the (qþ1)th column, for elements from row 1 to q, is revised. By comparison to Fig. 1, the corresponding changes in both the regression matrix F and the regression context are highlighted in the red boundaries in Fig. 2. The required changes to the regression context are given as follows:

candidate term will be re-calculated by replacing the nth term (i.e., pk) in the regression matrix Fn with jj A fjk þ 1 , . . . , jM g. Next, referring to (27) and (30), the LOO misclassification rate of each remaining candidate term (8jj A fjk þ 1 , . . . , jM g) can be computed using 8 n þ 1Þ > a^ j,j ¼ aðn þ ða^ n,j Þ2 =ða^ n,n þ l^ n Þ, > j,j > > < n ðn þ 1Þ þ b^ n a^ n,j =ða^ n,n þ l^ n Þ, b^ j ¼ bj ð39Þ > > > ðn1Þ ðn1Þ ðnÞ > ^ ^j :j ¼ jj þ ða^ n,j =ða^ n,n þ l n ÞÞðp^ n Þ,

(1) A single residual matrix Rq is recalculated as

8 1 1 > ðn1Þ 2 > ^ jðn1Þ 2 , a^ j ¼ an þ ½p^ n   ½j > > > < a^ n,n þ l^ n a^ j,j þ l^ j >^ b^ j b^ n ðn1Þ 2 > > ^ jðn1Þ 2 , b j ¼ bn þ ½p^ n   ½j > > ^ : a^ n,n þ l n a^ j,j þ l^ j

T

Rq1 p^ q ðp^ q Þ Rq1 : R^ q ¼ Rq1  ðp^ q ÞT Rq1 p^ q þ l^ q

ð31Þ

(2) For the matrix A, the qth and (qþ 1)th columns with elements from row 1 to q1 are exchanged. The qth and (qþ1)th rows are then given by 8 > þ a2q,q þ 1 =ðaq,q þ lq Þ a > < q þ 1,q þ 1 a^ q,j ¼ aq,q þ 1 > > : aq þ 1,j þ ðaq,q þ 1 aq,j Þ=ðaq,q þ lq Þ

a^ q þ 1,j ¼

j ¼ q, j ¼ q þ1,

ð32Þ

q þ2 r j rM,

8 < aq,q a2q,q þ 1 =ða^ q,q þ l^ q Þ

j ¼ qþ 1,

: aq,j aq,q þ 1 a^ q,j =ða^ q,q þ l^ q Þ

qþ 2 rj rM:

ð33Þ

(3) For the matrix C, the qth and (qþ1)th columns with elements from row qþ2 to n are exchanged. The qth and (q þ1)th rows are then given by c^ q,j ¼ cq þ 1,j þ aq,q þ 1 cq,j =ðaq,q þ lq Þ,

c^ q þ 1,j ¼

1 rj r q1,

8 < aq,q þ 1 =ða^ q,q þ l^ q Þ

j ¼ q,

: cq,j aq,q þ 1 c^ q,j =ða^ q,q þ l^ q Þ

j o q:

ð34Þ

ð35Þ

(4) For the vector b, only the qth and (q þ1)th elements are changed to 8 < b^ q ¼ bq þ 1 þ ðaq,q þ 1 bq Þ=ðaq,q þ lq Þ, ð36Þ : b^ b^ q Þ=ða^ q,q þ l^ q Þ: ¼ bq ða qþ1

ð40Þ

where ½2 denotes the square operation on each element. j Now the LOO misclassification rate (J^ n ðjj Þ) of a term (jj A fjk þ 1 , . . . , jM g) in the candidate term pool can be derived m j from (12) and (40). Assuming J^ n ðjm Þ ¼ minfJ^ n , jj A fjn þ 1 , . . . , jM gg, m and if J^ n ðjm Þ o J^ n ðp^ n Þ, then jm will replace p^ n in the selected regression matrix Fn . Meanwhile p^ n will be put back into the candidate term pool and take the position of jm .

3.2.3. Updating the regression context After the last selected regressor term p^ n and jm have exchanged their position, the corresponding elements in the regression context must be updated. (1) For the regularization parameter vector l or diagonal matrix L, the nth element and mth element are exchanged, i.e., l^ n ¼ lm , l^ m ¼ ln . (2) For the matrix A, the nth and mth columns with elements from row 1 to n1 are exchanged. The nth row is revised to 8 am,m j ¼ n, > > > > < an,m j ¼ m, ð41Þ a^ n,j ¼ n1 X > > T > j j  ða a Þ=ða þ l Þ 8j a n,j a m: s,m s,s s > s,j j m : s¼1

q,q þ 1

(5) For the matrix D, the qth and (qþ 1)th rows, with elements from column qþ2 to n are exchanged. The (qþ 1)th column with elements from row 1 to q are revised using 8 < b^ q =ða^ q,q þ l^ q Þ i ¼ q, d^ i,q þ 1 ¼ ð37Þ : d b^ q c^ =ða^ q,q þ l^ q Þ 1 r i oq: i,q

pq1 q

q,i

pqq þ 1

(6) Finally, and can be updated to 8 ðq1Þ ðqÞ < p^ q ¼ pq þ 1 þðaq,q þ 1 =ðaq,q þ lq ÞÞpðq1Þ , q ðq1Þ : p^ ðqÞ ¼ pðq1Þ ðaq,q þ 1 =ða^ q,q þ l^ q ÞÞp^ q : q qþ1

ð38Þ

(3) For the matrix C, the nth row with elements from column 1 to n1 are calculated using n1 X a^ j,n a^ s,n cs,j c^ n,j ¼  : ð42Þ ðaj,j þ lj Þ s ¼ j þ 1 ðas,s þ ls Þ (4) For the vector b, the nth element equals to the mth element. ðn1Þ ^ nj are changed to (5) The quantity p^ n and the vector j 8 ðn1Þ ðn1Þ > ^ ^m , ¼j > < pn a^ n,j ð43Þ ðn1Þ > j^ ðnÞ ¼ j^ jðn1Þ  ðp^ n Þ: > : j a^ n,n þ l^ n

(6) Finally, the vectors a^ n and b^ are updated as a^ n ¼ a^ m , b^ ¼ b^ , n n m 3.2.2. LOO misclassification rate comparisons After the regressor term pk ð8k A f1, . . . ,n2gÞ moves to the nth position in the full regression matrix F, the LOO misclassification rate of the previous model which includes those selected n regressor terms remains unchanged. To compare the LOO misclassification rate of the last selected term with all the remaining terms in the candidate pool, the LOO misclassification rate of each remaining

ðn þ 1Þ nþ1 a^ j,j and b^ j are updated using 8 < an,n ða^ n,m Þ2 =ða^ n,n þ l^ n Þ j ¼ m, nþ1 a^ j,j ¼ : aj,j ða^ n,j Þ2 =ða^ n,n þ l^ n Þ j am,

ðn þ 1Þ ¼ b^ j

8 < bn ða^ n,m b^ n Þ=ða^ n,n þ l^ n Þ

j ¼ m,

: bj ða^ n,j b^ n Þ=ða^ n,n þ l^ n Þ

j a m:

ð44Þ

ð45Þ

D. Du et al. / Neurocomputing 102 (2013) 10–22

3.2.4. Updating the regularization parameter Once the model refinement procedure is completed, the coefficients of each term are computed recursively by wi ¼ di,n þ 1 ¼ bi =ðaii þ li Þ

n1 X

4

ðcs,i bs Þ=ðass þ ls Þ,

3

ð46Þ



gi XT X Ng w2i

n X

,

1 r ir n,

Ratio

2.5

S

The regularization parameters are then optimized by the Bayesian evidence procedure [19,20]. Hence, the regularization parameters are updated by

lnew ¼ i

L=1 L=2 L=3

3.5

s ¼ iþ1

i ¼ 1, . . . ,n:

17

2 1.5 1

gi , gi ¼ 1li hi ,

ð47Þ

i¼1

where hn ¼ diagððFTn Fn þ Ln Þ1 Þ. The vector hn can be updated recursively as hn ¼ ½hn1 þ ½cn 2 ðan,n þ ln Þ1 , ðan,n þ ln Þ1 , where cn is the nth row with elements from column 1 to n1 of matrix C.

0.5 0

5

10

15

20

25 30 Model size (n)

35

40

45

50

Fig. 3. Comparison of the computations (scenario 1: L¼ 1, if n4 9, SRatio o 1; scenario 2: L ¼2, if n 415, SRatio o 1; and scenario 3: L ¼3, if n 421, SRatio o 1).

3.3. Complete algorithm The complete algorithm can be summarized as follows. Step 1: Initialization: (A) Using the concept of the ELM, construct the candidate regression matrix F by randomly selecting the center vectors from the training data and the widths from a specific range ½smin , smax . (B) Set li , 1 r ir M, to the same small positive value (e.g. 106 ), and the iteration index I¼1. Set the model size k ¼0, jjð0Þ ¼ jj , ð1Þ að1Þ ¼ jTj jj , bj ¼ jTj Y, J 0 ¼ 1. j,j Step 2: Forward selection:

Step 4: Updating the regularization parameters: (A) Using the final set of selected regressor terms, calculate the ^ and update the regularization model coefficient vector W parameters L using (46) and (47). (B) If l remains sufficiently unchanged in two successive iterations, or a pre-set maximum iteration number is reached, stop; otherwise, update the candidate set by k1 selected significant centers set I ¼ I þ1, and go to step 2.

3.4. Computational complexity analysis 



(A) At the kth step, for 8ji A jk , . . . , jM , use (30) and (12) to calculate ak , bk and the corresponding LOO misclassification rate Jjk, respectively. j (B) Find jk ¼ arg½minfJ jk ,kr j r Mg select Jk ¼ J kk and add the kth selected regression term to the regression matrix Fk ¼ ½Fk1 , jjk . (C) Update the matrices A, C, D, vector b and quantity pðk1Þ using k kþ1 kþ1 (23)–(27) respectively, and pre-calculate jðkÞ , a and bj . j j,j (D) The procedure is monitored and terminated when J k1 rJ k , which produces a ðk1Þ-unit model. Otherwise, set k ¼ k þ 1, and go to step 2(A). Step 3: Backward model refinement: (A) Interchange the positions of pk and pk þ 1 ðk ¼ n1, . . . ,1Þ, and update the related terms according to (32)–(38). This process continues until the regressor term pk is moved to the nth position. kþ1 (B) Update the jðkÞ , akj,jþ 1 , bj , a^ j and b^ j using (39) and (40), and j compute their new LOO misclassification rates. m (C) If J^ n ðjm Þ o J^ n ðp^ n Þ, then jm will replace p^ n in the selected regression matrix Fn , meanwhile p^ n will be put back into the candidate term pool and take the position of jm . (D) Update the related terms according to (41)–(45). If k4 1, set k ¼ k1, and go to step 3(A). (E) If one or more regressor terms were changed, then k ¼ n1, and repeat steps 3(A)–(D) to review all the terms again.

The computation involved in the proposed algorithm is dominated by the selection of the regressor terms. Assuming that the computation time is dominated by multiplication/division operations, only these values are analyzed here. Stage 1: Suppose there are M candidate model terms with N data samples being used for training. If a classifier with n regressor terms is constructed, the number of multiplication/ division operations is given by S1st  12n2 M þ 12nMN6n2 N þ5nM:

ð48Þ

Stage 2: The backward model refinement of the second stage includes the shifting of each selected term to the nth position, comparing the new LOO error and the change of a regressor term of interest with a candidate pool term. The number of multiplication/division operations for one complete check loop in the worst case is calculated as S2nd  2n2 M þ 83n3 8n2 N þ 9nMN:

ð49Þ

Updating the regularization parameters: Updating the regularization parameters by the procedure given above, requires the following number of multiplication/division operations Sparameter

update

 3n2 þ 12n2 N:

ð50Þ

Number S2nd is the number of multiplication/division operations for one check loop in the worst scenario. However, if more than one check loops, e.g. L check loops were performed, then the total complexity will be less than LS2nd . In practice, 1r n 5 N and

18

D. Du et al. / Neurocomputing 102 (2013) 10–22

1 rn 5M, so the number of multiplication/division operations can be simplified to S1st þ 2nd þ parameter

update  ð12 þ 9LÞnMN:

Stotal TSLRCC  ð12 þ 9LÞnMNI:

ð52Þ

In comparison for the OFS þLOO algorithm [29], if the number of iterations in updating the regularization parameter is also I, the total number of multiplication/division operations can be

Table 1 Details of datasets.

Breast cancer Heart Diabetes

Input features

2 Stotal OFS þ LOO  IMNð3n þ 15nÞ=2:

ð51Þ

If I is the number of iterations in updating the regularization parameters, the total number of multiplication/division operations then becomes

Dataset

calculated as ð53Þ

The ratio of (53) and (52) is SRatio ¼

8 þ6L  100%, nþ5

ð54Þ

and provides a convenient index for comparing the computations of both methods. This shows that if 3 þ 6Lo n, then SRatio o1. Fig. 3 shows the ratio with varying model size and different check loops. Therefore, the computation involved in the proposed method is larger than that of the OFS þLOO approach for small model sizes (n small), but gives a significant improvement for larger n values.

4. Simulation examples Training patterns

Testing patterns

Number of replications

9

200

77

100

13 8

170 468

100 300

100 100

Table 2 Parameter settings for TSLRCC algorithm. Parameter

Breast cancer

Heart

Diabetes

Width (d) Initial value (L)

[0.9, 1.2]

[0.9, 1.2]

[0.95, 1.15]

104 12

104 8

104 10

Iteration number (L)

To verify the effectiveness of the proposed algorithm, the wellknown benchmark data sets in Refs. [29,33,34] including breast cancer, heart, diabetes, and so on were operated on using our proposed TSLRCC algorithm. Here, we did not experiment on all the data sets, as our aim is simply to demonstrate that the proposed approach is a viable alternative to others in the literature [29,33]. Each data set contains 100 batches of randomly selected training and test data points. Table 1 lists the details of these data sets. Several existing classifications algorithms have been applied to these data sets and the results reported [29,33]. Thus, ADABOOST, short for Adaptive Boosting, is a machine learning algorithm, formulated by Freund and Schapire [35]. To overcome overfitting noisier learning conditions, regularization, linear programming, quadratic programming were incorporated to produce regularized

50 Total number of datasets

Model size

15

10

5

0

0

20 10

0

2

4

6 8 10 Model size

12

14

16

80 Total number of datasets

Misclassification rate (%)

30

0

10 20 30 40 50 60 70 80 90 100 Numerical order of dataset

40

30

20

10

0

40

0

10 20 30 40 50 60 70 80 90 100 Numerical order of dataset

60

40

20

0 10

Fig. 4. Breast cancer.

15 20 25 30 Misclassification rate (%)

35

D. Du et al. / Neurocomputing 102 (2013) 10–22

19

ADABOOST ðADABOOSTREG Þ, regularized linear programming ADABOOST ðLPREG -ADABOOSTÞ, regularized quadratic programming ADABOOST ðQPREG -ADABOOSTÞ [33]. Further, in [29], The OFS þLOO was compared with six alternatives, including an RBF neural network, ADABOOST with an RBF, ADABOOST REG , LPREG -ADABOOST, QPREG -ADABOOST, SVM with RBF kernels. It was found that the LOOþOFS approach could construct parsimonious classifiers with competitive classification accuracy for the data sets in Table 1. For the proposed TSLRCC algorithm, the Gaussian kernel function fðx,ci Þ ¼ expðJxci J=s2i Þ was used. In the three experiments, using the concept of the ELM, the centers (ci ,1 ri rM) were obtained by randomly selecting samples from the training data, and the widths (di ,1r i rM) were randomly selected from a specific range as shown in Table 2. The initial value and the maximum number of iterations for the regularization parameters are also listed in Table 2.

The TSLRCC algorithm was first applied to the breast cancer data sets, which produced 100 models and corresponding misclassification rates. Fig. 4 shows the model sizes and misclassification rates. It is seen that most of the model sizes produced range between 2 and 6, while the misclassification rates cluster around 25%. The results are summarized in Tables 3. It is seen that the proposed method produced not only the most compact classifier but also the best classification performance with minimal mean value and variance of misclassification rate. The TSLRCC algorithm was then applied to the heart data sets, where the model sizes and misclassification rates for each are shown as Fig. 5. It is seen that most of model sizes range between 2 and 4, while the misclassification rates cluster around 15%. The results are summarized in Table 4, which shows that the new method produced not only the most compact classifier but also the best overall classification performance with the minimum mean and variance values for the misclassification rate.

Table 3 Average misclassification rate of the breast cancer and model size compared to the results from [29].

Table 4 Average misclassification rate of the heart and model size compared to the results from [29].

Algorithm

Model size

Misclassification rate (%)

Algorithm

Model size

Misclassification rate (%)

RBF [29] ADABOOST with RBF [33] ADABOOST REG [33] LPREG-ADABOOST [33] QPREG-ADABOOST [33] SVM with RBF kernel [29] LOO þ OFS [29] ELM [36] Proposed TSLRCC algorithm

5 5 5 5 5 Not available 6 72 6 4.7 72.9

27.67 4.7 30.47 4.7 26.57 4.5 26.87 6.1 25.97 4.6 26.07 4.7 25.74 75 29.87 4.7 25.21 72.2

RBF [29] ADABOOST with RBF [33] ADABOOSTREG [33] LPREG-ADABOOST [33] QPREG-ADABOOST [33] SVM with RBF kernel [29] LOO þOFS [29] ELM [36] Proposed TSLRCC algorithm

4 4 4 4 4 Not available 107 3 20 3.87 1.8

17.6 7 3.3 20.3 7 3.4 16.5 7 3.5 17.5 7 3.5 17.2 7 3.4 16.0 7 3.3 15.8 7 3.7 37.8 7 5.4 15.6 7 2.4

50 Total number of datasets

Model size

15

10

5

0

20 10 0

1

2

5

7

3

4

5 6 7 8 Model size

9 10 11 12

60 Total number of datasets

Misclassification rate (%)

30

0 10 20 30 40 50 60 70 80 90 100 Numerical order of dataset

30

20

10

0

40

40

20

0 10 20 30 40 50 60 70 80 90 100 Numerical order of dataset Fig. 5. Heart.

0

9 11 13 15 17 19 21 23 25 Misclassification rate (%)

20

D. Du et al. / Neurocomputing 102 (2013) 10–22

50 Total number of datasets

20

Model size

15

10

5

0

20 10

0

2

0 18

19

4

6 8 10 Model size

12

14

16

20 21 22 23 24 25 Misclassification rate (%)

26

60 Total number of datasets

Misclassification rate (%)

30

0

0 10 20 30 40 50 60 70 80 90 100 Numerical order of dataset

30

20

10

0

40

0 10 20 30 40 50 60 70 80 90 100 Numerical order of dataset

40

20

Fig. 6. Diabetes.

Table 5 Average misclassification rate of the diabetes and model size compared to the results from [29]. Algorithm

Model size

Misclassification rate (%)

RBF [29] ADABOOST with RBF [33] ADABOOSTREG [33] LPREG-ADABOOST [33] QPREG-ADABOOST [33] SVM with RBF kernel [29] LOO þ OFS [29] ELM [36] Proposed TSLRCC algorithm

15 15 15 15 15 Not available 67 1 15 6.4 72.7

24.37 1.9 26.57 2.3 23.87 1.8 24.17 1.9 25.47 2.2 23.57 1.7 23.07 1.7 32.57 3.0 22.67 1.0

Finally, the TSLRCC algorithm was applied to the diabetes data sets, the model sizes and misclassification rates being shown in Fig. 6. It is found here that most of model sizes range between 2 and 8, while the misclassification rates cluster around 22% and 22.6%. The results are summarized in Table 5. From Table 5, it is found that the model size obtained by our method is slightly larger than that from the LOOþOFS but far smaller than those from the other five methods. It produced the best classification performance with the minimum average and variance values for the misclassification rate. Compared to the alternative methods, the proposed TSLRCC algorithm performs well. In particular, the performance of the classifier produced by the proposed TSLRCC algorithm is much better than the basic ELM.

class classification problem. This is achieved by combining the LOO misclassification rate and the ELM into the locally regularized forward recursive algorithm. The nonlinear parameters in each term firstly are determined by the ELM. An initial classifier is then generated by direct evaluation of these candidates models according to the LOO misclassification rate from the first stage. The significance of each selected regressor term is then checked and insignificant ones are replaced at a second stage. The proposed algorithm can directly choose the regressor terms from the original regression matrix and produce a optimized compact classifier. To reduce the computational complexity, a proper regression context is defined which allows fast implementation of the proposed method. Simulation results confirm the effectiveness of the proposed algorithm, compared to alternatives from the literature.

Acknowledgment This work was supported in part by the Research Councils UK under Grants EP/G042594/1, EP/G059489/1, and EP/F021070/1, the National Science Foundation of China (61074032, 60834002, 51007052, and 61104089), Science and Technology Commission of Shanghai Municipality (11ZR1413100), Leading Academic Discipline Project MEE&AMA of Shanghai University, the innovation fund project for Shanghai University.

Appendix A 5. Concluding remarks

If the tth row data sample is deleted, the parameter vector ^ ðtÞ then computed by W

A novel automatic two-stage locally regularized classifier construction method using the ELM has been proposed for the two-

^ ðtÞ ¼ ½ðFðtÞ ÞT FðtÞ þ Lj 1 ðFðtÞ ÞT Y ðtÞ , W j j j

ðA:1Þ

D. Du et al. / Neurocomputing 102 (2013) 10–22

where FðtÞ , Y ðtÞ denote the resultant regression matrix and j output vector respectively, with the tth row sample data being removed. With some further matrix manipulations, it can be shown that ðFðtÞ ÞT Y ðtÞ ¼ FTj YyðtÞjTj ðtÞ: j

ðA:2Þ

Using the well-known matrix inversion lemma ½A þ BCD1 ¼ A1 A1 B½DA1 B þC 1 1 DA1 , the inverse of ½ðFðtÞ ÞT FðtÞ þ Lj  j j can be calculated as ½ðFðtÞ ÞT FðtÞ þ Lj 1 ¼ ðFTj Fj þ Lj Þ1 j j þ

ðFTj Fj þ Lj Þ1 jTj ðtÞjj ðtÞðFTj Fj þ Lj Þ1 1jj ðtÞðFTj Fj þ Lj Þ1 jTj ðtÞ

:

ðA:3Þ

Using (A.1)–(A.3), the LOO model residuals can be given as eðtÞ j

ðtÞ

¼ yðtÞf j

ðt9N1Þ

^ ðtÞ ¼ ¼ yðtÞjj ðtÞW

yðtÞjj ðtÞðFTj Fj þ Lj Þ1 FTj Y 1jj ðtÞðFTj Fj þ Lj Þ1 jTj ðtÞ

:

ðA:4Þ

References [1] W. Chen, C.E. Metz, M.L. Giger, K. Drukker, A novel hybrid linear/nonlinear classifier for two-class classification: theory, algorithm and applications, IEEE Trans. Med. Imaging 29 (2) (2010) 428–441. [2] R. Zhang, G.-B. Huang, N. Sundararajan, P. Saratchandran, Improved GAP-RBF network for classification problems, Neurocomputing 70 (2007) 3011–3018. [3] B. Twala, Multiple classifier application to credit risk assessment, Expert Syst. Appl. 37 (2010) 3326–3336. [4] S.R. Gunn, Support Vector Machines for Classification and Regression, Technical Report, University of Southampton, 1998. [5] A. Ulas, M. Semerci, O.T. Y1ld1z, E. Alpayd1n, Incremental construction of classifier and discriminant ensembles, Inf. Sci. 179 (2009) 1298–1318. [6] S. Elanayar, Y.C. Shin, Radial basis function neural network for approximation and estimation of nonlinear stochastic dynamic systems, IEEE Trans. Neural Networks 5 (4) (1994) 594–603. [7] L. Xu, A. Krzyzak, A. Yuille, On radial basis function nets and kernel regression: statistical consistency, convergence rates, and receptive field size, Neural Networks 7 (4) (1994) 609–628. [8] X. Wang, A. Chen, H. Feng, Upper integral network with extreme learning mechanism, Neurocomputing 74 (2011) 2520–2525. [9] G.-B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2011) 107–122. [10] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006) 489–501. [11] G.-B. Huang, H. Zhou, X. Ding, R. Zhang, Extreme learning machine for regression and multi-class classification, IEEE Trans. Syst. Man Cybern. B: Cybern. 42 (2) (2012) 513–529. [12] X. Hong, P.M. Sharkey, K. Warwick, Automatic nonlinear predictive modelconstruction algorithm using forward regression and the PRESS static, IEE Proc. Control Theory Appl. 150 (3) (2003) 245–254. [13] K.Z. Mao, RBF neural network center selection based on Fisher ratio class separability measure, IEEE Trans. Neural Netw. 13 (5) (2002) 1211–1217. [14] K. Li, J. Peng, G.W. Irwin, A fast nonlinear model identification method, IEEE Trans. Autom. Control 50 (8) (2005) 1211–1216. [15] J. Peng, K. Li, G.W. Irwin, A novel continuous forward algorithm for RBF neural modelling, IEEE Trans. Autom. Control 52 (1) (2007) 117–122. [16] J. Xiao, C. Hen, X. Jiang, D. Liu, A dynamic classifier ensemble selection approach for noise data, Inf. Sci. 180 (2010) 3402–3421. [17] T. Poggio, V. Torre, C. Koch, Computational vision and regularization theory, Nature 317 (1985) 314–319. [18] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1) (1970) 55–67. [19] D.J.C. MacKay, Bayesian interpolation, Neural Comput. 4 (3) (1992) 415–447. [20] M.T. Tipping, Sparse Bayesian learning and the relevance vector machine, J. Mach. Learn. Res. 1 (2001) 211–244. [21] S. Chen, X. Hong, C.J. Harris, P.M. Sharkey, Sparse modelling using orthogonal forward regression with PRESS statistic and regularization, IEEE Trans. Syst. Man Cybern. B: Cybern. 34 (2) (2004) 898–911. [22] D. Du, X. Li, M.R. Fei, G.W. Irwin, A novel locally regularized automatic construction method for RBF neural models, Neurocomputing, http://dx.doi. org/10.1016/j.neucom.2011.05.045, in press.

21

[23] H. Akaike, Fitting autoregressive models for prediction, Ann. Inst. Stat. Math. 21 (1) (1969) 243–247. [24] M. Stone, Cross validatory choice and assessment of statistical predictions, J. R. Stat. Soc. 36 (1) (1974) 117–147. [25] X. Hong, R.J. Mitchell, Backward elimination model construction for regression and classification using leave one out criteria, Int. J. Syst. Sci. 38 (2) (2007) 101–113. [26] K. Li, J. Peng, E.W. Bai, A two-stage algorithm for identification of nonlinear dynamic systems, Automatica 42 (2006) 1189–1197. [27] J. Deng, K. Li, G.W. Irwin, Fast automatic two-stage nonlinear model identification based on the extreme learning machine, Neurocomputing 74 (16) (2011) 2422–2429. [28] K. Li, J. Peng, E.W. Bai, Two-stage mixed discrete-continuous identification of radial basis function (RBF) neural models for nonlinear systems, IEEE Trans. Circuits Syst. I 56 (3) (2009) 630–643. [29] X. Hong, S. Chen, C.J. Harris, A fast linear-in-the-parameters classifier construction algorithm using orthogonal forward selection to minimize leave-one-out misclassification rate, Int. J. Syst. Sci. 39 (2) (2008) 119–125. [30] X. Hong, R.J. Mitchell, S. Chen, C.J. Harris, K. Li, G.W. Irwin, Model selection approaches for non-linear system identification: a review, Int. J. Syst. Sci. 39 (10) (2008) 925–946. [31] C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and Its Applications, Wiley, New York, 1971. [32] R.H. Myers, Classical and Modern Regression with Applications, PWS-KENT, Boston, 1990. ¨ ¨ [33] G. Ratsch, T. Onoda, K.R. Muller, Soft margins for AdaBoost, Mach. Learn. 42 (2001) 287–320. [34] University of East Anglia Computational Biology Laboratory, URL /http:// theoval.cmp.uea.ac.uk/gcc/matlab/S. [35] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [36] Q.-Y. Zhu, G.-B. Huang, MATLAB Codes of ELM Algorithm. URL /http://www. ntu.edu.sg/home/egbhuang/ELM-Codes.htmS.

Dajun Du received the B.Sc. and M.Sc. degrees all from the Zhengzhou University, China in 2002 and 2005, respectively, and his Ph.D. degree in control theory and control engineering from Shanghai University in 2010. From September 2008 to September 2009, he was a visiting Ph.D. student at Intelligent Systems & Control (ISAC) Research Group at Queen’s University Belfast, UK. He is a lecture in Shanghai University. His main research interests include neural networks, system modeling and identification and networked control systems, and he is currently working on an UK–China Science Bridge project on sustainable energy and built environment at Queen’s University Belfast.

Kang Li received his Ph.D. degree on Control Theory and Applications from Shanghai Jiaotong University, China, in 1995. He is currently a Professor of Intelligent Systems and Control at the School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK, where he teaches and conducts researches in control engineering. His research interests covers nonlinear system modelling, identification and control, and bio-inspired computational intelligence, with recent applications to power systems and polymer extrusion. He has also extended his research to bioinformatics and systems biology with applications on food safety and healthcare. He has published over 160 papers in the above areas. Dr. Li serves in the editorial board as an Associate Editor or Member of the editorial board for The Transactions of the Institute of Measurement & Control, Neurocomputing, Cognitive Computation and International Journal of Modelling, Identification and Control. He Chairs the IEEE UKRI Section Control and Communication Ireland chapter, and was the Secretary of the IEEE UKRI Section (2008–2012). Dr. Li is a Senior Member of the IEEE, a Fellow of the Higher Education Academy, UK, a Member of the IFAC Technical Committee on Computational Intelligence in Control, and a Member of the Executive Committee of the UK Automatic Control Council. He is also a Visiting Professor of Harbin Institute of Technology, Shanghai University, and Ningbo Institute of Technology of Zhejiang University, and held visiting fellowships of National University of Singapore, University of Iowa, New Jersey Institute of Technology and Tsinghua University, and a visiting professorship with Technical University of Bari, Taranto.

22

D. Du et al. / Neurocomputing 102 (2013) 10–22

George Irwin is a 1st> class honours graduate in Electrical and Electronic Engineering (1972) from Queens University Belfast. He also obtained his Ph.D. in Control Theory (1976) and a D.Sc. (1998) from the same University. He has held a personal Chair in Control Engineering since 1989 and has just retired as Research Director of the Intelligent Systems and Control cluster. Prof. Irwin is a Fellow of the UK Royal Academy of Engineering, Member of the Royal Irish Academy and was recently elected a Fellow by IFAC. He also holds Fellowships of the IEEE and the UK Institute of Measurement and Control. International recognitions include Honorary Professorships at Harbin Institute of Technology and Shandong University, and is a Visiting Professor at Shanghai University. Prof. Irwin’s research covers identification, monitoring, and control, including neural networks, fuzzy neural systems and multivariate statistics, much of which involves industrial collaboration. His most recent personal contributions have been on wireless networked control systems, fault diagnosis of internal combustion engines and novel techniques for fast temperature measurement. He was Technical Director of Anex6 Ltd., a University spin out company specialising in process monitoring and has some 350 research publications, including 120 peerreviewed journal papers. George Irwin has received four IEE Premiums, a Best

Paper award from the Czech Academy of Sciences and the 2002 Honeywell International Medal from the Institute of Measurement and Control. He is a former Editor-in-Chief of the IFAC Journal Control Engineering Practice and Past Chair of the UK Automatic Control Council. As well as service on a number of IFAC technical Committees, he has chaired the IFAC Awards and Publications Committees and is currently a member of the Publications Management Board. He received an IFAC Distinguished Service Award in 2008.

Jing Deng is a postdoctoral research fellow at Queen’s University Belfast. He received his B.Sc. degree from the National University of Defence Technology, China, in 2001 and 2005, and his M.Sc. degrees from the Shanghai University, China, in 2005 and 2007. From March 2008 to April 2011 he started his Ph.D. research at Intelligent Systems and Control (ISAC) Group at Queen’s University Belfast, UK. His main research interests include advanced system modeling, pattern recognition and fault detection, and he is currently working on an EPSRC project developing advanced control technique for polymer extrusion process.