Feature selection of generalized extreme learning machine for regression problems

Feature selection of generalized extreme learning machine for regression problems

Communicated by Dr. G.-B. Huang Accepted Manuscript Feature selection of generalized extreme learning machine for regression problems Yong-Ping Zhao...

930KB Sizes 0 Downloads 160 Views

Communicated by Dr. G.-B. Huang

Accepted Manuscript

Feature selection of generalized extreme learning machine for regression problems Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen PII: DOI: Reference:

S0925-2312(17)31815-5 10.1016/j.neucom.2017.11.056 NEUCOM 19121

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

21 April 2017 19 September 2017 26 November 2017

Please cite this article as: Yong-Ping Zhao, Ying-Ting Pan, Fang-Quan Song, Liguo Sun, Ting-Hao Chen, Feature selection of generalized extreme learning machine for regression problems, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.11.056

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Feature selection of generalized extreme learning machine for regression problems

a

CR IP T

Yong-Ping Zhaoa , Ying-Ting Pana , Fang-Quan Songa , Liguo Sunb,† , Ting-Hao Chenc Jiangsu Province Key Laboratory of Aerospace Power Systems College of Energy and Power Engineering

Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China b

School of Aeronautic Science and Engineering, Beihang University, Beijing 100091, China Guangdong Maritime Safety Administration, Guangzhou 510260, China

AN US

c

Abstract

Recently a generalized single-hidden layer feedforward network was proposed, which is an extension of the original extreme learning machine (ELM). Different from the traditional ELM, this generalized ELM (GELM)

M

utilizes the p-order reduced polynomial functions of complete input features as output weights. According to the empirical results, there may be insignificant or redundant input features to construct the p-order reduced polynomial function as output weights in GELM. However, to date there has not been such work of selecting

ED

appropriate input features used for constructing output weights of GELM. Hence, in this paper two greedy learning algorithms, i.e., a forward feature selection algorithm (FFS-GELM) and a backward feature selection algorithm (BFS-GELM), are first proposed to tackle this issue. To reduce the computational complexity, an iterative strategy

PT

is used in FFS-GELM, and its convergence is proved. In BFS-GELM, a decreasing iteration is applied to decay this model, and in this process an accelerating scheme was proposed to speed up computation of removing the

CE

insignificant or redundant features. To show the effectiveness of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments. From these reports, it is demonstrated that both FFS-GELM and BFS-GELM can select appropriate input features to construct the p-order reduced polynomial

AC

function as output weights for GELM. FFS-GELM and BFS-GELM enhance the generalization performance and simultaneously reduce the testing time compared to the original GELM. BFS-GELM works better than FFSGELM in terms of the sparsity ratio, the testing time and the training time. However, it slightly loses the advantage in the generalization performance over FFS-GELM. Key words: single hidden layer feedforward network; extreme learning machine; feature selection; greedy learning; iterative updating

† Corresponding author: E-mail: [email protected]

1

ACCEPTED MANUSCRIPT

1

Introduction

In the field of neural networks, feedforward neural networks (FNNs) have been investigated thoroughly in the past two decades due to their excellent supervised learning performance. Single hidden layer feedforward neural networks (SLFNs) are the simplest FNN with only one hidden layer. Typical examples include multilayer perceptrons with one hidden layer and radial basis function networks, which mathematically consist of a linear combination of some basis functions. Because of the simple form and outstanding approximation capabilities [1–3], SLFNs are popular in many applications such as pattern recognition, signal processing, time-series prediction, nonlinear system modeling

CR IP T

and control, and so on [4–13]. It has already been proven that a SLFN with N − 1 hidden nodes can give any N input-target relations exactly [14, 15]. In [16], the same conclusion that a SLFN with N − 1 sigmoidal hidden nodes can learn N distinct samples without error was confirmed again. Subsequently, Huang et al. [17] extended this result and rigorously proved that a SLFN with at most N hidden nodes and with any bounded nonlinear activation function which has a limit at one infinity can learning N distinct samples with zero error.

AN US

Generally speaking, to obtain a good SLFN there are two subproblems to be solved: optimising the tunable parameters and constructing the optimal network architecture. On one side, since the backprogagation was proposed [18], a large number of gradient-based optimization methods, such as Levenberg-Marguardt algorithm [19, 20], scaled conjugate gradient algorithm [21], BFGS algorithm [22], and so forth, were investigated to search the optimal parameters for FNNs. On the other side, the greedy learning algorithms including the forward learning and the

M

backward learning are commonly utilized to determine the appropriate and compact architectures for SLFNs. The forward learning algorithms start with a small initial network and gradually add new hidden nodes until a satisfactory architecture is obtained. The forward orthogonal least squares method [4], fast recursive algorithm [23], kernel

ED

matching pursuit [24, 25], and least angle regression [26] belong to this class of algorithms. The backward learning algorithms starts by training a larger than necessary network and then deletes one hidden node at a time, which is

PT

the least significant in terms of the user-defined cost function deduction. The pruning algorithms [27–33] are the typical members. Generally, the forward learning algorithms tend to be more popular than the backward learning

CE

algorithms since they are computationally cheap and tend to have modest memory requirements. In contrast, the backward learning algorithms are less efficient with respect to numerical operations. But from a theoretical point of view, the backward algorithms have provable convergence properties while the forward learning algorithms offer no

AC

such known guarantees [34]. However, both the forward and backward learning algorithms may not be optimal [35]. In the learning community of SLFNs, recently there have been two sparkling members: support vector machine (SVM) [36–38] and extreme learning machine (ELM) [39–41]. For SVM, there are no parameters from the hidden layer to be tuned. We only require to optimise the parameters from the output layer. Although the optimisation on the tunable parameters begins with all the training samples, only a small portion of the training samples, referred to as support vectors, is found to construct the final model. SVM is a sparse machine learning algorithm in theory, but the sparsity of the solution is not as good as what we expect, which greatly blocks it into practical use. Investigations [42–44] illustrated that the support vectors of SVMs are sometimes swollen, and dispensable support vectors exist, so some algorithms [42, 45] were proposed to optimise the solution of SVM further.

2

ACCEPTED MANUSCRIPT As another special type of SLFN, the input weights and biases of hidden nodes in ELM are generated randomly, and its output weights are analytically determined by solving a linear system. Compared with the conventional gradient-based learning algorithms, the training speed of ELM is thousands of times faster [40]. Additionally, ELM has weaker generalization ability than SVM for small sample but can generalize as well as SVM for large sample, but it gains the overwhelming superiority in computational speed especially for large scale sample problems to SVM [46]. For the original ELM, there still exist a number of efforts on optimising its network architecture. An incremental ELM [41] and its improvements [47–49] were proposed in order to achieve a more compact network architecture.

CR IP T

Moreover, an error-minimization-based method [50] was presented to grow hidden nodes for ELM one by one or group by group, and output weights are incrementally updated which significantly reduces the computational complexity. Recently, the constructive parsimonious ELMs [51–53] were developed based on the orthogonal transformation. These aforementioned algorithms are inside the range of the forward learning algorithms. Likewise, there are backward learning algorithms to compact the ELM network. To address classification problems, a pruned-ELM was proposed [54], where it begins from an initial large number of hidden nodes, and then the irrelevant nodes are pruned by

AN US

considering their relevance to the class labels. Subsequently, an optimally pruned ELM methodology was presented [55], in which an original ELM is initially constructed, and then the hidden nodes are ranked with multi-response sparse regression technique [56], the number of hidden nodes finally decided by leave-one-out cross validation [57]. In addtion, there are destructive methods of constructing parsimonious ELMs correspondingly [51, 58]. Recently, a generalized SLFN (GSLFN) was investigated [59]. In the original ELM, output weights are constant,

M

which are irrelevant to the sample inputs. In contrast, GSLFN uses the p-order reduced polynomial functions of inputs as output weights as well as any infinitely differentiable activation function as hidden nodes. Evidently, when p = 0

ED

GSLFN degenerates into the original ELM. That is to say, the original ELM is a special case of GSLFN. Extensive experiments shown that the flexible polynomial order p renders GSLFN with considerably compact architecture yielding great superiority to the original ELM in terms of generalization and learning speed. In our opinion, the

PT

proposed SLFN in [59] is only an extension of the original ELM, so in this paper we name it as the generalized ELM (GELM).

CE

According to the obtained conclusion, we know that the generalization performance of the original ELM is not sensitive to the number of hidden nodes and good performance can be reached as long as the number of hidden nodes

AC

is large enough [60]. Although the large number of hidden nodes can guarantee that ELM gains good performance, its testing time is affected by this large number of hidden nodes. That is to say, the more hidden nodes usually signify the more testing time. The large number of hidden nodes in the ELM will impair its widespread applications in which the testing time is strictly required. Hence, a lot of algorithms stated previously were derived to make the network architecture of ELM more compact. In [59], experimental results demonstrates that GELM usually needs fewer hidden nodes but performs better than the original ELM in terms of generalization and learning speed. However, in GELM the output weights are closely related to the sample inputs, i.e., the sample features. The more features involved in the p-order reduced polynomial function as output weights will also worsen the testing time as well as the more hidden nodes. In addition, it has been proved that the generalization performance of GELM can be further enhanced by considering partial input 3

ACCEPTED MANUSCRIPT features to construct the p-order reduced polynomial function as output weights instead of complete input features because there exist insignificant or redundant input features [59]. There are two benefits of removing the insignificant or redundant features to construct the p-order reduced polynomial function as output layer weights: reducing the testing time and enhancing the generalization performance, so it is very important to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM. In the naive GELM, the p-order reduced polynominal function of complete input features is chosen as output layer weights. Therefore, in this paper we first tend to make some contributions to feature selection of the p-order reduced polynomial function in GELM.

CR IP T

Enlightened by experience above on compacting the network architecture with the greedy learning algorithms, here the greedy learning strategies are extended to select appropriate input features to construct the p-order reduced polynominal function as output weights for GELM. As a result, two important algorithms are obtained to select appropriate features to construct the p-order reduced polynominal function as output weights for GELM, shown as follows

AN US

(1) The forward learning algorithm for feature selection of GELM (FFS-GELM): FFS-GELM begins with an original ELM, and then it gradually grows input features to construct the p-order reduced polynomial function as output weights according to some criterion. It does not terminate until the stopping criterion is satisfied. To reduce the training computational burden, an iterative strategy is utilized in FFS-GELM. (2) The backward learning algorithm for feature selection of GELM (BFS-GELM): Contrary to FFS-GELM,

M

BFS-GELM starts with a full model, i.e., a GELM model with the p-order reduced polynominal function of complete input features as output weights as in [59]. Then it removes the insignificant or redundant features

ED

from the p-order reduced polynominal function one by one until the performance required is got. Similar to FFS-GELM, BFS-GELM employs the iterative strategy to shrink model gradually in order to cut down the computational complexity. Moreover, during the process of computing the criterion of removing the insignificant

PT

or redundant features, a scheme is proposed to relieve the computational burden. To confirm the usefulness of the proposed FFS-GELM and BFS-GELM, experimental evaluations over a range of

favored.

CE

benchmark data sets are carried out. From the reported results , the effectiveness of FFS-GELM and BFS-GELM is

AC

The rest of this paper is organized as follows. In section 2, the original ELM and GELM are introduced. In section 3, FFS-GELM is proposed to select appropriate features for the original GELM gradually. In order to reduce the computational complexity, an iterative scheme is applied to accelerate FFS-GELM, and its convergence is proved. In the following section, BFS-GELM is developed, and a scheme is presented to speed up its implementation. To confirm the effectiveness and feasibility of the proposed FFS-GELM and BFS-GELM, twelve benchmark data sets are employed to do experiments in section 5. Finally, conclusions follow.

4

ACCEPTED MANUSCRIPT

2

ELM and GELM

2.1

ELM N

Given a set of training samples {(xk , tk )}k=1 , where xk ∈