A novel extreme learning machine using privileged information

A novel extreme learning machine using privileged information

Neurocomputing 168 (2015) 823–828 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A novel...

384KB Sizes 4 Downloads 176 Views

Neurocomputing 168 (2015) 823–828

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A novel extreme learning machine using privileged information Wenbo Zhang n, Hongbing Ji, Guisheng Liao, Yongquan Zhang School of Electronic Engineering, Xidian University, Xi’an 710071, China

art ic l e i nf o

a b s t r a c t

Article history: Received 11 December 2014 Received in revised form 6 May 2015 Accepted 12 May 2015 Communicated by G.-B. Huang Available online 22 May 2015

Extreme learning machine (ELM) is a competitive machine learning technique, which is much more efficient and usually lead to better generalization performance compared to the traditional classifiers. In order to further improve its performance, we proposed a novel ELM called ELMþ which introduces the privileged information to the traditional ELM method. This privileged information, which is ignored by the classical ELM but often exists in human teaching and learning, will optimize the training stage by constructing a set of correcting functions. We demonstrate the performance of ELMþ on datasets from UCI machine learning repository, Mackey–Glass time series and radar emitter recognition and also present the comparison with SVM, ELM and SVMþ. The experimental results indicate the validity and advantage of our method. & 2015 Published by Elsevier B.V.

Keywords: Extreme learning machine (ELM) ELM þ Privileged information Hidden information Radar emitter recognition

1. Introduction Extreme learning machine (ELM) [1–5] was originally proposed for the single-hidden-layer feedforward neural networks (SLFNs) and then extended to the generalized SLFNs where the hidden layer need not be neuron alike. In ELM, the input weights of the SLFNs are randomly chosen without iterative tuning, and the output weights are analytically determined. Thus, the training speed of ELM can be thousand times faster than that of the traditional iterative implementations of SLFNs. In addition, different from the traditional learning algorithms for a neural type of SLFNs, ELM aims to reach not only the smallest training error but also the smallest norm of output weights. Bartlett’s theory [6] shows that for feedforward neural networks reaching smaller training error the smaller the norm of weights is, the better generalization performance the networks tend to have. Because of its good performance, ELM has been attracting the attentions from more and more researchers [6–12], and various extensions have been made to make ELM more efficient and more suitable for the real-world problems, such as ELM for imbalanced data [13], ELM for online sequential data [14–16], ELM for noisy data [17,18], and so on. However, further improvement on the performance of ELM has gradually entered the bottleneck. In a data-rich world, there often exists privileged information about training samples, which is not reflected directly in the training set. For example, in radar emitter recognition, traditional approaches [19] separate the received pulses into individual

n

Corresponding author. Tel./fax: þ 86 29 88201658. E-mail address: [email protected] (W. Zhang).

http://dx.doi.org/10.1016/j.neucom.2015.05.042 0925-2312/& 2015 Published by Elsevier B.V.

emitter group, such as passenger or cargo (civil), Model A or Model B (military). However, the state-of-the-art approaches [20] utilize individual parameters to ascertain specific emitters through the precise measurement of intercepted signals. Then, the groups of radar emitters can be considered as the privileged information which can be used to improve recognition performance. However, this privileged information can be easily ignored by the traditional learning machines. Recently, Vapnik and Vashist [21] proposed a general approach for solving such problems, known as Learning Using Privileged Information (LUPI), where at the training stage some additional information xn about training example x is given. An SVM-based optimization formulation under LUPI setting is called SVM þ which can effectively utilize this privileged information to improve performance. Recent empirical comparisons [22,23] show that SVMþ provides improved generalization accuracy for handwritten digit recognition data and landmine detection, respectively. Liang and Cherkassky [24,25] showed the empirical validation of SVMþ for classification including medical diagnosis data. Zhu and Zhong [26] proved that privileged information can also improve the performance of one-class SVM. Feyereisl and Aickelin [27] utilize the notion of privileged information to the unsupervised learning in order to improve clustering performance. Sharmanska et al. [28] proposed Rank Transfer method based on privileged information and applied it to visual object classification tasks. In fact, for almost all machine learning problems there exists some sort of privileged information. Currently, nearly all learning models based on privileged information focus on SVM. However, there is no published work about ELM using the advantages of privileged information. In this paper, ELMþ based on privileged information is proposed by embedding the additional information into the corresponding

824

W. Zhang et al. / Neurocomputing 168 (2015) 823–828

optimization problem. This so-called privileged information can be seen as group information in a way. We suppose that the available training data can be partitioned into several groups in a meaningful way. In order to be useful for learning, this group information is related to the slack variables and the additional constraints on the slack variables are introduced for samples from different groups. That is, we introduce different constraints on slacks from different groups. The slack variables for each group are modeled by the correcting functions which are defined in the correcting space. Thus, the main difference between ELMþ and the standard ELM is that the standard ELM projects inputs into one space whereas ELMþ projects inputs into two different spaces: decision space and correcting space. The experimental results show that the performance can be improved by introducing privileged information in the correcting space. The remainder of the paper is organized as follows. Section 2 briefly summarizes the principles of ELM, and illustrates privileged information by two specific examples. The proposed algorithm is described in detail in Section 3. In Section 4, the experiments and results analysis are presented. The conclusions are drawn in Section 5.

2. Related work The proposed ELMþ is based on ELM. This section provides a brief review of ELM. In addition, the description of the privileged information can be found in this section as well. 2.1. Extreme learning machine ELM [2] was originally proposed for the single-hidden layer feedforward neural networks and was then extended to the “generalized” single-hidden layer feedforward networks (SLFNs) where the hidden layer need not be neuron alike. The output of an ELM with N~ hidden nodes can be represented by f N~ ðxÞ ¼

N~ X

βi Gðai ; bi ; xÞ;

ai A R n ; x A R n

ð1Þ

i¼1

where ai and bi are the connection weights between inputs and hidden nodes, βi is the weight connecting the ith hidden node to the output node, and Gðai ; bi ; xÞ is the output of the ith hidden node with respect to the input x. For N arbitrary distinct samples ðxk ; tk Þ, if ELM can classify them accurately, it implies that there exist ai , bi and βi such that N~ X

βi Gðai ; bi ; xÞ ¼ tk ;

k ¼ 1; …; N:

ð2Þ

i¼1

Eq. (2) can be written compactly as Hβ ¼ T;

ð3Þ

where 2 6 Hða1 ; …; aN~ ; b1 ; …; bN~ ; x1 ; …; xN Þ ¼ 4



gða1 U x1 þ b1 Þ ⋮

… …

gða1 U xN þ b1 Þ



3 gðaN~ U x1 þ bN~ Þ 7 ⋮ 5

gðaN~ U xN þ bN~ Þ

NN~

T

and T ¼ t 1 ; t 2 ; …; t N  . H is called the hidden layer output matrix of the network, and the parameters ðai ; bi Þ of H are randomly chosen. Then, the classification problem for ELM can be formulated as Minimize : LELM ¼ 12‖β‖2 þ C 12

N X

‖ξi ‖2

i¼1

Subject to : Hðxi Þβ ¼ tTi ξTi ;

i ¼ 1; …; N

ð4Þ

where ξi is the training error vector for the training sample xi , and C is the regularization parameter which represents the trade-off between the minimization of training errors and the maximization of the marginal distance. According to Karush–Kuhn–Tucker (KKT)

theorem [29], to train ELM is equivalent to solving the following dual optimization problem: N N X m X   1 1X LELM ¼ ‖β‖2 þ C ‖ξ ‖2  αi;j hðxi Þβj  t i;j þ ξi;j : 2 2i¼1 i i¼1j¼1

ð5Þ

The KKT corresponding optimality conditions can be obtained as: N X ∂LELM ¼ 0-βj ¼ αi;j hðxi ÞT -β ¼ HT α ∂βj i¼1

∂LELM ¼ 0-αi ¼ Cξi ; ∂ξi

i ¼ 1; …; N

∂LELM ¼ 0-hðxi Þβ  tTi þ ξTi ¼ 0; ∂αi

i ¼ 1; …; N

where α ¼ ½α1 ; …; αN T . From (6), we have  1 I þ HHT T: β ¼ HT C Then, the output function of ELM classifier is  1 I þ HHT T: fðxÞ ¼ hðxÞβ ¼ hðxÞHT C

ð6Þ

ð7Þ

ð8Þ

In summary, the input weights in ELM are randomly chosen without iterative tuning, and the output weights are analytically determined. Thus, ELM can achieve similar or much better generalization performance at much faster learning speed than the traditional learning machines, such as SVM. 2.2. Privileged information Different from human learning, a teacher does not play an important role in the traditional machine learning paradigm. However, a teacher is very important in the process of teaching and learning, because along with examples a teacher can provide students with explanations, comments, comparisons, and so on. Vapnik and Vashist [21] proposed an advanced learning paradigm called LUPI, where at the training stage a teacher gives some additional information xn about training example x to improve generalization. Since the additional information is available only at the training stage but it is not available for the test set, it is called privileged information. Let us consider several examples where a teacher provides privileged information during the training stage. 1) In medical diagnosis, the lung cancer predictive model is estimated by using a training set of male and female patients. The gender can impact the medical tests results to a certain degree, and men and women have different lung cancer risk. Thus, the gender can be considered as the privileged information, and this information can be used to improve generalization. 2) In radar emitter recognition, our goal is to find a rule to classify radar emitters into specific types. At the training stage, apart from the radar signal datasets, a teacher can also give the additional information about the uses of radar emitters, such as civil and military. This additional information is available at the training stage but it is not available at the testing stage, and it can be used to improve the classification performance. 3. ELM þ using privileged information As the description above, ELM is a competitive learning method, which achieves excellent performance both in accuracy rate and run time. Recently, many researchers devote to further

W. Zhang et al. / Neurocomputing 168 (2015) 823–828

improve the performance of the algorithm. In this section, we introduce privileged information to the traditional ELM to enhance its performance. During the training stage, we consider the privileged information xn A X n belongs to the correcting space X n which is different from the feature space X. Then, the labeled data ðxi ; t i Þ is replaced   by triplets xi ; xni ; t i . Since data from different groups may have different characteristics, the privileged information of the data should be employed to improve the performance. The group information can be used by introducing the additional constraints on the slack variables for samples from different groups. In the standard ELM, ξi ¼ ½ξi;1 ; …; ξi;m T is the training error vector of the m output nodes with respect to the training sample xi . Thus, in order to be used in learning phase, this privileged information needs to be related to the slack variables ξi . That is, we need to introduce different constraints on slacks from different groups. Vapnik [21] proposed to define the slacks according to different privileged information by the so-called correcting functions     ξi ðxi Þ ¼ ϕ xni ; 8 xi ; xni ; yi : ð9Þ These correcting functions are defined in the correcting space X n .   Then, we map vector x of training triplets xi ; xni ; t i to the hidden layer feature space by hðxÞ ¼ h1 ðxÞ; …; hL ðxÞ and vector xn to the  n  n n hidden-layer correcting function by h ðxn Þ ¼ h1 ðxn Þ; …; hL ðxn Þ , n respectively. In our method, the kernel functions of hð U Þ and h ð U Þ can be same or different. Note that points of different groups are mapped into the same decision space, and they all used to construct the decision function. However, there are different correcting functions for different groups. Hence, by substituting (9) into (4>), the classification problem for the ELMþ can be formulated as Minimize : LELM þ ¼ 12‖β‖2 þ 2μ‖βn ‖2 þ C2 n

Subject to : hðxi Þβ ¼ t i h ðxni Þβn ;

N  X

n

h ðxni Þβn

2

i¼1

i ¼ 1; …; N

 † where HT is the Moore–Penrose generalized inverse of matrix T H [30]. Then, the output function of ELMþ classifier is   † † T: ð14Þ fðxÞ ¼ hðxÞβ ¼ hðxÞ H þ Hn ðμIþ CHnT Hn Þ  1 HnT HT For binary classification problem, ELM þ needs only one output node, and the decision function is !   †  † T : ð15Þ f ðxÞ ¼ sign hðxÞ H þ Hn ðμI þ CHnT Hn Þ  1 HnT HT For multiclass cases, the predicted class label of a testing point is the index number of the output node which has the highest output value for the given testing sample labelðxÞ ¼ arg maxf j ðxÞ:

ð16Þ

j A f1;:::;mg

4. Experimental results In the following, four methods are compared using four classification datasets. The compared methods are SVM, SVMþ , the traditional ELM, and the proposed ELM þ. The datasets are collected from the UCI Machine Learning Repository [31], the Mackey–Glass time series [32] and a set of radar emitter datasets [33]. The simulations of different algorithms on all the data sets are carried out in MATLAB R2014a running in 3.10-GHZ CPU and 48-GB RAM. In this paper, SVM, SVMþ, ELM and ELMþ are used with Gaussian kernel function Kðu; vÞ ¼ expð  γ‖u  v‖2 Þ if there is no special statement. In order to achieve good generalization performance, the trade-off constant C and the kernel parameter γ need ton be chosen appropriately. C and γ are searched in the o range of 2  18 ; 2  17 ; …; 224 ; 225 , respectively. 4.1. UCI machine learning repository

ð10Þ

where βni is the correcting weight connecting the ith hidden node to the output node. The capacity decision function is reflected by ‖β‖2 and the capacity of the correcting function for the privileged n 2 information is  reflected  by ‖β ‖ where we define our correcting n n n 2 function as h ðxi Þβ . The parameter μ is used to adjust the relative weights of these two capacities, which is empirically chosen from f0:01; 0:1; 1; 10; 100g according to the applications. Based on the KKT theorem, to train ELM þ is equivalent to solving the following dual optimization problem: N  N 2 X   1 μ CX n n LELM þ ¼ ‖β‖2 þ ‖βn ‖2 þ h ðxni Þβn  αi hðxi Þβ  t i þ h ðxni Þβn 2 2 2i¼1 i¼1

ð11Þ The KKT corresponding optimality conditions can be obtained as: N X ∂L ¼ 0-β ¼ αi hðxi ÞT ¼ HT α ∂β i¼1

ð12aÞ

∂L ¼ 0-μβn þ CðHn ÞT Hn βn ðHn ÞT α ¼ 0 ∂βn

ð12bÞ

∂L n ¼ 0-hðxi Þβ  t i þ h ðxni Þβn ¼ 0-Hβ þ Hn βn ¼ T: ∂αi

ð12cÞ

β can be obtained from (12a)–(12c)   † † β ¼ H þ Hn ðμI þ CHnT Hn Þ  1 HnT HT T

825

ð13Þ

There are 4177 instances in Abalone dataset, each of which has 8 attributes: Sex, Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, and Shell weight. The goal is to predict the age of abalone according to the physical features. First, we select a group attribute from the input attributes and whole dataset is sorted by the attribute. Second, the data is partitioned into several groups corresponding to different values of group attribute. Each group should be roughly of similar size so that the attribute distribution in each group is similar. Finally, we use fivefold cross validation to evaluate test accuracy, so that 80% of instances are used for training and 20% of instances are used for testing. For Abalone dataset, we present three different attributes in the experiment. First, attribute Length is used to separate the data into two groups: group1 (Lengtho 0:5) and group2 (Length Z 0:5). In the process of human teaching and learning, we call group1 “short” and group2 “long”, and this difference in length will help human to estimate the age of abalone. Similarly, we can obtain this difference as so-called privileged information at the training stage of ELMþ to help to improve the performance. Second, attribute Height is used to separate the data into two groups: group1 (Height o0:15, “thin”) and group2 (Height Z 0:15, “thick”). Third, attribute Whole weight is used to separate the data into two groups: group1 (Wholeweight o 0:8, “light”) and group2 (Wholeweight Z 0:8, “heavy”). As shown in Table 1, SVMþ and ELMþ can achieve better performance than that of SVM and ELM. There are 270 instances in Stalog Heart Disease dataset, each of which has 13 attributes. The goal is to predict an absence or presence of the heart disease according to the medical tests results. Because the medical tests results will varies according to different gender or different age, for the similar data, experienced

826

W. Zhang et al. / Neurocomputing 168 (2015) 823–828

4.3. Radar emitter datasets

Table 1 Test accuracies on Abalone dataset with different group attributes. Group attribute

Length

Height

Whole weight

SVM SVMþ ELM ELM þ

47.5 67.0 48.5 69.4

50.3 73.0 51.1 74.2

50.2 71.5 50.8 73.6

Table 2 Test accuracies on Stalog Heart Disease dataset with different group attributes. Group Attribute

Sex

Age

SVM SVMþ ELM ELM þ

52.1 69.9 51.5 69.4

53.2 66.7 53.3 67.1

doctors will make different diagnosis according to patients’ gender and age. Thus, we select two attribute as privileged information in the experiment. First, we separate the data into two groups by attribute Sex: group1 (Sex ¼ 0) and group2 (Sex ¼ 1). Second, we separate the data into four groups by attribute Age: group1 (Age o40), group2 (40 r Age o 50), group3 (50 rAge o60) and group4 (AgeZ 60). As shown in Table 2, the ELMþ using privileged information can achieve better performance than that of SVM and ELM.

There are 1952 instances belonging to thirteen categories in radar emitter datasets. The radar pulse signal is single frequency, and the signal/noise ratio (SNR) is in the range of 15–25 dB. To make the experiment more convincing, we select three types of datasets, which are Envelope, Empirical Mode Decomposition (EMD) and Ambiguity Function Representative-Cat (AFRC), in our experiments. The characteristics of radar emitters, such as envelope, usually vary according to the uses of the carriers. Thus, military experts will achieve better recognition rates when they know the uses of the carriers. As shown in Fig. 1(a), we first choose the uses of radar emitters as the privileged information to create a two levels hierarchical organization. In Table 5, we can observe that the value of the parameter μ will affect the performance of the SVM þ and the ELMþ. Moreover, ELM þ can achieve better performance than SVM and ELM.

Table 3 Performance comparisons of SVM, SVMþ, ELM and ELMþ training size. Training size

SVM SVMþ ELM ELM þ

100

200

for different

400

Training time (s)

Testing rate (%)

Training time (s)

Testing rate (%)

Training time (s)

Testing rate (%)

0.251 0.276 0.007 0.008

91.3 91.7 91.0 91.5

0.512 0.563 0.017 0.020

91.6 92.1 91.4 91.9

1.168 1.204 0.038 0.049

93.2 93.6 92.8 93.3

4.2. Mackey–Glass time series datasets In many real-world problems, it is important to predict values of time series. For example, in stock market, we hope to predict if the price will go up or down at the moment t according to the price before the moment t. However, in the historical data along with observations about the prices before moment t we also have observations about prices after moment t. This information can be used as privileged information to help us to achieve a better predictive performance. Many researchers [34,35] consider the time series introduced by Mackey and Glass which is solution of the equation dxðt Þ bxðt  τÞ ¼  axðt Þ þ dt 1 þ x10 ðt τÞ

Table 4 Performance comparisons of SVM, SVMþ, ELM and ELMþ steps ahead. Steps ahead

1

5

8

SVM SVMþ ELM ELM þ

91.3 91.7 91.0 91.5

90.1 90.6 89.9 90.4

88.7 89.1 88.4 88.9

ð17Þ

Radar Emitter

where a, b and τ are parameters of the equation. In order to compare the ELMþ with the SVMþ, we use the Mackey–Glass series with the same parameters used in the article [19]. To predict if xðt þ T Þ 4xðt Þ, we use a four dimensional vector of observations on time series xt ¼ ðxðt 3Þ; xðt  2Þ; xðt  1Þ; xðt ÞÞ:

for different

ð18Þ

Civil

Label 1

...

Military

Label 8

...

Label 9

Label 13

We partly use future observations after moment t as the privileged information xnt ¼ ðxðt þ T  1Þ; xðt þ T  2Þ; xðt þ T þ 1Þ; xðt þ T þ 2ÞÞ:

Radar Emitter

ð19Þ

First, we consider one step ahead prediction problems, namely T ¼ 1. As shown in Table 3, we present three different comparisons on the basis of different training size. It can be seen that ELM þ can always achieve comparable accuracy as SVMþ. Moreover, the learning speed of ELM þ is much faster than that of SVM þ. Table 4 shows the performance comparison of SVM, SVMþ , ELM and ELMþ for different steps ahead (T ¼ 1; 5; 8). It can be seen that the performance will decrease with increasing steps, and ELMþ can always achieve comparable accuracy as SVMþ which is better than that of SVM and ELM.

Civil

Passenger Label 1

...

Military

Cargo

Label 5

...

Model A Label 9

Model B ...

Label 11 ...

Fig. 1. Hierarchical organization of radar emitters. (a) Two level. (b) Three levels.

W. Zhang et al. / Neurocomputing 168 (2015) 823–828

827

Table 5 Performance comparisons of SVM, SVMþ, ELM and ELMþ for different parameter μ. μ

0.1

Datasets

Envelope

EMD

Envelope

Envelope

Envelope

AFRC

Envelope

EMD

AFRC

SVM SVMþ ELM ELM þ

60.21 60.87 59.90 60.72

68.97 69.78 69.02 69.89

69.56 70.33 69.46 70.23

60.21 61.25 59.90 61.12

68.97 70.18 69.02 70.23

69.56 70.72 69.46 70.65

60.21 61.48 59.90 61.35

68.97 70.39 69.02 70.41

69.56 70.93 69.46 70.86

1

10

Table 6 The impact of hierarchical levels on performance. Hierarchical level

2

3

Features

Envelope

EMD

AFRC

Envelope

EMD

AFRC

SVM SVMþ ELM ELM þ

60.21 61.48 59.90 61.35

68.97 70.39 69.02 70.41

69.56 70.93 69.46 70.86

60.21 62.24 59.90 62.28

68.97 71.32 69.02 71.26

69.56 71.70 69.46 71.69

Table 7 The impact of output functions of hidden nodes on performance. Output functions

Gaussian þ Gaussian

Sigmoid þSigmoid

Sigmoid þ Gaussian

Gaussian þ Sigmoid

Envelope EMD AFRC

62.28 71.26 71.69

62.19 71.16 71.60

62.25 71.24 71.66

62.35 71.36 71.78

Second, we consider the impact of hierarchical level on the performance. As shown in Fig. 1(b), we choose the uses and sorts of radar emitters as the privileged information to create a three levels hierarchical organization. Seen from Table 6, the more detailed the privileged information is, the better performance the ELMþ can achieve. Finally, we consider the impact of output functions of hidden nodes on the performance. Because of the difference of dimensions or structures between the feature space X and the correcting space X n , we should choose appropriate output functions for the feature x and the privileged information xn . For example, we map vector x by Sigmoid kernel function and map vector xn by Gaussian kernel function, which will be marked as “Sigmoid þGaussian”. As shown in Table 7, “Gaussian þSigmoid” is the most appropriate combination for the Radar emitter datasets.

5. Conclusions In this paper, we propose a new ELM called ELM þ, which introduces the privileged information to the traditional ELM method. In order to be used in learning phase, the privileged information needs to be related to training errors, or slack variables ξi used in the traditional ELM method. The privileged information can be used to introduce additional constraints on the slack variables for training samples according to practical applications. Furthermore, the proposed method retains the advantages of ELM, such as all the hidden node parameters are randomly generated and the output weights are analytically determined. The experiments results show that ELMþ can yield better performance than that of ELM, and achieve comparable accuracy as SVMþ with much faster learning speed than that of SVMþ. In the process of human teaching and learning, privileged information is used to solve various learning problems, thus, ELMþ can be applied in many fields, such as medical diagnostic,

handwritten digit recognition, and face recognition. In future study, we will investigate to find the better determination methods of the privileged information which are suited to wider scopes of real-world applications. Moreover, we will further analyze the impacts of different parameter μ and different output functions on the performance of the algorithm.

Acknowledgments This work was supported by the National Natural Science Foundation of China under grant no. 61301286. References [1] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: a new learning scheme of feedforward neural networks, in: Proceedings of International Joint Conference on Neural Networks, 2, Budapest, Hungary, 25–29 July, 2004, pp. 985–990. [2] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (1-3) (2006) 489–501. [3] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Networks 17 (4) (2006) 879–892. [4] G.-B. Huang, X.-J. Ding, H.-M. Zhou, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern. 42 (2) (2012) 513–529. [5] G.-B. Huang, L. Chen, C.-K. Siew, Universal approximation using incremental constructive feedforward networks with random hidden nodes, IEEE Trans. Neural Networks 17 (4) (2006) 879–892. [6] P.L. Bartlett, The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE Trans. Inf. Theory 44 (2) (1998) 525–536. [7] W. Deng, Q. Zheng, L. Chen, Regularized extreme learning machine, in: IEEE Symposium on Computational Intelligence and Data Mining, March 30–April 2, 2009, pp. 389–395. [8] X. Tang, M. Han, Partial Lanczos extreme learning machine for single-output regression problems, Neurocomputing 72 (13-15) (2009) 3066–3076. [9] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, A. Lendasse, OP-ELM: optimally pruned extreme learning machine, IEEE Trans. Neural Networks 21 (1) (2010) 158–162.

828

W. Zhang et al. / Neurocomputing 168 (2015) 823–828

[10] K. Neumann, J.J. Steil, Optimizing extreme learning machines via ridge regression and batch intrinsic plasticity, Neurocomputing 102 (2013) 23–30. [11] Y.-G. Wang, F.-L. Cao, Y.-B. Yuan, A study on effectiveness of extreme learning machine, Neurocomputing 74 (16) (2011) 2483–2490. [12] P. Horata, S. Chiewchanwattana, K. Sunat, Robust extreme learning machine, Neurocomputing 105 (2013) 31–44. [13] W.-W. Zong, G.-B. Huang, Y. Chen, Weighted extreme learning machine for imbalance learning, Neurocomputing 101 (2013) 229–242. [14] J.-W. Zhao, Z.-H. Wang, D.S. Park, Online sequential extreme learning machine with forgetting mechanism, Neurocomputing 87 (2012) 79–89. [15] H.-J. Rong, G.-B. Huang, N. Sundararajan, P. Saratchandran, Online sequential fuzzy extreme learning machine for function approximation and classification problems,, IEEE Trans. Syst. Man Cybern. 39 (4) (2009) 1067–1072. [16] N.-Y. Liang, G.-B. Huang, P. Saratchandran, N. Sundararajan, A fast and accurate online sequential learning algorithm for feedforward networks, IEEE Trans. Neural Networks 17 (6) (2006) 1411–1423. [17] Q. Yu, Y. Miche, E. Eirola, M. van Heeswijk, E. Severin, A. Lendasse, Regularized extreme learning machine for regression with missing data, Neurocomputing 102 (2013) 45–51. [18] Z.-H. Man, K. Lee, D.-H. Wang, Z.-W. Cao, S.Y. Khoo, Robust single-hidden layer feedforward network-based pattern classifier, IEEE Trans. Neural Networks Learn. Syst. 23 (12) (2012) 1974–1986. [19] C.S. Shieh, C.T. Lin, A vector neural network for emitter identification, IEEE Trans. Antennas Propag. 50 (8) (2002) 1120–1127. [20] L. Li, H.B. Ji, Radar emitter recognition based on cyclostationary signatures and sequential iterative least-square estimation, Expert Syst. Appl. 38 (2011) 2140–2147. [21] V. Vapnik, A. Vashist, A new learning paradigm: learning using privileged information, Neural Networks 22 (5) (2009) 544–557. [22] V. Vapnik, Empirical Inference Science, Springer, New York, 2006. [23] F. Cai, Advanced Learning Approached based on SVMþ Methodology, University of Minnesota, 2011. [24] L. Liang, V. Cherkassky, Connection between SVMþ and multi-task learning, in: Proceedings of the IEEE International Joint Conference on Neural Networks, IEEE, 2008, pp. 2048–2054. [25] L. Liang, V. Cherkassky, Learning using structured data: application to FMRI data analysis, in: Proceedings of the IEEE International Joint Conference on Neural Networks, IEEE, 2007, pp. 495–499. [26] W.-X. Zhu, P. Zhong, A new one-class SVM based on hidden information, Knowl.-based Syst. 60 (2014) 35–43. [27] J. Feyereisl, U. Aickelin, Privileged information for data clustering, Inf. Sci. 194 (2012) 4–23. [28] V. Sharmanska, N. Quadrianto, C. H. Lampert, Learning to rank using privileged information, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2013, pp. 825–832. [29] R. Fletcher, Practical Methods of Optimization: Constrained Optimization, vol. 2, Wiley, New York, 1981. [30] C.R. Rao, S.K. Mitra, Generalized Inverse of Matrices and Its Applications, Wiley, New York, NY, 1971. [31] C.B.S. Hettich, C. Merz, UCI Repository of Machine Learning Databases, 1998. Available 〈http://www.ics.uci.edu/  mlearn/MLRepository.html〉. [32] S. Singh, Mackey Glass Time Series. Available 〈http://www.mathworks.com/ matlabcentral/fileexchange/30232-mackey-glass-time-series〉. [33] V.C. Chen, Time-frequency/time-scale analysis for radar applications. Available 〈http://airborne.nrl.navy.mil/  vchen/tftsa.html〉. [34] Y. Chen, B. Yang, J. Dong, Time-series prediction using a local linear wavelet neural network, Neurocomputing 69 (4) (2006) 449–465. [35] M.A. Farsa, S. Zolfaghari, Chaotic time series prediction with residual analysis method using hybrid Elman-NARX neural networks, Neurocomputing 73 (13) (2010) 2540–2553.

Wenbo Zhang was born in Shaanxi, China in 1985. He received B.S. (2005) degree and M.S. (2009) degree in the School of Telecommunications Engineering, and the Ph.D. (2014) degree in the School of Electronic Engineering from Xidian University, respectively. Currently, he is a lecturer in the School of Electronic Engineering at Xidian University. His research interests include pattern recognition, support vector machine and extreme learning machine.

Hongbing Ji was born in Shaanxi, China in 1963. He graduated from Northern West Telecommunications Engineering College (the predecessor of Xidian University) and earned B.S. degree in radar engineering in 1983. He received M.S. (1989) degree in Circuit, Signals and Systems and Ph.D. (1999) degree in signal and information processing from Xidian University, respectively. After graduation in 1989, he has been with the School of Electronic Engineering at Xidian University, a lecturer from 1990 to1995, an associate professor from 1995 to 2000, a professor from 2000. From 1996 to 2002 he served as a vice dean of School of Electronic Engineering. From 2002, he was the executive dean of graduate school of Xidian University, also a vice chairman of the Academic Degree Evaluation Committee. His primary areas of research have been radar signal processing, automatic targets recognition, multisensor information fusion & target tracking. Prof. Ji is a senior member of the Institute of Electrical and Electronics Engineers (IEEE), a member of IEEE Signal Processing Society and a member of IEEE Aerospace & Electronic Systems Society.

Guisheng Liao was born in Guilin, China. He received the B.S. degree from Guangxi University, Nanning, China, in 1985 and the M.S. and Ph.D. degrees from Xidian University, Xi’an, China, in 1990 and 1992, respectively. He has been a Senior Visting Scholar with The Chinese University of Hong Kong, Shatin, Hong Kong. He is currently a Professor with the National Key Laboratory of Radar Signal Processing, Xidian University. His research interests include synthetic aperture radar (SAR), space-time adaptive processing, SAR ground moving target indication, and distributed small satellite SAR system design. Prof. Liao is currently a member of the National Outstanding Person in China.

Yongquan Zhang was born in Gansu, China in 1985. He received B.S. (2007) degree in computer science and technology, and M.S. (2010) degree in computer applications technology from Lanzhou University of Technology, respectively. He received the Ph.D. (2014) degree in the School of Electronic Engineering from Xidian University, Currently, he is a lecturer in the School of Electronic Engineering at Xidian University. His research interests include machine learning, signal processing, target tracking and data fusion.