Single index fuzzy neural networks using locally weighted polynomial regression

Single index fuzzy neural networks using locally weighted polynomial regression

JID:FSS AID:7600 /FLA [m3SC+; v1.295; Prn:1/03/2019; 15:23] P.1 (1-19) Available online at www.sciencedirect.com ScienceDirect 1 1 2 2 3 Fuzzy...

1MB Sizes 0 Downloads 86 Views

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.1 (1-19)

Available online at www.sciencedirect.com

ScienceDirect

1

1 2

2

3

Fuzzy Sets and Systems ••• (••••) •••–•••

3 4

www.elsevier.com/locate/fss

4

5

5

6

6 7

7

Single index fuzzy neural networks using locally weighted polynomial regression

8 9 10

8 9 10 11

11

a

Jer-Guang Hsieh , Jyh-Horng Jeng

12

b,∗

b

, Yih-Lon Lin , Ying-Sheng Kuo

c

12 13

13 14

a Department of Electrical Engineering, I-Shou University, Kaohsiung 84001, Taiwan

14

15

15

16

b Department of Information Engineering, I-Shou University, Kaohsiung 84001, Taiwan c General Education Center, Open University of Kaohsiung, Kaohsiung 81249, Taiwan

17

Received 8 February 2018; received in revised form 12 February 2019; accepted 12 February 2019

17

16

18

18

19

19

20

20

21

Abstract

23 24 25 26 27 28 29 30 31 32 33 34

The novel single index fuzzy neural network models are proposed in this paper for general machine learning problems. The proposed models are different from the usual fuzzy neural network models in that the output nodes of the networks are replaced by (nonparametric) single index models. Specifically, instead of pre-specifying the output activation functions as in the usual models, they are re-estimated adaptively during the training process via “Loess” (“LOcal regrESSion”), a powerful (nonparametric) scatterplot smoother. These estimated activation functions are not necessarily the usual sigmoidal or identity functions. It is interesting to find that in many cases the estimated output activation functions are well approximated by simple polynomial or generalized hyperbolic tangent functions. These problem-tailored simple functions can, if necessary, then be used as the actual output activation functions for neural network training and prediction. Particle swarm optimization, a commonly used evolutionary computation technique, is adopted in this study to search the optimal connection weights of the neural networks. The main advantages of the single index fuzzy neural network models are that they are well suited in situations when one lacks the information about the probability distribution of the response and it is not necessary to specify the output activation functions of the neural networks. Simulation results show that the proposed models usually provide better fits than the usual models for the data at hand. © 2019 Published by Elsevier B.V.

23 24 25 26 27 28 29 30 31 32 33 34 35

35 36

21 22

22

Keywords: Fuzzy neural network; Single index model; Generalized linear model; Loess; Scatterplot smoother

36

37

37

38

38

39

39

40 41 42 43 44 45

1. Introduction Fuzzy neural networks (FNNs) have long been reckoned as efficient and powerful learning machines for many problems in science and engineering, for instance, dynamic systems [1], health care [2], time series forecasting [3], ITS (intelligent transportation system) [4], robot control [5], and etc.

48 49 50 51 52

41 42 43 44 45 46

46 47

40

* Corresponding author.

E-mail addresses: [email protected] (J.-G. Hsieh), [email protected] (J.-H. Jeng), [email protected] (Y.-L. Lin), [email protected] (Y.-S. Kuo). https://doi.org/10.1016/j.fss.2019.02.010 0165-0114/© 2019 Published by Elsevier B.V.

47 48 49 50 51 52

JID:FSS AID:7600 /FLA

1 2 3 4 5 6 7

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.2 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

2

1

Nomenclature p n X×Y x

2 3

p := {1, 2, · · · , p}. n-dimensional real space. Cartesian product of sets X and Y . Euclidean norm of x ∈ n .

4 5 6 7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

Fig. 1. Standard fuzzy system.

23

23 24

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

The class of approximating functions represented as fuzzy neural networks, with the number of hidden nodes unfixed, possesses the universal approximation property [6,7], in the sense that given any continuous (or more general square-integrable) function g(x) defined on a compact set U ⊆ n and any positive constant ε > 0, no matter how small, there is a fuzzy neural network fε belonging to this class of functions such that   sup fε (x) − g(x) ≤ ε. x∈U

The universal approximation property is crucial for the success of a learning machine in a variety of applications. Consider the standard fuzzy system with n inputs and p outputs as shown in Fig. 1, where U is the input space and V is the output space [8–11]. The fuzzy rule base is composed of m fuzzy rules in canonical form: Rj : IF x1 is A1j and x2 is A2j and . . . and xn is Anj ,

27 28 29 30 31 32 33 34

36 37

where j ∈ m. Suppose we use (normal) Gaussian membership functions for the fuzzy sets Aij , and choose singleton fuzzifier, product inference engine, and center-average defuzzifier [8]. Then the input-output relationship of the fuzzy system, represented as a crisp nonlinear map from the input space U to the output space V , is given by m n 2 j =1 wj k exp[− i=1 (xi − cij ) vij ]   , k ∈ p, yk = fk (x) = (1) m n 2 j =1 exp[− i=1 (xi − cij ) vij ] T  x = x 1 . . . xn ∈  n , where wj k is the center of the normal fuzzy set Bj k , and cij and vij are the center and “precision” parameter (reciprocal of the “variance” parameter), respectively, of the Gaussian fuzzy set Aij [8]. Define, for i ∈ n, j ∈ m, and k ∈ p,

i=1

26

35

THEN y1 is Bj 1 and y2 is Bj 2 and . . . and yp is Bjp ,

n  uj = (xi − cij )2 vij ,

25

38 39 40 41 42 43 44 45 46 47 48 49 50

rj = exp(−uj ),

(2a)

51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.3 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

3

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22 23

23

Fig. 2. Fuzzy neural network.

24 25

sk =

26 27

30 31 32 33 34 35 36

wj k rj ,

j =1

28 29

m 

g=

m 

24 25

rj ,

(2b)

27

j =1

28

then

29

yk = sk /g.

(2c)

It is desirable to extend the applicability of the fuzzy systems to handle more types of data, all under the same framework. For instance, the fuzzy systems are allowed to perform the logistic regression for binary or binomial data, Poisson regression for count data, Gaussian regression for bell-shaped data, and gamma regression for right-skewed continuous response data. To this end, a natural strategy is to add an output activation function fok to the kth output of yk . Then (2c) becomes

39 40 41 42 43 44 45 46 47 48 49 50 51 52

30 31 32 33 34 35 36 37

37 38

26

tk = sk /g,

yk = fok (tk ),

and the predictive function (1) of the fuzzy system is generalized to become  m  n 2 j =1 wj k exp[− i=1 (xi − cij ) vij ] n m yk = fk (x) = fok , k ∈ p. 2 j =1 exp[− i=1 (xi − cij ) vij ]

(2d)

38 39 40 41

(3)

The fuzzy system (3) can now be represented as a feedforward network, namely the fuzzy neural network (FNN), as shown in Fig. 2 [8,10,12]. The advantage of using FNN for machine learning is that the parameters cij , vij , and wj k usually have clear physical meanings and we have some intuitive methods to choose good initial values for them [8]. The activation functions (i.e., the inverse link functions) for the output nodes may be specified according to the distribution of the response data, and these are not necessarily the usual sigmoidal or identity functions. Unfortunately, in many real situations, it is sometimes hard to pre-specify appropriate output activation functions for the data at hand. Inappropriate specification of the output activation functions can possibly give a poor fit to the data [13–15]. In [13], various activation functions (from a pool of pre-specified functions) in each hidden layer and the output layer of the artificial neural network were compared in order to select the best activation functions for identifying the

42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.4 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

types of internal fault of the transformer winding. In [14], a procedure was proposed to determine the optimum activation function for an artificial neural network. It regards selection of the most suitable activation function as a discrete optimization problem, which involves generating various combinations of functions from a pool of pre-specified functions, evaluating their performances as activation functions in a neural network, and returning the optimal function or combination of functions which yields the best result. Bridge scour problem was used to demonstrate the performance of their algorithm. It was pointed in [15] that the learning task of an artificial network would be better faced by relying on a more appropriate, problem-specific basis of activation functions. A connectionist model which exploits adaptive activation functions was proposed in [15]. In their approach, each hidden unit in the network is associated with a specific pair (f (·), p(·)), where f (·) is the activation function and p(·) is the likelihood of the unit being relevant to the computation of the network output over the current input. The activation function f (·) is optimized in a supervised manner, while the likelihood p(·) is realized via a statistical parametric model learned through unsupervised (or, partially supervised) estimation. In order to provide a better fit for the problem at hand, it would be better to estimate suitable output activation functions from available data, instead of specifying them from a pool of pre-specified functions as in the aforementioned papers. In this study, the output nodes will be replaced by the (nonparametric) single index models from the semiparametric regression theory [16,17]. A single index model summarizes the effects of the explanatory variables within a single variable called the index [16]. More recent developments on single index models can be found in [18–20], and the references therein. We will show in this paper how the scatterplot smoothers may be used to estimate the output activation functions from the given data. As a consequence, the approach proposed in this study may be suited to the data of arbitrary response distributions, not just from the usual exponential family of distributions. Furthermore, our approach can easily be generalized to other neural networks such as artificial neural networks and (generalized) radial basis function networks. 2. Single index fuzzy neural networks

27 28 29 30 31 32 33

40 41 42 43 44

47 48 49 50 51 52

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

In (4), X × Y denote the Cartesian product of the sets X and Y . In the following, we will use the subscript q to denote the qth example. For instance, xqi denotes the ith component of the qth input xq ∈ n . By this convention, we have, for q ∈ l, j ∈ m, and k ∈ p,

31

27 28 29 30

32 33 34

uqj

n  = (xqi − cij )2 vij ,

sqk =

i=1 m 

gq =

wj k rqj ,

j =1

rqj = exp(−uqj ), m 

35

(5a)

rqj ,

(5b)

j =1

tqk = sqk /gq ,

36 37

yqk = fok (tqk ).

38 39 40

(5c)

To compute the estimated output yqk or the estimated output activation function of fok in (5), the input and output data of the scatterplot smoother (i.e., Loess in this study) are {tqk }lq=1 in (5c) and {dqk }lq=1 in (4). The residual (or error) eqk at the kth output node due to the qth example is defined by

41 42 43 44 45

45 46

6

26

37

39

5

25

35

38

4

Since the output nodes of the network will be replaced by single index models, the fuzzy neural network shown in Fig. 2 may be called a single index fuzzy neural network (SI-FNN). Let X ⊆ n and Y ⊆ p . Suppose we are given the following training set

l S := (xq , dq ) q=1 ⊆ X × Y. (4)

34

36

3

24

25 26

2

23

23 24

1

eqk := dqk − yqk ,

q ∈ l, k ∈ p.

(6)

46

In the usual least squares approach, the optimal weights cij , vij and wj k are obtained by minimizing the sum of squared errors in (6) given by

47

1  2 Ψ := eqk . 2 l

p

q=1 k=1

48 49 50

(7)

51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.5 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

1

5

3. Scatterplot smoothers

1 2

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Scatterplot smoothing is commonplace in data analysis where one is interested in discovering the underlying trend in the scatterplot. The scatterplot points are simply a collection of points on a plane, without any regard to an underlying probabilistic model. We could think of the vertical position of each point as a realization of a random variable y conditioned on the univariate variable x with value corresponding to the horizontal position of the point [17]. The probabilistic model of the scatterplot smoothing can be written as yi = f (xi ) + εi ,

47 48 49 50 51 52

4 5 6 7 8

(8)

9

In (8), εi are the random errors and E[·] denotes the mathematical expectation. In nonparametric regression, the function f is some unspecified “smooth” function that needs to be estimated from the data (xi , yi ), i ∈ n [17]. There are several available methods for smoothing a scatterplot, including spline regression, penalized spline regression, local polynomial fitting (including kernel smoothers, e.g., Nadaraya-Watson estimators), locally weighted polynomial regression (e.g., Loess), and series-based smoothing (e.g., truncated Fourier series regression and wavelet regression), and so on [17,21]. In the spline regression, the selection of good basis functions and number of knots is usually challenging, though some automatic selection methods of the knots have been developed [17]. Adding more knots has more flexibility. However, as is well known that if there is too much flexibility, the plot is usually heavily “overfitted”, meaning that the fitted function is following small, apparently random, fluctuations in the data as well as the main features [17]. For a kernel smoother, the prediction is based on a weighted average of the data. Typically, the weights are chosen such that they are (nearly) equal to zero for all data outside of a defined “neighborhood” of the specific location of interest. These kernel smoothers use a “bandwidth” to define this neighborhood of interest [21]. A large value of the bandwidth results in more of the data being used to predict the response at the specific location. Consequently, the resulting plot of predicted values becomes much smoother as the bandwidth increases. On the other hand, as the bandwidth decreases, less of the data are used to generate the prediction, and the resulting plot looks more wiggly or bumpy [21]. One of the difficulties of the kernel smoothing is that it may happen that the denominator of estimator at a point is equal to zero. In that case, the numerator is also zero and the estimate is not defined at this point. This will be the case in regions of sparse data [16]. An important and powerful scatterplot smoother is the so-called “Loess” (“LOcal regrESSion”). This smoother will be adopted in this study to estimate the output activation function of the fuzzy neural network. The input and output data of the smoother are {tqk }lq=1 and {dqk }lq=1 . Now we briefly describe the scatterplot smoother “Loess”. Let x0 be the specific location of interest. Like kernel regression, Loess uses the data from a neighborhood around x0 , but in a totally different way. The data points in a neighborhood to be used for weighted least squares fit of x0 is defined in terms of a smoothing parameter called “span”. For instance, a span of 0.75 means that 3/4 nearest data points will be used to form the neighborhood of x0 . It actually determines the amount of the tradeoff between reduction in bias and increase in variance. At each point in the data set, a low-degree (usually linear or quadratic) polynomial is fitted to the points in the neighborhood. The weights for the weighted least squares fit at x0 are usually based on the distance of the points used in the estimation from x0 . Most software packages use the tri-cube weighting function as its default. Specifically, let (x0) be the maximum distance from x0 over points in the neighborhood of x0 [21]. The tri-cube function is given by |x0 − xj | W , (9a) (x0 )

10

where

44

45 46

E[εi ] = 0.

3

W (t) =

(1 − t 3 )3 , 0 ≤ t < 1, 0, elsewhere.

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

45

(9b)

The weighted least squares fit are usually performed in Loess with weights given by (9). As pointed out in [22], the biggest advantage Loess has over many other methods is the fact that it does not require the specification of a function to fit a model to all of the data in the sample. Our programs for simulations in this study were written in R, a powerful statistical programming language. The function loess() in R implements the Loess procedure [23]. It will be used to estimate the activation functions of the

46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.6 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

6

1 2 3 4 5 6 7 8 9 10 11

output nodes. In loess(), the argument “span” defines the fraction of the total points used to form neighborhoods, degree = 1 (resp., degree = 2) means that the first-order (resp., second-order) polynomial is used for local fitting, family = “gaussian” indicates that local fitting is by least squares, and family = “symmetric” indicates that the M-estimator is used with Tukey’s biweight (or bisquare) function [21]. The Tukey’s biweight (or bisquare) function, one of the so-called ρ-functions, is given by

2 (t /6){1 − [1 − (u/t)2 ]3 }, |u| ≤ t, ρ(u) := 2 (10a) t /6, |u| > t, u = e/c, (10b)

1

where e is the residual, c > 0 is a scale parameter, and t is a positive constant. This function is often used as the cost function in robust regression.

10

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Particle swarm optimization (PSO) is one of the best optimization techniques among many evolutionary computation methods [24–26]. This optimizer will be adopted in this study for finding optimal FNN weights cij , vij , and wj k . These weights will be the components of a particle in PSO algorithm. In the FNN shown in Fig. 2, the number of weights is 2nm + mp, which is then the number of components in each particle of the PSO algorithm. Let ξij be the j th position component of the ith particle and ηij the j th velocity component of the ith particle. The updating formulae for velocity and position of the ith particle are given by, respectively,

36

(1)

(2)

41 42 43 44 45 46 47 48 49 50 51 52

8 9

11

15 16 17 18 19 20 21 22

ξij ← ξij + ηij .

(11b)

23

Here, bestij is the j th component of the best previous position of the ith particle, champion is the particle giving the best objective value of all particles up to the present, w is the inertia weight balancing local and global searches, c1 and c2 are two pre-specified constants, sometimes called the individuality and sociality coefficients, respectively, (1) (2) and rij , rij are two random numbers drawn from [0, 1]. In many practical cases, it is preferable to put bounds on the velocities ηij ’s and/or positions ξij ’s. Moreover, we may need to apply the particle swarm optimization algorithm many times to find a reasonably well approximate solution. The parameters w = 0.6 and c1 = c2 = 1.7 used in our simulations were suggested by [27].

25

24

26 27 28 29 30 31 32

5. Algorithm

33 34

The algorithm for training a single index fuzzy neural network is listed below. Our programs were written in R. Recall that the components of a particle contains an array of connection weights cij , vij , and wj k .

35 36 37

Algorithm 1.

38 39

39 40

7

(11a)

37 38

6

ηij ← w ∗ ηij + c1 ∗ rij ∗ {bestij − ξij } + c2 ∗ rij ∗ {bestchampion,j − ξij },

34 35

5

14

32 33

4

13

4. Particle swarm optimization

14 15

3

12

12 13

2

Training set S = {(xq , dq )}lq=1 Find cij , vij , wj k , and fok . the number of epochs, epoch_max. for all i from 1 to epoch_max Initialize a population of particles containing cij , vij and wj k using random numbers. For each particle, calculate uqj , rqj , sqk , and tqk according to (5). Estimate the activation functions fok of output nodes by Loess using the training data {(tqk , dqk )}lq=1 . Calculate the estimated outputs yqk from (5c), residuals eqk from (6), and the cost Ψ in (7) or the Tukey’s biweight function in (10). Step 6: If the stopping criterion for PSO is met, then save the best particle best[i], increase epoch by 1, and go to Step 1; otherwise, go to Step 7. Step 7: Update the weights cij , vij , and wj k by particle swarm optimization formulae (11) and go to Step 3. Step 8: End of the for loop

Data: Goal: Settings: Step 1: Step 2: Step 3: Step 4: Step 5:

40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.7 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

7

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17 18

Fig. 3. True and predictive functions in Example 1. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)

21 22

Step 9: Choose the best particle from best particles best[i] of each epoch. Step 10: Get the best connection weights cij , vij , and wj k by copying the appropriate components of the best particle found in Step 9.

27 28 29 30 31 32 33 34

37 38

In this section, main features of the proposed SI-FNNs will be illustrated via several numerical examples. In the tables below for simulation examples, the quantity “cost” is the estimated minimal value of the cost functional relevant to the problem. The cost is the sum of squared errors defined in (7) for uncorrupted data and is the Tukey’s biweight function defined in (10) for corrupted data. If the unit normal scaling is applied to the given data, then the “cost” is expressed in the scaled version. Let eqk be the residual at the kth output node due to the qth example given in (6). The mean squared error (MSE) and mean absolute deviation (MAD) of the residuals are defined by, respectively, 1  2 MSE := eqk , l·p l

p

q=1 k=1

1  MAD := |eqk |. l·p l

41 42 43 44 45 46 47 48 49 50 51 52

26 27 28 29 30 31 32

p

(12)

q=1 k=1

These two quantities in (12) are commonly regarded as performance indicators for regressors for uncorrupted data. Note that the MSEs and MADs in the following simulations are expressed in original target scales. Let us begin by considering two simple function learning (or regression estimation) problems.

33 34 35 36 37 38 39

39 40

22

25

35 36

21

24

6. Illustrative examples

25 26

20

23

23 24

18 19

19 20

17

Example 1. Consider the 50 data points shown in Fig. 3 which were randomly generated from the fractional power function g(x) = x

2/3

,

x ∈ [−2, 2].

The parameter settings are shown in Table 1. As seen in Table 1, we set span to be 0.75 and use the second-order polynomial for weighted least squares fit in Loess. The estimated output activation function shown in Fig. 4 looks like a quadratic function. The least squares quadratic approximation is given by y = f (t) = 0.2393 + 0.8418 × t + 1.3001 × t 2 . It seems natural to ask whether we may actually use this quadratic function as the output activation function for a good fit of the data? Let us call the FNN with this polynomial output activation function Poly-FNN, and the FNN with the (fixed) identity output activation function as Lin-FNN. The true and predictive functions are drawn in Fig. 3. The three predictive functions are almost indistinguishable from the true regression function. It is interesting to notice that

40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

8

1 2

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.8 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

Table 1 Parameter settings of SI-FNN in Example 1.

3 4 5 6 7 8 9

No. of hidden nodes Span Degree Family Number of PSO epochs Population size Range of initial particle positions

10 11 12

Range of particle velocities

1 2

SI-FNN

3

4 0.75 2 gaussian 100 40 cij ∈ [−2, 2] vij ∈ [0.01, 4] wj k ∈ [−2, 2] [−1, 1]

4 5 6 7 8 9 10 11 12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31 32

Fig. 4. Estimated output activation function in Example 1.

35

Table 2 Simulation results in Example 1.

36 37 38 39

Cost MSE MAD

34 35

SI-FNN

Poly-FNN

Lin-FNN

0.0049 2 × 10−4 0.0090

0.0040 2 × 10−4 0.0079

0.0013 1 × 10−4 0.0053

42 43 44

the problem-tailored polynomial output activation function given above is not monotonic, while common practice is to use monotonic functions such as sigmoidal or identity functions as output activation functions. Table 2 lists some statistics of the simulation. As observed, the performances of the three FNNs are about the same in both MSE and MAD.

47 48 49 50 51 52

37 38 39

41 42 43 44 45

45 46

36

40

40 41

32 33

33 34

31

Example 2. Consider the 50 data points shown in Fig. 5 which were randomly generated from the sinc function

1, x = 0, g(x) = x ∈ [−10, 10]. sin(x)/x, x = 0, The parameter settings are also shown in Table 1 except that cij ∈ [−5, 5]. The estimated output activation function shown in Fig. 6 looks like a cubic function. The least squares cubic approximation is given by y = f (t) = 3.1201 + 28.8908 × t + 79.9609 × t 2 + 68.1865 × t 3 .

46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.9 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

9

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

Fig. 5. True and predictive functions in Example 2.

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

Fig. 6. Estimated output activation function in Example 2.

39

39 40 41

Table 3 Simulation results in Example 2.

42 43 44 45

Cost MSE MAD

40 41

SI-FNN

Poly-FNN

Lin-FNN

42

0.0088 4 × 10−4 0.0142

0.0265 0.0011 0.0205

0.0223 9 × 10−4 0.0154

43

Table 3 lists some statistics of the simulation. The true and predictive functions are drawn in Fig. 5. This time, SI-FNN provides the best fit, especially on both ends, and the performance of Lin-FNN is a little better than that of Poly-FNN.

52

48 49 50

50 51

45

47

47

49

44

46

46

48

37 38

38

Now consider the real-world datasets on regression problems from R-packages and other popular data sources. The LIBSVM is from National Taiwan University [28] and UCI is from University of California at Irvine [29].

51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.10 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

10

1

1

Table 4 Description of datasets in Example 3.

2 3

Dataset

No. of cases

No. of predictors

Source

Polynomial fit

Scaling

Dewpoint Faithful Geophones Bodyfat Concrete Housing Mg MPG

72 272 56 252 1030 506 1385 392

2 1 1 14 8 6 6 4

R-DAAG R-datasets R-DAAG LIBSVM UCI LIBSVM LIBSVM LIBSVM

quadratic linear quadratic linear quadratic quadratic cubic cubic

unit normal unit normal unit normal unit normal unit normal unit normal none unit normal

4 5 6 7 8 9 10 11

2

4

14 15 16 17 18 19

To better evaluate the trained fuzzy neural network, some form of cross validation can be employed. A popular method is the k-fold cross-validation. In this method, the available data is randomly subdivided into k roughly equalsized parts, then repeat the modeling (or training) process k times, leaving one section out each time for validation purposes. The values of the performance indicators, e.g., MSE and/or MAD, are computed on the validating data in each of the k training processes. Finally, we take the average (mean or median) of those k performance values to be the estimated values for the performance indicators. In this study, 5-fold cross validation will be adopted and we use median for averaging those 5 performance values.

22

Example 3. Consider the real-world datasets described in Table 4. The target and predictor variables of each dataset are briefly described below: Dewpoint [30]:

27

dewpt: maxtemp: mintemp:

monthly average dewpoint monthly maximum temperature monthly minimum temperature

32

Faithful [23]: target: predictor:

waiting: eruption:

waiting time to next eruption (in mins) eruption time in mins

37

40 41 42 43

46

thickness: distance:

time for signal to pass through substratum the location of geophone

target: predictors:

51 52

17 18 19

21 22

26 27 28

31 32

36 37

39

density determined from underwater weighing remaining 14 variables

40 41 42 43

Concrete:

44

target: predictors:

48

50

16

38

Bodyfat:

47

49

15

35

target: predictor:

44 45

14

34

38 39

13

33

Geophones [30]:

35 36

11

30

33 34

10

29

29

31

9

25

target: predictors:

28

30

8

24

25 26

7

23

23 24

6

20

20 21

5

12

12 13

3

concrete compressive strength remaining 8 variables (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age)

Housing: (scaled to [−1, 1]) target: predictors:

MEDV: median value of owner-occupied homes in $1000’s selected 6 variables (CRIM, RM, AGE, DIS, B, LSTAT)

45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.11 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

1

11

1

Table 5 Simulation results in Example 3.

2 3 4

Dewpoint MSE

5 6

MAD

7 8

Faithful MSE

9 10

MAD

11 12

Geophones MSE

13 14

MAD

15 16

Bodyfat MSE

17 18 19

MAD

20

Concrete MSE

21 22 23

MAD

24 25

Housing MSE

26 27

MAD

28 29

Mg MSE

30 31

MAD

32 33

MPG MSE

34 35 36

MAD

37

2

Poly-FNN

Lin-FNN

0.2069 (0.0704) 0.3463 (0.0463)

0.1938 (0.0893) 0.3746 (0.0547)

0.2173 (0.0558) 0.4190 (0.0458)

5

34.6032 (9.0366) 4.7676 (0.5993)

35.4183 (4.3021) 4.7012 (0.3806)

37.9912 (8.1798) 4.7779 (0.5924)

9

6.5901 (89.0449) 1.9193 (2.7152)

7.7465 (6.1713) 2.3746 (0.6312)

8.6396 (13.8744) 2.2665 (0.8245)

3 × 10−6 (2 × 10−5 ) 0.0010 (5 × 10−4 )

6 × 10−6 (8 × 10−5 ) 0.0016 (7 × 10−4 )

8 × 10−6 (2 × 10−5 ) 0.0012 (6 × 10−4 )

39.2251 (2.5014) 4.6113 (0.2226)

43.3497 (2.7737) 5.0146 (0.2048)

39.0163 (3.0427) 4.7664 (0.3268)

15.5928 (4.2948) 2.4766 (0.2407)

17.6948 (7.2020) 2.8187 (0.3685)

15.6415 (9.3234) 2.8644 (0.3664)

26

0.0150 (8 × 10−4 ) 0.0944 (0.0023)

0.0145 (0.0011) 0.0930 (0.0044)

0.0154 (0.0011) 0.0969 (0.0040)

30

16.4614 (2.1189) 2.9612 (0.1796)

16.7639 (3.5273) 2.9078 (0.2153)

16.4720 (2.1418) 2.9611 (0.1793)

4

40 41 42

47 48

Mg: (scaled to [−1, 1]) target: predictors:

V1 (first variable) remaining 6 variables

51 52

10 11 12 13 14 15 16 17 18 19 20

22 23 24

27 28 29

31 32 33 34 35 36 37

39 40 41 42 43

MPG: (6 “NA” omitted) (scaled to [−1, 1])

44 45

target: predictors:

mpg: city-cycle fuel consumption in miles per gallon remaining 4 continuous variables (displacement, horsepower, weight, acceleration)

46 47 48 49

49 50

8

38

45 46

7

25

43 44

6

21

38 39

3

SI-FNN

The parameter settings are similar to those in Table 1 except the range of initial particle positions. Table 5 shows the 5-fold cross validated simulation results, where the values in the parentheses are the standard deviations of the performance indicators. Note that the standard deviation in MSE of the “Geophones” dataset is unreasonably big.

50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.12 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

12

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

Fig. 7. True and estimated decision boundaries in Example 4.

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

The reason is that there are only 56 cases in this dataset, and each fold in 5-fold cross validation has only about 10 cases. In that case, the estimation of the standard deviation (or variance) may become quite unstable because of the random sampling of the training and validating datasets in each fold. In most cases in Table 5, the SI-FNN has the best performances in terms of MSE and/or MAD. Poly-FNN and Lin-FNN have about the same performances. Next, consider a simple binary classification problem. The target values are labeled as +1 or −1.

36 37 38 39 40 41 42 43

48 49 50 51 52

21 22 23 24 25

The parameter settings are shown in Table 1. This time, the estimated output activation function shown in Fig. 8 looks like a mirror image of a sigmoidal function. It may well be approximated by a simple generalized hyperbolic tangent function of the form

31

y = f (t) =

1 − e−b(t−a)

. −b(t−a)

1+e The least squares approximation gives a = −1.5513,

b = −9.4361.

Let us call the FNN with this output activation function Htan-FNN, and the FNN with the (fixed) bipolar sigmoidal output activation function as the Bip-FNN. Table 6 lists some statistics of the simulation. The true and estimated decision boundaries are drawn in Fig. 7. The decision boundaries determined are all reasonably close to the true decision boundary.

27 28 29 30

32 33 34 35 36 37 38 39 40 41 42 43 44

Now consider several real-world datasets on classification problems.

45 46

46 47

20

26

44 45

19

Example 4. Consider the 50 randomly generated data points shown in Fig. 7 with the true discriminant function given by the sinc function

x2 − 1, x1 = 0, g(x1 , x2 ) = x ∈ [−10, 10], x2 ∈ [−2, 2]. x2 − [sin(x1 )/x1 ], x1 = 0, 1

34 35

17 18

18

Example 5. Consider the real-world datasets described in Table 7. The target and predictor variables of each dataset are briefly described below:

48 49

Australian: target: predictors:

47

50

A1 (first variable) 6 continuous variables and 2 categorical variables

51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.13 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

13

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

Fig. 8. Estimated output activation function in Example 4.

20

20

Table 6 Simulation results in Example 4.

21 22 23

Cost No. of misclassifications

24 25

21 22

SI-FNN

Htan-FNN

Bip-FNN

4 × 10−6 0

1 × 10−8 0

2 × 10−6 0

Table 7 Description of datasets in Example 5.

28 29 30

32 33 34 35 36

29

No. of cases

No. of predictors

Source

Scaling

Australian Breast Cancer Diabetes German Number Haberman Liver Disorders

690 200 768 1000 306 345

8 9 8 3 3 6

LIBSVM UCI UCI LIBSVM UCI UCI

unit normal unit normal unit normal unit normal unit normal unit normal

class label (2 for benign, 4 for malignant) remaining 9 variables (except the sample code number)

36

41 42 43

Diabetes:

44 45

target: predictors:

Class variable (1: tested positive for diabetes, 0: tested negative for diabetes) remaining 8 variables

46 47 48

48

49

German Number:

50

50

52

35

40

target: predictors:

45

51

34

39

43

49

33

38

Breast Cancer:

40

47

32

37

38

46

30 31

37

44

25

28

Dataset

31

42

24

27

27

41

23

26

26

39

18 19

19

target: predictors:

V1 (first variable) V2, V4, V10

51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.14 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

14

1

Table 8 Simulation results of misclassification rates in Example 5.

2 3 4

Australian

5

Breast Cancer

6 7

Diabetes

8 9

German Number

10

Haberman

11 12

Liver Disorders

13

Htan-FNN

Bip-FNN

3

0.2464 (0.0321) 0.3077 (0.0531) 0.2662 (0.0387) 0.2800 (0.0288) 0.2787 (0.0430) 0.3188 (0.0418)

0.2609 (0.0364) 0.3000 (0.1212) 0.2532 (0.0545) 0.2900 (0.0266) 0.2623 (0.0632) 0.3043 (0.0423)

0.2464 (0.0379) 0.3250 (0.0627) 0.2468 (0.0401) 0.2850 (0.0288) 0.3279 (0.0904) 0.3333 (0.0627)

4

18 19 20

target: predictors:

22

Two possible survival status: positive (survival rate of less than 5 years), negative (survival rate or more than 5 years) age of the patient at time of operation patient’s year of operation number of positive axillary nodes detected

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

8 9 10 11 12 13

17 18 19 20 21 22 23

Liver Disorders:

24 25

25

27

7

16

23

26

6

15

Haberman:

21

24

5

14

15

17

2

SI-FNN

14

16

1

target: predictors:

selector (field used to split data into two sets) remaining 6 variables

The parameter settings are shown in Table 1 except for the Australian dataset where we set cij ∈ [−0.2, 0.2], vij ∈ [0.1, 1], and wj k ∈ [−1, 1]. Table 8 shows the 5-fold cross validated simulation results of the misclassification rates, where the values in the parentheses are the standard deviations of the misclassification rates. For the datasets in Table 7, it is seen that SI-FNNs and Htan-FNNs have about the same performances, and are usually better than those of Bip-FNNs. Moreover, the standard deviations of the misclassification rates using SI-FNNs are usually smaller. This means that SI-FNNs are more accurate. In the following example, we consider a data set that contain outliers, i.e., atypical observations. These are observations well separated from the majority or bulk of the data, or in some way deviate from the general pattern of the data [9,10,31–34]. Indeed, in a broad range of practical applications, data collected inevitably contain one or more outliers. As is well known, classical least squares fit of a regression model can be very adversely influenced by outliers, even by a single one, and often fails to provide a good fit to the bulk of the data [33]. Robust (or resistant) regression that is resistant to the adverse effects of outlying response (and/or explanatory) values offers a half-way house between including outliers and omitting them entirely. Rather than omitting outliers, it dampens their influence on the fitted regression curve by down-weighting them [9,10,34]. One of the main approaches to robust regression involves M-estimation [10,31–33]. Specifically, the Tukey’s biweight (or bisquare) function will be used as the cost function in the next robust regression problem. In the following, we consider a data set that contain obvious outliers. Example 6. Consider the 50 randomly generated data points shown in Fig. 9 with the true regression function given by the sinc function defined in Example 2. This dataset clearly contains some outliers. The parameter settings are given in Table 9. The estimated output activation function shown in Fig. 10 may be approximated by a cubic polynomial given by y = f (t) = 5.3896 + 58.7363 × t + 198.5614 × t 2 + 211.9176 × t 3 .

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.15 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

15

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

Fig. 9. True and predictive functions in Example 6.

19 20

Table 9 Parameter settings of SI-FNN in Example 6.

21 22 23 24 25 26

SI-FNN No. of hidden nodes Span Degree Family ρ-function

27 28 29 30

Number of PSO epochs Population size Range of initial particle positions

31 32 33

17 18

18

Range of particle velocities

4 0.75 2 symmetric Tukey t = 5.5 scale = 1 100 40 cij ∈ [−5, 5] vij ∈ [0.01, 4] wj k ∈ [−2, 2] [−1, 1]

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

34

34

35

35

36

36

37

37

38

38

39

39

40

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50 51

51 52

Fig. 10. Estimated output activation function in Example 6.

52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.16 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

16

1

1

Table 10 Simulation results in Example 6.

2 3 4

Cost MSE MAD

5 6

2

SI-FNN

Poly-FNN

Lin-FNN

3

1.9143 0.0779 0.1597

2.1320 0.0867 0.1867

1.9575 0.0793 0.1913

4

8

Table 11 Parameter settings of SI-FNN in Example 7.

9

SI-FNN

11

No. of hidden nodes Span Degree Family Number of PSO epochs Population size Range of initial particle positions

12 13 14 15 16 17 18 19

Range of particle velocities

4 0.5 2 gaussian 100 100 cij ∈ [−2, 2] vij ∈ [0.01, 4] wj k ∈ [−2, 2] [−1, 1]

Table 12 Simulation results in Example 7.

22 23 24

Cost for training data MSE for training data MAD for training data MSE for testing data MAD for testing data

25 26 27 28

41 42 43 44 45 46 47 48 49 50 51 52

15 16 17 18 19

23 24

3.2776 0.0109 0.0719 0.0227 0.0957

4.6127 0.0154 0.0941 0.0222 0.1069

25 26 27 28 29 30

Table 10 lists some statistics of the simulation. Note that, for data containing severe outliers, MSE is usually not a good performance indicator of a learning machine. The true and predictive functions are drawn in Fig. 9. The predictive functions determined by SI-FNN is much closer to the true regression function than the other two.

31 32 33 34

In the final example, we consider a multi-input and multi-output case. We will consider the performance of the SI-FNN on training and testing data.

35 36 37

37

40

14

Lin-FNN

34

39

13

SI-FNN

30

38

12

22

29

36

11

21

21

35

9

20

20

33

8

10

10

32

6 7

7

31

5

Example 7. Consider a data set with 4 inputs and 3 outputs. The 200 training data and 100 testing data were randomly generated from the true regression functions given by

1, x1 x2 x3 = 0, g1 (x1 , x2 , x3 , x4 ) = sin(x1 x2 x3 )/(x1 x2 x3 ), x1 x2 x3 = 0, g2 (x1 , x2 , x3 , x4 ) = g1 (x1 , x2 , x3 , x4 ),   g3 (x1 , x2 , x3 , x4 ) = 0.5 × x1 sin(x1 ) + cos(x2 ) + cos(x3 x4 ) , −2 ≤ xi ≤ 2, i = 1, . . . , 4. We deliberately set g2 = g1 to see if the estimated output activation functions of the first and second output nodes are similar. The parameter settings are given in Table 11. Table 12 summaries the simulation results for the training and testing data. The performance of both FNNs are about the same. The estimated output activation functions for three output nodes are shown in Figs. 11–13, respectively. It is usually not expected that the estimated output activation functions will have simple shapes that we can approximate them with simple parametric functions. Moreover, in multi-output cases the estimated activation functions for different output nodes may be totally different in shape, as is

38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.17 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

17

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

Fig. 11. Estimated output activation function of node 1 in Example 7.

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36 37

37 38

Fig. 12. Estimated output activation function of node 2 in Example 7.

40 41 42

shown in Figs. 11–13. The estimated output activation functions of the first and second output nodes look like mirror image of each other. It is preferable to use the FNN with fixed output activation functions if it has about the same level of performance as that of the SI-FNN because of the computational simplicity.

45 46 47 48 49 50 51 52

40 41 42 43

43 44

38 39

39

From the simulation results reported in the preceding examples (regression and classification problems), it is natural to ask why SI-FNNs usually produce better fits than the usual FNNs with prescribed output activation functions. The reason is that we actually use more parameters in SI-FNN to estimate the target values. For instance, in estimating the output activation function using Loess in Example 1, the equivalent number of parameters used is about 3.8225, which is not an integer. It can be calibrated with polynomial fits: a scatterplot smoother with υ degrees of freedom summarizes the data to about the same extent as a (υ − 1)-degree polynomial [17]. For more information about the equivalent number of parameters in nonparametric scatterplot smoothers, see [17]. However, if the problemtailored output activation functions are used, then the number of parameters is the same as that of the usual FNN with prescribed output activation functions.

44 45 46 47 48 49 50 51 52

JID:FSS AID:7600 /FLA

18

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.18 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

Fig. 13. Estimated output activation function of node 3 in Example 7.

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

7. Conclusion In this paper, the novel single index fuzzy neural network models were proposed for general machine learning problems, in which the output nodes of the networks were replaced by (nonparametric) single index models and their output activation functions were re-estimated adaptively during the training process by Loess, a nonparametric scatterplot smoother. It was pointed out that in many cases the output activation functions may well be approximated by simple parametric functions, e.g., polynomial or generalized hyperbolic tangent functions. Such problem-tailored output activation functions can then be used for neural network training and prediction if necessary. Particle swarm optimization was adopted to search the optimal connection weights of the neural networks. Some numerical examples, including function learning (or regression estimation) problems and binary classification problems were used to illustrate the main features of the proposed models. Some data sets contain obvious outliers, in which case the robust regression was employed in our single index neural network training. Simulation results showed that the proposed models usually provide better fits than the usual models for these examples. The reason why the SI-FNNs usually produce better fits than the usual FNNs with prescribed output activation functions was explained in terms of the equivalent number of parameters used in estimating the output activation functions via Loess procedure. The main advantages of the single index fuzzy neural network models are that they are well suited in situations when one lacks the information about the probability distribution of the response and it is not necessary to specify the output activation functions of the neural networks. This research may provide an alternative but not a substitute of the usual neural network models with prescribed output activation functions. Acknowledgements

43 44

The authors would like to thank the anonymous referees for their valuable comments, which helped the authors improve the clarity of this paper. The research reported here was supported by the Ministry of Science and Technology, Taiwan, under grant no. MOST-106-2221-E-214-002. References

49 50 51 52

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

40

42 43 44

46 47

47 48

22

45

45 46

21

41

41 42

20

39

39 40

18 19

19

[1] C.F. Juang, Y.Y. Lin, R.B. Huang, Dynamic system modeling using a recurrent interval-valued fuzzy neural network and its hardware implementation, Fuzzy Sets Syst. 179 (1) (2011) 83–99. [2] W.H. Allen, A. Rubaai, R. Chawla, Fuzzy neural network-based health monitoring for HVAC system variable-air-volume unit, IEEE Trans. Ind. Appl. 52 (3) (2016) 2513–2524. [3] G.D. Wu, Z.W. Wei Zhu, An enhanced discriminability recurrent fuzzy neural network for temporal classification problems, Fuzzy Sets Syst. 237 (2014) 47–62.

48 49 50 51 52

JID:FSS AID:7600 /FLA

[m3SC+; v1.295; Prn:1/03/2019; 15:23] P.19 (1-19)

J.-G. Hsieh et al. / Fuzzy Sets and Systems ••• (••••) •••–•••

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

19

[4] J. Tang, F. Liu, Y. Zou, W. Zhang, Y. Wang, An improved fuzzy neural network for traffic speed prediction considering periodic characteristic, IEEE Trans. Intell. Transp. Syst. 18 (9) (2017) 2340–2350. [5] H.G. Han, L.M. Ge, J.F. Qiao, An adaptive second order fuzzy neural network for nonlinear system modeling, Neurocomputing 214 (2016) 837–847. [6] L.X. Wang, J.M. Mendel, Fuzzy basis functions, universal approximation, and orthogonal least squares learning, IEEE Trans. Neural Netw. 3 (5) (1992) 807–814. [7] B. Kosko, Fuzzy systems as universal approximators, IEEE Trans. Comput. 43 (11) (1994) 1329–1333. [8] L.X. Wang, A Course in Fuzzy Systems and Control, Prentice-Hall, New Jersey, 1997. [9] J.G. Hsieh, Y.L. Lin, J.H. Jeng, Preliminary study on Wilcoxon learning machines, IEEE Trans. Neural Netw. 19 (2) (2008) 201–211. [10] H.K. Wu, J.G. Hsieh, Y.L. Lin, J.H. Jeng, On maximum likelihood fuzzy neural networks, Fuzzy Sets Syst. 161 (21) (2010) 2795–2807. [11] H.K. Wu, Y.L. Lin, J.G. Hsieh, J.H. Jeng, Study on semiparametric Wilcoxon fuzzy neural networks, Soft Comput. 16 (1) (2012) 11–21. [12] V. Kecman, Learning and Soft Computing, MIT Press, Cambridge, Massachusetts, 2001. [13] A. Ngaopitakkul, C. Jettanasen, Selection of proper activation functions in back-propagation neural networks algorithm for identifying the phase with fault appearance in transformer windings, Int. J. Innov. Comput. Inf. Control 8 (6) (2012) 4299–4318. [14] A. Ismail, D.S. Jeng, L.L. Zhang, J.S. Zhang, Predictions of bridge scour: application of a feed-forward neural network with an adaptive activation function, Eng. Appl. Artif. Intell. 26 (2013) 1540–1549. [15] I. Castelli, E. Trentin, Combination of supervised and unsupervised learning for training the activation functions of neural networks, Pattern Recognit. Lett. 37 (2014) 178–191. [16] W. Härdle, M. Müller, S. Sperlich, A. Werwatz, Nonparametric and Semiparametric Models, Springer, Berlin, 2004. [17] D. Ruppert, M.P. Wand, R.J. Carroll, Semiparametric Regression, Cambridge University Press, New York, 2003. [18] F. Leitenstorfer, G. Tutz, Estimation of single-index models based on boosting techniques, Stat. Model. 11 (3) (2011) 203–217. [19] J. Liu, R. Zhang, W. Zhao, Y. Lv, A robust and efficient estimation method for single index models, J. Multivar. Anal. 122 (2013) 226–238. [20] Q. Zou, Z. Zhu, M-estimators for single-index model using B-spline, Metrika 77 (2) (2014) 225–246. [21] D.C. Montgomery, E.A. Peck, G.G. Vining, Introduction to Linear Regression Analysis, 4th ed., Wiley, New Jersey, 2006. [22] URL: http://en.wikipedia.org/wiki/Local_regression. [23] R Development Core Team, R: a language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, 2013, available from: URL: http://www.R-project.org/. [24] R.C. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proceeding of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 1995, pp. 39–43. [25] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceeding of IEEE International Conference on Neural Networks, vol. 4, Perth, Australia, 1995, pp. 1942–1948. [26] J. Kennedy, R.C. Eberhart, Swarm Intelligence, Morgan Kaufmann, San Francisco, California, 2001. [27] I.C. Trelea, The particle swarm optimization algorithm: convergence analysis and parameter selection, Inf. Process. Lett. 85 (2003) 317–325. [28] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2016. [29] C. Blake, C. Merz, UCI machine learning repository, http://www.ics.uci.edu/~mlearn/MLRepository.htm, 1998. [30] J.H. Maindonald, W.J. Braun, DAAG: Data Analysis And Graphics data and functions, available from: URL: http://CRAN.R-project.org/ package=DAAG, R package version 1.18, 2010, 2013. [31] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, W.A. Stahel, Robust Statistics: The Approach Based on Influence Functions, Wiley, New York, 2005. [32] P.J. Huber, E.M. Ronchetti, Robust Statistics, Wiley, New Jersey, 2009. [33] R.A. Maronna, R.D. Martin, V.J. Yohai, Robust Statistics: Theory and Methods, Wiley, United Kingdom, 2006. [34] Y.L. Lin, J.G. Hsieh, J.H. Jeng, W.C. Cheng, On least trimmed squares neural networks, Neurocomputing 161 (2015) 107–112.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

36

36

37

37

38

38

39

39

40

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50

51

51

52

52