Interpolating support information granules

Interpolating support information granules

ARTICLE IN PRESS Neurocomputing 71 (2008) 2433– 2445 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/loc...

1MB Sizes 7 Downloads 98 Views

ARTICLE IN PRESS Neurocomputing 71 (2008) 2433– 2445

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Interpolating support information granules Bruno Apolloni a,, Simone Bassis a, Dario Malchiodi a, Witold Pedrycz b a b

Department of Computer Science, University of Milan, Via Comelico 39/41, 20135 Milano, Italy Department of Electrical and Computer Engineering, University of Alberta, ECERF, 9107 - 116 Street, Edmonton, Alberta, Canada T6G 2V4

a r t i c l e in fo

abstract

Available online 7 May 2008

We introduce a regression method that fully exploits both global and local information about a set of points in search of a suitable function explaining their mutual relationships. The points are assumed to form a repository of information granules. At a global level, statistical methods discriminate between regular points and outliers. Then the local component of the information embedded in the former is used to draw an optimal regression curve. We address the challenge of using a variety of standard machine learning tools such as support vector machine (SVM) or slight variants of them within the unifying hat of Granular Computing realm to obtain a definitely new featured nonlinear regression method. The performance of the proposed approach is illustrated with the aid of three well-known benchmarks and ad hoc featured datasets. & 2008 Elsevier B.V. All rights reserved.

Keywords: Algorithmic Inference Granular Computing Linear regression confidence region Modified SVM Kernel methods

1. Introduction In the machine learning domain, when dealing with uncertainty quite often we are concerned with the probabilistic or fuzzy set framework. While combining these two frameworks could be beneficial, we are often faced with serious problems given the fact that from one side the theoretical foundations of these two methodologies generally look quite different; from the other we are used to exploit statistical methods to estimate parameters of fuzzy models and truth-functionality tools to separately manage components of probabilistic structures (see [11] for an extended discussion around misunderstanding, bridges and gaps between the frameworks). Quite often fuzzy sets serve as a second option when we do not have enough knowledge about the data to be processed, so we try to compensate this lack by questioning the operational experience. On the contrary, within the Granular Computing framework [21] we may get a more robust integration between the two approaches, as the data we observe are considered repositories of information that we may consider either at a global level as a sample of a probabilistic population (hence within a statistical inference framework) or at a local level as information granules (by being treated in terms of fuzzy sets). Roughly speaking, P consider the total probability formula PðAÞ ¼ i PðAjei ÞPðei Þ [20]. It says that we may compute the probability PðAÞ of an event A

 Corresponding author. Tel.: +39 02 50316284; fax: +39 02 50316228.

E-mail addresses: [email protected] (B. Apolloni), [email protected] (S. Bassis), [email protected] (D. Malchiodi), [email protected] (W. Pedrycz). 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.11.038

from the probabilities PðAjei Þ of the same event under specific conditions provided that the conditioning events ei ’s constitute a partition of the sample space and we know the probabilities Pðei Þ’s of these events as well. With fuzzy sets framework we are in the situation where we know a set of conditional probabilities with the role of normalized membership functions, but we have no guarantee about the partitioning of the sample space through the conditioning events and in any case do not know their probabilities. In this paper we consider this evident lack not as diminishing our knowledge but rather we enjoy the availability of additional local features supplied by the conditional probabilities. Thus, we work with a mixture of local and global information brought by sample data with the aim of synthesizing with suitable nonlinear regression curves the structure underlying them. We pay, however, a great attention to distinguish the nature of this information and use inferential tools that are specific to each of them. In the literature we accounted a huge repository of studies on statistical regression theory (refer to [27,9] as representative examples). Also fuzzy regression has gained some visibility, where the drifts of the model w.r.t. the observed data are associated with the fuzziness with which the whole data generation system (the coefficients of the regression line included) can be defined [31,26]. The developed fuzzy methods, however, suffer of the mentioned hybridization of the frameworks. This leads for instance to use statistics on the data, hence approximation of probabilistic models’ parameters such as sample mean and variance, to infer membership function parameters. We guess this to be a bias induced by the fact that both approaches start, in any case, from the general assumption of the existence of the true model [12,36], concealed to the humans, apart from some air-holes releasing sample observations alternatively framed into

ARTICLE IN PRESS 2434

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

the probabilistic framework we separate regular points from outliers on the basis of a linear model, as an early robust relation between data (see Fig. 1). This is a usual praxis in statistical data analysis [24]. Then, with simple topological models we locally gather points into fuzzy sets described by bell-shaped membership functions. They actually represent conditional Gaussian densities that we read in terms of relevance score profiles of the sample space elements [7] (see Fig. 2(a)). Finally, we build a relevance landscape through the norm of the above bells, and assess an optimization procedure to identify the curve maximizing the line integral of this norm (see Fig. 2(b)). The focus of the method lies exactly in a sound balance between the two directions of this global/local information dichotomy. In principle, our intent is not to introduce new specific methods but rather revisit algorithms that are readily available in the frameworks of the Algorithmic Inference [5] and the Granular Computing [21], such as twisting arguments [2] in linear regression and Fuzzy C-Means (FCM) clustering [6]. On the other hand, the optimization of the regression curve on the membership landscape represents an innovative technique. It constitutes a variant of a support vector machine (SVM) learning problem [8], including the adoption of kernels in case of nonlinear curves. For short, with an SVM we try to draw a line passing along the valleys, while with the regression curve along the crests of the relevance landscape. This is an alternative way of formalizing regression problems with SVM, where the general thread is to minimize the diameter of a tube along a linear axis containing almost all sample points [10]. In terms of the main results, we construct a quick procedure for identifying a regression curve that complies both with the statistical indications coming from the examples as a whole, and with local peculiarities captured by the membership functions. The paper is organized as follows: Section 2 describes the concept of the regression model we propose, while Section 3 covers its actual implementation. In Section 4 we report numerical examples to demonstrate the efficiency of the method, and in Section 5 we offer some conclusions and elaborate on future developments of the proposed approach.

either an exact though indeterminate framework or a context not susceptible of plain numeric computations. On the contrary, our starting point is the sample data we try to organize into operationally suitable descriptions. Within this scope, facing density diagrams, denoting either probability densities or membership functions of fuzzy sets, we sharply divide horizontal methods useful in the former case from vertical methods suitable in the latter. Namely, with horizontal methods we use the abscissas of the diagrams, in the basic idea that observed points are fair representatives of a random variable population that we consequently weight with their frequency. Vertical methods, on the contrary, concern the ordinates of the diagrams that we manage with operations such as minimum and maximum in substiP tution of the convex combination i PðAjei ÞPðei Þ that would be requested by the total probability theorem (and we are unable to implement). Also the goals achieved with the two families of tools are different. We meet the current thread of refining the data processing by taking into account the quality of the single data items [35,37,17,32,25]. This generally passes through a relevance score that is either attributed to data items a priori with the problem specification within a fuzzy sets framework [35,37,34,16], or is estimated a posterior on the basis of statistical techniques [14] within a probabilistic framework [23], or also managed with intermediate approaches such us possibilistic/belief [13,28] ones. Then the scores are used to weight the contributions of the single data items to the solution of a wisely reformulated regression problem [14,15,30]. As for us, with the global vision supported by

2. The design of the regression model Broadly, we look for a principled linear regression model taking into account the granularity of the data. Two main steps

Fig. 1. A synopsis of the proposed method.

10 7.5

10

5 2.5 0

8

0.01

h

0.008 6 y

0.006 0.004

4

0.002 y

0 0

2

2.5 x

2

4

6

8

10

5

7.5 10

x Fig. 2. Fitting the granules’ information with a line. (a) Relevance score contour lines; x independent variable, y dependent variable. (b) Crossing landscape with the regression line; h: relevance score.

ARTICLE IN PRESS

1100

1100

1050

1050

1000

1000 M

M

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

950

950

900

900

850

850 10

20

2435

30

10

20

%NW

30

%NW

Fig. 3. (a) A sample of pairs of variables’ specifications extracted from the SMSA dataset. %NW: percentage of non-white persons; M: age adjusted mortality; (b) Distinction between regular pairs (gray points) and outliers (black points). Thin lines: divide lines between the two categories of points.

are sought: (i) removal of outliers from experimental data and (ii) quantification of the relevance of the remaining data versus the regression line. Since relevance reverberates into information granularity associated to these points, a third step of the procedure is devoted to the determination of a line crossing the landscape of the granules’ membership functions with an optimal section. For a working example we will use the SMSA dataset [33] (see Fig. 3(a)) listing age adjusted mortality specifications (M) as a function of a demographic index (%NW, the non-white percentage). Throughout the paper we will use different typesettings of a same character in order to highlight aspects of the variable it refers to. Namely, capital letters (such as U, X) will denote random variables and small letters (u, x) their corresponding realizations; the sets the realizations belong to will be denoted by capital gothic letters (U; X), while strings of these realizations take boldtype ðu; xÞ. The general target of this section is a linear model that we usually denote through the implicit equation w  x þ q ¼ 0 in vectorial notation, being  a suitable inner product. However, when we specialize it to two-dimensional data we prefer using the explicit equation y ¼ a þ bðx  cx Þ in scalar notation in order to avoid vector indices, having split x in the two components ðx; yÞ, with cx being a central parameter of the x-axis. However, for some extensions of two-dimensional results the n-dimensional model will read with the hybrid notation y ¼ a þ b1 ðx1  cx1 Þ þ    þ bn1 ðxn1  cxn1 Þ. 2.1. Identifying information granules 2.1.1. The probability versant Abandoning the paradigm where there exist the truth about the data and we want to discover it with some approximation, our modest but feasible goal is to give the data a suitable description through a model that is compatible with them. The attitude is that of a person who wants to forecast the winner of next horse race. He looks for a result that is compatible with the racing pedigree of the various competitors. The way of appreciating compatibility in the Algorithmic Inference approach is through a probability distribution over the candidates; the tool for assessing the probabilities is through a model family of the data that we call sampling mechanism and a set of logical implications centered around a set of master equations. These equations are aimed at twisting probability measures of the data given their model with compatibility measures of the model given the data. Before continuing we clearly state that the latter is definitely not derived from the Bayesian approach to probability, as will be clear in a moment [3]. Namely, for the questioned data fðx1 ; y1 Þ; . . . ; ðxm ; ym Þg we assume a

sampling mechanism: yi ¼ a þ bðxi  xÞ þ i

(1)

to explain the relation between coordinates of both current and future observations from the same source of data, where i denotes a certain random noise (for instance Gaussian) Ei P and x ¼ 1=m m i¼1 xi . Apart from some free parameters, such as the variance, we are not interested in exploring further on the properties of Ei s, which instead play the role of unquestionable random seeds. On the contrary, we are interested in the rest of the sampling model, i.e., in the linear relation r between coordinates y ¼ a þ bðx  xÞ

(2)

playing the role of the explaining function within the sampling mechanism in the Algorithmic Inference notation [5] and of the regression curve according to the regression theory [27]. Typical master equations w.r.t. this mechanism are: sA ¼ ma þ

m X

i ,

(3)

i¼1

sB ¼

m X

bðxi  xÞ2 þ

i¼1

m X

i ðxi  xÞ,

(4)

i¼1

Pm where sA and sB are the observed statistics i¼1 yi and Pm y ðx  xÞ, respectively. The rationale with which we use them i i i¼1 is the following. We have observed sA and sB and know neither the parameters a and b, nor the seeds f1 ; . . . ; m g. However, to any specification f1 ; . . . ; m g of them would correspond, for the observed sA ; sB , the parameters: P P s  m s  m i¼1  i i¼1  i ðxi  xÞ ; b ¼ b a ¼ A , (5) Sxx m Pm where Sxx ¼ i¼1 ðxi  xÞ2 , with a consequent transfer of probability masses (densities) coming from f1 ; . . . ; m g to a ; b as specifications of the random variables A and B. Since the two parameters are jointly involved in the model (1) this transfer is not trivial, as explained in [5,2,4]. Luckily, under the not heavily restrictive hypothesis of Ei s being independently and identically distributed symmetrically around 0, we obtain a straightforward description of a confidence region for the line y ¼ A þ Bðx  xÞ at the basis of model (1). Namely, since they are random variables, A and B define a random function R, i.e., a family of lines each affected by a probability mass (density). We may define a region exactly containing elements of this family up to a predefined probability 1  g. Definition 1. Given the sets X; Y and a random function C : X7!Y, a confidence region with confidence level g is a domain C  X  Y containing C with a probability 1  g, i.e., PðC  CÞ ¼ 1  g,

(6)

ARTICLE IN PRESS B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

80

80 60 40 20 -20 -40 -60

60 40 y

y

2436

1

2 x

3

4

20 1

-20

2 x

3

4

-40

Fig. 4. Contour of the confidence region for a regression line computed through (7) and (8) where (9) is coupled with: (a) (10) and (b) (11). Black piecewise lines: upper and lower contour of the confidence region; gray lines: lines summing up the contours; gray points identify the contours.

20

% NW 60 40

80 10

SO

20

50

2

% NW 57.5 55 52.5

7.5 10

SO

2

12.5 15

30

17.5 1050 1500

950

1000

M

1000 M

1250

900

750 850

Fig. 5. 0:90 confidence regions for a regression plane assuming a Gaussian noise. Landscapes: confidence region bounds; plane: ML estimate. Points: sample extracted from SMSA dataset. %NW and M as in Fig. 3; SO2: concentration of sulfur dioxide. (a) The surfaces are identified by their contour edges; (b) magnification of the central part of (a) using filled surfaces.

where we denote by c  C the inclusion of the set fðx; cðxÞÞ; 8x 2 Xg in C.

and, alternatively, either jDajpt a

We refer the reader to the references indicated above for a complete treatment of the subject. Here we come to identifying the confidence region as follows. In the case where Ei s are Gaussian, for scalar X the region is spanned by the sheaf of lines y ¼ a þ bðx  xÞ, with 1 s 1 s sA  z1g=2 pffiffiffiffiffi pap sA þ z1g=2 pffiffiffiffiffi , m m m m sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    sB 1 1 1  þ  z1g=8 s þ a  sA  m Sxx m Sxx sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi    sB 1 1 1  þ  a  sA , þ z1g=8 s pbp m Sxx m Sxx

(7)

(8)

like in (7), or jDbjpt b

jDa þ Dbjpt tot 1

(9)

For instance, for Laplace distribution or zero-symmetric Pareto distribution [5].

(11)

giving rise to shapes similar to those shown in Fig. 4(a) and (b) for suitable thresholds ðt tot ; t a Þ and ðt tot ; t b Þ, respectively. Da; Db denote the shifts around the central values of A and B distributions, where central values are represented by sA =m and sB =Sxx , in the order. The choice between those or some intermediate shape is tied to development of a very concentrated confidence region and depends on the ratio between E and X variances. In the general case of multivariate X, the model reads as follows according to the mentioned hybrid notation yi ¼ a þ b1 ðx1;i  x1 Þ þ    þ bn1 ðxðn1Þ;i  xn1 Þ þ ei ,

where s is the standard deviation and zZ is the Z quantile associated to the normal distribution. For different distributions of the Ei s, we find analogous forms when the above general hypotheses are maintained.1 Actually, the general form of the sheaf is captured by the inequality

(10)

(12)

where the second index in xj;i refers to the sample listing. In this case the system (9)–(10) (respectively, (9)–(11)) is converted into a relation     n X   Dbi pt tot Da þ  

(13)

i¼1

together with single constraints on Da and Dbi s, thus obtaining regions like those illustrated in Fig. 5.

ARTICLE IN PRESS B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

2.1.2. Outliers’ pruning A dual concept of confidence interval is the alpha-cut of a fuzzy set, collecting all points having a membership function to the set greater or equal to a. We have Definition 2. Given a fuzzy set A on a domain X, i.e., A ¼ fðx; mA ðxÞÞ; x 2 Xg, where mA : X ! ½0; 1 denotes the membership grade of the points in X to A, the alpha-cut Aa of A is defined as Aa ¼ fx : mA ðxÞXa; 8x 2 Xg.

(14)

In this respect we use the borders of the confidence region, for a suitable confidence level, as a divide between regular points and outliers, hence as the contour of an alpha-cut of the fuzzy set gathering regular points [22]. The choice of the confidence level is not algorithmic. While in the statistical framework it may be connected to expected percentages of estimation failure, here we just aim at narrowing the focus of the data analysis. On one side, we may expect that points substantiated by a sampling mechanism whose hyperplane is close but inside these borders do trespass them because of E contribution. But, since we want to capture the structure of the model, we do not take into consideration these shifts and formally assume points outside the confidence interval to be outliers. On the other side the value of the confidence level is fixed case by case as function of the problem at hands. For instance, in Fig. 3 starting from the sample in part (a) we draw a 0:90 confidence region for the regression line in part (b) (under the hypothesis on E to be distributed according to a Gaussian random variable) and maintain only points inside this region for next processing.

2437

the number of clusters to be equal to c, once their centroids fv1 ; . . . ; vc g have been identified, we compute the relevance score hi as the maximal value of membership grades of xi s to the various clusters, that is 80 1 9 2=ðr1Þ 1 = < X c  kx  v k i k A , (16) hi ¼ max @ ; kxi  vj k 1pkpc: j¼1

where r 2 N is a fuzzification factor (41) whose original value has been selected when running the clustering procedure. The typical value of this factor is taken as 2.0 or 3.0, the latter having the value of enhancing differences in membership grades. For c ¼ 3, Fig. 6(a) shows the output of such a procedure applied to our leading example together with the emerging clusters (represented by alpha-cuts in order to visually enhance their fuzziness), while Fig. 6(b) shows the bells membership functions corresponding to points gathered by the central alphacut.2 These images show a two-layer bell texture, the lower spreading information of the single questioned points, the higher locally linking data into a topological structure. We expect that the global trend of the examples goes linear or quasi-linear; with the score distribution we give a different weight to the example contributions in the identification of the regression curve. Namely, we look for a curve regressing phenomena (say, hyperpoints) rather than single examples; thus, the optimal regression curve should get closest to the points that are mostly representative of them. The tools for evidencing these phenomena, such as FCM, and their parametrization, such as the mentioned r or the number c of clusters, require either additional information or a sensitivity analysis; in Section 4 we will discuss about them. 2.3. Finding the optimal regression line

2.2. Assigning relevance to the granules With the above pruning we consider exhausted the global features of the data fðx1 ; y1 Þ; . . . ; ðxm ; ym Þg under analysis. As a result we come to a subset fðxi1 ; yi1 Þ; . . . ; ðxim0 ; yim0 Þg with m0 pm on which we toss the granularity of data information in terms of an expanded version of the representative points. In other words, the fact that we have observed some points—that we denote example points henceforth—in place of the surrounding ones does not mean that the latter do not exist. Rather, we may expect that the structural properties we observe for the single points may extend to their surrounding as well, though with some reasonable smoothing. This leads us to associate to each example point, say with coordinates ðxi ; yi Þ, a fuzzy set centered on the point and described by a bellshaped membership function mi defined as follows: 2

mi ðx; yÞ ¼ hi ephi ððxxi Þ

þðyyi Þ2 Þ

.

(15)

Each of these functions looks like a local Gaussian density function, with same variance s2 ¼ ð2phi Þ1 on each axis and covariance r ¼ 0. Determining the set fhi ; i ¼ 1; :::; mg is the operational way of making the model definite. Given the impossibility of relating the distributions each other within a global probabilistic model, this corresponds to embedding in the i-th granule some information about its relevance. Indeed, the higher the value of hi , the smaller the variance of the corresponding density. As mentioned in Section 1, having quantitative appreciation of the quality of the data under process and exploiting it to refine learning algorithms is a recent thread in the scientific community that has been debated and acquainted from many perspectives. Since we lack a probability connection between the bells, we cannot use statistical methods. Rather, we decide determining hi by revealing a local topology of the examples by some clustering mechanisms, say FCM. Using the vectorial notation, having fixed

Moving from interval to point estimate involves the use of some optimality criterion. A typical solution in the statistical frameworks is represented by the maximum likelihood estimator (MLE). Hence we look for the model parametrization that maximizes the joint probability of observing exactly the sample at hand, bringing to least square solution for the fault hypothesis of Gaussian error distribution. Thanks to the above membership functions, we have spread on the entire sample space a set of relevance scores linked to the observed points such that the sum of these scores in each point determines a relevance landscape. In this scenario the companion criterion of the maximum likelihood deals with the maximization of the relevance score integral along the regression line r. This resolves into the maximization of the cross sections of the single bells around the questioned points, under the constraint that the sections represent integrals of bell functions along a same base line. Let us denote by Ii ða; bÞ the integral over the i-th bell, where a and b are the common r coefficients. In the plane having as its axes r and any line orthogonal to it, say having coordinates x and c, given the radial symmetry of the bell membership function, we may express the latter again as a bidimensional Gaussian density function 2

mi ðx; cÞ ¼ hi ephi ððxxi Þ

þðcci Þ2 Þ

.

(17)

Summing up, the integral Ii ða; bÞ corresponding to the i-th granule is Z 1 Ii ða; bÞ ¼ mi ðx; ci Þ dx 1 Z 1 mi ðxjci Þ dx ¼ mi ðci Þ, (18) ¼ mi ðci Þ 1

2

We focus on this subset of points to facilitate visualization.

ARTICLE IN PRESS 2438

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

1000 1.0 µ

0.5

970

0.0

900

960

10 0

5

10

15

20 25 % NW

30

15 %N W

35

20

M

M

950

950

25

Fig. 6. (a) Output of the Fuzzy C-Means procedure with 3 clusters applied to the points shown in Fig. 3. Gray disks: cluster alpha-cuts. Black circles have a radius proportional to the relevance of the points they refer to. (b) Bell membership functions obtained by applying (15) to the points gathered in the central alpha-cut.

1050 1000 M

M

1000 950

950 900

900 10

20

30

0

% NW

5

10

15

20

25

30

35

% NW

Fig. 7. Comparison between: (a) optimal regression line (22) (black line) and MLE line (dashed line), and (b) the line minimizing the distance (without rendering distortion) of the farthest points from itself.

R1 where 1 mi ðxjci Þ dx ¼ 1 since mi ðxjci Þ has the shape and mathematical properties of a conditional density function of a random variable X given the value of the companion variable C ¼ ci . Analogously, mi ðci Þ is the marginal distribution of C evaluated on ci . Hence 1=2

mi ðci Þ ¼ hi

2

ephi ci .

(19)

Finally, as ci is the distance of the point ðxi ; yi Þ from r, the relation between the new coordinate and the original reference framework is the following: ci ¼

jyi  a  bðxi  xÞj pffiffiffiffiffiffiffiffiffiffiffiffiffiffi , 2 1þb

(20)

so that the integral value becomes equal to 1=2

Ii ða; bÞ ¼ hi

2

ephi ðyi abðxi xÞÞ

=ð1þb2 Þ

.

(21)

Therefore, the optimal regression line w.r.t. the relevance profiles obtained as shown in Section 2.2 has parameters determined by the relationship: 

ða ; b Þ ¼ arg max a;b

m X

1=2 phi ðyi abðxi xÞÞ2 =ð1þb2 Þ

hi

e

.

of the confidence region computed on the original dataset as in Section 2.1.1. In our example, after a few thousand iterations of the gradient descent algorithm we obtained the results shown in Fig. 7(a). This formalization directly extends to multivariate data.

3. Shortcutting computations With the shift from statistical to granular contexts, we are left with a landscape minimization problem whose solution may prove to be very hard depending on the granules membership function and the regression curve complexity. On the other hand, we declared the challenge of drawing well-assessed tools from the machine learning arsenal in order to obtain a fast and easy regression method. In this section, we will adopt a sharp simplification of the optimization problem (22) in order to use SVM techniques for its solution. Since this simplification comes at a cost, we will relieve it allowing more complex than linear regression curves thanks to a special implementation of the kernel trick—being of common usage in these computations.

(22)

i¼1

This looks like a stretched variant of a mean square optimization target, since the squares of the distances are exponentially enhanced as a function of the points’ relevance, with relevance playing a role which is similar but not equal to the conditioning probabilities of a Gaussian mixture model. In order to solve the related optimization problem, we can turn to an incremental algorithm, e.g. a simple gradient descent, exploiting the simple expression of the derivatives of the integrals w.r.t. the parameters a and b of the regression line. Other techniques such as simulated annealing [1] could be an alternative. The sole constraint we put is that the final line must not trespass the borders

3.1. An SVM instance for regression problems In view of saving computational costs, we focus on the problem of minimizing the distance of the farthest points from the line. This is a definitely tangible change w.r.t. the above minimization problem that directly involves the square distances of all points. However, the change proves not so disrupting, in consideration of the fact that: (i) on the one hand we identify a confidence region for the regression line on the basis of the regression lines’ distribution law and (ii) within this region we are looking for a meaningful curve. Now, as we have already discriminated between regular and outlier points, preserving

ARTICLE IN PRESS

7.5

7.5

7.0

7.0

6.5

6.5

6.0

6.0

y

y

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

5.5

5.5

5.0

5.0

4.5

4.5 2.5

3.0

3.5

4.0

4.5

5.0

2.5

x

3.0

2439

3.5 x

4.0

4.5

5.0

Fig. 8. (a) Symmetrical and (b) relevance-based shifts of sample points orthogonally to a tentative regression line.

the influence of the farthest ones within the former looks like a worthwhile target to pursue. Rather, for expository reasons we will start with equal bells formed around each point, so that what counts is their topological distance from the regression line. We will remove this constraint later on. Let us directly start with the vectorial notation, so that the regression curve is a hyperplane and the optimization problem may be enunciated as follows: Definition 3 (Original problem). Given a set of points S ¼ fxi ; i ¼ 1; . . . ; mg, maximize the norm of w under the constraint that all points have functional distance jw  xi þ qj less or equal to 1 from the hyperplane w  xi þ q ¼ 0 . In formulas maxf12kwk2 such that Zi ðw  xi þ qÞp1 8ig, w;q

where Zi ¼ Signðw  xi þ qÞ. In terms of Lagrangian multipliers ai X0, (23) reads: ( ) m X 1 kwk2  ai ðZi ðw  xi þ qÞ  1Þ . max w;q 2 i¼1

(23)

(24)

The drawback of this problem is that the function we want to optimize does not have a saddle point in the space w  a. Hence, to fulfill this condition and work with a dual problem in a we consider the equivalent problem which we obtain after having symmetrically slid the points orthogonally to the regression hyperplane, by a fixed quantity that is sufficient to swap the positions w.r.t. the hyperplane of the farthest points. In this way the closest point from one side of it becomes the farthest from the other side (see Fig. 8(a)) and the maximization problem in (24) translates to a minimization one. Namely: Definition 4 (Modified problem). For hyperplane, points S and labels as in Definition 3 and a suitable instantiation of the hyperplane w  xi þ q ¼ 0, denote by w the direction orthogonal to it and by dmax ¼ maxx2S fjw  x þ qjg. Then, map S into S0 through the mapping: x0i ¼ xi þ Zi ðdmax þ Þw with small and positive . Then find a solution to the following problem: minf12kwk2 such that Zi ðw  x0i þ qÞX1 8ig w;q

(25)

i.e., min w;q

( ) m X   1 kwk2  ai Zi ðw  x0i þ qÞ  1 . 2 i¼1

(26)

Given this problem we arrive at the dual formulation: max a

m X

ai 

i¼1

m 1X Z Z a a x0  x0 2 i;j¼1 i j i j i j

such that m X

Zi ai ¼ 0;

ai X0; 8i.

(27)

i¼1

Note that all we require of the point shifts is that they are orthogonal to the separating hyperplane and to the same extent but in a different direction, depending on whether the point belongs to one or the other of the separated half spaces. This allows to revert the maximization problem (24) into the minimization problem (26) on the same function but different arguments. Apart from rare pathologies in the hyperplane original instantiation, the procedure that computes yi ’s, translates points according to the running hyperplane and updates the latter on the basis of the above operations has a fixed point in the solution of the problem in Definition 3. In particular, continuing our illustrative example we obtain the line in P Pns Pns 0 0 Fig. 7(b), where w ¼ m i¼1 ai Zi xi and q ¼ 1=ns i¼1 ðZi  j¼1 aj Zj xj  xi Þ, where both sums range on the ns support vectors corresponding to the non null as obtained from (27). We remark that the modified problem is used to find w, while q locates the hyperplane in the middle of the farthest points of the original S along w direction.

3.2. Toward more complex regression curves As can be seen in Figs. 7(a) and (b) (and will be seen in Fig. 9(b) later on) the line computed on the basis of the support vectors lies in an opposite position than the optimal regression line from (22) w.r.t. the MLE curve, thus emphasizing the simplification drawbacks. To compensate this decrease in the amount of information (at most only three points) we process, we may try to render their processing more sophisticated by introducing nonlinear explaining functions. A proper introduction of kernels allows us to solve the related regression problem in the same way as for hyperplanes. As well known, this boils down to the fact that the optimization object in (27) depends on the points only through the inner product xi  xj . Then the kernel

ARTICLE IN PRESS 2440

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

1050

1000 M

M

1000 950

950 900 900 5 0

5

10

15

20 25 % NW

30

10

15

35

20 25 % NW

30

35

Fig. 9. (a) SVM parabola (fine-grain dashes black curve) and (b) MLE line (coarse-grain dashes gray line) and MLE parabola (fine-grain dashes gray curve), optimal regression line (plain black line), SVM regression line (coarse-grain dashes black line) and SVM parabola (fine-grain dashes black curve) computed from the SMSA dataset.

1200 1200

1000

1000 800

800

600

600 400

400

200

200 200

400

600

800

1000

200

400

600

800

1000

Fig. 10. Testing with the Norris dataset: (a) sample points with embedded relevance score and 0:9 confidence region for the regression line (same notation as in previous figures); (b) comparison between optimal regression line (black line) and MLE line (dashed gray line).

trick consists of assuming the inner product as a special issue of a symmetric function3 kðxi ; xj Þ—the kernel—and repeating the computation outlined in Definition 4 for any other issue of this function intended as the inner product ni  nj with ni ¼ fðxi Þ ranging in a suitable feature space. In this way we obtain a fitting of the points according to a linear function in n, hence a possibly nonlinear function in x. The vector n has typically higher dimension than x (actually the additional components take into account the nonlinearities of the fitting function). We will come back to the leading example after having introduced the last point of the procedure. 3.3. Freeing the shapes of the granular bells In principle, having different bells around each sample point locates them at virtual distances that are different from the topological ones. As evident from (17), we may reformulate the optimization problem (22) with h ¼ 1 and a modified distance of the points from the regression hyperplane, so that the more relevant a point the farther it must be considered from the hyperplane in the x space. We may induce this virtual metric by simply pushing or pulling the points along the orthogonal direction to the hyperplane (see Fig. 8(b)). With the same notation as in Definition 4, the solution of the new problem may be achieved after a preliminarily map of S into St through the mapping xti ¼ xi  Zi f ðhi Þw , where f is a function of relevance hi possibly introducing some nonlinearities in order to privilege v more/less relevant points—e.g. f ðhi Þ ¼ hi with v41 to strengthen the influence of relevant points, and the opposite behavior for 0ovo1. Then we slid St into S0t and proceed with the optimization 3

Plus some regularity conditions.

problem (26) with x0i correspondingly substituted by x0ti . An analogous formulation holds with hyperplanes in the extended n space. While in principle this figures as an extension of the procedure adopted in Fig. 8(a), with the sole variant that the point shifts are specifically calibrated, the problem is that the virtual distances depend now not only on the versor but also on the position of the hyperplane (i.e., on the q coefficient). Thus we must find both parameters in the fixed point of the whole procedure. This may require some dumping operator, such as exponential smoothing, to converge to a fixed point: for instance as shown in Fig. 9, when dealing with the example where we substituted the dot product in the last procedure with an ad hoc polynomial kernel computing the class of parabolas. Namely, the kernel is kP ðxi ; xj Þ ¼ xi  xj þ x2i1 x2j1 , accounting for features ni ¼ fxi1 ; xi2 ; x2i1 g, with xik denoting the k-th component of the point xi . Fig. 9(a) shows such a curve minimizing the distance (in the feature space) between the farthest points according to the relevance score correction, while Fig. 9(b) summarizes the types of forms obtained so far.

4. Numerical experiments We experimented with the proposed method by considering a series of well-known benchmarks with the general aim of both appreciating the effects of the local information and exploring the viability of the simplified SVM approach. Moreover, we devised a pair of case studies in order to understand the inner mechanisms of the proposed method. Let us start with an unfavorable example. The Norris benchmark [19], published by the US National Institute of Standards and Technology (NIST), contains 36 items concerning the calibration of ozone monitors. This is an example of highly structured

ARTICLE IN PRESS B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

2441

100 90

90 80

80 70

80 70

60

60

60 40

50

50 20

40

40

20

40 60 agriculture

80

20

40 60 catholic

90

90

80

80

70

70

60

60

50

50

40

40

5

10

15

20

25

30

35

40

examin.

80

12.5

100

15 17.5

10

20 22.5

20

25 27.5

30 40 education

50

60

30

mortality

Fig. 11. Exploiting local structure in the Swiss database. The notation is the same used in Fig. 10.

dataset: the corresponding scatter plot, drawn in Fig. 10(a), shows 10 granules whose centers almost lie on a straight line, having the x-coordinates equally distributed. As a consequence, all points are contained in the confidence region, i.e., none of them is excluded from the information granules’ set. As a further consequence, the FCM routine (run with c ¼ 4 in the experiments) introduce an undue over-structure over the points; it actually partitions them in four groups of contiguous points, with the result of giving higher relevance to the points close to the centers of the clusters. As a result we obtain an optimal (granular) regression line that shifts slightly from the MLE line [29] because of the enhancement of the original random shifts of the sampled points, which is improperly introduced by the FCM procedure. In any case the residual errors sum up to a few centesimal points in percentage of the end scale, indicating good performance of the algorithm even in this calibrating instance suggesting to conclude that the introduction of local features is neither effective nor noising when the data have a fair spread along the regression curve. We have a more complex structure inside a dataset reported in [18] that describes standardized socio-economic indicators for 47 French-speaking provinces of Switzerland at about 1888. In particular we get the scatter plots in Fig. 11 by crossing the IG fertility indicator (vertical axes) with a second indicator that alternatively refers to the percentages of males involved in agriculture, of Catholic people, of educated people, of people

achieving the highest exam scores and of infant mortality in those provinces. In the figure we see that the linear model obtained through (22) (black line) looks for crossing crowded ensembles of points (denoting the effect of a specific phenomenon) while the common MLE line (dashed gray line) looks to balance the shifts. As mentioned in Section 2.2, this is exactly the main benefit we expect from scoring the points with a relevance indicator coming from clustering. When the two targets coincide, like in Fig. 11(b), the two lines overlap. Vice versa, the influence of the two different targets is clearly shown in Fig. 11(d) and (e). On the contrary, in Fig. 11(a) and (c) the way through which the granular regression target is pursued is less clear because it derives from a compromise between the attractions the line bears from variously spread directions. Also the identification of a suitable number of clusters and the fuzzification factor is quite immediate in Fig. 11(d) and (e), though with rules of thumb, while we used trial and error procedures in the other cases. As for the SVM algorithm, in all experiments of this section we enhanced the relevance of the v points with the power function f ðhi Þ ¼ hi discussed in Section 3.3, with v ranging between 1 and 2. Results obtained with other functions, such as a scaled tanh, proved similar. Returning in greater detail to the scattering in Fig. 11(e), Fig. 12(a) shows its 0.90 confidence region—denoting fertility percentage with F and infant mortality with IM—while Fig. 12(b) compares the optimal regression line (plain line) with the SVM

ARTICLE IN PRESS 2442

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

25

24 22

F

F

20

20 15 50

60

70

80

50

90

60

70

80

90

IM

IM Fig. 12. (a) The 0.90 confidence regions from the dataset of Fig. 11(e). (b) Optimal regression line (plain line), SVM regression line (dashed black line) and MLE line (dashed gray line).

24

F

22

0.2

f (X;Y)

20

2

0.1 0.0

60

70

80

90

IM Fig. 13. Comparison between MLE parabola (dashed gray curve) and SVM parabola (dashed black curve).

0

x

50

2

y

0

-2 2

Fig. 14. Density function of a two-dimensional Laplace random variable.

line (dashed black line) and the MLE line (dashed gray line), both computed on the points lying in the confidence region and enriched with relevance information (obtained by applying the FCM algorithm with c ¼ 3). Finally, moving to nonlinear models, Fig. 13 compares the MLE parabola (dashed gray curve) with the one obtained through the SVM procedure using the above kernel kP and taking into account the additional relevance information (dashed black curve). Both examples, the latter and the one used to lead the exposition, seem denoting a common utility of the method. In the presence of local structures in addition to the general quasilinear trend, the landscape section optimization produces lines more attracted by the cores of these structures than MLE regression lines. The over-simplification introduced by the SVM target reduces this benefit, while the additional degrees of freedom allowed by the kernel trick partly recovers this drawback. To toss this behavior we devised a case study constituted by three groups of points distributed each according to a two-dimensional Laplace distribution (just to move far from the usual Gaussian framework, see Fig. 14), with centers variously displaced along a line and standard deviations around one tenth of the maximum distance between them. In the mosaic in Fig. 15 we reported pictures corresponding to different locations of the centers that confirm the mentioned behavior. It becomes more accentuated when the SVM approximation gives rise to a line very different from the exact optimization’s, like in Fig. 15(d). In this case the linear prejudice about the general trend excluded many points as outliers, thus cutting off a great part of the third group. Thus the optimal regression line recognizes only two bulks corresponding to the other groups and compute a line crossing their centers. The

SVM algorithm instead locates first and second order curves more centered w.r.t. the confidence region, as the pruned points are well spread inside it, thus getting close to the MLE curves. Finally in Figs. 15(e) and (f) we parametrized the FMC algorithm with an exceeding number of clusters (4 and 5, respectively), with the consequence of giving undue relevance to some peripheral points of the groups. In this case the optimal line is somehow wandering in the feature space and the quadratic SVM prefers staying farther from it than the linear SVM. We also experimented with other kernel operators. For instance in Fig. 16 we used the typical polynomial kernel with degree 2: kD ðxi ; xj Þ ¼ ðxi  xj þ yÞ2 for any constant y4 on a specially devised dataset [4]. Note that the equidistance of the farthest points is obtained in the n feature space, hence from the hyperplane fitting the points in this space. Thus what we really exploit of this hyperplane is its versor (the angular coefficients), while the constant term must be renegotiated in the x space. At this point we are free to add more stringent requirements, for instance that the weighted sum of the quadratic distances from the fitting curve is minimized as in the original goal. In this way we obtain an approximate solution of the original problem in a reasonable amount of time thanks to the use of kernels in the dual optimization problem. Note that thanks to the quadratic shape of the membership functions the curve is very close to the one

4 Actually a slight variant also embedding the shifts of points to pivot the curves on farthest points, as explained in Definition 4.

ARTICLE IN PRESS

8

8

6

6

4

4

y

y

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

2

2

0

0 2

4 x

6

8

8

8

6

6

4

4

y

y

0

0

2

4 x

6

8

0

2

4 x

6

8

0

2

4 x

6

8

2

2 0

0 0

2

4 x

6

8

8

8

6

6

4

4

y

y

2443

2

2

0

0 0

2

4 x

6

8

Fig. 15. Comparison between regression curves computed through the proposed approaches and ML estimates on a synthetic dataset. Gray rings denote the centers of the original Laplace distributions; thin lines: divide lines between regular points and outliers. Plain black line: granular regression line; coarse-grain dashes black line: SVM regression line; fine-grain dashes black curve: SVM regression parabola. Coarse-grain dashes gray line: MLE line; fine-grain dashes gray curve parabola: MLE regression parabola. Pictures (a–d) differ in the displacement of the distributions’ means; pictures (e–f) in the number of FCM centroids.

50

50

40

40

30

30

20

20

10

10

0.5

1

1.5

2

2.5

3

3.5

4

0.5

1

1.5

2

2.5

3

3.5

4

Fig. 16. Nonlinear regression curves. Dashed curves: regression curves found by minimizing the distances of the farthest points from the SVM curve obtained using a polynomial kernel of degree 2; plain curves: curves obtained by minimizing the weighted distance of points from curves. Points’ size is proportional to relevance score. (a) All the points, and (b) only those lying in the first quadrant are considered.

obtained with a q minimizing the distances of farthest points (dashed curve) as in all previous examples. Moreover, assuming the point lying in the second quadrant as further outlier and therefore deleting it, we obtain the different scenario depicted in Fig. 16(b).

The procedure can be applied with no additional variation to multidimensional data points. For the sake of visualization we focus on a three-dimensional benchmark constituted by the above SMSA dataset, where the age adjusted mortality now depends on both the non-white percentage and the sulphur dioxide

ARTICLE IN PRESS 2444

B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

%

20

NW 10

0

1050

M

1000

950

900 0 SO

2

100 200

Fig. 17. Three-dimensional SVM paraboloid fitting the points drawn from the SMSA dataset as in Fig. 5.

concentration SO2. Fig. 17 shows a paraboloid surface solving the dual problem (27).

5. Conclusions In the perspective of probability as a way of organizing available information about a phenomenon rather than a physical property of the phenomenon, we consider additional information which is local, hence not gathered through a measure summing to 1 over a population. In particular, with respect to the linear regression problem, we focus on: (i) alpha-cuts identified through statistical methods and (ii) a local density of clusters of points that reverberates in a membership function of population points to the information granules represented by the sample points. In the perspective that the representation of these information still has to be negotiated with the suitability of their exploitation, we used an augmented kernel trick to have the possibility of locating the information granules in the virtual space we feel most proper, and the dual formulation of the SVM problem to get results quickly. The proposed method is very general and exploits both types of information to efficiently produce regression lines different from those obtained with straight statistical methods whenever local information is prominent. This could help move forward toward full exploitation of information available with data. Further future extensions of the approach may focus on the following directions:

 embedding the procedure with an optimization module for the



kernel identification intended as the best representation of the data—something that may figure as a nonlinear and finalized counterpart of the search for independent components and considering new forms of local information and ways of reverberating them into the virtual localization of the sampled points within the feature space.

References [1] E. Aarts, J. Korst, Simulated annealing and Boltzmann machines: a stochastic approach to combinatorial optimization and neural computing, Wiley, Chichester, 1989.

[2] B. Apolloni, S. Bassis, S. Gaito, D. Iannizzi, D. Malchiodi, Learning continuous functions through a new linear regression method, in: B. Apolloni, M. Marinaro, R. Tagliaferri (Eds.), Biological and Artificial Intelligence Environments, Springer, Berlin, 2005, pp. 235–243. [3] B. Apolloni, S. Bassis, S. Gaito, D. Malchiodi, Appreciation of medical treatments by learning underlying functions with good confidence, Curr. Pharm. Des. 13 (15) (2007) 1545–1570. [4] B. Apolloni, D. Iannizzi, D. Malchiodi, W. Pedrycz, Granular regression, in: B. Apolloni, M. Marinaro, R. Tagliaferri (Eds.), Proceedings of WIRN 2005, Lecture Notes in Computer Science, Springer, Berlin, 2006, pp. 147–156. [5] B. Apolloni, D. Malchiodi, S. Gaito, Algorithmic Inference in Machine Learning, second ed., Advanced Knowledge International, Magill, 2006. [6] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [7] J. Bi, T. Zhang, Support vector classification with input data uncertainty, in: Advances in Neural Information Processing Systems, NIPS’04, vol. 17, 2004, pp. 161–168. [8] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, 2000. [9] B.M. Douglas, D.J. Watts, Nonlinear Regression Analysis and its Applications, Wiley, New York, 1988. [10] H. Drucker, C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in: Advances in Neural Information Processing Systems, NIPS, vol. 9, 1997, pp. 155–161. [11] D. Dubois, H. Nguyen, H. Prade, Fuzzy sets and probability: misunderstandings bridges and gaps, in: D. Dubois, H. Prade (Eds.), Fundamentals of Fuzzy Sets, Kluwer, Boston, 2000, pp. 343–438. [12] D. Dubois, H. Prade, Possibility Theory, Plenum Press, London, 1976. [13] D. Dubois, H. Prade, Properties of measures of information in evidence and possibility theories, Fuzzy Sets and Systems 24 (2) (1987) 161–182 (special issue: Measures of Uncertainty). [14] T. Gesterl, J. Suykens, B. De Moor, J. Vandewalle, Automatic relevance determination for least squares support vector machine regression, in: Proceedings of International Joint Conference on Neural Networks IJCNN ’01, vol. 4, 2001, pp. 2416–2421. [15] P. Hao, J. Chiang, A fuzzy model of support vector machine regression, in: The 12th IEEE International Conference on Fuzzy Systems, FUZZ ’03, vol. 1, 2003, pp. 738–742. [16] H. Huang, Y. Liu, Fuzzy support vector machines for pattern recognition and data mining, Int. J. Fuzzy Syst. 4 (3) (2002) 826–835. [17] Y. Ivanov, B. Blumberg, A. Pentland, Expectation maximization for weakly labeled data, in: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., San Francisco, 2001, pp. 218–225. [18] F. Mosteller, J.W. Tukey, Data Analysis and Regression: A Second Course in Statistics, Addison-Wesley, Reading, MA, 1977. [19] National Institute of Standards and Technology, StRD dataset Norris, hhttp:// www.itl.nist.gov/div898/strd/lls/data/Norris.shtmli (online, accessed April 2005). [20] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, 1965. [21] W. Pedrycz, Granular computing in data mining, in: M. Last, A. Kandel (Eds.), Data Mining & Computational Intelligence, Springer, Berlin, 2001, pp. 37–61. [22] W. Pedrycz, F. Gomide, An Introduction to Fuzzy Sets: Analysis and Design, The MIT Press, Cambridge, MA, 1998. [23] J. Platt, Probabilistic outputs for support vector machines and comparison to regularized likelihood methods, in: A. Smola, P. Bartlett, B. Schoelkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000, pp. 61–74. [24] J. Rice, Mathematical Statistics and Data Analysis, Wadsworth and Brooks/ Cole, Pacific Grove, 1988. [25] K. Sankar, S. Mitra, P. Mitra, Rough fuzzy MLP: modular evolution, rule generation and evaluation, IEEE Trans. Knowl. Data Eng. 15(1). [26] D. Savic, W. Pedrycz, Evaluation of fuzzy regression models, Fuzzy Sets Syst. 39 (1991) 51–63. [27] G.A.F. Seber, L.J. Alan, Linear Regression Analysis, second ed., WileyInterscience, Hoboken, 2003. [28] P. Smets, Numerical representation of uncertainty, Handbook of Defeasible Reasoning and Uncertainty Management Systems 3 (1998) 265–309. [29] S. Stigler, The History of Statistics: The Measurement of Uncertainty Before 1900, Harvard University Press, Cambridge, MA and London, England, 1986. [30] Z. Sun, Y. Sun, Fuzzy support vector machine for regression estimation, IEEE Int. Conf. Syst. Man and Cybern. 4 (2003) 3336–3341. [31] H. Tanaka, S. Uejima, K. Asai, Linear regression analysis with fuzzy model, IEEE Trans. Syst. Man Cybern. 12 (6) (1982) 903–907. [32] C. Thiel, F. Schwenker, G. Palm, Using Dempster–Shafer theory in MCF systems to reject samples, in: N.C. Oza, R. Polikar, J. Kittler, F. Roli (Eds.), Proceedings of the Sixth International Workshop on Multiple Classifier Systems MCS 2005, Springer, Berlin, 2005, pp. 118–127. [33] U.S. Department Labor Statistics, SMSA dataset, Air pollution and mortality hhttp://lib.stat.cmu.edu/DASL/Datafiles/SMSA.htmli. [34] T. Villmann, B. Hammer, F. Schleif, T. Geweniger, W. Herrmann, Fuzzy classification by fuzzy labeled neural gas, Neural Networks. 19 (6) (2006) 772–779 (special issue: Advances in Self-Organizing Maps). [35] T. Villmann, F. Schleif, B. Hammer, Fuzzy labeled soft nearest neighbor classification with relevance learning, in: Proceedings of the Fourth International Conference on Machine Learning and Applications (ICMLA’05), vol. 00, IEEE Computer Society, Washington, DC, 2005, pp. 11–15.

ARTICLE IN PRESS B. Apolloni et al. / Neurocomputing 71 (2008) 2433–2445

[36] L. Zadeh, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets Syst. 100 (1999) 9–34. [37] L. Zouhal, T. Denoeux, Generalizing the evidence-theoretic k-NN rule to fuzzy pattern recognition, in: Proceedings of the Second International Symposium on Fuzzy Logic and Applications, ISFL’97, ICSC, Academic Press, Zurich, 1997, pp. 294–300. Bruno Apolloni is Full Professor in Computer Science at the University of Milan, Italy. His main research interests are in the frontier area between probability and mathematical statistics and computer science, with special regard to pattern recognition and multivariate data analysis, probabilistic analysis of algorithms, subsymbolic and symbolic learning processes, fuzzy systems. He has published more than 120 papers in international journals and proceedings of international congresses. Apolloni is head of the Neural Networks Research Laboratory (LAREN, http://laren. dsi.unimi.it) in the Department of Computer Science of the University of Milan as well as President of the Italian Society for Neural Networks (SIREN, http://siren.dsi.unimi.it). He is a member of the editorial board of the journals Neural Networks, Neurocomputing and International Journal of Hybrid Intelligent Systems. Simone Bassis is an Assistant Professor at the Department of Computer Science, University of Milano, Italy. His main research activities concern the inference of spatial and temporal processes, ranging from linear and nonlinear statistical regression methodologies to techniques of fractal processes identification, passing through the analysis and synthesis of both population and neuronal evolutionary dynamics. He published around 15 papers in international journals and conference proceedings.

2445 Dario Malchiodi is an Assistant Professor in the Computer Science Department, University of Milano, Italy. He is currently teaching in the ‘‘laboratory of computer programming,’’ ‘‘advanced java programming,’’ and ‘‘simulation’’ classes. His research activities concern the treatment of uncertain information and related aspects of mathematical statistics and artificial intelligence, including applications to machine learning, population dynamics, and pervasive computing. He published around 45 papers in international journals and conference proceedings.

Witold Pedrycz received the M.Sc., and Ph.D., D.Sci. all from the Silesian University of Technology, Gliwice, Poland. He is a Professor and Canada Research Chair (CRC) in Computational Intelligence in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. He is also with the Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland. His research interests encompass computational intelligence, fuzzy modeling, knowledge discovery and data mining, fuzzy control including fuzzy controllers, pattern recognition, knowledge-based neural networks, granular and relational computing, and software engineering. He has published numerous papers in these areas. He is also an author of 11 research monographs. Witold Pedrycz has been a member of numerous program committees of IEEE conferences in the area of fuzzy sets and neurocomputing. He currently serves as an Associate Editor of IEEE Transactions on Systems Man and Cybernetics, IEEE Transactions on Neural Networks and IEEE Transactions on Fuzzy Systems. He is also an Editor-in-Chief of Information Sciences (Elsevier).