Robust fitting for the Sugeno integral with respect to general fuzzy measures

Robust fitting for the Sugeno integral with respect to general fuzzy measures

Robust fitting for the Sugeno integral with respect to general fuzzy measures Journal Pre-proof Robust fitting for the Sugeno integral with respect ...

795KB Sizes 0 Downloads 34 Views

Robust fitting for the Sugeno integral with respect to general fuzzy measures

Journal Pre-proof

Robust fitting for the Sugeno integral with respect to general fuzzy measures Gleb Beliakov, Marek Gagolewski, Simon James PII: DOI: Reference:

S0020-0255(19)31069-2 https://doi.org/10.1016/j.ins.2019.11.024 INS 15016

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

29 May 2019 10 October 2019 13 November 2019

Please cite this article as: Gleb Beliakov, Marek Gagolewski, Simon James, Robust fitting for the Sugeno integral with respect to general fuzzy measures, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.11.024

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Robust fitting for the Sugeno integral with respect to general fuzzy measures Gleb Beliakova , Marek Gagolewskia,b,c , Simon Jamesa a

b

School of Information Technology, Deakin University, Geelong, AUSTRALIA Faculty of Mathematics and Information Science, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, POLAND c Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447 Warsaw, POLAND

Abstract The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require efficient methods for learning the associated fuzzy measure. While such methods are well developed for the Choquet integral, the fitting problem is more difficult for the Sugeno integral because it is not amenable to being expressed as a linear combination of weights, and more generally due to plateaus and non-differentiability in the objective function. Previous research has hence focused on heuristic approaches or simplified fuzzy measures. Here we show that the problem of fitting the Sugeno integral to data such that the maximum absolute error is minimized can be solved using an efficient bilevel program. This method can be incorporated into algorithms that learn fuzzy measures with the aim of minimizing the median residual. This equips us with tools that make the Sugeno integral a feasible option in robust data regression and analysis. We provide experimental comparison with a genetic algorithms approach and an example in data analysis. Keywords: Sugeno integral, Fuzzy measure, Parameter learning, Email addresses: [email protected] (Gleb Beliakov), [email protected] (Marek Gagolewski), [email protected] (Simon James) Preprint submitted to Information Sciences

November 15, 2019

Aggregation functions 1. Introduction The Sugeno integral has been extensively studied since its introduction in 1974 [1] and is well understood as a sophisticated aggregation function that can be applied on both real and ordinal inputs. Further interest has arisen recently as it was recognized that the famous Hirsch index [2] used in bibliometrics is actually a special case of the Sugeno integral [3, 4]. Semantically, the output of the Sugeno integral summarizes a set of inputs with a value that captures both the extent (quality) and breadth (quantity) to which inputs are high. For example, in the simplest case possible, if the inputs represent decision criteria then the output is: The highest value y such that a subset of criteria with combined importance at least y (breadth) all have scores of at least y (extent).

In the case of the Hirsch index (or h-index), the input scores are paper citations and the importance is measured by the number of papers (the counting measure), so that the common definition is: The highest value h such that at least h published papers (breadth) have at least h citations (extent).

In decision-making and fuzzy sets research, the Sugeno integral and the Choquet integral [5] (for discrete sets of inputs) have received the most attention among the fuzzy integrals, however applications of the Choquet integral are far more prolific (e.g., [6–8]), owing at least in part to the well developed methods available for solving the Least Squares (LS) [9] and Least Absolute Deviation (LAD) [10] fitting problems. In both cases, we suffer from an exponentially increasing number of parameters for general measures, i.e., we need to define the measure of importance across all subsets A ⊂ N for |N | = n inputs. However, with the Sugeno integral we face two additional concerns: 1. The output cannot be represented as a linear combination of the fuzzy measure weights and inputs, hence the usual approaches to minimizing the LS and LAD residuals are unavailable; 2

2. Applications involving ordinal inputs require us to meaningfully define what constitutes goodness-of-fit. In response to the first problem, previous studies on parameter learning for the Sugeno integral have focused on simplified measures and/or employed general purpose machine learning techniques. Keller and Osborn’s approach to the learning problem (applied to classifier combination) in [11] used an iterative reward-and-punish algorithm to estimate the weights of the bestperforming λ-fuzzy measures (also called Sugeno measures). Sugeno measures are determined by the individual weights of the inputs and a parameter λ, and so only require n + 1 parameters rather than the 2n − 1 weights requiring identification in the case of general fuzzy measures. Once the n weights associated with each of the inputs are known, λ can be determined numerically. Keller and Osborn’s algorithm made small adjustments to the fuzzy measure values associated with correct and incorrect classifications, increasing or decreasing them respectively, before then recalculating the λ value at each step. A more recent study by Saleh et al. [12] also used λ-fuzzy measures, where the individual weights were based on the rule confidence (ratio of support for a particular class against support for other classes) of a number of fuzzy decision rules learned from the data. The authors also considered distorted probability measures, another form of simplified measure where the importance of each subset is calculated from a probability measure via a univariate function g, e.g., g(Pr(A)) = (Pr(A))q . In these experiments the Sugeno integral proved to outperform the Choquet integral and other methods in classifier combination. In our own work we have investigated branch-and-bound based algorithms for symmetric fuzzy measures [13, 14] toward the aim of minimizing LS and LAD. As with λ-measures and distorted probabilities, symmetric fuzzy measures require far fewer parameters for their determination. One of the few previous studies (to our knowledge) that has considered the learning of general (unsimplified) fuzzy measures is that detailed in [15]. In the more general context of triangular fuzzy valued fuzzy measures, they considered a custom-built genetic algorithm with the initial fuzzy measure population randomly generated and then modified to satisfy monotonicity constraints. Experiments in this case focused on measures defined for n = 4 inputs. On the matter of meaningfully measuring the goodness-of-fit in the case of ordinal inputs, there are a number of approaches one can take. Depending 3

on the granularity of the scale used, methods based on LS and LAD can still provide a good proxy for well performing measures. In [16, 17] we have explored techniques for ordinal fitting, where the aim is to preserve the relative ranking of the outputs. In this contribution we focus on minimizing the maximum absolute error (i.e., the l∞ norm) and the median absolute error. While such measures still imply some interpretation of distance over the ordinal scale, i.e., being within 3 labels is treated equally regardless of the labels’ position, they can still be considered robust in that ‘close’ fitting across a number of data instances cannot compensate for poor fitting elsewhere. The maximum absolute error can be useful in real world applications for modeling that focuses on the worst case scenario. By proposing an auxiliary penalty, we show that the problem of minimizing the maximum absolute error can be solved efficiently to an arbitrary level of precision using a bilevel optimazation approach consisting of a singlevariate step (numerically solvable using, e.g., bisection search) and a linear programming step. In doing so, we gain access to feasible sampling-based methods for minimizing the median absolute error (or indeed any fitting criterion) toward practical applications of the Sugeno integral. In the Preliminaries section, we give the required background on the Sugeno integral and fitting functions to data. Section 3 presents an auxiliary residual function, which is a key component of the bilevel ftting approach we develop in Section 4. In Section 5 we detail how this can be used to find the fuzzy measure that fits to data with respect to the least median residual and we show its comparative performance against a genetic algorithms approach in Section 6. In Section 7 we provide a data analysis example using the Boston housing dataset and Section 8 concludes. 2. Preliminaries In this section we present definitions pertaining to the Sugeno integral and give an overview of relevant linear formulations for data fitting. 2.1. The Sugeno integral We consider Sugeno integrals defined with respect to discrete fuzzy measures.

4

Definition 1. Let N = {1, 2, . . . , n}. A discrete fuzzy measure is a set function µ : 2N → [0, 1] which is monotonic (i.e., µ(A) 6 µ(B) whenever A ⊂ B) and satisfies µ(∅) = 0 and µ(N ) = 1.

We assume that the fuzzy measure values range over [0, 1], however in general, the Sugeno integral can be calculated with respect to any set function over a bounded chain with analogous properties. Calculation of the Sugeno integral for an input vector x is then expressed as follows. Definition 2. The Sugeno integral with respect to a fuzzy measure µ and input vector x ∈ [0, 1]n is given by  _  Sµ (x) = (min xi ) ∧ µ(A) , (1) A⊆N

i∈A

where ∨ is the maximum and ∧ is the minimum.

Rather than calculating over all subsets, this can be simplified by means of rearranging the input vector into non-decreasing order x(1) 6 x(2) 6 · · · 6 x(n) and defining a weighting vector h ∈ [0, 1]n corresponding with the chain of nested subsets induced by the ordering of x. This gives us S(x, h) = max min{x(i) , hi }, i=1,...,n

(2)

where x(i) denotes the i-th smallest argument of x and hi = µ({(i), (i + 1), . . . , (n)}). The Sugeno integral can also be expressed in terms of the median where S(x, h) = Med(x1 , x2 , . . . , xn , h2 , h3 , . . . , hn ). It can be visually depicted as the maximum point of overlap between the decreasing distribution h and the increasing distribution of x, e.g., as shown in Fig. 1 (Section 3). When the fuzzy measure is symmetric, i.e., the measure of a subset depends only on its cardinality, |S| = |T | =⇒ µ(S) = µ(T ), then h will be the same regardless of the relative ordering of x. Throughout the paper we may make reference to the Choquet integral. Definition 3. The Choquet integral with respect to a fuzzy measure µ and input vector x is given by Cµ (x) =

n X (x(i) − x(i−1) )hi , i=1

5

(3)

where x(i) and hi are defined as before and x(0) = 0 by convention. As fuzzy measures are defined by O(2n ) values, the Shapley indices are often used to estimate the relative importance of each variable. For a given fuzzy measure µ, the Shapley value [18, 19] for each index i is given by, φ(i) =

X

A⊆N \{i}

(n − |A| − 1)!|A|! [µ(A ∪ {i}) − µ(A)]. n!

(4)

~ The Shapley value is the vector φ(v) = (φ(1), . . . , φ(n)). 2.2. Fitting functions to data For a given dataset consisting of observed input vectors x(k) and observed outputs y (k) , k = 1, . . . , m, the aim in data fitting is to find the parameters of a function f such that f (x(k) ) is as close to y (k) as possible over the observed instances. The criterion for closeness is usually the sum of least squared differences (f (x(k) ) − y (k) )2 , however other choices are possible, for example, the sum, median, or maximum of absolute deviations |f (x(k) ) − y (k) |. With aggregation functions, this optimization needs to be done whilst preserving the monotonicity and boundary conditions. Here we are interested in finding the parameters µ such that the maximum absolute error is minimized, i.e., with the objective Minimize max |fµ (x(k) ) − y (k) | = t, w.r.t. µ, t, k=1,...,m

(5)

where t is an additional decision variable we introduce to make fitting possible. When f is the Choquet integral, this can be solved with linear constraints with respect to µ and t by first mapping the n dimensional input vectors to 2n − 1 length vectors corresponding with the fuzzy measure coefficients,   gA (x) = max 0, min xi − max xi i∈A

so that Cµ (x) =

P

A⊆N

i∈N \A

µ(A)gA (x).

Denoting the decision variables µ(A) by µA and the coefficients from the (k) data gA (x(k) ) by gA , the constraints will be:

6

• (data constraints) (k)

(k)

(k)

(k)

(k)

(k)

µ{1} g{1} + µ{2} g{2} + · · · + µN gN − t 6 y (k) , ∀k

(6)

µ{1} g{1} + µ{2} g{2} + · · · + µN gN + t > y (k) , ∀k

• (monotonicity and boundary constraints) µA − µA\{i} > 0, ∀i ∈ A, ∀A ⊆ N . µ∅ = 0 µN = 1.

(7)

We refer the reader to [20] for formulations based on minimizing the sum of least squares and least absolute deviations. 3. An auxiliary residual function for the Sugeno integral Fitting the Sugeno integral to data using linear-constraint based optimization presents particular difficulties because the weights cannot be expressed as a linear function of the inputs as they can with the Choquet integral in (5)–(7). However in the case of zero error (where the outputs are exactly the Sugeno integral of the input vector with respect to the unknown fuzzy measure), the weights can be inferred from input and output pairs since the Sugeno integral always returns either one of the inputs, or one of the fuzzy measure weights. For a given input vector x(k) and observed output y (k) , let a be such that x(a−1) < yk 6 x(a) . The Sugeno integral then reduces to min{ha , x(a) }. We recall the example presented in [13]. Example 1. Suppose we observe the following (x(k) , y (k) ) instances and can assume that S(x(k) , h) = y (k) exactly in each case. k 1 2 3

(k)

x(1) 0.21 0.6 0.33

(k)

(k)

(k)

x(2) x(3) x(4) 0.3 0.31 0.7 0.78 0.9 0.91 0.42 0.67 0.71

7

(k)

x(5) 0.73 1 0.86

y (k) 0.33 0.78 0.63

For k = 1, the observed output falls between the 3rd and 4th inputs and so (k) we can deduce that h4 ∧ x(4) = 0.33 and hence h4 = 0.33. For k = 2, y (k) is equal to the 2nd input. This means that as long (k) as h2 > 0.78, we will have h2 ∧ x(2) = 0.78. This will be greater than (k)

(k)

h1 ∧ x(1) = x(1) and then we would require h3 6 0.78 in order to ensure that (k)

h3 ∧ x(3) is not taken as the output. In the case of k = 3, the input falls between the 2nd and 3rd inputs, so (similar to the case of k = 1) we will have h3 = 0.63. (k)

(k)

In general, for x(a−1) < y (k) < x(a) , it follows that ha = y (k) and for (k)

(k)

x(b−1) = y (k) < x(b) we have hb−1 > y (k) and hb 6 y (k) . Any fuzzy measure satisfying these constraints will hence interpolate the data. For a given Sugeno integral with respect to the induced vector h, we define the following residual function Rh (x, y). Definition 4. For an input vector x ∈ [0, 1]n , let h ∈ [0, 1]n denote the vector induced from a fuzzy measure µ according to the non-decreasing ordering x(1) 6 x(2) 6 · · · 6 x(n) , i.e., hi = µ({(i), . . . , (n)}), i = 1, . . . , n. Let a denote the maximum index such that x(a−1) < y and b the minimum index such that x(b) > y, that is, a = max{i|x(i−1) < y},

b = min{i|x(i) > y},

with x(0) = −∞, x(n+1) = ∞, hn+1 = −∞ by convention. We define the residual function Rh (x, y) : [0, 1]n+1 → [0, 1] associated with the Sugeno integral with respect to h as,  x(1) 6 y 6 x(n) ;  max(0, y − ha , hb − y), max(0, y − x(n) , y − hn ), y > x(n) ; Rh (x, y) = (8)  max(0, x(1) − y, h2 − y), y < x(1) .

Note that if y 6= x(a) , then a = b. The vector h depends on x(k) but we omit this in order not to overload notation. The following proposition ensures that this function is appropriate for data fitting. Proposition 1. The residual Rh (x, y) = 0 if and only if S(x, h) = y. 8

Proof. We will show that S(x, h) = y implies Rh (x, y) = 0 and conversely that Rh (x, y) = 0 implies S(x, h) = y. (i) S(x, h) = y =⇒ Rh (x, y) = 0. For an observation pair (x, y), if S(x, h) = y, then we know that y = min(x(a) , ha ) where a is defined as in Def. 4. If ha < x(a) then y = ha and y − ha = 0. Furthermore, hb 6 ha and hence hb − y 6 0. Therefore, Rh = max(0, y − ha , hb − y) = 0. Otherwise, if ha > x(a) then y = min(x(a) , ha ) = x(a) and y − ha 6 0. It again follows that hb 6 y, hence Rh = 0. (ii) Rh (x, y) = 0 =⇒ S(x, h) = y. If Rh (x, y) = 0, from Eq. (8) it follows that x(1) 6 y 6 x(n) and that y 6 ha , y > hb . Since we also know that a = b whenever y 6= x(a) (from the definition of a and b) for this situation we have y 6 ha , y > ha , i.e., that y = ha . From this it follows that x(a−1) 6 ha 6 x(a) , hence S(x, h) = min(x(a) , ha ) = ha = y. For the case that y is equal to one or more of the x(i) , by the definition of a we have y = x(a) . Recall again that y 6 ha , y > hb from Eq. (8), and so we infer that hb 6 x(a) 6 ha and hence S(x, h) =

max min(x(i) , hi ) = min(x(a) , ha ) = x(a) .

i=a,...,b−1

We therefore obtain y = S(x, h). This means that if we minimize the residual Rh such that we obtain a value of 0, our obtained function interpolates the data. However for cases where S cannot model this residual always over-estimates the data exactly, the actual residual S x(k) − y (k) . The following example helps illustrate this.

Example 2. Consider Fig. 1. For this vector with n = 10, x is shown as shaded bars in non-decreasing order. The corresponding values of hi for each x(i) are shown as the decreasing step-function. The obtained value of S(x, h) is shown as the horizontal dotted line, while an observed value of y is shown as the horizontal solid line. Since the line for y intersects the x distribution between x(2) and x(3) , the value of a is 3, which is the same as b and the value of Rh is ha − y = h3 − y. 9

y S(x,h)

aux resid



0.6 0.0

0.2

0.4

Value

0.8

1.0

x h

2

4

6

8

10

Index

Figure 1: The introduced residual Rh (the auxiliary error) indicated by the arrow and calculated with respect to an input vector x and fuzzy measure h. The residual is the difference between y and the relevant weight h3 , where x(2) < y < x(3) .

The amount that Rh overestimates |S(x, h)−y| will largely depend on how close a is to the actual index that is used in the Sugeno integral calculation (i.e., the i such that S(x, h) = min(x(i) , hi )). However, as we will show in the following section, with the help of an additional parameter, we can use Rh to help us minimize the actual maximum error for any given dataset. 4. Minimizing the maximum absolute error The auxiliary residual Rh presented in the previous section can be used as the basis of linear constraints that will allow us to fit the Sugeno integral to data such that the maximum absolute error is minimized. We start by formulating the program that minimizes the maximum Rh . We let ak , bk denote the values of a and b with respect to the instance 10

(x(k) , y (k) ) as per Def. 4. Then hak refers to the weight ha corresponding to the instance k. Similarly, hbk is the weight corresponding to hb for instance k. We remind that hi corresponds with µ({(i), (i + 1), . . . , (n)}). Now, let r denote the maximum Rh (x(k) , y (k) ) for all k. For the objective we have Minimize r, w.r.t. µ, r

(9)

Then we have the following data constraints: (k)

(k)

For all k such that x(1) 6 y (k) 6 x(n) hak + r > y (k) −hbk + r > −y (k) , (k)

For all k such that y (k) < x(1)

(k)

r > x(1) − y (k)

−h2k + r > −y (k) , (k)

For all k such that y (k) > x(n) (k)

r > y (k) − x(n)

hnk + r > y (k) .

Following this we can add constraints to ensure monotonicity and Pbound- ary conditions of the fuzzy measure (as in (7)). In general there are ns=2 ns monotonicity constraints. In the case of symmetric fuzzy measures, we need only consider the n decision variables h1 , h2 , . . . , hn (i.e., each hik will be the same for all k) along with r, while for general fuzzy measures, we can consider the fuzzy measure values µ({1}), µ({2}), µ({1, 2}), µ({3}), . . . , µ({1, . . . , n}) (using any appropriate indexing system) along with r. For data that can be interpolated, an objective of 0 will be obtained in (5), however, as stated previously, r will generally overestimate the maximum absolute residual. To fit the data so that the actual maximum residual is minimized, we introduce an additional parameter and bilevel optimization approach. 11

Consider a non-negative value t such that we now associate the interval [y − t, y (k) + t] with the observation y (k) . We let α denote the maximum index such that x(α−1) < y−t and β the minimum index such that x(β) > y+t, i.e. α = max{i|x(i−1) < y − t}, β = min{i|x(i) > y + t} (k)

with x(0) = −∞, x(n+1) = ∞, hn+1 = −∞ by convention. We can now update Rh with respect to t.

+ − Rh (x, y, t) = max(Rh,t (x, y, t), Rh,t (x, y, t)),

(10)

+ − where Rh,t (x, y) and Rh,t (x, y) are given by

Rh+ (x, y, t)

Rh− (x, y, t)

=



max(0, hβ − (y + t)), y + t > x(1) ; max(0, x(1) − (y + t), h2 − (y + t)), y + t < x(1) ,

(11)

=



max(0, (y − t) − hα ), y − t 6 x(n) ; max(0, (y − t) − x(n) , (y − t) − hn ), y − t > x(n) .

(12)

If t = 0, then α and β will be equal to a and b respectively, however if t > 0 then we expand the interval at which Rh = 0. Example 3. Fig. 2 shows the same x, y and h as given in Fig. 1. Now we have α = 1 and β = 4, and a residual value of Rh (x, y, t) = max(h4 − (y + t), (y − t) − h1 ) = h4 − (y + t). Equipped with this construction, we can extend (9) to the following.

12

y S(x,h)

y +/− t aux resid



0.6 0.0

0.2

0.4

Value

0.8

1.0

x h

2

4

6

8

10

Index

Figure 2: The auxiliary error calculated with respect to the same input vector x and induced weights h, where an additional parameter t is used to create an interval around y. The parameter reduces the difference and also changes the relevant weight, which is now h4 with x(1) > y − t and x(4) > y + t. Once t becomes equal to or greater than the absolute error, i.e., t > |S(x, h) − y|, the relevant weight will be h5 (or h6 , h7 , etc.) and Rh (x, y, t) = 0.

13

Minimize r

w.r.t. µ, r

s.t. hαk + r > y

(k)

− t,

(13) ∀k|y

hnk + r > y (k) − t,

(k)

∀k|y (k) − t > (k)

r > y (k) − t − x(n) ,

−hβk + r > −(y (k) + t), (k)

> > = =

(k)

∀k|y (k) − t > x(n) (k)

∀k|y (k) + t > x(1)

r > x(1) − (y (k) + t),

−h2k + r µA − µA\{i} µ∅ µN

−t6

(k) x(n) (k) x(n)

(k)

∀k|y (k) + t < x(1) (k)

−(y (k) + t) ∀k|y (k) + t < x(1) 0, ∀i ∈ A, ∀A ⊆ N 0 1.

For a fixed t, all index values αk , βk are determined and (13) reduces to a linear program. As a bilevel optimization problem we hence have the objective of minimizing t such that r = 0 in (13). For an optimal t = t∗ , all choices t > t∗ will also result in r = 0 and hence bisection search is an appropriate method for this step. A high degree of precision can be obtained using only 10-20 iterations. When solving the LP for a given t, we will require the same number of monotonicity constraints and a similar number of data constraints (either 2 or 3 data constraints per observed instance) as we do for the Choquet integral formulation detailed in Section 2. The monotonicity constraints are independent of t, however, and so can be stored rather than needing to generate them at each iteration. We hence have an efficient and tractable way of solving the maximum residual fitting problem for the Sugeno integral. Increasing the number of bisections allows us to achieve this to an arbitrary level of precision. The maximum error can be a useful fitting criterion in itself when it comes to finding a solution that achieves a certain level of performance across all the data, i.e. where t is the error in the worst case scenario. This also makes particular sense for ordinal data, since it allows us to state that the function predicts values within a given number of steps with respect to the ordinal scale. 14

5. Application to least median residual In addition to the bilevel solution of (13) providing us with a fuzzy measure that guarantees a level of performance across the dataset, it can also be incorporated into algorithmic approaches to fuzzy measure learning that achieves the Least Median (of Squares) residual (LMS). The LMS as a fitting criterion was proposed by Rousseeuw and others [21–23] in the context of linear regression to alleviate the effect that outliers can have on fitting accuracy. In [24] we explored a number of algorithms for weights estimation for the weighted arithmetic mean based on random sampling and fitting to the reduced dataset. The following algorithm adapts these to fitting the Sugeno integral with the aim of minimizing the LMS. For simplicity we will use the term median residual when referring to the median absolute residual, which corresponds with the median of squares residual for purposes of fitting. Algorithm 1. Fuzzy measure weights minimizing LMS Input: a dataset consisting of m observations (x(k) , y (k) ), sampling size (s), number of bisections (B), number of iterations (Q) Output: Estimate of best fitting fuzzy measure (µ∗ ) and the median residual (rMed ) 1. Initialize the fuzzy measure µ∗ as the additive fuzzy measure with equal weights allocated to the singletons; 2. Calculate and initialize the best residual rMed , where rMed = Med({|Sµ (x(k) ) − y (k) |, k = 1, . . . , m});

3. For each of Q iterations: (a) Randomly sample s instances (x(k) , y (k) ) from the dataset; (b) Use bisection search (B bisections) to minimize t such that r = 0 in (13) across the s observations and obtain µS ; (c) Calculate the residuals, r(k) = |SµS (x(k) ) − y (k) | and let rS = Med(r(1) , . . . , r(m) ); (d) If rS < rMed then replace rMed with rS and replace µ∗ with µS ; 4. Return µ∗ along with rMed . The least median residual can be seen as meaningful in the case of ordinal data since it corresponds to 50% of the data being calculated within a natural distance along the chain or ordinal values. 15

6. Comparison with genetic algorithms approach To evaluate the usefulness of Algorithm 1, we compare with a genetic algorithms learning approach. 6.1. Algorithm formulation The genetic algorithm proposed in [15] stands as the only method for fitting Sugeno integrals to data with respect to general fuzzy measures (to our knowledge and aside from our own recent explorations). It was actually developed for a generalized version of the Sugeno integral where the inputs and fuzzy measure weights are expressed as triangular membership values. Here we adapt it to real values in the unit interval and calculate the goodnessof-fit according to the median residual. Algorithm 2. Fuzzy measure weights by genetic algorithms Input: a dataset consisting of m observations (x(k) , y (k) ) observations, initial fuzzy measure (FM) population size (M ), number of parent FMs (P ), number of generations (G), number of additional random FMs at each generation (C), probability of an offspring FM undergoing crossover (λ1 ), probability of subset exchange (λ2 ), probability of an offspring FM being shifted (λ3 ), probability of subset shift (λ4 ), factor of shift (λ5 ), probability of offspring undergoing mutation (λ6 ), probability of subset mutation (λ7 ) Output: Estimate of best fitting fuzzy measure (µ∗ ) and the median residual (rMed ) 1. Define the function repair(µ) such that for all subsets A ⊆ N , |A| > 2, the subset µ(A) is replaced with max(µ(B)). This ensures the fuzzy B⊆A

measure satisfies monotonicity; 2. Initialize the starting population by generating M random vectors of length 2n − 1 and apply µ0 = repair(µ) for all M vectors; 3. For each of G generations: (a) Generate C random vectors of length 2n − 1 and apply repair, then add these to the current population; (b) Evaluate the median residual corresponding to each member of the population; 16

(c) Determine the best performing P fuzzy measures (parents) and duplicate in random order to give the set of offspring; (d) Crossover i. For each pair of offspring (pairs (µ1 , µ2 ), (µ3 , µ4 ), . . . , (µP −1 , µP )) determined by the random order, with probability λ1 , the pair undergoes crossover; ii. If the pair undergoes crossover, for all A ⊆ N , with probability λ2 , exchange µi (A) with µj (A) iii. If the pair underwent crossover, apply repair; (e) Shift i. For each of the P offspring, perform shifting with probability λ3 ; ii. If the offspring undergoes shifting, for each A ⊆ N shift the value µ(A) with probability λ4 and set µ0 (A) = µ(A) + λ5 · U (−1, 1) where U (−1, 1) represents a random value in [−1, 1] drawn uniformly; iii. If the offspring underwent shifting, apply repair; (f) Mutate i. For each of the P offspring, perform mutation with probability λ6 ; ii. If the offspring undergoes mutation, for each A ⊆ N with probability λ7 , set µ0 (A) = (1 − α)µ(A) + α max µ(B), where |B|=|A|

α is drawn from a uniform distribution U (0, 1); iii. If the offspring underwent mutation, apply repair; (g) Combine the parents and offspring and set as the new population; 4. Evaluate the fitness of the final population according to the median residual; 5. Return the best µ∗ along with its associated median residual rMed . Each time a fuzzy measure is generated or altered, the algorithm requires all subsets |A| > 2 to be checked for monotonicity and potentially repaired. This operation in particular can result in the algorithm being comparatively slow, however the random variations mean that with enough time, it should converge to a well fitting fuzzy measure.

17

6.2. Experimental study To give an idea of the comparative behavior of the two algorithms, we performed numerical experiments in the following way. For each experiment, a fuzzy measure was randomly generated with µ(A) each taken from a uniform distribution U (0, 1) and multiplied by |A|/n. These values were then transformed such that µ0 (A) = (µ(A))q with q being a random value drawn from [0.5, 1.5] uniformly. The fuzzy measure’s monotonicity was then ensured using the repair function described in Algorithm 2. Each dataset was comprised of 500 instances x with the xi values taken from a uniform distribution, and each observed y value calculated using Sµ (x). Random Gaussian noise was then added to each y with 0 mean and σ = 0.1. Values below 0 or above 1 were fixed to 0 and 1 respectively and all values (for the fuzzy measure and the observed vectors) were rounded to 5 decimal places. We then applied Algorithms 1 and 2 with 10, 100 and 1000 iterations/generations for n = 4, n = 6 and n = 8. For Algorithm 1 we set the sample size to 50 and the number of bisections to 10. For Algorithm 2, based on the parameters used in [15] we set the initial population to M = 50, number of parents to P = 20, additional random FMs to C = 5, then probability and extent parameters to λ1 = 0.3, λ2 = 0.5, λ3 = 0.2, λ4 = 0.5, λ5 = 0.2, λ6 = 0.7, λ7 = 0.25. For each combination of n and iterations/generations, we ran the experiment 100 times in the R environment [25]. Table 1 summarizes the results. Values represent the average, and standard deviation in parentheses, over 100 trials. Median residuals are rounded to 3 (4) decimal places. Times are in seconds and rounded to 2 (3) decimal places. Bold indicates lowest average median residual obtained for each setting, however we should stress that direct comparisons between these approaches in terms of time taken and number of iterations/generations would be misleading. The experiments and results rather can give some indication about how the algorithms scale with increasing n. With these values of n, one iteration for Algorithm 1 runs far quicker than a generation for Algorithm 2. As can be seen from the table, the LP-based method with sampling is able to achieve a good level of accuracy quite quickly. Even in the case of n = 8 (255 unknown fuzzy measure values), an average median residual of 0.080 was obtained after 10 iterations (within about 4 seconds). For this many variables, the genetic algorithms approach did not get close to this level of fitting performance, even after 1000 iterations (which took approximately 28 minutes). However for fewer variables, although it was much slower, it was able to obtain better performance. The progress of each of the algorithms 18

Table 1: Comparison of fitting performance (median residual rMed ) and time taken (in seconds) for the LP-based and genetic algorithms approaches. LP-based n 4 6 8

10 iterations rMed time 0.077 (0.0047) 0.24 (0.031) 0.083 (0.0067) 0.43 (0.018) 0.080 (0.0076) 3.96 (0.357)

Genetic Alg. n 4 6 8

10 generations rMed time 0.088 (0.0144) 9.27 (0.220) 0.148 (0.0328) 11.33 (0.249) 0.200 (0.0335) 17.98 (0.367)

100 iterations rMed time 0.073 (0.0042) 2.22 (0.062) 0.079 (0.0044) 4.09 (0.125) 0.078 (0.0077) 36.25 (2.721)

1000 iterations rMed time 0.069 (0.0036) 22.24 (0.462) 0.077 (0.0036) 40.82 (1.658) 0.078 (0.0064) 270.39 (13.1973)

100 generations rMed time 0.067 (0.0041) 82.65 (0.918) 0.108 (0.0215) 103.42 (1.210) 0.174 (0.0309) 163.67 (1.410)

1000 generations rMed time 0.062 (0.0040) 830.33 (24.100) 0.071 (0.0081) 1039.24 (29.541) 0.122 (0.0270) 1649.35 (34.044)

as the number of iterations and generations are increased is illustrated by Figs. 3(a)-(d). Fig. 3(a) shows the progress for a single experiment with n = 6 for Algorithms 1 and 2. We see that the best value of rMed was found within the first 100 iterations while for the genetic algorithm we see gradual improvements made over the entire 1000 generations (which takes about 15 minutes compared to less than a second using Algorithm 1). Figs. 3(b)-(d) show the average results over 100 trials for n = 4, n = 6 and n = 8. Similar to the single example, overall we see that Algorithm 1 finds its best solution fairly quickly while Algorithm 2 gradually improves. While the LP-based approach has clear advantages in terms of time taken and achieves a good level of performance for all values of n, one can see the potential to combine the two methods. As well as generating fuzzy measures randomly in Step 3(a) of Algorithm 2, a pool of fuzzy measures found using Algorithm 1 could be included in the population. This would mean a good level of performance is already achieved in the first few generations and additional generations can be used to provide incremental improvements. 7. Data analysis example: Comparison with the Choquet integral Tools for using the Sugeno integral in data analysis have been limited in the past and so there has been little investigation into how the Choquet integral and Sugeno integral may compare when inferring values from the fuzzy measure values fit to data, i.e. both functions are defined with respect to fuzzy measures and fuzzy measures are usually interpreted using the Shapley 19

0.15

median residual

0.20

LP−based approach Genetic Algorithm

0.05

0.10

0.15 0.10 0.05

median residual

0.20

LP−based approach Genetic Algorithm

0

200

400

600

800

1000

0

iterations / generations

600

800

1000

(b) Average results for n = 4 LP−based approach Genetic Algorithm

0.15 0.10 0.05

0.05

0.10

0.15

median residual

0.20

LP−based approach Genetic Algorithm

0.20

400

iterations / generations

(a) Single experiment with n = 6

median residual

200

0

200

400

600

800

1000

0

iterations / generations

200

400

600

800

1000

iterations / generations

(c) Average results for n = 6

(d) Average results for n = 8

Figure 3: Comparison of fitting performance with the number of iterations / generations for Algorithms 1 and 2. Note that the time required for running 1000 generations in the genetic algorithm (Algorithm 2) is substantially longer than 1000 iterations of the LP-based approach (Algorithm 1).

indices, but the choice of function will have an impact on the best fitting fuzzy measure.

20

7.1. The dataset We consider 5 variables of the Boston Housing Data, which consists of 506 instances with 4 predictors: 1. 2. 3. 4.

INDUS — proportion of non-retail business acres per town, RM — average number of rooms per dwelling, PTRATIO — pupil-teacher ratio by town, LSTAT — % lower status of the population

and 1 target variable: 1. MEDV — median value of owner-occupied homes in $1000s. The variables then were transformed using sigmoid functions x0i = 1/(1 + exp(axi + b)) where a and b are learned automatically for each variable by minimizing the squared difference between axi + b and log(1/y − 1), approximating the values that maximize the correlation between each variable and y (k) after the y values are scaled to the unit interval. 7.2. Fitting results We then follow the steps in Algorithm 1 where step (3b) is replaced with fitting the Choquet integral as well as the Sugeno integral for comparison. The sample size is set at 16 (the number of fuzzy measure values when n = 4) and the number of iterations is set to 1000. The least median residual obtained for the Choquet integral was 0.0486 while the least median residual for the Sugeno integral was 0.0529 (to put this in perspective, fitting a Choquet integral using the least squared residual criterion has a median residual of 0.060). Predicted values Sµ (x), Cµ (x) against observed y (transformed) are shown in Figs. 4 (a)-(b). We can now consider interpretation of the fuzzy measure values and associated Shapley indices, shown in Tables 2 and 3. The fuzzy measure values are usually interpreted as indicating the importance of each coalition. However in making such interpretations, it should be understood that the Choquet integral weights behave more similarly to a weighted mean, where values are multiplied by the given weights, while the Sugeno integral weights result in median-like behavior. In the case of the Sugeno integral, the weight of 0.648 for µ({2, 3, 4}) (the highest of the 3-tuples) means that the function cannot exceed 0.648 unless all 4 of the inputs are above 0.648 – having only variables 2, 3 and 4 above 0.648 will still 21

1.0

1.0

0.8

0.8

● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●●● ● ●● ●● ● ●●● ● ●● ● ● ●●● ● ● ●● ● ● ● ●●● ● ●● ●● ● ●● ●●●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ●●● ●●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ●●● ●● ●●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●●●●●●● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ●● ●● ●● ●●● ● ● ● ● ●●●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ●● ● ● ● ●●●● ●●



0.0

0.6 0.4 0.2

predicted

●● ●● ● ●●●● ●● ●● ●● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ●● ● ● ●●● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ●● ●● ● ●●● ●● ●● ● ●●●● ● ●● ● ● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ●●●●● ● ● ●● ●● ●●● ● ●● ●● ●●●●●● ●●●●●● ●● ● ●● ● ●● ●● ●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●●●● ● ● ● ●● ● ●● ● ●●● ● ●● ●● ● ● ●● ● ● ●● ● ● ●● ●●●● ●● ● ● ● ● ● ●●●● ● ●● ● ●●● ●● ●● ●●● ●●●●● ●

0.0

0.6 0.4 0.0

0.2

predicted



0.2

0.4

0.6

0.8

1.0



0.0

0.2

observed

0.4

0.6

0.8

1.0

observed

(a) Sugeno integral

(b) Choquet integral

Figure 4: Predicted against observed values after fitting (a) a Sugeno integral and (b) a Choquet integral with respect to the minimum median residual. Grey area shows values falling within the median residual from the observed output. Table 2: Fuzzy measure values obtained by fitting a Sugeno integral (Sµ ) and Choquet integral (Cµ ) with respect to minimizing the median residual. Note that µ({1, 2, 3, 4}) is fixed to 1 in both cases.

µ µ({1}) µ({2}) µ({3}) µ({4}) µ({1, 2}) µ({1, 3}) µ({1, 4})

Sµ 0 0 0 0.353 0 0.102 0.353

Cµ 0 0 0.173 0.005 0 0.173 0.005

µ µ({2, 3}) µ({2, 4}) µ({3, 4}) µ({1, 2, 3}) µ({1, 2, 4}) µ({1, 3, 4}) µ({2, 3, 4})

Sµ Cµ 0 0.173 0.353 0.682 0.353 0.323 0.173 0.102 0.371 0.682 0.353 0.323 0.648 1

result in an ouput of 0.648. This behavior is observed in the plot, where no predicted values are above 0.648. Each of the weights in the Sugeno integral hence behaves as somewhat of a threshold criterion. The Shapley values for the fitted integrals are similar, although in the case of the Sugeno integral, more importance is placed on the LSTAT variable. Note that with the Choquet integral, the Shapley value indicates the average 22

Table 3: Shapley values for each of the predictor variables calculated from the fuzzy measures in Table 1.

predictor Sµ Cµ

INDUS RM PTRATIO LSTAT 0.106 0.188 0.199 0.507 0 0.338 0.246 0.416

weight applied to a particular variable, however in the case of the Sugeno integral, this interpretation is not as applicable. Here the Shapley index is indicative of the average drop in ha when xi = x(a−1) and hence summarizes, to a degree, the requirement of xi to be high. 8. Conclusion We have introduced a bilevel programming method for determining the values of a general fuzzy measure from data for aggregation with the Sugeno integral. The optimization is performed with the objective of minimizing the maximum error between the observed and predicted values, which in turn can be applied to minimizing the median residual (or any aggregation of residuals that can be calculated for potential µ). The median and maximum residual based objectives are meaningful whether the data are expressed over a real or ordinal scale, and allow us to apply the Sugeno integral as a model for data analysis and regression. Furthermore, such methods are feasible, relative to the complexity of the fuzzy measure fitting problem in general. The same methodology can be applied to symmetric fuzzy measures with the usual benefits of reducing the number of variables that need to be identified. We showed that the proposed method had significant time savings compared to a genetic algorithms approach with random population generation, and that it was practical and efficient for higher values of n. We also showed an example in data analysis to compare with the Choquet integral and the interpretations that can be drawn from the two methods. For future research, we are interested to further explore how simplification strategies could be used in conjunction with Algorithm 1 for larger datasets as well as the robustness of the Sugeno integral for prediction problems when fit using the least median residual.

23

Acknowledgment Marek Gagolewski wishes to acknowledge the support by the Czech Science Foundation through the project No.18-06915S. References References [1] M. Sugeno, Theory of fuzzy integrals and its applications, Ph.D. thesis, Tokyo Institute of Technology, 1974. [2] J. E. Hirsch, An index to quantify individual’s scientific research output, Proceedings of the National Academy of Sciences 102 (46) (2005) 16569– 16572. [3] V. Torra, Y. Narukawa, The h-index and the number of citations: Two fuzzy integrals, IEEE Transactions on Fuzzy Systems 16 (3) (2008) 795– 797. [4] M. Gagolewski, R. Mesiar, Monotone measures and universal integrals in a uniform framework for the scientific impact assessment problem, Information Sciences 263 (2014) 166–174. [5] G. Choquet, Theory of Capacities, Ann. Inst. Fourier 5 (1953) 1953– 1954. [6] G. Beliakov, S. James, Citation-based journal ranks: The use of fuzzy measures, Fuzzy Sets and Systems 167 (2011) 101–119. [7] D. Liginlal, T. Ow, Modeling attitude to risk in human decision processes: An application of fuzzy measures, Fuzzy Sets and Systems 157 (2006) 3040–3054. [8] R. Yang, Z. Wang, P. Heng, K. Leung, Fuzzified Choquet integral with a fuzzy-valued integrand and its application on temperature prediction, IEEE Transactions on Systems, Man and Cybernetics – Part B: Cybernetics 38 (2) (2008) 367–380.

24

[9] M. Grabisch, I. Kojadinovic, P. Meyer, A review of methods for capacity identification in Choquet integral based multi-attribute utility theory: Applications of the Kappalab R package, European Journal of Operational Research 186 (2) (2008) 766–785. [10] G. Beliakov, Construction of aggregation functions from data using linear programming, Fuzzy Sets and Systems 160 (1) (2009) 65–75. [11] J. Keller, J. Osborn, Training the fuzzy integral, International Journal of Approximate Reasoning 15 (1996) 1–24. [12] E. Saleh, A. Valls, A. Moreno, P. Romero-Aroca, V. Torra, H. Bustince, Learning fuzzy measures for aggregation in fuzzy rule-based models, in: V. T. et al. (Ed.), MDAI 2018, LNAI 11144, 114–127, 2018. [13] M. Gagolewski, S. James, Fitting Symmetric Fuzzy Measures for Discrete Sugeno Integration, in: Advances in Intelligent Systems and Computing, vol. 642, Springer, 104–116, 2018. [14] M. Gagolewski, S. James, G. Beliakov, Supervised learning to aggregate data with the Sugeno integral, IEEE Transactions on Fuzzy Systems 27 (4) (2019) 810–815. [15] D. Anderson, J. Keller, T. Havens, Learning fuzzy-valued fuzzy measures for the fuzzy-valued Sugeno fuzzy integral, Lecture Notes in Artificial Intelligence 6178 (2010) 502–511. [16] G. Beliakov, M. Gagolewski, S. James, Aggregation on ordinal scales with Sugeno integral for biomedical applications, Under review . [17] G. Beliakov, M. Gagolewski, S. James, DC optimization for constructing discrete Sugeno integrals and learning nonadditive measures, Under review . [18] L. Shapley, A value for n-person games, Contributions to the Theory of Games 2 (28) (1953) 307–317. [19] M. Grabisch, M. Roubens, An axiomatic approach to the concept of interaction among players in cooperative games, International Journal of Game Theory 28 (4) (1999) 547–565.

25

[20] G. Beliakov, S. James, J.-Z. Wu, Discrete Fuzzy Measures - Computational Aspects, Springer Nature, Cham, Switzerland, 2019. [21] P. J. Rousseeuw, Least median of squares regression, Journal of the American Statistical Association 79 (388) (1984) 871–880. [22] G. E. Dallal, P. J. Rousseeuw, LMSMVE: A program for Least median of squares regression and robust distances, Computers and Biomedical Research 25 (1992) 384 – 391. [23] P. J. Rousseeuw, M. Hubert, Recent developments in PROGRESS, Statistical Procedures and Related Topics, IMS Lecture Notes – Monograph Series 31 (1997) 201–214. [24] M. Gagolewski, S. James, G. Beliakov, Least median of squares (LMS) and least trimmed squares (LTS) fitting for the weighted arithmetic mean, in: M. J. et al. (Ed.), Information Processing and Management of Uncertainty in Knowledge-Based Systems. Theory and Foundations. IPMU 2018. Communications in Computer and Information Science, vol 854, Springer, Cham, 367–378, 2018. [25] R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, 2019.

26

Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: