Entropy for non-additive measures in continuous domains

Entropy for non-additive measures in continuous domains

JID:FSS AID:7103 /FLA [m3SC+; v1.237; Prn:10/10/2016; 13:58] P.1 (1-11) Available online at www.sciencedirect.com ScienceDirect 1 1 2 2 3 3 ...

516KB Sizes 1 Downloads 64 Views

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.1 (1-11)

Available online at www.sciencedirect.com

ScienceDirect

1

1

2

2

3

3

Fuzzy Sets and Systems ••• (••••) •••–•••

4

www.elsevier.com/locate/fss

4

5

5

6

6

7

7

Entropy for non-additive measures in continuous domains

8 9

8 9

10

Vicenç Torra

11 12

School of Informatics, University of Skövde, Skövde, Sweden

13

10 11 12 13

Received 29 January 2016; received in revised form 11 September 2016; accepted 1 October 2016

14

14

15

15

16

16

17

17

18

Abstract

19 20 21 22 23

19

In a recent paper we introduced a definition of f -divergence for non-additive measures. In this paper we use this result to give a definition of entropy for non-additive measures in a continuous setting. It is based on the KL divergence for this type of measures. We prove some properties and show that we can use it to find a measure satisfying the principle of minimum discrimination. © 2016 Published by Elsevier B.V.

24 25

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Entropy is an important concept in information theory defined for probability distributions. It is used to measure the difference of the quantity of information before and after a data transmission. It is also used in statistics to help to define a probability distribution under some constraints. This is done through the application of the maximum entropy principle [2]. For example, the Gaussian distribution maximizes the entropy over all distributions with the same variance (see e.g. [2] p. 255 and Ch. 12). For continuous probability distributions the principle of minimum discrimination is used, which is based on the Kullback–Leibler divergence [11]. Non-additive measures [21,3,19] (also known as capacities and as fuzzy measures) generalize additive measures, and thus probabilities, replacing additivity by monotonicity. They have been applied in a large variety of contexts (computer vision, decision making, economics). At present there exist several generalizations [24,12] of the entropy for discrete non-additive measures. See also [8] for an overview. Nevertheless, up to our knowledge there are no definitions available for measures in continuous domains. In this paper we focus in this problem. We introduce a definition for non-additive measures for infinite sets. The measure roots in our recent definition [23] of the f -divergence for non-additive measures. The structure of the paper is as follows. In Section 2, we present some definitions needed later on and in Section 3 we introduce our definition and give some properties. Section 4 generalizes the principle of minimum discrimination to non-additive measures. The paper finishes with some conclusions and lines of future work.

51 52

25

28

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

48

50

23

29

47

49

22

27

1. Introduction

29

31

21

26

27

30

20

24

Keywords: Entropy; KL-divergence; Non-additive measures

26

28

18

48

E-mail addresses: [email protected], [email protected]. http://dx.doi.org/10.1016/j.fss.2016.10.001 0165-0114/© 2016 Published by Elsevier B.V.

49 50 51 52

JID:FSS AID:7103 /FLA

1

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.2 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

2

2. Preliminaries

1

2 3 4

2

In this section we review some results on non-additive measures and integrals, divergences on measures, and entropies. See [19,21,3,5,18] for details.

5 6

11

A non-additive measure is a monotonic set function.

14 15 16 17

Definition 1. Let (, F) be a measurable space. A set function μ defined on F is called a non-additive measure (or fuzzy measure) if an only if

20 21 22 23

1. 0 ≤ μ(A) ≤ ∞ for any A ∈ F ; 2. μ(∅) = 0; 3. If A1 ⊆ A2 ⊆ F then μ(A1 ) ≤ μ(A2 )

28 29 30 31 32

35 36 37 38 39 40

μ(A) + μ(B) ≥ μ(A ∪ B) + μ(A ∩ B), 2. we say that μ is supermodular if μ(A) + μ(B) ≤ μ(A ∪ B) + μ(A ∩ B).

43 44

Definition 3. [1] Let (, F) be a measurable space and let ν, μ : F → R+ be non-additive measures. We say that ν is a Choquet integral of μ if there exists a measurable function g :  → R+ with  ν(A) = (C) gdμ (1) A

for all A ∈ F . In this paper we need the following proposition related to the Choquet integral.

45 46 47

50 51 52

16 17

19 20 21 22 23

25

27 28 29 30 31 32

34 35 36 37 38 39 40 41 42 43 44 45

Theorem 1. [1,3,5] Let μ be a non-additive measure on (R, B), and f, g be non negative measurable functions. Then, the following properties hold.

48 49

15

33

The Choquet integral of a function with respect to a non-additive measure is defined below. Using the Choquet integral, we can consider the derivative of a non-additive measure with respect to another.

41 42

14

26

1. we say that μ is submodular if

33 34

13

24

Definition 2. Given a non-additive measure μ,

26 27

11

18

Distorted Lebesgue measures, introduced in [7], are an example of non-additive measures. They are defined in terms of the Lebesgue measure λ and a distortion function. The distortion function should be non-decreasing. Then, μ is a distorted Lebesgue measure if it can be expressed as μ = m ◦ λ where m is a non-decreasing distortion function and λ is the Lebesgue measure. Recall that the Lebesgue measure of an interval [a, b] is λ([a, b]) = b − a. Some other types of measures are useful in this paper.

24 25

10

12

18 19

8 9

12 13

6 7

9 10

4 5

2.1. Non-additive measures and the Choquet integral

7 8

3

46 47 48

1. When μ is submodular, then    (C) (f + g)dμ ≤ (C) f dμ + (C) gdμ.

49 50 51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.3 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–••• 1 2 3

3

2. When μ is supermodular, then    (C) (f + g)dμ ≥ (C) f dμ + (C) gdμ.

1 2 3

4 5

4 5

We will also need the concept of derivative of a non-additive measure with respect to another one.

6 7 8 9 10 11

6

Definition 4. Let μ and ν be two non-additive measures. If μ is a Choquet integral of ν, and g is a function such that Equation 1 is satisfied we write

14 15 16 17

10 11

and we say that g is a derivative of ν with respect to μ.

12

Several authors have considered the conditions that permit us to find the derivative of a non-additive measure with respect to another one. For example, Graf [6] focused on subadditive measures and gives a theorem (Theorem 4.3) about necessary and sufficient conditions for this derivative to exist. Rébillé [15] considers the case of almost subadditive set functions of bounded sum. Sugeno considered [16] the same problem for distorted Lebesgue measures. We will use the following result.

18 19 20 21 22 23 24 25 26

G(s) = F (s)/sM(s)

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

15 16 17

19 20 21 22 23

(3)

24

Here, F (s) is the Laplace transformation of f , M the Laplace transformation of m, and G the Laplace transformation of g.

25

g(t) = L−1 [F (s)/sM(s)].

26 27 28

2.2. Divergences

29

For discrete probability distributions P and Q the f -divergence is defined as follows.

31 32

14

(2)

29 30

13

18

Proposition 1. (Proposition 4 in [16]) Let f (t) be a continuous and increasing function with f (0) = 0, let μm be a distorted Lebesgue measure, then there exists an increasing (non-decreasing) function g so that f (t) = (C) [0,t] g(τ )dμm and the following holds:

27 28

8 9

dν/dμ = g,

12 13

7

30 31

Definition 5. Let P = (p1 , . . . , pk ) and Q = (q1 , . . . , qk ) be two probability distributions. Let f be a convex function with f (1) = 0. Then, the discrete f -divergence is defined by:   k  pi Df (P , Q) = qi f . qi i=1

When f (x) = x log x we get the Kullback–Leibler divergence. Its expression is as follows.   k  pi KL(P , Q) = pi log . qi i=1

Similar expressions exist for the continuous case. They use the derivative of a measure with respect to the other. That is, the Radon–Nikodym derivative. The most general expression for the Kullback–Leibler divergence is given in terms of a third measure μ. In this case, if P and Q are absolutely continuous with respect to μ, then  p KLμ (P , Q) = p log dμ q X

where p = dP /dμ and q = dQ/dμ. That is, p and q are the Radon–Nikodym derivatives of P and Q with respect to μ. For additive measures, this definition does not depend on the measure μ.

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7103 /FLA

1

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.4 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

4

2.3. Entropies

1

2 3 4 5 6 7 8 9

2

Shannon introduced the entropy of a discrete probability distribution p by  H (p) = − pi log pi , i

and then defined in an analogous way the entropy for a continuous distribution p as follows.  H (p) = − p(x) log p(x)dx.

10 11 12 13 14 15 16 17 18 19

22 23 24 25 26 27 28 29 30

33 34 35

When m is a probability distribution, Definition 6 is the negative value of the Kullback–Leibler divergence. HJ is also known by relative entropy. In the discrete case, the Kullback–Leibler divergence and the entropy are different, but the following equality can be proven.

38 39

H (p) = log N − KL(p, u) where N is the cardinality of X (i.e., N = |X|) and u is the uniform distribution (i.e., u(x) = 1/N for all x ∈ X).

44 45

48 49 50 51 52

9

11 12 13 14 15 16 17 18 19

21 22 23 24 25 26 27 28 29 30

32 33 34 35

37 38 39

41 42

In this section we introduce a definition of continuous entropy. We will base it on the definition of KL-divergence for non-additive measures, which are based on the f -divergence. Therefore, let us start reviewing the f -divergence for continuous non-additive measures. We introduced this definition in [23].

46 47

8

40

3. Continuous entropy for non-additive measures

42 43

7

36

The principle of maximum entropy establishes that given a set of constraints we have to select the probability distribution that maximizes the entropy. The rationale is that this is the distribution that “leaves you the largest remaining uncertainty” [14].

40 41

6

31

Proposition 2. Let X be a reference set and p a probability distribution on a finite set X, then

36 37

5

20

Definition 6. Let p be a probability distribution, and let m be a density function. Then, the entropy of p with respect to m is defined by  p(x) HJ (p) = − p(x) log dx. m(x)

31 32

4

10

This is known as the differential entropy. This definition is translation invariant. That is, if we define p (X + c) = p(X), the differential entropy of p and p is the same. However, there are several properties of the discrete entropy that do not hold in the continuous case. One is that the entropy is not invariant to some coordinate transformation, another is that the entropy can be negative. In addition, Jaynes [9,10] shows that the definition for the continuous case is not the limit of the expression for discrete probabilities. As an alternative, Jaynes [9,10] introduced the following definition for continuous distributions that can be expressed as the limit of discrete distributions. In addition, this definition is invariant to some transformations.

20 21

3

43 44 45 46

Definition 7. Let μ1 and μ2 be two non-additive measures that are Choquet integrals of ν. Let f be a convex function. The f -divergence between μ1 and μ2 is defined as    dμ2 dμ1 /dν Df,ν (μ1 , μ2 ) = (C) f dν dν dμ2 /dν

47

Here dμ1 /dν and dμ2 /dν are the derivatives of μ1 and μ2 with respect to ν according to Definition 4.

52

48 49 50 51

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.5 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–••• 1 2 3 4 5

5

√Different functions f lead to different divergences and distances. As in the case of additive measures, with f (x) = ( x − 1)2 we have the Hellinger distance, with f (x) = |x − 1| the variation distance, and with f (x) = x log x, f (x) = x α , and f (x) = (x − 1)2 we have expressions for the Kullback–Leibler, Rényi, and χ 2 distances, respectively. f divergence was proposed and studied in [23]. Some properties of the Hellinger distance were also proven. In this paper we use the Kullback–Leibler divergence. It corresponds to the following definition.

6 7 8 9 10 11 12

15 16

19 20 21 22 23 24 25 26 27 28 29 30 31 32

12

It is easy to prove that this definition is the f -divergence with f (x) = x log x. So, the following holds. Proposition 3. The Kullback–Leibler divergence is a particular case of the f -divergence.

41 42 43

If g and h are derivatives of μ1 and μ2 with respect to ν, we will also use, for the sake of simplicity and with an abuse of notation, KLν (g, h). We need to prove some properties of this definition. The first one is that it is well defined. To prove it, we need to consider what happens when the derivative of μ1 and μ2 with respect to ν is not unique. Graf [6] studied and characterized when the Radon–Nikodym derivative exists for subadditive measures. I.e., when dμ/dν exists for μ and ν subadditive measures. See Theorem 4.3 in [6] based on the strong decomposition property. Graf proves that when the derivative exists, if there are two functions g1 and g2 which are two different derivatives of μ with respect to ν then ν(g1 = g2 ) = 0. In the next proposition we prove that the distance is well defined because it is the same for pairs of derivatives equal except for sets of measure zero. That is, given two pairs of functions g1 , h1 and g2 , h2 such that ν(g1 = g2 ) = 0 and ν(h1 = h2 ) = 0 then the distance is the same. Proposition 4. Given submodular non-additive measures μ1 , μ2 and ν with g1 and g2 two different derivatives of dμ1 /dν and h1 and h2 two different derivatives of dμ2 /dν such that ν({x|g1 (x) = g2 (x)}) = 0 and

11

14 15 16

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

37

0 ≤ ν({x|γg (x) = 0} ∪ {x|γh (x) = 0}) ≤ ν({x|γg (x) = 0}) + ν({x|γh (x) = 0}) = 0 + 0 = 0. Now, we consider

g1 log

49

and KLν (g2 , h2 ) = (C)

39 40 41 42 43 44 45

g1 dν = (C) h1

 (g2 + γg ) log

g 2 + γg dν h1 + γh

47 48 49 50



51

38

46



KLν (g1 , h1 ) = (C)

48

18

36

Proof. First, let us assume that g2 ≤ g1 and h2 ≤ h1 . We define γg as the function such that g1 = g2 + γg . Similarly, we define γh as the function h1 = h2 + γh . It is clear that ν({x|γg (x) = 0}) = 0 and similarly ν({x|γh (x) = 0}) = 0. For simplicity, we denote by γg = 0 the set {x|γg (x) = 0} and by γh the set {x|γh (x) = 0}. Note that as ν is submodular, ν({x|γg (x) = 0} ∪ {x|γh (x) = 0}) = 0 because

47

52

10

35

KLν (g1 , h1 ) = KLν (g2 , h2 )

45

50

ν({x|h1 (x) = h2 (x)}) = 0

then

44

46

9

17

37

40

8

13

36

39

5

Here dμ1 /dν and dμ2 /dν are the derivatives of μ1 and μ2 with respect to ν according to Definition 4.

34

38

4

7

33

35

3

Definition 8. Let μ1 and μ2 be two non-additive measures that are Choquet integrals of ν. Then, the Kullback–Leibler divergence between μ1 and μ2 with respect to ν is defined as    dμ1 dμ1 /dν KLν (μ1 , μ2 ) = (C) log dν dν dμ2 /dν

17 18

2

6

13 14

1

g2 log

g2 dν. h2

51 52

JID:FSS AID:7103 /FLA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.6 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

6

The next step is to prove that we can express KLν (g1 , h1 ) as    g2 (C) + γ dν g2 log h2

44 45 46

49 50 51 52

3 4 5

The fact that γ can only be different to zero in the set (γg = 0) ∪ (γh = 0) implies that the integral of γ with respect to ν is zero. Therefore,

22

KLν (g1 , h1 ) = KLν (g2 , h2 ). We have assumed that g1 ≤ g2 and h1 ≤ h2 . Naturally, if g2 ≤ g1 and/or h2 ≤ h1 the same argument applies. Let us consider the general case of some x such that g1 (x) < g2 (x) and others g1 (x) > g2 (x) (similar discussion would apply for h1 and h2 ). Let us define in this case gn (x) = min(g1 (x), g2 (x)) and gx (x) = max(g1 (x), g2 (x)). It is easy to see that gn , gx , g1 and g2 are all derivatives of μ1 with respect to ν because defining ga (x) = g1 (x) − gn (x) and gb (x) = gx (x) − ga (x), it follows    (C) gn dν < (C) g1 (x) = (C) (gn + fa )dν   < (C) gx (x) = (C) (gn + fa + fb )dν    < (C) gn dν + (C) fa dν + (C) fb dν  = (C) gn dν

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Note that the but last expression follows from submodularity of ν and Theorem 1, and that the last expression follows from the previous one because ν(fa = 0) = 0 and ν(fb = 0) = 0. As all gn , gx , g1 and g2 are derivatives of μ1 with respect to ν, it means that the first part of the proof can be applied for appropriate derivatives and thus the proposition is proven. 2

47 48

2

for a certain γ and that γ can only be not zero either when γg = 0 or when hh = 0. To do so, observe that for x∈ / {(γh = 0) ∪ (γg = 0)} we have that (g2 + γg ) log((g2 + γg )/(h1 + γh )) equals to g2 log(g2 / h1 ), therefore for these x we have γ (x) = 0. From now on, we assume that γ (x) ≥ 0 for all x. We would rename (or redefine) g1 , g2 , h1 , and h2 if this is not the case. Now, as ν is submodular, we can apply Theorem 1.  g2 (C) g2 log dν = KLν (g2 , h2 ) h2 ≤ KLν (g1 , h1 )    g2 + γ dν = (C) g2 log h2   g2 ≤ (C) g2 log dν + (C) γ dν h2  (4) = KLν (g2 , h2 ) + (C) γ dν

42 43

1

43 44 45 46 47

Following the usual interpretation in information theory, we can understand the Kullback–Leibler divergence as relative entropy. Now, following Jaynes [9,10] and Definition 6, we introduce a definition of entropy of a non-additive measure as follows. While Jaynes uses the minus sign, it is usual (see e.g. [2]) to define it in positive terms. We use this approach here.

48 49 50 51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.7 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–••• 1 2

7

Definition 9. Let μ be a non-additive measure that is the Choquet integral of ν. Then, the (relative) entropy of μ with respect to ν is defined as

3 4 5

8 9

4 5

where λ is the Lebesgue measure.

6

When μ and ν are both additive, this definition results into Jaynes definition. In general, we can use other additive measures (densities) instead of λ. We use λ because this is the one that corresponds to the uniform distribution (see Proposition 2) in the continuous case. Therefore, the following holds (using the positive sign as in [2]).

10 11

14

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

9

11

13 14 15

Example 1. Let μ1 be a distorted Lebesgue measure μ1 = m1 ◦ λ with m1 (x) = x 2 and let ν be a distorted Lebesgue measure ν = n ◦ λ with n(x) = x 1/2 . Naturally, the distortion of λ is l(x) = x. Then, let us consider the computations of dμ1 /dν and dλ/dν. Here M(s), N (s), and L(s) denote the Laplace transformations of m, n, and l. Now recall that the Laplace transformation of t p is (p + 1)/s p+1 for p > −1, and that the Laplace inverse of 1/s v is t v−1 / (v). That is, L(t p ) = (p + 1)/s p+1 and L−1 (1/s v ) = t v−1 / (v). Therefore, we can compute the following derivatives.    3  dμ1 −1 1 M(s) −1 1 (3)/s =L = L dν s N (s) s (3/2)/s 3/2     3/2 (3) s (3) 1 −1 −1 =L =L (3/2) s 4 (3/2) s 5/2 (3) x 3/2 = (3/2) (5/2)     dλ 1/s 2 −1 1 L(s) −1 1 =L = L dν s N (s) s (3/2)/s 3/2   1 1 s 3/2 = x 1/2 = L−1 3 (3/2) (3/2) s (3/2)

35 36

8

12

Let us illustrate the definition of the Kullback–Leibler divergence with an example. We will use distorted Lebesgue measures in the example to apply the results of Proposition 1 to compute the derivatives.

15 16

7

10

Proposition 5. Entropy according to Definition 9 generalizes Jaynes definition of entropy.

12 13

2 3

KLν (μ, λ)

6 7

1

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Now, we compute (dμ1 /dν)/(dλ/dν) because we need this result below. dμ1 /dν = dλ/dν

(3) 3/2 (3/2) (5/2) x 1 1/2 (3/2) (3/2) x

=

(3) 3/2 (3/2) x 1 (3/2)

=

(3) x = x/3 3/2

Then, the Kullback–Leibler divergence of μ with respect to ν according to Definition 9 corresponds to  dμ1 dμ1 /dν KLν (μ, λ) = (C) log dν = dν dλ/dν  (3) (3) = (C) x 3/2 log x (3/2) (5/2) 3/2  = (C) α0 t 3/2 log α1 tdν (3) where α0 = (3/2) (5/2) and α1 = 4/3. Fig. 1 displays the function α0 t 3/2 log α1 t to be integrated by means of the Choquet integral. Current theory for the Choquet integral does not permit us to find an analytical solution for this integral for non additive measures ν. Nevertheless, we can compute its value numerically. We have used for this purpose the software introduced in [20].

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.8 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

8

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

Fig. 1. Function for which the Choquet integral has to be calculated.

21

21

22

22

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

As it can be seen from the figure, the function is splitted in two parts. A positive and a negative part. The zero, which can be found also numerically, is at x = 0.75. There are two alternative ways to define the Choquet integral of a negative function. They are the symmetric and asymmetric definitions. Here, we use the symmetric integral because according to [4] (see also [18]), for ratio scales the symmetric integral seems to be the most suitable one: “The symmetric integral maps a ratio scale (where also the ratio of numbers is meaningful) to a ratio scale. A ratio scale has a zero with a fixed position, while its position is arbitrary for a difference scale”. Then, for ν(A) = (λ(A)1/2 ) the Choquet integral of the function in [0, 1] is 0.34. Of these 0.185 corresponds to the integral of the negative part and 0.155 corresponds to the integral of the positive part. The results in the previous example can be generalized. It is easy to prove that for μr (x) = mr ◦ λ with mr (x) = x r for r > 1/2, the Radon–Nikodym derivatives with respect to ν are

50 51 52

26 27 28 29 30 31 32 33 34 35

38 39 40

dμ1 /dν (r + 1) (3/2) r−1 = t dλ/dν (r + 1/2) (r+1) (3/2) (r+1/2)

and α1 =

41 42 43

(r+1) (3/2) (r+1/2)

we can express the Kullback–Leibler divergence as follows

KLν (μr , λ) = (C)

44 45



46

α0 t r−1/2 log α1 t r−1 dν.

(5)

48 49

25

37

and that

46 47

24

36

dμ1 (r + 1) = t r−1/2 dν (3/2) (r + 1/2)

So, defining α0 =

23

47 48

Fig. 2 displays four of the functions to be integrated using the Choquet integral. They correspond to the functions obtained for μr with r = 0.6, 2, 4 and 7. The computation of this divergence depends on the reference measure ν. Using a different reference measure ν leads to a different result.

49 50 51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.9 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

9

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36 37 38

36

Fig. 2. Functions to be integrated using a Choquet integral in Equation 5 to complete the calculation of KLν (μr , λ) for r = 0.6, 2 (top left and right) and r = 4, 7 (bottom left and right).

37 38

39

39

40

40

41

4. Principle of minimum discrimination

42

42

43 44 45 46 47 48 49 50 51 52

41

43

When we are considering continuous distributions the principle of maximum entropy can be applied considering the differential entropy or by means of the relative entropy or the Kullback–Leibler divergence. In this latter case, as the divergence is minimized when the entropy is maximized, we have the Principle of Minimum Discrimination Information analogous to the Principle of Maximum Entropy (see e.g. [2]). The Kullback–Leibler divergence for non-additive measures can be used to define and apply the principle of minimum discrimination information to a family of non-additive measures. We formalize this principle below and we apply it in Example 2 to a family of distorted Lebesgue measures. Note, however, that this approach can be applied using any prior measure instead of the Lebesgue measure. Our definition is given in such general terms.

44 45 46 47 48 49 50 51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.10 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–•••

10

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

Fig. 3. KLν (μr , λ) as a function r for 0.5 ≤ r ≤ 10. The minimum is obtained at r = 1.

22 23 24 25 26 27

μ∗ ∈ arg min KLν (μi , μ). μi ∈M

32 33 34 35

40 41 42 43

Uncited references [13] [17] [22]

52

32 33 34 35

37

39 40 41 42 43

45

47 48

References

50 51

31

46

48 49

29

44

46 47

27

38

In this paper we have introduced the entropy of non-additive measures for a continuous domain. We have also introduced the principle of minimum discrimination. We have proven some properties and given some examples. Further work is needed in this area. We need to compare the definitions introduced here with the ones on discrete domains. Also, consider the application to other types of measures. In particular, study their application to distorted probabilities in general.

44 45

26

36

5. Conclusion

38 39

25

30

Example 2. Let us consider the problem of finding a distorted Lebesgue measure with m1 (x) = x r for x > 1/2 and compliant with the principle of minimum discrimination information. The solution is the r > 1/2 such that minimizes Equation 5. Numerical integration of Equation 5 for 0.5 ≤ r ≤ 10 is displayed in Fig. 3. We can observe that the Kullback– Leibler divergence increases when we depart from r = 1 and the minimum divergence is obtained at r = 1 as expected.

36 37

24

28

We illustrate this definition with an example.

30 31

22 23

Definition 10. Given a prior non-additive measure μ, a reference measure ν, and a family of non-additive measures M, a measure μ∗ compliant with the principle of minimum discrimination information is a measure that

28 29

21

49 50

[1] G. Choquet, Theory of capacities, Ann. Inst. Fourier 5 (1953/54) 131–295. [2] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, 1991.

51 52

JID:FSS AID:7103 /FLA

[m3SC+; v1.237; Prn:10/10/2016; 13:58] P.11 (1-11)

V. Torra / Fuzzy Sets and Systems ••• (••••) •••–••• 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

11

[3] D. Denneberg, Non Additive Measure and Integral, Kluwer Academic Publishers, Dordorecht, 1994. [4] M. Grabisch, C. Labreuche, The symmetric and asymmetric Choquet integrals on finite spaces for decision making, Stat. Pap. 43 (2002) 37–52. [5] M. Grabisch, J.-L. Marichal, R. Mesiar, E. Pap, Aggregation Functions, Cambridge University Press, 2009. [6] S. Graf, A Radon–Nikodym theorem for capacities, J. Reine Angew. Math. 1980 (1980) 192–214. [7] A. Honda, Canonical fuzzy measure on (0, 1], Fuzzy Sets Syst. (2001) 147–150. [8] A. Honda, Entropy of capacity, in: V. Torra, Y. Narukawa, M. Sugeno (Eds.), Non-additive Measures: Theory and Applications, 2013, pp. 79–95. [9] E.T. Jaynes, Information theory and statistical mechanics, in: K.W. Ford (Ed.), Brandeis University Summer Institute Lectures in Theoretical Physics, W.A. Benjamin, 1963, pp. 181–218. [10] E.T. Jaynes, Prior probabilities, IEEE Trans. Syst. Sci. Cybern. 4 (3) (1968) 227–241. [11] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1951) 79–86. [12] J.-L. Marichal, M. Roubens, Entropy of discrete fuzzy measures, Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 8 (6) (2000) 625–640. [13] Y. Narukawa, V. Torra, M. Sugeno, Choquet integral with respect to a symmetric fuzzy measure of a function on the real line, Ann. Oper. Res. (2016), in press. [14] P. Penfield, Principle of maximum entropy, 2015, ch. 9 of the course Information, entropy and computation 6.050J/2.110J. [15] R. Rébillé, A super Radon–Nikodym derivative for almost subadditive set functions, Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 21 (2013) 347–365. [16] M. Sugeno, A note on derivatives of functions with respect to fuzzy measures, Fuzzy Sets Syst. 222 (2013) 1–17. [17] M. Sugeno, A way to Choquet calculus, IEEE Trans. Fuzzy Syst. (2014), in press. [18] M. Sugeno, T. Murofushi, Fuzzy Measure (in Japanese), Tokyo, Nikkan Kogyo Shinbunsha, 1993. [19] V. Torra, Y. Narukawa, Modeling Decisions: Information Fusion and Aggregation Operators, Springer, 2007. [20] V. Torra, Y. Narukawa, Numerical integration for the Choquet integral, Inf. Fusion 31 (2016) 137–145. [21] V. Torra, Y. Narukawa, M. Sugeno (Eds.), Non-additive Measures: Theory and Applications, Springer, 2013. [22] V. Torra, Y. Narukawa, M. Sugeno, M. Carlson, Hellinger distance for fuzzy measures, in: Proc. EUSFLAT 2013, 2013. [23] V. Torra, Y. Narukawa, M. Sugeno, On the f-divergence for non-additive measures, Fuzzy Sets Syst. (2016), in press. [24] R.R. Yager, On the entropy of fuzzy measures, Technical Report #MII-1917R, Machine Intelligence Institute, Iona College, New Rochelle, NY, 1999.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39

39

40

40

41

41

42

42

43

43

44

44

45

45

46

46

47

47

48

48

49

49

50

50

51

51

52

52