INCUBATION
SCIENCES
35,145-156
(1985)
145
Trigonometric Entropies, Jensen Difference Divergence Measures, and Error Bounds* ANNIBAL P. SANT’ANNA Institute de Matematica, Universidade Federal do Rio de Janeiro, 21944 Rio de Janeiro, RJ, Brasil and
INDER JEET TANEJA Bepartamento de Matembtica, ~niversida~ Federal de Santa Catarina, 88.000 Fiorianbpoiis, SC, Brasil
ABSTRACT Various authors have attemptedto characterizegeneralizedentropieswhichin specialcases reduce to the Shannonentropy. In this paper, we also characterizenew trigonometricentropies, and some donation-~eo~tic propertiesare studied. Bounds on the Bayesianprobabilityof error in terms of t~gonome~c entropiesand Jensen differencedivergencemeasureshave been obtained. Xdeasof paired entropies applied to statistical mechanics and fuzzy set theory are also discussed.
1.
INTRODUCTION
Various entropies have been introduced in the literature, taking the Shannon entropy as basic. It was R&q-i [21] who for the first time gave a parametric generalization of the Shannon entropy, known as entropy of order a; later Havrda and Charvgt [15] introduced another kind, known as entropy of degree 8. Sharma and Mittal [23] introduced a third kind involving two parameters which unifies those of Rknyi with those of Havrda and Char&, known as entropy of order OLand degree @ Sharma and Taneja [?A] also gave a direct generalization of entropy, that of entropy of degree (a, 8). All these generalizations are based on a power function of the type f(P) = z_, p;, where r is any parameter greater than zero. Taneja [26] for the first time gave a systematic way
*Partially supportedby CNPq (Bra@. QElsevier Science PublishingCo., Inc. 1985 52 VanderbiltAve., New York, NY 10017
ANNIBAL
146
P. SANT’ANNA
AND INDER
JEET TANEJA
to generalize the Shannon entropy, involving the sine function, and later, Sharma and Taneja [25] characterized it jointly with entropy of degree (cll,/3). There are two main basic approaches adopted to characterize these entropies, one axiomatic and another by functional equations. It is true that the Shannon entropy is fundamental from the applications point of view and arises naturally from statistical concepts. But during past years, researchers have also examined the applications of generalized entropies in different fields [2, 4, 6, 81 and found them as good as the Shannon entropy, and sometimes better because of the flexibility of the parameters, especially for comparison purposes. Here our aim is to characterize new families involving the sine function. In special cases, these either reduce to the Shannon entropy or are as good as the Shannon entropy. Some information-theoretic properties are studied. Bounds on the Bayesian probability of error are obtained. The idea of Jensen difference divergence measure or information radius has been generalized. Some possible applications mechanics and fuzzy set theory are discussed. t0 statistical 2.
SINE ENTROPIES
AND THEIR
PROPERTIES
Let A,= {P=(p,,p, ,..., p,,) 1pi 2 0, C:_ Ipi = l} be the set of all complete finite discrete probability distributions associated with a discrete random variable taking a finite number of values. The sine entropy, introduced by Taneja [26] (see also [25]) is given by
W)=-&i
B+kgr,
P, sin( P log pi 1,
k-0,1,2
,...,
(1)
r=l
p,,) E A,,, and its characterization is based on the funcfor all P=(p,,p,,..., tional equation arising from the following generalized additivity:
W*Q>
=W)~W+G(Q)~W),
(4
where
H(P)=E:=,h(pi), G(P)=Xy=,g(pi) for all PEA,, QEA,, and and f and g are continuous functions defined over [0, I]. It is easy P * Q E L,, to verify that
gli_m$P)=H(P)=-
ipJogp,. i=l
It is understood throughout the paper that all the logarithms OlogO = 0, and Osin(/ilogO) = 0, /3 # 0.
are to base 2,
JENSEN
DIFFERENCE
DIVERGENCE
MEASURES
147
The above sine entropy enjoys many interesting properties Similar to the Shannon entropy. Here the log is inside the sine function. In the following we give characterization of three different sine entropies; in two of them the sine function is inside the log function, while the third is only in terms of the sine function. Some information-theoretic properties are also studied. 2.1.
CHARACTERIZATION
Let h and f be two continuous functions defined over [O,l] and satisfying the following relations: (4) and
f(P + 4)f(P- 4) =fb)2-f(q)2~
(5)
for all p, q E [0, 11.Also consider the following generalized average sums:
4(p) = i W(PA9
(6)
&(P) = i f(PJW(PJ)~
(7)
i=l
i=l
and
h(P) =
k f(A),
(8)
i=l
for all P=(p,,p, ,..., P,,)EA,. The most general nontrivial continuous solutions of the functional equations (4) and (5) (see A&l [l]) are given by
h(f(P)) =Al%f(P)~ f(p) = cl sin&
and j(p)
=c,stiBp,
where A, cl, c2, and /3are arbitrary constants. The above set of solutions under the boundary conditions f(f) = 4, and h()) = 1 lead to f(p)_
sinfip
2sin( P/2) ’
f(P)=
sinh@p 2 sinh( /3/2) ’
antiKft
PI)‘= -logffpfa
This, together with the sum representations @), (7),
and {8), gives
arid
It
iseasier
to veti@ the fo~o~:
and
The ~~W~~EOl~~~
~~bU~U~
of sta~~~~me&ani=
can be ob-
tained by ~8 the Shmon entxqq subject to the constraint that the average energy of the system is prescribed. This ~s~butio~, however, is not obeyed by any particle in nature. All particles in nature obey either Bose-Einstein statistics or Fermi-Dirac statistics. Evidently, these ~st~butions cannot be derived by maximizing the ordinary Shamxm entropy. They can, however, bc derived from a modification of the Shannon entropy. Such a procedure was used in Capocelli and De Luca 191,where the Bose4Xinstein statistics is derived along with the Fermi-Dirac statistics and intermediate ones. Forte and Sempi [13]
JENSEN
DIFFERENCE
DIVERGENCE
MEASURES
149
showed that the abovementioned entropies can be derived, without recourse to a special entropy, by m aximi&tg the Shannon conditional entropy. The Bose-Einstein distribution, which is satisfied by bosons (photons, and nuclei and atoms containing an even number of particles), can be derived by maximizing the Bose-Einstein entropy (see Kapur [17,18]), viz., - $+lnp,+
t
(l+pi)In(l+p,)-21n2.
i-l
Similarly, Fermi-Dirac distribution, which is satisfied by electrons, neutrons, and protons, is given by m aximizmg the Fermi-Dirac entropy or paired entropy (see Kapur [17, 18]), viz., -
2 pihlpi - i i-l
(I-pi)hl(l-pi).
i-l
(10)
This idea of paired entropies has been systematically carried over to fuzzy-set theory by De Luca and Termini [ll], where the generalized additivity (2) appears as one of the properties of entropy of fuzzy sets. Recently Ebanks [12] took it as one of the axioms and came up with an entropy of fuzzy sets which is known as the quadratic entropy [27] in information theory. In a manner somewhat similar to Kapur [17], Burbea [6] recently extended (9) and (10) to the entropies of degree /3. Here our aim is just to introduce trigonometric paired entropies, whose details of applications to fuzzy-set theory and statistical mechanics will be discussed elsewhere. These are as follows:
&CC’)=
iI sf(Pi,l-Pi)P
k =1,2,3,4,
i-l
where S,f(p,l-p), k=1,2,3,4, are the binary trigonometric entropies. similar way, Bose-Einstein trigonometric entropies can be introduced. 2.3.
(11) In a
PROPERTIES
In this section, we shall give some information-theoretic properties of the trigonometric entropies. These properties are subject to the condition that /3 E (0, s]. Using the periodicity of the sine function, we can extend them to other intervals. (i) Nonnegativity: Sf(P), k = 1,2,3,4, are nonnegative for 0 < /3 Q x; S!(P), k =1,2,3,4, are continuous functions of P; (ii) Continuity: (iii) Symmetry: Sf(pl,pz,. . . , p,), k =1,2,3,4, are symmetric functions
of
ANNIBAL
150
P. SANT’ANNA
their arguments; (iv) Expansibility: S,f(p,,p, (v) Normality: S,f($,i) =l, (vi) Decisiuity: s#,o)
AND INDER
,..., p,,,0)=Sf(p1,p2 k =1,2,3,4;
JEET TANEJA
,..., p,,), k=1,2,3,4;
= S,s(O,l) = 0,
s,a(~,o)=s~(o,~)=log(cos~)-l,
S,a(l,O) =S,s(O,l)
s&1,0)
B+n,
=(cos$log(cos$
= S,B(O,l) = cos;.
In the second case the entropy is never decisive. In the third and fourth cases it is decisive only when j3 = rr, i.e., S;(l,O) = S,“(O,l) = 0 and S,“(l,O) = S,“(O,l) = 0. 3.
TRIGONOMETRIC
ENTROPIES
AND ERROR
BOUNDS
Let us consider the decision-theory problem of classifying an observation X as coming from one of n possible classes (hypotheses) C,, C,, . . . , C,. Let pi=Pr{C=Ci}, i=1,2 ,..., n, denote the a priori probability of the classes, and let p( x 1Ci) denote the probability density function of the randoin variable given that Ci in the true class or hypothesis. We assume that the pi and p( x 1Ci) are completely known. Given any observation x on X, we can calculate the conditional (a posteriori) probabilities p (C, 1x) by the Bayes rule. Consider the decision rule which chooses the hypothesis with the largest a posteriori probability. Using this rule, the partial probability of error for X= x is expressed by P,(X)
=l-m={p(G
Ix)t...Tp(Glx)).
Prior to observing X, the probability of error P, associated with X is defined as the expected probability of error, i.e.,
P,=Ex{ where p(x)
P,(X)}
=jx~(x)~,(x)
= Cy_‘,,pip(x 1Ci) is the unconditional
dx, density of X evaluated
at x.
JENSEN
DIFFERENCE
DIVERGENCE
151
MEASURES
In the recent literature, researchers in pattern recognition have shown considerable interest in the applications of certain probabilistic information and distance measures as criteria for feature selection. Kanal [16] and Chen [lo] provided a fairly good list of information and distance measures, corresponding bounds, and relationships among them. Now we will give bounds on P, in terms of trigonometric entropies in two different ways: by a sum representation, and by Jensen difference divergence measures. 3.1.
SUM
REPRESENTATION
AND ERRUR
BOUNDS
Kovalevski [19] gave a pointwise upper bound for P, in terms of the parameter t, and a Fano bound taking the Shannon entropy into consideration. Based on Kovaleski’s idea, Ben-Bassat [3] extended his results to the general class of functions satisfying the sum property, defined by
r(f)
=
i
H(P))H:A, +.I?, n222, H(P)= i f strictly concave, f” exists, f(0)
The bounds
f(Pi)>
i=l
= peof(
p) = 0 . )
[3] are given by
and
H(P)
*)3
where t is an integer such that t-l -~pp,<---t
t t+l’
The particular cases considered by Ben-Bassat 131 are the Shannon entropy, the quadratic entropy [27], and the entropy of degree /3 [15]. We can extend them to the trigonometric entropies as follows: For t = 0.5, 0 < P, < f , the upper bounds on P, are given by
~,4%%~lx),
k =1,2,3,4,
(12)
ANNIBAL
152
P. SANT’ANNA
AND INDER JEET TANEJA
x, and Si(C]x) (k=1,2,3,4) are the where S&(ClX)-J,S&(Clx)p(x)d conditional trigonometric entropies of C for X = x. The upper bounds on St< C 1X) are the Fano-type bounds given by
s&(C]X)@(&+&
k=1,2,3,4.
,... &,l-Pe),
(13)
The bounds given in (12) and (13) are subject to the condition of concavity: For k = 1, f(p) is concave provided /I log tan( /3 log p) Q 1, which holds for B small and p E (O,l]. For k = 2 and 3, the bounds are valid for /3 E (0, n/4]. For k - 4, the bound is valid for /3 E (0, a].
3.2.
JENSEN
DIFFERENCE
DIVERGENCE
MEASURES
AND
ERROR
BOUNDS
Recently, Burbea and Rao [7] and Burbea [5] considered three different classes of divergence measures and studied their convexity properties. Two of them are direct generalizations of Jeffreys, Kullback, and Leibler’s J-divergence, and one is based on the Jensen difference. They put the greatest emphasis on the Jensen difference. It has a wide range of applications in biological sciences [22], information theory [14], statistics [20], and other related areas. Here our aim is to consider divergence measures in terms of the Jensen difference and to obtain bounds on the probability of error. Some trigonometric examples are considered. Here we consider only the two-class case. In terms of the prior probabilities the Jensen difference divergence measure based on $ is given by
J+=/,[
~(P(xIc,))+~(P(xlc*)) 2
_~
117 dx
PblcJ+PwG) 2 i
(14) where $ is a convex function class of measures
J,(PI,P,)
=jx[
defined on [O,l]. Let us consider the more general
~(PlP(xlC,))+~(P*P(xlC*)) 2 P,P(xlCI)+P*P(xlC,) 2
-4
dx 11
*
05)
JENSEN
DIFFERENCE
DIVERGENCE
153
MEASURES
lx>,
Let us consider the Jensen difference in terms of the posterior probabilities as (16) where
#tPtC,/x))+cptptr,/x))
J*(x)=
2
-+(i).
Now we will obtain bounds on P, in terms of ‘J+( pt, pz) and then will consider some examples. In the two-class case p,(x)= min{p(C, Ix),p(C;Ix)). Since J.(X) is symmetric in p(C, Ix) and p(C,lx), consider pl(C, jx)=pe(x) so that p(C,lx) =1--p,(x). This gives qx)
=
‘pbetx))+(P(1-Petx))-+(i). 2
(i) Lower Bound on IJ, ( pl, p2) in term of P,: J*(x) convex. Thus
As Cpis convex, this gives
This is a lower bound on V+(p,, p2) in terms of P,, which in turn gives bound on P, in terms of ‘Y..(p,, pz) but in a complicated form. (ii) Upper Bound on P, in terms of ‘J+( pl, pz): In order to obtain an upper bound on P,, let us put the following conditions on the function Cp: 441) = HO) = 0
and
+(‘i) =-f.
This gives f+(O) = f+(l) = f and J+(f) = 0. Consider the function f,(x)
=I--2J*b);
then f+(l) = f*(O) = 0 and f+(i) = 1. Also f+ is concave. Then by the shape off+ and p,(x) we can easily see that
ANNIBAL P. SANT’ANNA AND INDER JEET TANEJA
154 i.e.,
P,(X)
Gf[l-25,(x)],
i.e., P,~f[l-2x#bb~,)]. EXAMPLE
1. For @(p)=plogp,
(18)
J(P,,P~)=‘J(P,,P&
we have
and where
P(c~I~)logP(c~I~);tP(c~I~)logP(c*I~)+A
J(P,,P*) =,,[
2
_
i
P(xlc,)+P(xlc,)
*ogP(~lcd+P(~lc2)
1i
2
p(x)dx,
2I
2
11’ dx
and H(Z’,,l-PC)=-P,logP,-(l-~‘,)log(l-P,). EXAMPLE
2. For +(p)
=
psin(Blogp) sinp
’
o
we have
where
=
P(C,lx)sin(8logp(C,Ix))+p(C~lx)sin(Plogp(C,Ix))
d
2
X
xp(x) EXAMPLE
dx.
3. For
+A 2I
JENSEN DIFFERENCE
DIVERGENCE
MEWXJRJZS
155
we have
where
+
s~v(G 1x1 2
EXAMPLE 4. For +(p)=
1%
sinIrp( C, 1x) 2
-(sinlrp)/2,
where
This work was started during the first author’s stay with the Departamento de Matematica, Universidade Federal de Santa Catarina, 88.000 Florianbpolis, SC, Brazil, from August to November 1983, and was completed during the second author’s stay with the Instituto di Scienze dell’lnformazione, Facolta di Scienze, Uniuersita di Salerno, 84100 Salerno, Italy, from November I983 to October 1984; both authors are thankful to those universities for providing facilities and hospitality. Thanks are also extended to CNPq (Brazil) for partial support. REFERENCES 1. J. D. A&I, Lectures on Functional Equations and Their Applications, Academic, 1966. 2. S. Arimoto, Information measures and capacity of order 01 for discrete memoryless channels, in Colloquium on Information Theory, Kesthely, Hungary, 1975, pp. 41-52. 3. M. Ben-Bassat, f-entropies, probability of error and feature selection, Inform. and Control, 39:227-242 (1978). 4. M. Ben-Bassat and J. Raviv, Rcnyi’s entropy and the probability of error, IEEE Trans. Inform. Theory IT-24:324-331 (1978). 5. J. Burbea, J-divergence and related concepts, in Encyclopedia of Statistical Sciences, Vol. 4. 1983, pp. 290-296. 6. -, The Bose-Einstein entropy of degree OLand its Jensen difference, Utilitas Math., to appear.
ANNIBAL P. SANT’ANNA AND INDER JEET TANEJA 7. J. Burbea and C. R. Rao, On the convexity of some divergence measures based on entropy functions, IEEE Trans. Inform. Theory IT-28:489-495 (1982). 8. L. L. Campbell, A coding theorem and Renyi’s entropy, Inform. and Conrrol 8:423-429 (1%5). 9. R. M. Capocelli and A. De Luca, Fury sets and decision theory, Inform. and Control, 23~446-473 (1973). 10. C. H. Chen, On information and distance measures, error bounds, and feature selection, Inform. Sci. 10:159-173 (1976). 11. A. De Luca and S. Termini, A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory, Inform. and Conrrol 20:301-312 (1972). 12. B. R. Fbanks, On measures of fuzziness and their representation, J. Math. Anal. Appt. 94:24-37 (1981). 13. B. Forte and C. Sempi, Maximizing conditional entropies: A derivation of quantal statistics, Rend. Mar. (6) 9:551-566 (1976). 14. R. G. Gallager, Information Theory and Reliable Communication, Wiley, New York, 1%8. 15. J. Havrda and F. Charvat, Quantification method of classification processes: Concept of structural a-entropy, Kybemetika (Prague) 3:30-35 (1967). 16. L. Kanal, Patterns in pattern recognition, IEEE Trans. Inform. Theory IT-20:697-622 (1974). 17. J. N. Kapur, Measures of uncertainty, mathematical programming and information theory, J. Indian Sot. Agric. Statist. 24~47-66 (1972). Non-additive measures of entropy and distributions of statistical mechanics, Indian 18. -9 J. Pure Appl. Math. 14:1372-1387 (1983). 19. V. A. Kovalevski, The problem of character recognition from the point of view of mathematical statistics, in Character Readers and Pattern Recognitions, 1%8, pp. 3-30. 20. C. R. Rao. Diversity and dissimilarity coefficients: A unified approach, Theoret. Poptdation Biol. 21:24-43 (1982). 21. A. Renyi, On measures of entropy and information, in Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, Univ. of Calif. Press, Berkeley, 1 %l , Vol. 1, pp. 547-561. 22. R. Sibsoo, Information radius, 2. Wahrsch. Verw. Gebiere, 14:149-160 (1969). 23. B. D. Sharma and D. P. Mittal, New nonadditive measures of entropy for discrete probability distributions, .I. Math. Sci. IO:2840 (1975). 24. B. D. Sharma and I. J. Taneja, Entropy of type (a, jzI) and other generalized measures in information theory, Merrika 22:205-215 (1975). 25. -9 Three generalized additive measures of entropy, Elektron. Informationsverarb. Kybemet. 13:419-433 (1977). 26. I. J. Taneja. A study of generalized measures in information theory, Ph.D. Thesis, Univ. of Delhi, India, 1975. 27. I. Vajda, Bounds on the minimal error probability and checking a finite or countable number of hypotheses, Inform. Trans. Problems 419-17 (1968). Receiued 13 November
I984; revised 8 January 1985