Journal of Statistical Planning and Inference 142 (2012) 1716–1732
Contents lists available at SciVerse ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
On the convergence of Shannon differential entropy, and its connections with density and entropy estimation Jorge F. Silva n, Patricio Parada University of Chile, Department of Electrical Engineering, Av. Tupper 2007, Santiago 412-3, Chile
a r t i c l e in f o
abstract
Article history: Received 9 June 2011 Accepted 13 February 2012 Available online 21 February 2012
This work extends the study of convergence properties of the Shannon differential entropy, and its connections with the convergence of probability measures in the sense of total variation and direct and reverse information divergence. The results relate the topics of distribution (density) estimation, and Shannon information measures estimation, with special focus on the case of differential entropy. On the application side, this work presents an explicit analysis of the density estimation, and differential entropy estimation, for distributions defined on a finite-dimension Euclidean space ðRd ,BðRd ÞÞ. New consistency results are derived for several histogram-based estimators: the classical product scheme, the Barron’s estimator, one of the approaches proposed by Gy¨ orfi and Van der Meulen, and the data-driven partition scheme of Lugosi and Nobel. & 2012 Elsevier B.V. All rights reserved.
Keywords: Convergence of probability measures Shannon information measures Strong consistency Density estimation Differential entropy estimation Consistency in information divergence Histogram-based estimators
1. Introduction The estimation of Shannon information measures, such as the differential entropy and the mutual information (Shannon, 1948; Cover and Thomas, 1991; Gray, 1990; Csisza´r and Shields, 2004), is fundamentally related to the problem ¨ of distribution (density) estimation (Devroye and Gyorfi, 1985; Devroye and Lugosi, 2001), as these information measures are functionals of a probability distribution. These two important learning scenarios are well understood and have been systematically studied by the statistical learning community. Density estimation, when posed as a histogram-based problem, has been characterized extensively in the literature, ¨ where strong consistency in the L1 sense is well understood (Devroye and Gyorfi, 1985). Necessary and sufficient conditions are known in particular for product non-adaptive histogram-based estimates (Abou-Jaoude, 1976) (see also, ¨ Devroye and Gyorfi, 1985). In recent years, some extensions have been derived using data-dependent partitions (Lugosi and Nobel, 1996), and the family of histogram-based estimators proposed by Barron et al. (1992). In the particular case of the Barron-type histogram-based estimator, research has addressed consistency under topologically stronger notions, such ¨ as consistency in direct information divergence (I-divergence) (Barron et al., 1992; Gyorfi and Van der Meulen, 1994), ¨ in w2 -divergence and expected w2 -divergence (Gyorfi et al., 1998; Vajda and Van der Meulen, 2001) and in the general family of Csisza´r’s f-divergence (Berlinet et al., 1998). For the estimation of information measures, there is a large body of literature dealing with mutual information (MI) and Shannon differential entropy estimation for distributions defined on a finite dimensional Euclidean space ðRd ,BðRd ÞÞ, (see Beirlant et al., 1997 and references therein for an excellent review). In particular, consistency is well known for histogram-based and
n
Corresponding author. E-mail addresses:
[email protected],
[email protected] (J.F. Silva),
[email protected] (P. Parada). URL: http://www.ids.uchile.cl/~josilva/ (J.F. Silva).
0378-3758/$ - see front matter & 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2012.02.023
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1717
kernel plug-in estimates (Beirlant et al., 1997; Gy¨orfi and Van der Meulen, 1987). In the case of histogram-based estimators, the standard approach considers non-adaptive product partition (Beirlant et al., 1997), and some extensions have been proposed for data-driven partitions (Darbellay and Vajda, 1999; Silva and Narayanan, 2010b). A natural question to ask is if there is a connection between the many flavors of consistency for density estimation (in total variation—or L1 in the case of absolutely continuous distribution with respect to the Lebesgue measure, in (directreverse) I-divergence, in Csisza´r’s f-divergence) and the problem of estimating Shannon information measures. We are interested in knowing what flavor of consistent for the density estimation, if any, is sufficient, or needed, to achieve a ¨ strongly consistent estimate of the differential entropy. A version of this question was originally stated by Gyorfi and Van der Meulen (1987). Based on their results (on two histogram-based constructions), they conjectured that extra conditions are always needed to make L1-consistent histogram-based density estimates consistent for the differential entropy. Silva and Narayanan (2010a, 2010b) have found congruent results when working with data-dependent partitions in the context of MI and Kullback–Leibler divergence (KLD) estimation. In particular, they found stronger conditions for estimating MI and KLD than the one obtained for a consistent estimation of the underlying density in the L1 sense (Lugosi and Nobel, 1996). These findings, although interesting, are partial in the sense that they are valid only for specific constructions (estimators) and, consequently, general conclusions cannot be derived from them. To the best of our knowledge the stipulation of concrete results connecting the topics of information-measure estimation and density estimation remains an open problem. Such a result (or results) would provide cross-fertilization between these two important lines of research, which to our knowledge have been mostly developed as independent tracts. Moving in this direction, this work addresses lines of studying the Shannon differential entropy as a functional of the space of probability distribution, in particular, in terms of its convergence properties with respect to deterministic sequences of measures. This is the basic ingredient for understanding consistency, since in the learning scenario we also ¨ have sequences of measures, although they are random objects driven by an empirical process (Devroye and Gyorfi, 1985). Along these lines, Piera and Parada (2009) recently studied this problem and derived a number of conditions on a sequence of probability measures fP n ,n 2 Ng and the limiting distribution P to guarantee that limn-1 HðPn Þ ¼ HðPÞ. In the first part of this work, we revisit, refine, and extend these convergence results. From them, we derive concrete relationships between convergence in (reverse and direct) I-divergence and the convergence of Shannon differential entropy. These relationships are obtained under different settings, varying from stronger to weaker conditions on the limiting distribution, and from weaker to stronger conditions in the way the sequence converges to P, respectively. Interestingly, in many of these settings, the convergence on I-divergence suffices to guarantee the convergence of the ¨ differential entropy. The results ratify the conjecture raised by Gyorfi and Van der Meulen (1987), in the sense that convergence on total variation is not sufficient to obtain a convergence of the Shannon differential entropy for the continuous alphabet case. These findings also agree with recent results that demonstrate the discontinuity of the Shannon measures in the countable alphabet scenario (Ho and Yeung, 2009, 2010). In the second part of this article, we report applying those convergence results to the problem of histogram-based estimation. Specifically we studied four particular estimators: the classical product-type partition estimator (Abou-Jaoude, 1976), the data-driven partition estimator (Lugosi and Nobel, 1996), the Barron histogram-based estimator (Barron et al., ¨ 1992), and the histogram-based estimator by Gyorfi and Van der Meulen (1987). We derived new density-free strong consistency results for each estimator, either in the case of density (in the sense of I-divergence), or in the Shannon differential entropy estimation problem. The rest of the paper is organized as follows. Section 2 introduces notations and the background needed for the rest of the exposition. Section 3 addresses the convergence of Shannon differential entropy. Section 4 presents the applications of the results in the two previously mentioned statistical learning scenarios. Finally, some of the proofs are presented in the Appendix section.
2. Preliminaries We start with some basic notations and definitions needed for the rest of the exposition. Let ðRd ,BðRd ÞÞ denote the standard k-dimensional Euclidean measurable space equipped with the Borel sigma field (Halmos, 1950; Breiman, 1968). Let X 2 BðRd Þ be a separable and complete subset of Rd (i.e., X is a Polish subspace of Rd ). For this space, let PðXÞ be the collection of probability measures in ðX,BðXÞÞ and let ACðXÞ PðXÞ denote the set of probability measures absolutely continuous with respect to l, the Lebesgue measure1 (Halmos, 1950). For any m 2 ACðXÞ, ðdm=dlÞðxÞ denotes the Radon– Nikodym (RN) derivative of m with respect to l. In addition, let AC þ ðXÞ denote the collection of probability measures m 2 ACðXÞ where ðdm=dlÞðxÞ is strictly positive, Lebesgue almost everywhere in X, i.e., the supportðdm=dlÞ differs from X in a set of Lebesgue measure zero.2 Note that when m 2 AC þ ðXÞ, then m and l are mutually absolutely continuous in X, and consequently, ðdl=dmÞðxÞ is well-defined and, furthermore, is equal to ððdm=dlÞðxÞÞ1 for Lebesgue almost every (Lebesgue-a.e.) point x 2 X. 1 A measure s is absolutely continuous with respect to a measure m, denoted by s 5 m, if for any event A such that mðAÞ ¼ 0, then sðAÞ ¼ 0. R Consequently ds=dm is well-defined, which is the Radon–Nikodym derivative or density, and furthermore, 8A 2 BðXÞ, sðAÞ ¼ A ðds=dmÞ dm. 2 Let f : X-R be a real function: then its support is the closure of the set fx : f ðxÞ 4 0g.
1718
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
Let MðXÞ denote the space of measurable functions from X to R. Then for every m 2 ACðXÞ let us define Z L1 ðdmÞ ¼ f 2 MðXÞ : 9f 9 dm o 1 ,
ð1Þ
X
the space of m integrable functions. In consequence the L1-norm of f 2 L1 ðdmÞ is Z Jf JL1 ðdmÞ ¼ 9f 9 dm:
ð2Þ
X
In addition, L1 ðduÞ MðXÞ denotes the space of bounded functions m-almost everywhere, where for every f 2 L1 ðduÞ we can define, Jf JL1 ðdmÞ ¼ inffM 4 0 : mðf
1
ð½M,MÞÞ ¼ 0g o 1:
ð3Þ
2.1. Total variation and Kullback–Leibler divergence Let v and m be two probability measures in PðXÞ. The total variation of v and m is given by Vðv, mÞ ¼ sup 9vðAÞmðAÞ9,
ð4Þ
A2BðXÞ
which is a metric in PðXÞ and has been widely adopted as an error criterion in statistics, for instances to show consistency ¨ in distribution estimation (Devroye and Lugosi, 2001; Devroye and Gyorfi, 1985). If m and v are absolutely continuous with respect to l, with RN derivatives or densities f and g, respectively, the Scheffe´’s identity (Scheffe´, 1947) provides a connection between total variation and the L1-norm of the densities involved (see, Devroye and Lugosi, 2001; Devroye and ¨ Gyorfi, 1985). More precisely, Z 1 1 9f ðxÞgðxÞ9 dlðxÞ ¼ Jf gJL1 ðdlÞ : ð5Þ Vðv, mÞ ¼ 2 X 2 A stronger notion of similarity between distributions was proposed by Kullback and Leibler (1951) (see also Kullback, 1958; Csisza´r, 1967; Csisza´r and Shields, 2004; Gray, 1990) as an indicator of discrimination information for a binary hypothesis testing setting. Considering again m and v in PðXÞ, then the Kullback–Leibler divergence (KLD) or I-divergence of m with respect to v is given by (see, Gray, 1990; Cover and Thomas, 1991), DðmJvÞ ¼
sup
X
mðAi Þ log
fAi g2QðXÞ i
mðAi Þ vðAi Þ
,
ð6Þ
where QðXÞ denotes the collection of finite measurable partitions of X. It is well known that DðmJvÞ Z0, and that DðmJvÞ ¼ 0 if, and only if, m ¼ v, but since the functional is neither symmetrical nor satisfies the triangular inequality, it is not a proper metric. Furthermore, for DðmJvÞ to be well-defined, it is necessary that m 5v. The Pinsker’s inequality (see, Kullback, 1967; Csisza´r, 1967; Csisza´r and Shields, 2004) establishes a relationship between I-divergence and the total variation, namely, 8m,v 2 PðXÞ, 2 ln 2 Vðm,vÞ2 r Dðm,vÞ:
ð7Þ
Given m 2 PðXÞ, let ACðX9mÞ denote the collection of probability measures absolutely continuous with respect to m. Restricted to this set, let us define ds HðX9mÞ ¼ s 2 ACðX9mÞ : log 2 L1 ðdsÞ : ð8Þ dm
HðX9mÞ denotes the collection of probability measures where the I-divergence with respect to m is well-defined, and where, alternatively to (6), we have 8s 2 HðX9mÞ (Gray, 1990), Z ds ds, ð9Þ DðsJmÞ ¼ log dm X which is a well formulated expression since s 5 m. 2.2. Shannon differential entropy Let us consider the space dm HðXÞ ¼ m 2 ACðXÞ : log 2 L1 ðdmÞ : dl
ð10Þ
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
Then 8m 2 HðXÞ, we can properly define Z Z dm dm dm HðmÞ ¼ log dm ¼ log dl o 1, dl dl dl X X
1719
ð11Þ
to be the Shannon differential entropy of m (see, Cover and Thomas, 1991; Gray, 1990; Beirlant et al., 1997). 2.3. Convergence Here we revisit some classical convergence notions of probability measures and their interrelationships. In what follows, we are going to denote the space of continuous and bounded real-valued functions in ðX,BðXÞÞ by C b ðXÞ. Definition 1. A sequence fmn : n 2 Ng PðXÞ is said to converge weakly to m 2 PðXÞ, if 8f 2 C b ðXÞ, Z Z lim f dmn ¼ f dm: n-1 X
ð12Þ
X
The standard notation is mn ) m. Since fmn : n 2 Ng and m are probability measures (finite in particular), the Lebesgue integrals in (12) are well-defined for any f 2 C b ðXÞ. Definition 2. A sequence fmn : n 2 Ng PðXÞ is said to converge in total variation (or variation) to m 2 PðXÞ, if lim Vðmn , mÞ ¼ 0:
ð13Þ
n-1
Finally, we introduce two notions of convergence in I-divergence proposed by Barron et al. (1992). Definition 3. Let fmn : n 2 Ng be a sequence in HðX9mÞ with m 2 PðXÞ. We say that fmn : n 2 Ng converges to m in reverse I-divergence if, lim Dðmn JmÞ ¼ 0:
ð14Þ
n-1
Alternatively, let fmn : n 2 Ng be in PðXÞ and m 2 PðXÞ such that m 5 mn , for all n. Then fmn : n 2 Ng is said to converge to m in direct I-divergence if, lim DðmJmn Þ ¼ 0:
ð15Þ
n-1
It is well understood that the convergence in total variation implies weak convergence (see, Billingsley, 1999), and from Pinsker’s inequality in (7), that the convergence in direct and reverse I-divergence is stronger than total variation convergence. Some new results connecting the notions of convergence of probability measures, with the convergence of the Shannon differential entropy functional in (11) are presented in the next section. 3. Convergence of Shannon entropy We revisit, extend, and refine the results recently presented by Piera and Parada (2009) who studied the convergence of Shannon entropy in the continuous and discrete alphabet setting. We are going to focus on the continuous scenario introduced in Section 2. The results are presented from stronger to weaker conditions on the limiting measure m 2 PðXÞ, while imposing from less to more convergence and bounding conditions on the sequence of measures fmn : n 2 Ng PðXÞ. Theorem 1 (Piera and Parada (2009)). Let m be a probability measure in AC þ ðXÞ such that logðdm=dlÞ 2 C b ðXÞ and let us consider a sequence of measures fmn : n 2 Ng HðXÞ. Then, fmn : n 2 Ng HðX9mÞ, m 2 HðXÞ and the following conditions are equivalent:
mn ) m and Hðmn Þ-HðmÞ and limn-1 Dðmn JmÞ ¼ 0. The proof is presented in Piera and Parada (2009, Theorem 1, p. 83). Direct implications of this result are the followings. Corollary 1. Under the setting of Theorem 1, if mn ) m, then lim Dðmn JmÞ ¼ 03 lim Hðmn Þ ¼ HðmÞ:
n-1
n-1
Corollary 2. Under the setting of Theorem 1, lim Dðmn JmÞ ¼ 0 ) lim Hðmn Þ ¼ HðmÞ:
n-1
n-1
Hence, under the condition that logðdm=dlÞ is a bounded and continuous function, there is a close connection between convergence in reverse I-divergence and the convergence of Shannon entropy. In fact from Corollary 1, equivalence is obtained under the weak convergence of fmn : n 2 Ng to the limiting measure m.
1720
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
In the following result, we relax the continuous condition on logðdm=dlÞ at the expense of imposing a stronger convergence on fmn : n 2 Ng. Theorem 2. Let m be in AC þ ðXÞ with the condition that logðdm=dlÞ 2 L1 ðdlÞ and let us consider that fmn : n 2 Ng HðXÞ. Then mn 2 HðX9mÞ for all n, and the following two conditions are equivalent:
limn-1 Vðmn , mÞ ¼ 0 and Hðmn Þ-HðmÞ and limn-1 Dðmn JmÞ ¼ 0. From Theorem 2 we can derive analogous results to Corollaries 1 and 2 connecting the convergence in reverse I-divergence and the convergence of the Shannon differential entropy but only by assuming that logðdm=dlÞ is bounded Lebesgue-a.e. in X. R R Proof. First note that Dðmn 9mÞ ¼ X logðdmn =dmÞ dmn ¼ Hðmn Þ X logðdm=dlÞ dmn o1, since mn 2 HðXÞ and logðdm=dlÞ is 1 bounded mn -a:e: (this because mn 5 l). Note that logðdm=dlÞ 2 L ðdlÞ implies that m 2 HðXÞ, then we can use the following expansion: Z dm dm dmn Dðmn JmÞ ¼ HðmÞHðmn Þ þ log ð16Þ dl, dl dl dl X where the magnitude of the last term on the right hand side (RHS) of (16) is upper bounded by Jlog dm=dlJL1 ðdmÞ Vðmn , mÞ, which suffices to conclude the result from (16). & In addition to the assumptions of Theorem 2, the next result establishes conditions on fmn : n 2 Ng so that the convergence in variation is sufficient to guarantee the Shannon differential entropy convergence as well. The idea, originally explored by (Piera and Parada, 2009, Theorem 2), states conditions to bound the reverse I-divergence with the variational distance and, from Eq. (7), it makes both convergence criteria equivalent. A refined version of this result is presented below. Theorem 3. Let us assume the conditions on fmn : n 2 Ng and m stipulated in Theorem 2. In addition, let us consider that the uniform bounding conditions (UBCs) in (17) are satisfied for Lebesgue-almost every (a.e) point in X: inf n
dmn ðxÞ 4 0, dm
sup n
dmn ðxÞ o 1, dm
ð17Þ
then, lim Vðmn , mÞ ¼ 0,
n-1
is equivalent to lim Dðmn JmÞ ¼ 0, n-1
and, furthermore, lim Vðmn , mÞ ¼ 0,
n-1
implies that lim Hðmn Þ ¼ HðmÞ: n-1
Proof. First note that mn 2 HðX9mÞ (from Theorem 2), and then we get point wise, from triangular inequality, that: dm log dm dmn log dmn r log dm dm dmn þ dmn log dmn , ð18Þ dx dx dx dx dx dx dx dx dm where ðdmn =dxÞðxÞ is a short hand notation for ðdmn =dlÞðxÞ. Let us concentrate on the right-hand term of the RHS of (18). By the hypotheses, there exists Nmin 4 0 such that inf n ðdmn =dmÞðxÞ 4 Nmin Lebesgue-a.e, and consequently, we get log dmn r log Nmin dmn 1, ð19Þ dm Nmin 1 dm Lebesgue-almost surely (a.s.), from the fact that 9logðxÞ9r logðx0 Þ=x0 19x0 19, for all x 2 ðx0 ,1Þ (Piera and Parada, 2009, p. 85). In addition, since we have that logðdm=dxÞ 2 L1 ðdlÞ, then 1 dmn dm dmn r dm : 1 dm dx dx dx Lebesgue-a.e. Integrating (19) and (20), we obtain dmn dm dmn log Nmin dm dmn log n r , dx dm dm N min 1 @x dx log Nmin dm dmn , rN max N min 1 dx dx
ð20Þ
ð21Þ
Lebesgue-a.e., where Nmax o 1 is a constant defined such that supn ðdmn =dmÞðxÞ oN max Lebesgue-a.e. Integrating both sides of (21) with respect to l, we get one of the results, i.e., Dðmn JmÞ rKðN min ,Nmax Þ Vðmn , mÞ for all n 4 0, where KðN min ,Nmax Þ
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
denotes the constant in (21). Finally integrating (18) with respect to l, we have that, 8n 40 " # dm 9Hðmn ÞHðmÞ9 r log þ KðNmin ,Nmax Þ Vðmn , mÞ, dx L1 ðdlÞ which concludes the proof.
1721
ð22Þ
&
Note that the UBCs imposed on dmn =dm (see (17)) make the total variation convergence equivalent with the general stronger reverse I-divergence convergence. We conjecture that these types of uniform bounding conditions are always needed, given the nature of the log functional to pass from convergence in variation to the convergence of the entropy functional. In regard to the applications of these results to learning settings (density estimation and differential entropy estimation), it is desirable to impose as few conditions as possible (ideally no-assumption) on the limiting distribution m, in order to achieve ¨ universal or density-free consistency (Devroye and Lugosi, 2001; Devroye and Gyorfi, 1985). A result in that direction was presented in Piera and Parada (2009) after introducing a stronger point-wise convergence on fmn : n 2 Ng to the limit m. Theorem 4 (Piera and Parada, 2009). Let m be in ACðXÞ such that HðmÞ o1 and let us assume that fmn : n 2 Ng belongs to ACðX9mÞ. If we consider that lim
n-1
dmn ðxÞ ¼ 1 dm
for m-almost everyx in X and that dmn sup o 1, n 4 0 dm L1 ðdmÞ then, fmn : n 2 Ng HðX9mÞ \ HðXÞ and lim Dðmn JmÞ ¼ 0
n-1
and
lim Hðmn Þ ¼ HðmÞ:
n-1
The proof is based purely on the dominated convergence theorem (see, Halmos, 1950; Breiman, 1968; Varadhan, 2001) for which the bounding conditions and point wise convergence condition (both with respect to the measure m) are strictly needed. The proof of Theorem 4 is presented in Piera and Parada (2009, Theorem 3, p. 87). Remark 1. It is important to mention that the point wise convergence of ðdmn =dmÞðxÞ is stronger than convergence in total ´’s Lemma Scheffe´, 1947, see Devroye and Lugosi, 2001), used in previous results. Thus this can be variation (from Scheffe interpreted as the penalty we have to pay to reduce the constraints on m. It is also interesting to note that this point wise convergence agrees with the sufficient conditions found by Silva and Narayanan (2010a, 2010b) to obtain density-free convergence of the MI and I-divergence functional, in a histogram-based learning scenario. 3.1. Studying the connections with the direct I-divergence: DðmJmn Þ The results presented so far show a number of concrete connections between convergence of the Shannon entropy and the convergence of the involved distributions in the reverse I-divergence sense. A natural question to ask is if similar relationships can be established with respect to the direct I-divergence instead. The previously mentioned results in Theorems 1–3 have been found working mainly with the following equality: Z dm dmn dm HðmÞHðmn Þ ¼ Dðmn JmÞ þ log ð23Þ dl: dl dl dl X In the following, we work with Z dm dm dmn dl, Hðmn ÞHðmÞ ¼ DðmJmn Þ þ log n dl dl dl X
ð24Þ
as a natural way to relate Shannon entropy to direct I-divergence. Using the same measure theoretic tools in this new scenario, basic requirements to get the desired conditions are that m 5 mn , 8n, and two flavors of uniform bounding conditions on flogðdmn =dxÞ : n 2 Ng. Theorem 5. Let m 2 HðXÞ and let mn 2 AC þ ðXÞ such that logðdmn =dxÞ 2 L1 ð@lÞ for all n, and let us assume that dmn o1: sup log dx 1 n40 L ðdlÞ Then, fmn : n 2 Ng HðXÞ, m 2 HðX9mn Þ for all n, and the following conditions are equivalent:
limn-1 Vðmn , mÞ ¼ 0 and Hðmn Þ-HðmÞ and limn-1 DðmJmn Þ ¼ 0.
1722
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
This result can be considered the counterpart of Theorem 2. With respect to that, here we relax the conditions on m at the expense of a uniform bounding condition on flogðdmn =dxÞ : n 4 0g, not considered in previous results. The proof is presented in Appendix A. Under the assumptions of Theorem 5, we have the following straightforward corollaries. Corollary 3. If limn-1 Vðmn , mÞ ¼ 0, then limn-1 DðmJmn Þ ¼ 03limn-1 Hðmn Þ ¼ HðmÞ. Corollary 4. limn-1 DðmJmn Þ ¼ 0 ) limn-1 Hðmn Þ ¼ HðmÞ. Our next step is to add conditions onto Theorem 5, to make equivalent total variation to direct I-divergence convergence (the counterpart of Theorem 3). By this means, we can achieve the objective of convergence of the Shannon differential entropy. Theorem 6. Let m 2 AC þ ðXÞ such that logðdm=dxÞ 2 L1 ðdlÞ. In addition, let mn 2 AC þ ðXÞ, logðdmn =dxÞ 2 L1 ðdlÞ for all n 40, and supn 4 0 Jlog dmn =dxJL1 ðdlÞ o 1. Then, the following three convergence criteria are equivalent: lim Vðmn , mÞ ¼ 0,
n-1
lim Dðmn JmÞ ¼ 0,
n-1
lim DðmJmn Þ ¼ 0:
n-1
In addition, lim Vðmn , mÞ ¼ 0
n-1
implies that lim Hðmn Þ ¼ HðmÞ: n-1
The proof is presented in Appendix B. Remark 2. The uniform bounding condition (UBC) of Theorem 6 implies the UBC of Theorem 3. See Appendix B for details.3 It is important to note that in the process of achieving an equivalence between total variation and direct I-divergence, we get stronger conditions than those established in Theorem 3 (the counterpart of this result considering the reverse I-divergence). This justifies the equivalence among total variation, reverse, and direct I-divergence obtained in Theorem 6. Furthermore, the set of conditions of Theorem 6 are the same as the conditions stipulated in Piera and Parada (2009, Theorem 2), although this result only connects convergence in reverse I-divergence and total variation (a refined version of this result is presented in Theorem 3). Finally, we close this section with the counterpart of Theorem 4, where we explore the point wise convergence of ðdmn =dmðxÞÞ to 1 and the application of the dominated convergence theorem. Theorem 7. Let us consider m 2 HðXÞ and mn 2 HðXÞ for all n, such that, dmn log o 1: sup dx L1 ðdmÞ n40 In addition let us assume that mn 2 ACðX9mÞ and that lim
n-1
dmn ¼ 1, dm
for m-almost every point in X. Then, m 2 HðX9mn Þ, lim DðmJmn Þ ¼ 0
n-1
and
lim Hðmn Þ ¼ HðmÞ:
n-1
The proof is presented in Appendix C. Remark 3. The conditions m 2 HðXÞ and supn 4 0 Jlog dmn =dxJL1 ðdmÞ o1 of Theorem 7 imply the UBC of Theorem 4, i.e., supn 4 0 Jdmn =dmJL1 ðdmÞ o1. This shows again that establishing a connection between direct I-divergence and Shannon entropy requires stronger conditions than doing so with its counterpart, the reverse I-divergence, under the same measure theoretic tools used to construct both results. Finally, integrating Theorems 4 and 7 we can state the following result. Corollary 5. Under the setting and assumptions of Theorem 7, we have that m 2 HðX9mn Þ for all n 4 0, fmn : n 2 Ng HðX9mÞ, and lim DðmJmn Þ ¼ 0,
n-1
lim Dðmn JmÞ ¼ 0 and lim Hðmn Þ ¼ HðmÞ:
n-1
n-1
3.2. Discussion on the convergence results Some remarks and discussions about the results presented in this section follow. It is important to mention that in order to achieve convergence of the Shannon entropy, (see Theorems 3, 4, and 6) and 7, we ended up dominating both the direct and reverse I-divergence, respectively, by the total variation, which made those 3 More precisely, supn 4 0 Jlog dmn =dxJL1 ðdlÞ o 1 and logðdm=dxÞ 2 L1 ðdlÞ imply that supn 4 0 Jlog dmn =dmJL1 ðdlÞ o 1, this last expression equivalent to the UBC of Theorem 3.
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1723
convergence notions equivalent. This was done by means of uniform bounding conditions (UBC’s) on flogðdmn =dxÞ : n 2 Ng and flogðdmn =dmÞ : n 2 Ng, respectively (Lebesgue or m almost everywhere depending on the set of results). In light of this, we strongly believe that these kinds of UBC’s are needed in addition to the total variation convergence conditions to achieve convergence of the Shannon entropy functional. In that sense, this set of results echos and ratifies the conjecture ¨ stated by Gyorfi and Van der Meulen (1987), that extra conditions are needed to get the convergence of the differential entropy, with respect to the conditions needed to achieve convergence of the respective densities (in total variation). Another point worth emphasizing is that the set of conditions connecting the reverse I-divergence with the Shannon entropy convergence (Section 3) is weaker than its counterpart connecting Shannon entropy with the direct I-divergence (Section 3.1). Thus, the reverse I-divergence provides the more natural criterion to connect convergence of distributions with the convergence of information functionals. In fact, for the applications of these results to statistical learning, the topic of the next section, the results involving the reverse I-divergence will be used exclusively, as the UBS on flogðdmn =dxÞ : n 2 Ng (required by Theorems 5–7) are hard (or even not possible) to verify in the concrete learning contexts presented next. 4. Applications to statistical learning The results of the previous sections are contextualized in the scenario where the distribution sequence fmn : n 2 Ng is not deterministic, but is driven by an empirical process, i.e. independent and identically distributed (i.i.d.) realizations of an underlying measure m 2 PðXÞ. In particular, we are interested in the learning problem of inferring m and HðmÞ from the sequence of induced measures fmn : n 2 Ng almost-surely with respect to the empirical process distribution. We concentrate on the connections between density estimation (in reverse I-divergence) and differential entropy estimation mainly adopting the results in Theorem 4 and also Theorem 2. The reason is that Theorem 4 imposes the weakest set of conditions on the limiting distribution m. Concerning the learning techniques for inducing fmn : n 2 Ng from the empirical data, we will focus on instances of the histogram-based density estimator. (We refer the reader to Devroye and Lugosi, 2001; Devroye ¨ and Gyorfi, 1985; Lugosi and Nobel, 1996; Barron et al., 1992 for excellent expositions of these techniques.) 4.1. Preliminaries and problem setting Let m be a probability measure in ðX,BðXÞÞ, with X being a Polish sub-space of Rd . Let us denote by X 1 ,X 2 ,X 3 . . . the empirical process induced from i.i.d. realizations of a random variable driven by m, i.e., X i m, for all i Z0. Let Pm denote the empirical process distribution in the space of sequences ðX1 ,BðX1 ÞÞ and Pnm be the finite block distribution of X n1 ðX 1 , . . . ,X n Þ in the product space ðXn ,BðXn ÞÞ.4 Given a realization of X 1 ,X 2 ,X 3 . . . the idea is to estimate m from this empirical data, where we consider a histogram based approach. More precisely, let pn ¼ fA1 , . . . ,Amn g be in QðXÞ, the collection of finite measurable partition of X, with 9pn 9 ¼ mn cells. Then the standard empirical distribution restricted to the events in pn is given by
mn ðAÞ ¼
n 1X 1A ðX k Þ, nk¼1
8A 2 pn ,
ð25Þ
where 1A ðxÞ denotes the indicator function of the set A. From this measure defined on ðX, sðpn ÞÞ,5 we can define the Radon–Nikodym (RN) derivative (or density) with respect to l by, X dmn m ðAÞ ðxÞ ¼ 1A ðxÞ n , dl lðAÞ A2p
8x 2 X,
ð26Þ
n
or alternatively, dmn m ðAn ðxÞÞ , ðxÞ ¼ n dl lðAn ðxÞÞÞ
8x 2 X,
ð27Þ
where An(x) denotes the set in pn that contains the point x, which is well-defined as p is a partition of X. Consequently, 8B 2 BðXÞ Z X dmn lðB \ AÞ , ð28Þ mn ðBÞ ¼ ðxÞ dl ¼ mn ðAÞ lðAÞ B dl A2p n
which is the histogram-based extension of the empirical distribution on the full space ðX,BðXÞÞ, such that it belongs to ACðXÞ. Note that by construction, mn is a measurable function of X n1 , and consequently is a random object derived from the measurable space ðPnm , Xn ,BðXn ÞÞ, however, for the sake of simplicity, this last dependency will be implicit. Since mn 5 l, we can compute the Shannon differential entropy by Z X dmn dm m ðAÞ ð29Þ ðxÞ log n ðxÞ dlðxÞ ¼ mn ðAÞ log n , Hðmn Þ ¼ d l d l lðAÞ X A2p n
4 5
BðXn Þ denotes the Borel sigma field of the product space Xn , i.e., the smallest sigma-field that contains the sets in BðXÞ BðXÞ. sðpn Þ denotes the smallest sigma field that contains pn .
1724
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
which is expected to be well-defined as the partition pn is constructed so that 8A 2 pn , lðAÞ 40 and 9pn 9 ¼ mn o1. On the other hand, assuming that supportðmÞ ¼ X and that m 2 AC þ ðXÞ, then mn 5 m, Pm -a:s: Hence we can consider ! Z 1 dmn dmn dm log ðxÞ ðxÞ dlðxÞ, ð30Þ Dðmn JmÞ ¼ dl dl X dl where all the elements in the RHS of (30) are well-defined. In the sections that follow, we report the study of density-free (universal) conditions on pn that guarantee a strongly consistent estimate of m (in I-divergence) and, at the same time, a consistent estimate of HðmÞ as the number of sample points n tends to infinity. 4.2. Data-driven partitions Data-driven partitions are a special case of histogram-based density estimators, where the partition of X is adaptively constructed from the data (Lugosi and Nobel, 1996; Nobel, 1996; Devroye et al., 1996). More precisely, let pn ðÞ : Xn -QðXÞ denote a partition rule that maps a sequence of points of length n to the space of finite measurable partitions of X. Then we define by P ¼ fp1 , p2 , . . .g the partition scheme, which is the collection of partition rules of all lengths. Given an i.i.d. realizations X 1 , . . . ,X n driven by m, the learning problem uses the data in two stages: first, constructing the partition pn ðX 1 , . . . ,X n Þ and second, estimating the empirical distribution by (25), (26) and (28). Universal conditions on P that guarantee strong consistency in density estimation (in the L1 sense), classification, regression and the estimation of mutual information and KL divergence have been established in Lugosi and Nobel (1996), Nobel (1996), Wang et al. (2005), Silva and Narayanan (2010b, 2010a), respectively. A new set of results will be presented that shows consistency in reverse I-divergence, for the density estimation problem, as well as consistency of the differential entropy, adopting for this case the plug-in estimate in (29). To obtain this result, we study the conditions on P that guarantee that the sufficient conditions of Theorem 4 are satisfied Pm -a:s. To state the result, we need to first introduce some terminologies and complexity notions for the collection of partitions. Definition 4 (Lugosi and Nobel, 1996). Let A QðXÞ be a collection of measurable partitions of X: then the maximum cell count is given by MðAÞ ¼ sup9p9:
ð31Þ
p2A
In addition, let xn1 ¼ ðx1 , . . . ,xn Þ be a finite sequence in Xn and let DðA,xn1 Þ denote the number of different partitions of fx1 ,x2 , . . . ,xn g induced by the elements of A (partitions of the form ffx1 ,x2 , . . . ,xn g \ B : B 2 pg with p 2 A). Then, the growth function of A is given by
Dnn ðAÞ ¼ sup DðA,xn1 Þ:
ð32Þ
xn1 2Xn
Definition 5. Let P ¼ fp1 , p2 , . . .g be a partition scheme. Then let us define by An ¼ fpn ðx1 , . . . ,xn Þ : ðx1 , . . . ,xn Þ 2 Xn g QðXÞ,
ð33Þ
the collection of measurable partitions associated with pn ðÞ 2 P, and by pn ðx9X n1 Þ the cell in pn ðX n1 Þ that contains x 2 X. Definition 6. Let ðan Þn2N and ðbn Þn2N be two sequences of non-negative real numbers. ðan Þ dominates ðbn Þ, denoted by ðbn Þ$ ðan Þ (or alternatively ðbn Þ is Oðan Þ), if there exists C 4 0 and k 2 N such that bn rC an , 8n Z k. ðbn Þn2N and ðan Þn2N (both strictly positive) are asymptotically equivalent, denoted by ðbn Þ ðan Þ, if there exists C 4 0 such that limn-1 an =bn ¼ C. Theorem 8. Let X be a compact subset of Rd and let us consider a measure m 2 HðXÞ with dm=dx 2 C b ðXÞ. Let P ¼ fp1 ðÞ, p2 ðÞ, . . .g be a partition scheme of X driven by i.i.d. realizations X 1 ,X 2 , . . . with X i m for all i. If there exists t 2 ð0; 1Þ such that: (i) limn-1 ð1=nt ÞMðAn Þ ¼ 0, (ii) limn-1 ð1=nt Þ log Dnn ðAn Þ ¼ 0, (iii) (ðkn Þn2N a sequence of non-negative numbers, with ðkn Þ ðn0:5 þ t=2 Þ, such that, 8n 4 0 and 8ðx1 , . . . ,xn Þ 2 Xn , kn inf mn ðBÞ Z , B2pðxn1 Þ n (iv) and 8d 4 0, lim mðfx 2 X : diamðpn ðx9X n1 ÞÞ 4 dgÞ-0,
n-1
Pm -a:s:,6 6
For an event B 2 BðXÞ, its diameter is given by diamðBÞ ¼ supx,y2B JxyJ, with J J denoting the Euclidian norm in Rd .
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1725
then, lim Hðmn Þ ¼ HðmÞ
n-1
and
lim Dðmn JmÞ ¼ 0, Pm -a:s:
n-1
Proof. From Theorem 4, it is sufficient to verify that limn-1 ðdmn =dmÞðxÞ ¼ 1 (in measure with respect to m) and supn 4 0 Jdmn = dmJL1 ðdmÞ o 1, both Pm -a:s: (with respect to the empirical process distribution). First note that ðdmn =dmÞðxÞ ¼ ðdmn =dxÞðxÞ ððdm=dxÞðxÞÞ1 is a well-defined expression in X, since m 2 AC þ ðXÞ and by construction of the histogram-based approach mn 2 ACðXÞ. If we denote by An(x) the cell in pn ðX n1 Þ that contains x, then " 1 # dmn m ðAn ðxÞÞ mðAn ðxÞÞ dm ðxÞ ðxÞ ¼ n , 8x: ð34Þ dm mðAn ðxÞÞ lðAn ðxÞÞ dx In the RHS of (34) we recognize two terms. First, the estimation error component, that we denote by m^ n ðxÞ ½mn ðAn ðxÞÞ= mðAn ðxÞÞ and, second, the approximation error expression, which we denote by m~ n ðxÞ ½ðmðAn ðxÞÞ=lðAn ðxÞÞÞððdm=dxÞðxÞÞ1 , that relates to the notion of asymptotic sufficiency of pn ðX n1 Þ (Csisza´r, 1973). Concerning the estimation error we use the following result: Proposition 1 (Silva and Narayanan, 2007). Under the conditions (i), (ii) and (iii) of Theorem 8, m ðAÞ 1 ¼ 0, Pm -a:s: lim sup n n-1 n m ðAÞ A2pn ðX 1 Þ
ð35Þ
From Proposition 1, we have that limn-1 supx2X m^ n ðxÞ ¼ 1, Pm -a:s:, which implies that 8E 4 0, lim fx 2 X : 9m^ n ðxÞ19 4 Eg ¼ |,
n-1
Pm -a:s:
and by the dominated convergence theorem (Varadhan, 2001), 8E 4 0, lim mðfx 2 X : 9m^ n ðxÞ19 4 EgÞ ¼ 0,
n-1
Pm -a:s:
ð36Þ
In other words, m^ n ðxÞ converges to 1 in measure with respect to m, for Pm -almost every realization of the empirical process. Concerning the approximation error, first note that m 2 HðXÞ implies that logððdm=dlÞðxÞÞ 2 L1 ðdmÞ and consequently (m 40,(M 4 0 such that ðdm=dlÞðxÞ 2 ½m,M for m-almost every point in X. Then, mðAn ðxÞÞ dm mðAn ðxÞÞ dm dm dm lðA ðxÞ dl ðxÞ lðA ðxÞ dl ðxÞ dl ðxÞ dl ðxÞ n n ¼ ð37Þ r 9m~ n ðxÞ19 ¼ dm m m ðxÞ dl the last equality by the mean value theorem, where x 2 An ðxÞ. As ðdm=dlÞðxÞ is continuous in the compact set X, then is uniformly continuous in X. Consequently, 8E 40, (d 4 0 such that 8x 2 X if diamðAn ðxÞÞ o d ) supx2An ðxÞ 9ðdm=dlÞ ðxÞðdm=dlÞðxÞ9 o E. Then from (37), 8E 40 there exists d 4 0, such that lim mðfx 2 X : 9m~ n ðxÞ19 4 EgÞ r lim mðfx 2 X : diamðAn ðxÞÞ 4 dgÞ ¼ 0,
n-1
n-1
ð38Þ
the last equality from the shrinking cell condition in (iv). Note that since we have the same result for m^ n ðxÞ in (36), then (34) implies that limn-1 ðdmn =dmÞðxÞ ¼ 1, in measure with respect to m, for Pm -almost every sequence of X 1 ,X 2 , . . . . To conclude the proof, we need to verify the UBC of Theorem 4. First note that by definition Jðdmn =dmÞðxÞJL1 ðdmÞ r Jm~ n ðxÞJL1 ðdmÞ Jm^ n ðxÞJL1 ðdmÞ . From Proposition 1, it follows that limn-1 supx2X m^ n ðxÞ ¼ 1, so then supx2X m^ n ðxÞ o1 þ E, Pm -a:s: for a sufficiently large n. Then there is N 4 0 such that 8n 4 N, supn 4 N Jm^ n ðxÞJL1 ðdmÞ o 1 Pm -a:s. On the other hand from the mean value theorem, it is simple to show that m~ n ðxÞ oM=m, and consequently supn 4 0 Jm~ n ðxÞJL1 ðdmÞ o 1 Pm -a:s, which concludes the proof. & There are two concrete data-driven constructions, the Gessaman’s statistically equivalent block (Gessaman, 1970) and a binary tree structured partition, in which Theorem 8 has direct implications. These results derive from Silva and Narayanan (2010b), which are briefly stated next. 4.2.1. The Gessaman’s statistically equivalent block Let us assume that the support of m is a close and bounded rectangle, i.e., X ¼ #nj¼ 1 ½Lj ,U j . The Gessaman’s partition sequentially splits every coordinate of X using axis-parallel hyperplanes by a statistically equivalent partition principle. More precisely, let ln 40 denote the least number of samples points that we want to have in every bin of pn ðX n1 Þ, and let us
1726
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
choose a particular sequential order for the axis-coordinates, such as the standard ð1, . . . ,dÞ. With that, T n ¼ bðn=ln Þ1=d c is the number of partitions to create in every coordinate. Then the inductive construction goes as follows:
Project the i.i.d. samples X 1 , . . . ,X n into the first coordinate, which for simplicity we denote by Y 1 , . . . ,Y n . Compute the order statistics Y ð1Þ ,Y ð2Þ , . . . ,Y ðnÞ or the permutation of Y 1 , . . . ,Y n such that Y ð1Þ oY ð2Þ o oY ðnÞ -this
permutation exists with probability one if m 2 ACðXÞ (Devroye et al., 1996). Based on this, the following set of intervals is induced, fIi : i ¼ 1, . . . ,T n g ¼ f½L1 ,Y ðsn Þ ,ðY ðsn Þ ,Y ð2sn Þ , . . . ,ðY ððT n 1Þsm Þ ,U 1 Þg, where sn ¼ bn=T n c. Assign the samples of X 1 , . . . ,X n to the different resulting cells, i.e., fIi Rd1 \ X : i ¼ 1, . . . ,T n g. Conduct the same process in each of the cells of fIi Rd1 \ X : i ¼ 1, . . . ,T n g by projecting its data into the second coordinate and iterate the previously described steps until the last coordinate.
At the end of this process, we get the Gessaman data-dependent partition denoted by pn ðX n1 Þ. Theorem 9. Under the assumptions of Theorem 8, let us in addition assume that supportðmÞ ¼ X is a bounded rectangle. If ðln Þ ðn0:5 þ t=2 Þ for some t 2 ð1=3; 1Þ, then mn induced from the Gessaman’s data driven partition, and X 1 ,X 2 , . . . (X i m for all i) is such that, lim Hðmn Þ ¼ HðmÞ
n-1
and
lim Dðmn JmÞ ¼ 0, Pm -a:s:
n-1
The proof argument reduces to verify the sufficient conditions of Theorem 8, which is included in (Silva and Narayanan, 2010b, Proof of Theorem 4). 4.2.2. A tree-structured partition (TSP) Let us consider an instance of a balanced search tree (see, Devroye et al., 1996, Chapter 20.3), in particular the binary case. More precisely, given X 1 ,X 2 , . . . ,X n i.i.d. realizations of m with support in a compact rectangle of the form X ¼ #dj¼ 1 ½Lj ,U j Rd , this partition scheme chooses a coordinate-axis of Rd in a sequential order, say the dimension i for the first step, and then the i axis-parallel half-space by Hi ðX n1 Þ ¼ fx 2 X : xðiÞ rX ðdn=2eÞ ðiÞg,
ð39Þ
where X ðiÞ o X ðiÞ o, . . . , o X ðiÞ denotes the order statistics (Abou-Jaoude, 1976) obtained by a permutation of the sampling points fX 1 , . . . ,X n g projected in the target dimension i. Note that this permutation exists with probability one as the i-marginal distribution of m is equipped with a density (Devroye et al., 1996; Lugosi and Nobel, 1996). Using the hyperplane, X is divided into two statistically equivalent rectangles with respect to the coordinate-axis i, denoted by U ð1;0Þ ¼ X \ Hi ðX n1 Þ and U ð1;1Þ ¼ X \ Hi ðX n1 Þc , respectively. Assigning the sample points ðX 1 , . . . ,X n Þ to their belonging cell in fU ð1;0Þ ,U ð1;1Þ g, we can choose a new coordinate-axis, in the mentioned sequential order, and continue with the splitting process, independently in each of the two intermediate cells. Staying with this process in an inductive fashion, the termination criterion is based on a stopping rule that guarantees a minimum number of sample points per cell, denoted by kn 40. Then at the end of the process, a binary tree-structured partition (TSP) of X is induced, with almost the same empirical mass in each of its cells. Note that the stopping criterion agrees with restriction (iii) of Theorem 8. In fact, this condition is fundamental to obtaining the following consistency result: ð1Þ
ð2Þ
ðnÞ
Theorem 10. Under the hypothesis of Theorem 8 and assuming that X is a bounded rectangle, if ðkn Þ ðn0:5 þ t=2 Þ for some t 2 ð1=3; 1Þ the empirical measure mn induced from the TSP satisfies that lim Hðmn Þ ¼ HðmÞ
n-1
and
lim Dðmn JmÞ ¼ 0, Pm -a:s:
n-1
The proof reduces again to verify the conditions of Theorem 8. The argument is contained in Silva and Narayanan (2010b, Proof of Theorem 5). 4.3. The Barron distribution estimator Consider now the Barron-type of histogram-based estimator proposed in Barron et al. (1992). It was originally proposed for the problem of density estimation consistent in direct I-divergence. The application of Theorem 4 in this construction provides a feasible set of conditions to make this estimator also strongly consistent in reverse I-divergence. In addition, strong consistency for the estimation of the differential entropy can be obtained when adopting the plug-in estimate in (29). This construction assumes a known reference measure v 2 PðXÞ such that m 5 v and DðmJvÞ is finite (Barron et al., 1992). If we assume that supportðmÞ ¼ X is compact and, consequently, bounded (lðXÞ is finite), then we can consider v ¼ l=lðXÞ 2 PðXÞ as the reference measure, where DðmJvÞ ¼ HðmÞ þ logðlðXÞÞ o 1, when m 2 HðXÞ. Barron’s construction is based on a finite partition pn ¼ fAn,1 , . . . ,An,mn g 2 QðXÞ formed by statistically equivalent blocks with respect to the reference measure v. More precisely, given a sequence of positive real numbers ðhn Þn 4 0 , vðAn,j Þ ¼ hn , for all j 2 f1, . . . ,mm g, where hn ¼ 1=mn with ðmn Þn 4 0 denoting the sequence of positive integers associated with 9pn 9. The Barron’s estimator is a
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1727
modification of the classical histogram-based estimator in (28), where the empirical distribution is induced by
mnn ðBÞ ¼ ð1an Þ
mn X
mn ðAn,j Þ
j¼1
vðAn,j \ BÞ þ an vðBÞ vðAn,j Þ
ð40Þ
8B 2 BðXÞ, where mn is the standard empirical distribution in (25) and ðan Þn 4 0 is a smoothing sequence with values in ð0; 1Þ. Note that mnn is a well-defined measure in AC þ ðXÞ with density dmnn m ðAn ðxÞÞ 1 þ an 4 0, ðxÞ ¼ ð1an Þ n dl lðAn ðxÞÞ lðXÞ
ð41Þ
for all x 2 X. Barron et al. (1992) proposed this mixture construction, with 1 4an 40, to guarantee that m 5 mnn and by doing so was able to characterize the direct I-divergence in (15). In addition, this smoothing technique provides a way of controlling the minimum empirical mass of mnn , where by construction in (40), mnn ðBÞ Zan hn , 8B 2 BðXÞ. This bounding condition ends up being the key for controlling estimation error effects in the process of estimating Shannon information measures. In fact, a similar bounding technique has been adopted in Silva and Narayanan (2010a, 2010b) for the consistent estimation of mutual information and KL divergence. Finally, since by assumption m 2 AC þ ðXÞ, dmnn =dm is well-defined in X, consequently, we can also talk about the reverse I-divergence in this setting. Then, we can state the following result: Theorem 11. Let m be in AC þ ðXÞ \ HðXÞ and X be a compact subset of Rd . Under the Barron distribution estimator in (40) induced from X 1 ,X 2 , . . . i.i.d realizations driven by m, if (i) ðan Þ is oð1Þ and ðhn Þ is oð1Þ, (ii) ðdm=dlÞðxÞ is a continuous function on X, (iii) and there exists t 2 ð0, 12Þ such that ð1=an hn Þ is oðnt Þ, then, lim Hðmnn Þ ¼ HðmÞ
n-1
and
lim Dðmnn JmÞ ¼ 0, Pm -a:s:
n-1
Remark 4. It is worth mentioning that (iii) in Theorem 11 implies the sufficient condition of Barron et al. (1992, Theorem 2, Eq. (3.16))7 and consequently from this result we also have that, lim DðmJmnn Þ ¼ 0, Pm -a:s:
n-1
Proof. The problem reduces to verify that the sufficient conditions of Theorem 4 are met Pm -a:s. We consider, 1 1 dmnn dmnn dm m ðAn ðxÞÞ 1 dm þan ðxÞ ¼ ðxÞ ðxÞ ¼ ð1an Þ n ðxÞ dm dl dl lðAn ðxÞÞ lðXÞ dl 3 2 3 2 mn ðAn ðxÞÞ 1 mðAn ðxÞÞ 1 þ a þa ð1a Þ ð1a Þ n n n n 6 6 lðAn ðxÞÞ lðXÞ7 lðAðxÞÞ lðXÞ7 7: 76 ¼6 5 4 mðAn ðxÞÞ 1 5 4 dm þ an ð1an Þ ðxÞ lðAn ðxÞÞ lðXÞ dl
ð42Þ
ð43Þ
The first term on the RHS of (43) captures the estimation error that we denote by ðdmnn =dm~ n ÞðxÞ with m~ n ðBÞ being the probability measure induced by the Barron estimator in (40) when mn is replaced with the true distribution m. The second term on RHS of (43) captures the approximation error that we denote by ðdm~ n =dmÞðxÞ. Let us first focus on the estimation error component: n n m ðAÞ dm 9mn ðAÞm~ n ðAÞ9 1 r sup n sup n ðxÞ1 r sup n : ð44Þ ~ ~ d m m ðAÞ an hn x2X A2pn A2pn n n The first inequality in (44) is by definition of the Barron measures and the second is by the fact that m~ n ðBÞ Z an hn for all B, from (43). We will use a version of the celebrated Vapnik–Chervonenkis inequality presented in Silva and Narayanan (2010a, Lemma 2) for the case of mixture distributions, that implies that, ! ( ) n ðEan hn Þ2 Pm sup9mnn ðAÞm~ n ðAÞ94 Ean hn r 89pn 9 exp : ð45Þ 32 A2pn Consequently from the hypothesis in (iii), there exists to 2 ð0; 1Þ such that, ! 1 1 ðnð1to Þ=2 an hn Þ2 E2 n ~ -1, log P sup 9 m ðAÞ m ðAÞ94 E a h ¼ t logð89pn 9Þ m n n n n o nto n 32 A2pn
7
The condition of Barron et al. (1992) requires that lim supn-1 1=hn an nr 1.
ð46Þ
1728
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
for all E 4 0. Then,
!! $ðexpfnto gÞn 4 0 ,
Pm sup9mn ðAÞm~ n ðAÞ9 4 Ean hn n
A2pn
ð47Þ
n40
which from the Borel Cantelli Lemma implies that limn-1 supx2X ðdmnn =dm~ n ÞðxÞ ¼ 1, Pm -a:s. Concerning the approximation error component ðdm~ n =dmÞðxÞ, as ðan Þ is oð1Þ, it is sufficient to analyze the following approximation error expression: dm mðAn ðxÞÞ mðAn ðxÞÞ dm dm ðxÞ ðxÞ ðxÞ lðAn ðxÞÞ lðAn ðxÞÞ dl dl dl r sup , ð48Þ 1 r dm m m x2An ðxÞ ðxÞ dl where the first inequality in (48) used m ¼ inf x2X ðdm=dlÞðxÞ 4 0 from the fact that ðdm=dlÞðxÞ is continuous and defined in a compact support X, while the second inequality in (48) derives from the mean value theorem. It is simple to show that as X is bounded and pn is induced from mn ¼ 1=hn statistically equivalent blocks with respect to l, then from the fact that ðhn Þ is o(1), this implies that lim sup diamðAn ðxÞÞ ¼ 0:
ð49Þ
n-1 x2X
In other words, pn is a v-approximating partition of X (Csisza´r, 1973). (See Definition 7 in Section 4.5.) From this condition, and the fact that ðdm=dlÞðxÞ is uniformly continuous in X, we have that dm dm ðxÞ ðxÞ dm~ n dl dl ¼ 0: ð50Þ lim sup ðxÞ1 r lim sup sup n-1 x2X dm n-1 A 2p x,x2A m n n,j n,j Consequently from (43), (46) and (50), limn-1 supx2X ðdmnn =dmÞðxÞ ¼ 1, Pm -a:s. From the uniform nature of this convergence, showing that supn 4 0 supx2X ðdmnn =dmÞðxÞ is bounded Pm -a:s. follows directly, which concludes the proof. & 4.4. Gy¨ orfi and Van der Meulen estimator ¨ and Van der Meulen In this section we consider one of the density-free histogram-based estimators proposed by Gyorfi (1987) for the problem of differential entropy estimation. Let pn ¼ fAn,1 , . . . ,An,nm g be a finite measurable partition of X Rd , with the condition that lðAn,k Þ Z ðhn Þd for a given sequence of strictly positive real numbers ðhn Þn 4 0 . Let m be in ¨ ACðXÞ and X 1 , . . . ,X n be a sequence of i.i.d. realizations with X i m. Gyorfi and Van der Meulen (1987) proposed the following estimator for the entropy: X m ðA Þ ~n ð51Þ mn ðAn:i Þ log n n:i , H lðAn:i Þ i2F n
where mn ðAn:i Þ is the standard empirical measure in (25), and F n fi : mn ðAn,i Þ Zan g f1, . . . ,nm g
ð52Þ
with ðan Þn 4 0 a sequence of non-negative numbers. We consider a small variation of (51), where we first induce a welldefined density with respect to l, f n ðxÞ ¼
mn ðAn ðxÞÞ
1[i2F n An;:i ðxÞ S i2F n An;:i Þ
ð53Þ
lðAn ðxÞÞ mn ð
and its induced measure by Z m n ðAÞ ¼ f n ðxÞ dlðxÞ,
8A 2 BðXÞ,
ð54Þ
A
that is then use in the plug-in estimator, Z Hðm n Þ ¼ f n ðxÞ log f n ðxÞ dlðxÞ ¼ X
mn ð
S
1
i2F n An;:i Þ
" ~ n þ log m H n
[
!# An;:i
:
ð55Þ
i2F n
~ n a density-free strongly ¨ Gyorfi and Van der Meulen (1987, Theorem 2) stipulated conditions on ðhn Þ and ðan Þ to make H consistent estimate of HðmÞ. The next result shows that the plug-in estimate in (55) is strongly consistent under the same conditions. 1
Proposition 2. Let m be HðXÞ, then if ðan Þ is o(1), hn is an integer such that ðhn Þ is o(1), and X 1 d ecnan hn o1, 8c 40, d n Z 0 an hn
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1729
then lim Hðm n Þ ¼ HðmÞ,
Pm -a:s:
n-1
¨ and Van der Meulen (1987, Theorem 2). The argument is presented in The proof derives almost completely from Gyorfi Appendix D. ¨ and Van der Meulen (1987, Section 5, p. 434), the conditions presented in Proposition 2 are stronger As stated in Gyorfi than its counterpart for the problem of estimating m consistently in total variation. Consequently, m n is a strongly consistent estimate (in total variation) of m and, therefore, the application of Theorem 2 implies the following result: Corollary 6. Under the conditions on ðhn Þ and ðan Þ stipulated in Proposition 2, if m 2 AC þ ðXÞ and logðdm=dlÞ 2 L1 ðdlÞ, then m n is strongly consistent in reverse I-divergence, i.e., lim Dðm n JmÞ ¼ 0,
n-1
Pm -a:s:
ð56Þ
The proof derives from Theorem 2 after verifying that limn-1 Vðm n , mÞ ¼ 0, Pm -a:s. The argument of this last part is presented in Appendix E. 4.5. Classical histogram-based estimator: revisited We conclude this section by revisiting the classical histogram based estimator mn in (28) induced from X 1 ,X 2 . . . i.i.d. realizations of m and with pn being a finite measurable partition of X Rd . As in previous sections, we restrict ourselves to the case where X ¼ supportðmÞ is a compact subset of Rd . For the density estimation based on this construction, strong consistency in total variation is well known (Abou-Jaoude, 1976). Barron et al. (1992) extended this result to the more general problem of distribution estimation, and explored conditions for the problem of consistency in reverse I-divergence. The following result is a corollary of this seminal work. Before presenting it, the following approximation goodness property for partitions will be introduced. ´r, 1973). Let fpn : n 2 Ng be a sequence of measurable partitions of X and let v be a measure in Definition 7 (Csisza ðX,BðXÞÞ. fpn : n 2 Ng is said to be v-approximating, if for all A 2 BðXÞ with vðAÞ o 1 and for all E 4 0, there exists N where 8n 4 N there exists An 2 sðpn Þ such that vðAWAn Þ o E: Theorem 12 (Barron et al., 1992). Let X be a compact subspace of Rd and m 2 AC þ ðXÞ such that ðdm=dlÞðxÞ is a continuous function in X. Let fpn : n 4 0g QðXÞ be a sequence of partitions. If fpn : n 2 Ng is l-approximating in X and ð9pn 9 ¼ mn Þ is o(1), then, lim Em ðDðmn 9mÞÞ ¼ 0,
n-1
ð57Þ
where the expected value is taken with respect to Pm , the process distribution of X 1 ,X 2 , . . . . This result derives directly from Barron et al. (1992, Theorem 5), that considers the special case where m has a compact rectangle support and ðdm=dlÞðxÞ is uniformly continuous. (See details in Barron et al., 1992, Remarks 9 and 11.) The proof of this result considers controlling two terms associated with an estimation error and a bias error. For the estimation error part, ðmn Þ being o(1) is sufficient to get the result (see Barron et al., 1992, Theorem 3), while for the approximation error, the l-approximating property of fpn : n 2 Ng is sufficient (see Barron et al., 1992, Theorem 4 and Remark 11). In fact in Barron et al. (1992, Theorem 5) offers consistency in the sense of (57) under more general conditions than its version stated in Theorem 12. However, for this more constrained setting, we can complement the result and obtain strong consistency in reverse I-divergence. Theorem 13. Under the setting and assumptions of Theorem 12, if in addition the partition sequence fpn : n 40g is such that, lim sup diamðAÞ ¼ 0,
n-1 A2p n
ð58Þ
then limn-1 Dðmn 9mÞ, Pm -a:s. The proof derives from Theorem 4 and the shrinking cell condition in (58). Note that (58) is sufficient for pn to be l-approximating for our case of separable metric spaces (see result by Csisza´r, 1973). The proof is presented at the end of this section. Finally from Theorems 2 and 13, the following result can be stated. Corollary 7. Under the assumptions of Theorem 13, given that dm=dl 2 Cb ðXÞ, then fmn : n 4 0g HðXÞ, and lim Hðmn Þ ¼ HðmÞ,
n-1
Pm -a:s:
ð59Þ
Remark 5. Note that ðmn Þ being o(1) and the shrinking cell condition in (58) are the necessary and sufficient conditions for the classical histogram-based estimator to be density-free strongly consistent in the variational distance sense
1730
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
¨ (Abou-Jaoude, 1976) (see also Devroye and Gyorfi, 1985). Notably, adding the compact support assumption on m and the continuity assumption on its density ðdm=dlÞðxÞ, we are able to match these conditions to achieve the stronger convergence in reverse I-divergence, as well as consistency for the differential entropy (from the plug-in estimator). Proof. Adopting the decomposition proposed by Barron et al. (1992, Eq. (4.4)), we have that Z Z Z f^ ðxÞ f ðxÞ f ðxÞ f^ n ðxÞ log n dlðxÞ þ f^ n ðxÞ log n dlðxÞ, ¼ Dsðpn Þ ðmn ðxÞJmðxÞÞ þ f^ n ðxÞ log n dlðxÞ, Dðmn JmÞ ¼ f n ðxÞ f ðxÞ f ðxÞ X X X
ð60Þ
with the densities f^ n ðxÞ ¼ mn ðAn ðxÞÞ=lðAn ðxÞÞ, f n ðxÞ ¼ Em ðf^ n ðxÞÞ ¼ mðAn ðxÞÞ=lðAn ðxÞÞ and f ðxÞ ¼ dm=dlðxÞ, where An(x) denotes the cell in pn that contains point x. Note that in (60), we characterize the estimation error term by Dsðpn Þ ðmn ðxÞJmðxÞÞ, which is the I-divergence of mn with respect to m, restricted to the sigma field induced by the partition pn (Gray, 1990). Barron et al. (1992, Proof of Theorem 3, p. 1450) showed that under the condition that ðmn Þ is o(1), Dsðpn Þ ðmn ðxÞJmðxÞÞ tends to zero Pm -a:s. Then we need to focus on the second term in (60), capturing the approximation error. By triangular inequality Z Z Z f^ ðxÞ log f n ðxÞ dlðxÞ r f ðxÞ log f n ðxÞ dlðxÞ þ log f n ðxÞ Jf n ðxÞf^ n ðxÞJ dlðxÞ ð61Þ n n f ðxÞ f ðxÞ f ðxÞ L1 ðdlÞ X X X dm n ¼ Dðm n JmÞ þ log dm 1 Vðm n , mn Þ, L ðdlÞ
ð62Þ
where, from (61) to (62), m n denotes the measure in ACðXÞ induced by the density fn(x). From the shrinking cell condition in (58) and the continuity of ðdm=dlÞðxÞ, the same arguments adopted in the proof of Theorem 11 can be used to bound Dðm n JmÞ in (62). More precisely, we have that limn-1 supx2X 9ðdm n =dmÞðxÞ19, which implies that limn-1 ðdm n =dmÞðxÞ ¼ 1 for all x, and that supn 4 N Jlog m n =mJL1 ðdlÞ o 1 for a sufficiently large N. Hence from Theorem 4, limn-1 ,Dðm n JmÞ ¼ 0. For the 2 remaining term on the RHS of (62), we have that Vðmn , mn Þ ¼ supA2 pn 9mn ðAÞmðAÞ9 rK9pn 9enE with K a constant independent of m, this last bound from the celebrated VC inequality (Vapnik and Chervonenkis, 1971). Finally considering that 9pn 9 is o(1) and the Borel–Cantelli Lemma, we have that limn-1 Vðmn , mn Þ ¼ 0 Pm -a:s:, which concludes the proof. &
Acknowledgments We want to thank Todd Coleman for a stimulating discussion on the topic of information measure estimation, which helped us to set the initial ideas and questions that were ultimately studied in this work. We thank Sandra Beckman for proofreading this material. The work presented in this manuscript is supported by funding from FONDECYT Grant 1110145, CONICYT-Chile. Appendix A. Theorem 5 Proof. First note that mn 2 HðXÞ by the bounding condition and the fact that mn 5 l, where m 5 mn from the fact that R R dmn =dx 4 0 Lebesgue a.e. for all n 40. We can consider Dðm9mn Þ ¼ X logðdm=dmn Þdm r X 9log dm=dmn 9dm r log dm= dmn kL1 ðdmÞ o 1 (see details in (64)). Then m 2 HðX9mn Þ for all n 4 0. Finally it is simple to show that Z log dmn dm dmn dl r suplog dmn Vðmn , mÞ: ð63Þ dl dl dl dl L1 ðdlÞ X n40 The result then follows from (24) and Pinsker’s inequality.
&
Appendix B. Theorem 6 Proof. First note that we can obtain the UBC of Theorem 3. It is clear that 9log dmn =dm9 r9log dmn =dx9 þ9log dm=dx9 then dmn log dmn log dm sup log r sup þ o 1: ð64Þ dm L1 ðdlÞ n 4 0 dx L1 ðdlÞ dx L1 ðdlÞ n40 Then from Theorem 3, limn-1 Vðmn , mÞ ¼ 0 is equivalent to limn-1 Dðmn JmÞ ¼ 0, and limn-1 Vðmn , mÞ ¼ 0 implies that limn-1 Hðmn Þ ¼ HðmÞ. On the other hand, DðmJmn Þ is well-defined from Theorem 5. Then using (24) it is sufficient to show that under the hypotheses of the theorem, we can upper bound DðmJmn Þ by Vðmn , mÞ. From (64) there exists Nmin 4 0 such that inf n ðdm=dmn ÞðxÞ 4 Nmin Lebesgue-a.e. Then, Z dm log dm, DðmJmn Þ ¼ dmn X Z logðNmin Þ dm dm 1 r X N min 1 dmn
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
1731
1 logðNmin Þ dmn dm dmn dm dx dx dx dx dx X N min 1 Z dmn 1 dm dm dmn logðNmin Þ dx: ¼ C Vðm , mÞ: inf r n N min 1 n 4 0 dx L1 ðdlÞ dx L1 ðdlÞ X dx dx Z
¼
ð65Þ
The first inequality uses the fact that for any x0 4 0, 9log x9 r log x0 =ðx0 1Þ9x19 8x 2 ðx0 ,1Þ (see Piera and Parada, 2009), while the second inequality uses the UBC and the fact that logðdm=dxÞ 2 L1 ðdlÞ. & Appendix C. Theorem 7
Proof. Using the identity Hðmn ÞHðmÞ ¼ DðmJmn Þ þ
Z
log X
dmn dm dmn dx, dx dx dx
ð66Þ
it is simple to show that DðmJmn Þ o 1. Let us work with the second term on the RHS of (66). Z Z Z dm dm dmn dm dm @m dm dx ¼ log n 1 n dm, r sup log n log n 1 n dm: dx dx dx dx dm @x L1 ðdmÞ dm X X Xn 4 0
ð67Þ
From (64), the function dmn =dm is bounded m-a:e:, and given that limn-1 dmn =dm ¼ 1 m-a:e:, the term in (67) converges to zero from the dominated convergence theorem (Varadhan, 2001). From (66), it is sufficient to show that limn-1 DðmJmn Þ ¼ 0. This is a simple consequence of the fact that log dm=dmn is bounded by supn 4 0 log dmn =dmL1 ðdmÞ o 1 for m-almost every point in X, and the dominated convergence theorem. & Appendix D. Proposition 2
~ n -HðmÞ, Pm -a:s: by Gyorfi ¨ Proof. The proof is based on the relationship in (55) and the fact that under these hypotheses H S and Van der Meulen (1987, Theorem 2). Then it is sufficient to show that limn-1 mn ð i2F n An;:i Þ ¼ 1, Pm -a:s. Concerning this, S ¨ Gyorfi and Van der Meulen (1987, Eq. (3.34), p. 432) showed instead that limn-1 mð i2F n An;:i Þ ¼ 1 Pm -a:s. Then if we S S consider a new sequence of partitions p n ¼ f i2F n An;:i , X\ i2F n An;:i g 2 QðXÞ for all n, from the VC inequality (Vapnik and Chervonenkis, 1971)8 and the Borel–Cantelli lemma, we have that lim sup 9mn ðAÞmðAÞ9 ¼ 0,
n-1 A2p n
Pm -a:s; :
ð68Þ
which concludes the proof from the construction of p n .
&
Appendix E. Corollary 6
Proof. Let us consider the standard histogram-based estimator based on pn , i.e, 8A 2 BðXÞ,
mn ðAÞ ¼
mn X
mn ðAn,i Þ
i¼1
lðAn,i \ AÞ : lðAn,i Þ
ð69Þ
On the other hand, our histogram based estimator in (53) is given by,
m n ðAÞ ¼
mn ð
S
1
i2F n An;:i Þ
X
mn ðAn,i Þ
i2F n
lðAn,i \ AÞ , 8A 2 BðXÞ: lðAn,i Þ
ð70Þ
¨ We know from Gyorfi and Van der Meulen (1987), that limn-1 Vðmn , mÞ ¼ 0, Pm -a:s. From the triangular inequality, it is then sufficient to analyze 8 19 0 < = mn ðAn,i Þ @ [ S Vðmn , m n Þ ¼ max supmn ðAn,i Þ An;:i A , ð71Þ , mn : i2F n ; mn ð i2F n An;:i Þ i= 2F n
which from the fact that limn-1 mn ð
S
i2F n An;:i Þ ¼
1 Pm -a:s: (see Appendix D) concludes the proof.
&
8 Note that the scatter coefficient of p n is equal to 9p n 9 ¼ 2 for all n, which guarantees an exponential rate of convergence for Pm ðsupA2p n 9mn ðAÞmðAÞ9 4 EÞ, 8E 4 0. An excellent exposition of the Vapnik and Chervonnekis inequality and its combinatorial complexity metrics can be found in Vapnik (1998), Devroye and Lugosi (2001), and Devroye et al. (1996).
1732
J.F. Silva, P. Parada / Journal of Statistical Planning and Inference 142 (2012) 1716–1732
References Abou-Jaoude, S., 1976. Condition ne´cessaires et sufficantes de convergence L1 en probabilite´ de l‘hitograme pour une densite´. Annales de l’Institut Henri Poincare´ 12, 213–231. ¨ Barron, A., Gyorfi, L., van der Meulen, E.C., 1992. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Transactions on Information Theory 38 (5), 1437–1454. ¨ Beirlant, J., Dudewics, E., Gyorfi, L., van der Meulen, E.C., 1997. Nonparametric entropy estimation: an overview. International Journal of Mathematics and Mathematical Sciences 6, 17–39. Berlinet, A., Vajda, I., van der Meulen, E.C., 1998. About the asymptotic accuracy of Barron density estimate. IEEE Transactions on Information Theory 44 (3), 999–1009. Billingsley, P., 1999. Convergence of Probability Measures, second ed. Wiley Series in Probability and Statistics. Breiman, L., 1968. Probability. Addison-Wesley. Cover, T.M., Thomas, J.A., 1991. Elements of Information Theory. Wiley Interscience, New York. Csisza´r, I., 1967. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica 2, 299–318. Csisza´r, I., 1973. Generalized entropy and quantization problems. In: Academia (Ed.), Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, and Random Processes, pp. 159–174. Csisza´r, I., Shields, P.C., 2004. Information Theory and Statistics: A Tutorial. Now Inc. Darbellay, G.A., Vajda, I., 1999. Estimation of the information by an adaptive partition of the observation space. IEEE Transactions on Information Theory 45 (4), 1315–1321. ¨ Devroye, L., Gyorfi, L., 1985. Nonparametric Density Estimation: The L1 View. Wiley Interscience, New York. ¨ Devroye, L., Gyorfi, L., Lugosi, G., 1996. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York. Devroye, L., Lugosi, G., 2001. Combinatorial Methods in Density Estimation. Springer-Verlag, New York. Gessaman, M.P., 1970. A consistent nonparametric multivariate density estimator based on statistically equivalent blocks. The Annals of Mathematical Statistics 41, 1344–1346. Gray, R.M., 1990. Entropy and Information Theory. Springer-Verlag, New York. ¨ Gyorfi, L., Liese, F., Vajda, I., Van-der-Meulen, E.C., 1998. Distribution estimates consistent in w2 -divergence. Statistics 32 (1), 31–57. ¨ Gyorfi, L., Van der Meulen, E.C., 1987. Density-free convergence properties of various estimators of entropy. Computational Statistics and Data Analysis 5, 425–426. ¨ Gyorfi, L., Van der Meulen, E.C., 1994. Density estimation consistent in information divergence. In: IEEE International Symposium on Information Theory, pp. 35. Halmos, P.R., 1950. Measure Theory. Van Nostrand, New York. Ho, S.-W., Yeung, R.W., 2009. On the discontinuity of the Shannon information measures. IEEE Transactions on Information Theory 55 (12), 5362–5374. Ho, S.-W., Yeung, R.W., 2010. The interplay between entropy and variational distance. IEEE Transactions on Information Theory 56 (12), 5906–5929. Kullback, S., 1958. Information Theory and Statistics. Wiley, New York. Kullback, S., 1967. A lower bound for discrimination in terms of variation. IEEE Transactions on Information Theory 13, 126–127. Kullback, S., Leibler, R., 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 79–86. Lugosi, G., Nobel, A.B., 1996. Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics 24 (2), 687–706. Nobel, A.B., 1996. Histogram regression estimation using data-dependent partitions. The Annals of Statistics 24 (3), 1084–1105. Piera, F., Parada, P., 2009. On convergence properties of Shannon entropy. Problems of Information Transmission 45 (2), 75–94. Scheffe, H., 1947. A useful convergence theorem for probability distribution. The Annals of Mathematical Statistics 18, 434–458. Shannon, C.E., 1948. A mathematical theory of communication. Bell System Technical Journal 27, 379–423. 623–656. Silva, J., Narayanan, S., 2007. Universal consistency of data-driven partitions for divergence estimation. In: IEEE International Symposium on Information Theory, IEEE. Silva, J., Narayanan, S., 2010a. Information divergence estimation based on data-dependent partitions. Journal of Statistical Planning and Inference 140 (11), 3180–3198. Silva, J., Narayanan, S., 2010b. Non-product data-dependent partitions for mutual information estimation: strong consistency and applications. IEEE Transactions on Signal Processing 58 (7), 3497–3511. Vajda, I., Van der Meulen, E.C., 2001. Optimization of Barron density estimates. IEEE Transactions on Information Theory 47 (5), 1867–1883. Vapnik, V., 1998. Statistical Learning Theory. John Wiley. Vapnik, V., Chervonenkis, A.J., 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability Applications 16, 264–280. Varadhan, S., 2001. Probability Theory. American Mathematical Society. Wang, Q., Kulkarni, S.R., Verdu´, S., 2005. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Transactions on Information Theory 51 (9), 3064–3074.