Journal of Environmental Management (1997) 49, 43–51
Three Information–Theoretical Methods to Estimate a Random Variable Niels C. Lind Institute for Risk Research, University of Waterloo, Waterloo, ON, Canada N2L 3G1
Three information–theoretical methods to estimate a continuous univariate distribution are proposed for estimation when the distribution type is uncertain, when data are scarce, or when extremes are important. The first is a new version of Jaynes’ MaxEnt Method. The second, minimizing Shannon’s information measure, yields a minimally informative estimate. The third analysis produces a pair of distributions that together minimize the relative entropy (=cross-entropy) satisfying the principle of least information. Thus, one distribution (p) provides minimal information among all chosen candidates about these events, while the other (q) minimizes the information among all distributions that satisfies the sample rule, which is an essential constraint. Model p is also identical to the ‘‘maximum product of spacings’’ estimate. Distribution q is determined from p by simple algebra. The principle of least information yields a unique solution (p,q) when other methods fail. The algorithm is the same for all types of distributions; the estimation process introduces a minimum of information, approaching objectivity, and the solution is invariant under monotonic variable transformations. All three methods are computationally simple but involve optimization. There are also several approximate information–theoretical methods that retain some of the advantages cited and are computationally simpler. 1997 Academic Press Limited
Keywords: distribution, estimation, information, entropy, sample
1. Introduction Statistical estimation of random variables from random sample data is an important problem that can be solved by many established methods, such as the method of moments or maximum likelihood. This paper describes an alternative, information–theoretical estimations of the distribution of a real-valued variable of unknown mathematical type, for which a set of independent observations is available. The established methods of estimation have shortcomings for many applications. In natural hazards analysis, for Address for correspondence: Niels C. Lind, #504–640 Montreal Street, Victoria, BC, Canada V8V 1Z8. Fax (250) 721 6051.
0301–4797/97/010043+09 $25.00/0/ev960115
1997 Academic Press Limited
44
Information–theoretical estimation
example, the interest is often focused on one of the extreme tails of the distribution, but traditional methods perform better in the central portion (the ‘‘tail sensitivity problem’’). The data are often scarce. Moreover—and especially when sensitive political issues are involved—probability statements are often questioned and must stand up to intense scrutiny, possibly in public forum under adversarial format. Objectivity helps in their defense and should therefore replace arbitrariness as far as possible. Available statistical methods of estimation are not well suited for such applications. One response to such difficulties in recent years has been to relax the conventional assumptions about distribution type. This trend is reflected in the developments in density estimation (Silverman, 1986). These methods have useful applications although not under conditions of scarce data. In 1957 Jaynes presented a radically new approach to statistical inference based on information theory, called (maximum) entropy estimation (Jaynes, 1983), yielding the ‘‘least biased’’ estimate of the random variable that is possible on the given information. Jaynes’ analysis can be applied when the random variable is discrete-valued and the data is given in the form of a finite set of moments. If Jaynes’ method is applied to continuous random variables, the distribution type depends on which moments are used in the estimation process. This is a weakness from the classical (frequentist) viewpoint that considers the distribution type to be an objective property of the random variable which ought not be determined by the moments reflecting the data. This may be bad enough when the moments are observed directly, but when the data are sample values of the random variable, the moments are derived quantities. If a random sample is available, the analyst is free to calculate and use any selection and any number of moments, but the distribution type will be different for each selection. For example, it will be Gaussian if the analyst chooses to use only the first two moments in the calculations. Another fault is that the result is not, but should be, invariant under monotonic transformations of the variable. A straightforward way to extend Jaynes’s method to continuous random variables has emerged recently (Lind, 1994). It is presented below in some detail. The domain of the random variable is considered to be subdivided into intervals by the observations. This creates a finite set of discrete events. Any distribution model assigns probabilities to the intervals. Hence, it determines the amount of Shannon information (see Section 4). It also determines the entropy, which is merely the mathematical expectation of the Shannon information. If the model disagrees with the data, the entropy becomes small. For a good fit the entropy is large. A principle of maximum entropy is therefore suggested as one means of identifying a best fitting model, p(1), called the maximum entropy (‘‘MaxEnt’’) distribution, in a given set. Maximum entropy estimation was first extended to continuous random variables by Kullback and Leibler and axiomatized by Shore and Johnson (1980). It has been further extended to accommodate data in the form of quantiles (Lind and Solana, 1988). These extensions to Jaynes’ entropy theory all employ a reference distribution—but they provide no principles or guidance for its selection. As a further development of the information–theoretical approach, a method of estimation derives from the idea that the process should add as little information as possible to the information provided by the sample. If a distribution model disagrees with the data, the Shannon information becomes large; for a good fit this information is small. A principle of minimum information is therefore suggested as a means of identifying a best fitting model. Kapur and Kesavan (1992) invoke
N. C. Lind
45
‘‘arguments based on common sense and scientific principles: • Speak the truth and nothing but the truth; and • Make use of all the information that is given and scrupulously avoid making assumptions about information that is not available’’.
This principle leads to the minimum information distribution p(2). In the univariate case p(2) is identical with the maximum product of spacings (MPS) distribution proposed by Cheng and Amin (1983) and Ranneby (1984). The information–theoretical perspective supplements those earlier studies. Conversely, with more recent work (Titterington, 1985; Cheng and Stephens, 1989) they provide considerable information on the properties of the least information method as summarized herein. The minimum information distribution in a candidate set of distributions is not, in general, a global minimum information model. Among these there is, however, a unique model q(2) closest to p(2) in the sense that it minimizes the relative entropy with respect to p(2) under the constraints of the data (Shore and Johnson, 1980). The two principles, of minimum information and minimum relative entropy, are unified into the principle of least information for selection of a reference distribution p(2) and determination of the corresponding posterior distribution q(2).
2. Information and entropy Let {x1, x2,..., xn } be a random sample of a real-valued random variable X defined in domain E=[x0,xm ]. The sample partitions E into m=n+1 discrete events xvE0=[x0, x1], xvE1=(x1,x2],...,xvEn=(xn,xm]. Let p=p(x) be an element in a set C of candidate distributions for X, all defined and positive everywhere in domain E; p assigns to event Ei the probability pi=/ p(x)dx, i=0,1,...,n.
(1)
Ei
The sample rule asserts that all m interval events Ei, i=0,...,n, are equally probable (Feller, 1968; Lind and Solana, 1988). The sample rule is exact. It is based on the symmetry argument that an element selected at random from any set of m numbers is equally likely to occupy any one of the m positions in the ordered set. An arbitrary element p of a candidate set C will not normally satisfy the sample rule, but a suitable candidate set will contain some elements that are close. The (information–theoretical or Shannon) entropy of a model p is defined as n
H=− ; pi logpi .
(2)
i=0
3. Principle of maximum entropy The entropy H takes a maximum value, H0, when all pi are equal (Jaynes, 1983). If the model disagrees with the data, the entropy becomes small. For a good fit the entropy is large. A principle of maximum entropy is therefore suggested as a means of identifying a best fitting model, p(1), called the maximum entropy (‘‘MaxEnt’’) distribution, from a candidate set C.
46
Information–theoretical estimation
(1) Principle of Maximum Entropy. Given is a random sample of X that partitions the space of outcomes and a set C of candidate distributions. Select a model p(1) in C that maximizes the entropy for the discrete events defined by the partitioning. 4. Principle of minimum information The (Shannon) information of a model p is defined as n
I=− R logpi. i=0
(3)
I takes a minimum value, I0, if all pi are equal, equal to 1/m, since Rpi=1 and since the information is a symmetric function of a convex function of all pi . The set of all such models is called the set of global minimum information models, S. If the set of candidate distributions is chosen for mathematical convenience it will not normally contain a distribution that assigns all pi to equal 1/m. Thus, set C has normally no element in common with S, but it contains at least one element, p(2), called the minimum information model in C. The minimum information distribution is normally close but not identical to the MaxEnt distribution model. The set S of globally minimum information models is identical to the set of all probability distributions q on E that satisfy the sample rule, i.e. qi=/ q(x)dx=1/m, i=0,1,...,n.
(4)
E
If the distribution satisfies the sample rule, the associated information assumes the greatest lower bound I0=m log m; otherwise the information is greater than I0. If a model does not satisfy the sample rule, it ought at least to approach satisfying it as far as possible. Also, it should convey a minimum amount of information, little more than the sample. This suggests the following principle: (2) Minimum Information Principle. Given is a random sample of X that partitions the space of outcomes and a set C of candidate distributions. Select a model p(2) in C that minimizes the information about the discrete events defined by the partitioning. 5. Other extremum principles The information in Equation (3) is the negative logarithm of the product p0 p1 ...pn. Thus, the minimum information principle is equivalent to a principle of maximizing the product of the terms given by Equation (1). This may be compared with the likelihood of the sample, which is the product p(x0)p(x1)...p(xn). Evidently the maximum likelihood principle should lead to a similar estimate. Indeed, Titterington (1985) has suggested that p(2) can be regarded as the maximum likelihood estimate based on grouped data. Cheng and Amin (1983) considered a candidate set in the form of distributions with density p(x,h) and c.d.f. P(x,h), where h is a parameter vector. They transformed the sample {x} and the end points of the domain of X into the unit interval (0,1) using the transformation y=P(x,h), subdividing the unit interval into m spacings. The Maximum Product of Spacings (MPS) Method chooses h to maximize the geometric mean of the spacings. The motivation is that the maximum is obtained only when all the spacings
N. C. Lind
47
are equal, which in a rough sense corresponds to trying to set h equal to the ‘‘true’’ but unknown parameter vector for which the spacings become identically distributed. Ranneby (1984) observed that a good inference method ought to minimize the distance between the ‘‘true’’ distribution and the model measured in terms of a suitable metric. The relative entropy is one such possible metric. Ranneby showed that maximizing the product of spacings is equivalent to minimizing a quantity that converges to the relative entropy as the sample size goes to infinity. Thus, these two studies provide similar but distinctly different rationales for the reference distribution. The rationales rest on the notion of a ‘‘true’’ distribution which is sometimes meaningless, as in the context of climatic processes. 6. Relative entropy or cross-entropy Assume now that p(x)>0 and q(x)>0 for all X v E. The Kullback–Leibler relative entropy (or cross-entropy) of distribution q on E with respect to p is D(p,q)=/q(x)logq(x)dx−/ q(x)logp(x)dx. E
(5)
E
D(p,q) is a semi-metric, and may also be called the directed distance from p to q (Kapur and Kesavan, 1992). The relative entropy is invariant under a monotonic transformation of x. Let p be a reference distribution that does not satisfy the sample rule. That is, p is not an element of S. We seek a distribution q in S that is closest to p in the sense that D(p,q) is minimized, according to the principle of minimum relative entropy: (3) Principle of Minimum Relative Entropy. Given a reference distribution p, select a distribution q that satisfies the sample rule and minimizes the entropy relative to p. In terms of pi given by Equation (1) and qi given by Equation (4) the solution is (Lind and Solana, 1988) q(x)=p(x)(qi /pi ), i=0,1,...,n.
(6)
If q satisfies Equation (6), then D(p,q) in Equation (5) will assume the minimum value n
Dmin= R qi log(qi /pi )=−Rqi logpi−(−Rqi logqi )=I(p)/m−logm i= 0
(7)
D(p,q) can attain a minimum value only if p conforms to the minimum information principle, i.e. p=p(1). The corresponding distribution q may be called the posterior distribution and is denoted by q(2). 7. The posterior distribution The posterior distribution function Q(x) is monotonically increasing everywhere in E and can be interpreted as a transformation of x, denoted by y, with an associated transformation Y of the random variable X: Q:E→[0,1]; Y=Q(X); X=Q−1(Y).
(8)
48
Information–theoretical estimation
S
C p
q p(1)
H(p(1))
p(2)
I(p(2))
q(2)
Figure 1. Schematic representation of the linear variety of distributions showing the set C of candidate distributions, the set S of distributions satisfying the sample rule, and three optimal distributions.
The relative entropy D(p,q) is invariant under the transformation. It may be considered as pertaining to the estimation of Y. It is then a functional of the reference distribution pY and the posterior distribution qY , which is the standard uniform distribution u on domain [0,1]: D(p,q)=DY(pY ,u). Writing D in the form of a Stieltjes integral and observing that log u is identically zero gives 1
D(p,q)=−/ logpY(y)dy.
(9)
0
Thus, the relative entropy of q with respect to p equals the information provided by p about the transformed random variable Y. Since Y may be identified with the distribution function Q(x), the principle of minimum relative entropy and the principle of minimum information may be unified into one principle: (4) Principle of Least Information. Let C be a set of candidates for a reference distribution p to estimate a real random variable X, and let S be the set of posterior distributions q satisfying the sample rule. Choose distributions p(2) v C and q(2) v S that minimize the information about the distribution function of X. Figure 1 shows symbolically the sets C and S, the generic distributions p and q, and the three solutions p(1), p(2) and q(2). Distributions p in set C can be selected to have convenient mathematical form. Distributions q in set S satisfy the sample rule but are generally not differentiable. 8. Some properties of the minimum information distribution The minimum information distribution p(2), identical to the known MPS estimate, has properties that have been demonstrated by Cheng and Amin (1983), Ranneby (1984) and Cheng and Stephens (1989). It exists and is a consistent estimator. Maximum likelihood estimation, in contrast, can break down when the likelihood is unbounded, giving inconsistent estimators. Common examples are three-parameter lognormal, Weibull and gamma distributions or distributions uniform over unknown intervals. In parametric estimation the reference distribution parameters will not necessarily be functions of sufficient statistics. However, when the limits of the support of the
N. C. Lind
49
P1
C PH = {0.397, 0.239, 0.363}
Pc = {0.487, 0.250, 0.263}
PI = {0.400, 0.240, 0.360} S = {1/3, 1/3, 1/3}
P2
0
P3 Figure 2. The linear variety RPi=1, representing all possible probability assignments in the example. The candidate set of exponential distributions projects into the curve shown. The set S projects into the center of the triangle.
reference density function are known, the estimators will have the same properties as maximum likelihood estimators, including that of asymptotic sufficiency (Cheng and Amin, 1983). From the spacings pi in parametric estimation, with H being an efficient estimator of the reference distribution parameter vector, we can form Moran’s statistic m
M(H)=−R log(pi ) 1
(10)
Cheng and Stephens (1989) have shown that under mild assumptions M(H) has asymptotically a normal distribution, and they have given a chi-squared approximation so that a simple test can be made for small samples. 9. Example For simplicity suppose that there are just two observations of X: 2 and 4, and suppose that it is believed that the distribution is exponential, so that the candidate set C is the single parameter family of all exponential distributions. The sample divides the domain of the real axis into three intervals, i.e. (0, 2], (2, 4] and (4, x]. An exponential distribution with density p(x)=exp(−x/m) assigns probabilities to these three intervals as follows: P1=exp(−2/m), P2=exp(−4/m)−exp(−2/m), P3= 1−exp(−4/m). As the parameter m varies, the three probabilities vary as shown in the diagram, Figure 2, by the curve labelled ‘‘C’’ that represents the candidate set. The entropy and the information are functions of m: H=H(m)=−P1logP1−P2logP2−P3logP3;
50
Information–theoretical estimation
I=I(m)=−log(P1P2P3). Extremizing H(m) and I(m) yields, respectively, the maxent distribution p(1) with mean m1=3·95 and the least information distribution p(2) with mean m2=3·92. The latter assigns a probability P3=0·397 to the high-X interval. By the sample rule this probability should equal 1/3; therefore, the relative entropy is minimized by a distribution that in the high tail is q(3)(x)=1−(0·333/0·397) exp(−x/3·92), x>4. 10. Some simple approximations The posterior distribution q is insensitive to small variations in p except for very high or very small probabilities. This makes it possible in some circumstances to avoid optimization to determine p when it is only auxiliary to determining q. One solution is to use the method of moments to estimate p instead of the optimal p(2). Another is to select p to pass through the extreme high or low quantile observation, as appropriate for the application, i.e. either (x0, 1/m) or (xn , 1−1/m). If p has two parameters it can be made to pass through another point, which could be the median observation (xM , 1/2); in general, if C is an N-parameter family, p can thus be made a priori to satisfy N of the constraints of the sample rule. If the remaining n-N sample rule constraints (judiciously chosen) are ignored, then q coincides with p. 11. Conclusions Entropy methods derived from information theory may be adapted to the estimation of a real random variable given a random sample. The sample divides the domain into intervals. Any distribution that assigns probabilities to these intervals has thereby an associated entropy and information. This defines (1) a maximum entropy distribution and (2) a minimum information distribution within a set of distributions. These distributions do not generally satisfy the combinatorial sample rule, which requires that the interval probabilities should be exactly equal. Minimizing the relative entropy by the principle of least information identifies (3) a unique optimal posterior distribution that does. This principle has intuitive appeal since it ensures that the results incorporate a minimum of information. The minimum information distribution is approached by the maximum likelihood distribution in the candidate set. The least information method has many desirable attributes: (1) It yields a unique solution pair (p,q) when other methods (e.g. max. likelihood) fail. (2) The posterior distribution, q, satisfies the logical constraints of the sample rule, unlike common parametric solutions. (3) The objective function (the entropy of q relative to p), has a rationale that is easy to understand and justify. (4) The algorithm is the same for all sets of candidate distributions. (5) The estimation process introduces a minimum of information and approaches objectivity in a well defined sense. (6) The distributions p and q are invariant under monotonic transformations of the variable. Thanks are due to the Natural Sciences and Engineering Research Council of Canada for financial support.
References Cheng, R. C. H. and Amin, N. A. K. (1983). Estimating parameters in continuous univariate distributions with a shifted origin. Journal of the Royal Statistical Society, Series B 45, 394–403.
N. C. Lind
51
Cheng, R. C. H. and Stephens, M. A. (1989). A goodness-of-fit-test using Moran’s statistic with estimated parameters. Biometrika 76, 385–392. Feller, W. (1968). An Introduction to Probability Theory and its Applications, Third edition. New York, NY: John Wiley and Sons, Inc. Jaynes, E. T. (1983). Papers on Probability, Statistics, and Statistical Physics. Netherlands: D. Reidel, Dordrecht. Kapur, J. N. and Kesavan, H. K. (1992). Entropy Optimization Principles with Applications. London: Academic Press, Inc. Kullback, S. and Leibler, R. A. (1951). On Information and Sufficiency. Annals of Mathematical Statistics, 22, 79–86. Lind, N. C. (1994). Information theory and maximum product of spacings estimation. Journal of the Royal Statistical Society, Series B 56, 341–343. Lind, N. C. and Solana, V. (1988). Estimation of random variables with fractile constraints. IRR paper No. 11. Waterloo, ON: Institute for Risk Research, University of Waterloo. Ranneby, B. (1984). The maximum spacing method. An estimating method related to the maximum likelihood method. Scandinavian Journal of Statistics 11, 93–112. Shore, J. and Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on Information Theory IT-26, No. 1, 26–37. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. Titterington, D. M. (1985). Comment on ‘‘Estimating parameters in continuous univariate distributions’’. Journal of the Royal Statistical Society, Series B 47, 115–116.