Advances in Water Resources 36 (2012) 3–10
Contents lists available at ScienceDirect
Advances in Water Resources journal homepage: www.elsevier.com/locate/advwatres
Generalized priors in Bayesian inversion problems Peter K. Kitanidis Stanford University, Civil and Environmental Engineering, 473 Via Ortega, Stanford, CA 94305, United States
a r t i c l e
i n f o
Article history: Available online 14 June 2011 Keywords: Inverse modeling Interpolation Probabilistic analysis Uncertainty quantification
a b s t r a c t Inverse problems, when only hard data are considered, are mathematically ill-posed because they have multiple solutions and are sensitive to small changes in the data. An example is the determination of hydraulic conductivity from head data. Bayesian methods provide a general stochastic framework within which one can solve such problems. The approach allows one to combine hard data with other information in order to explore the range of possible solutions. We consider the role of generalized (also known as improper) probability distributions within this framework. These distributions are very useful because they represent information parsimoniously (with simple equations and few parameters) and are ideal in terms of representing situations with limited information. They are particularly useful in introducing prior information about the dependent variables of the inverse problem and structural parameters that describe the degree of continuity or smoothness of dependent variables. Additionally, they are very useful from a computational perspective because they can lead to formulations with special structure, such as with high degree of sparsity or structure that can be exploited through high performance computing algorithms. We examine in the context of a simple example some of the consequences of using different generalized priors. Ó 2011 Elsevier Ltd. All rights reserved.
1. Inverse problems
y1 ¼ e cos
The term inverse problem is sometimes used as synonym for parameter estimation. Although this may be a valid description in some cases, a distinguishing feature of an inverse problem is that it is mathematically ill-posed in the sense of Hadamard. Such problems are studied in, for example, Laurientiev [17]. A problem is said to be well posed if: (1) a solution exists; (2) the solution is unique; and (3) the solution depends continuously on the data (i.e., a small change in the data produces a small change in the solution). An example is the simplest version of the deconvolution problem,
yðtÞ ¼
Z
t
sðsÞds
ð1Þ
a
where y is a known function and s is the unknown function. One can write the solution explicitly,
sðtÞ ¼
dy dt
ð2Þ
In an idealized world devoid of errors, this problem should be straightforward to solve. However, in practical applications, y is data and always affected by error. Consider that the y is measured without gaps in a certain interval but the ‘‘true’’ value y0(t) is contaminated with an error y1(t), which is, for illustration, E-mail address:
[email protected] 0309-1708/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.advwatres.2011.05.005
2pt T
ð3Þ
Then, the solution is
sðtÞ ¼
dy0 2pe 2pt sin T dt T
ð4Þ
where the second term is the error. This illustrates that even though the error in the data may be small, in the sense that its amplitude e may be small, the error in the result may be large if the period T is small. This is a consequence of the well-known fact that the process of differentiation accentuates the small-scale features. From a practical estimation perspective, the conclusion is that it is virtually impossible to estimate the small-scale features of s(t) by solving the deconvolution problem because they are likely to be swamped by noise in the data. This is the fundamental resolution limitation that affects many hydrogeologic and hydrogeophysical methods, like hydraulic tomography and seismic exploration. This type of mathematical problem has been referred to as ‘‘ill-posed’’ by Hadamard because small changes in the data can effect large changes in the results. A hydrogeologic example is the classic problem of inverse modeling of conductivity from head observations and other information (e.g., see [21,4]). The estimate of the conductivity depends on the head gradients and, thus, the problem is similar to the deconvolution problem. Consequently, the estimation of conductivity from head observations is an ill-posed problem, even in the
4
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
hypothetical case that the head is measured everywhere. In actual practice, of course, the head is not measured everywhere creating another serious difficulty, that of data sparsity. In many cases, the number of observations is finite, n, while the unknown function can be discretized in a larger number, m, giving rise to the issue of algebraic indeterminacy (i.e., fewer observations than unknowns). This means that the second requirement for well-posedness is also not met. A final complication is that observations are never without some error, introducing additional ambiguity. In a general sense, then, there are infinite solutions consistent with available hard data. In any specific application, however, there is additional information that allows one to limit the range of possible solutions. One particular approach is to add requirements that force the problem into a unique solution. This includes the method of regularization or Tikhonov regularization [24] where the problem is reformulated to that of finding a function that is consistent with the data but also minimizes a certain functional, such as the norm of the unknown function or the norm of the gradient of the function. The solution is then unique and depends continuously on the data. However, it would be hard to rationalize that the regularization requirement is based on hard physics. Surely the range of possible solutions contains mostly solutions that have very large variance and are highly irregular – and the regularization requirements allows one to get rid of solutions that have variability beyond what is needed to strictly reproduce the data. But to state that the unique solution that is produced by regularization is the true and only solution is quite a stretch and neglects to recognize uncertainty. We will return to such methods and interpret this methodology within a Bayesian framework. Similarly, in a different approach, one can change the problem by imposing some form of structure such as that the function is piecewise constant thus making the number of unknowns smaller than the number of data. Then the problem can be examined as a least squares problem from which one can obtain estimates and confidence intervals. However, this uncertainty quantification is conditional on what may turn out to be an unrealistic, or at least highly coarse-grained, representation of the function. Nevertheless, in any context or specific application, for example hydrogeologic practice, additional information exists and needs to be included. Such information may not even require a close understanding of the site hydrogeology but is a consequence of characteristics of the unknown function, such as the porosity is between 0 and 1, the conductivity is positive, the variance of any quantity is finite, the quantities cannot change in a completely erratic way at every point, etc. Of course, ideally, one should have more specific information regarding the geologic structure of the site that can dramatically reduce the range of possible solutions. All these types of information are seldom precisely certain and can best be described through probabilistic concepts that allow quantification of uncertainty. The Bayesian methods that are the focus of this work utilize probability theory in the processing of data but these statistical methods are better viewed as systematic procedures to make logical inferences in the face of uncertainty, rather than to analyze frequencies of repeated trials as classical statistics [9]. Some of these issues have already been discussed by the author elsewhere [14,15]. The emphasis of this paper is on: 1. How prior information and reasonable criteria can be translated into prior distributions. 2. Choosing prior distributions to achieve certain objectives, including reduction of computational cost in the particular case of large-scale inverse problems. 3. Finding connections between various and seemingly disparate approaches, such as geostatistical and variational.
2. Bayes theorem We will start by recalling some concepts from probability theory (e.g., [2]). Consider events B1, B2, . . . , Bn that are mutually exclusive and collectively exhaustive (i.e., they do not intersect and the probability of their union is 1). Consider another event A. Then, Bayes theorem states that the probability of event A conditional on event Bi is proportional to the product of probability of event Bi given A times the probability of event A.
pðBi jAÞpðAÞ / pðBi jAÞpðAÞ pðAjBi Þ ¼ Pn j¼1 pðBj jAÞpðAÞ
ð5Þ
The term p(A) is called prior probability, p(BijA) is called likelihood, and p(AjBi) is called posterior probability. For a variable, scalar or vector, an analogous relation holds in terms of probability density functions (pdf). In place of event A, the unknown s can be used and in place of Bi is the data y. Then, Bayes theorem has the form
pðsjyÞ ¼ R
pðyjsÞpðsÞ / pðyjsÞpðsÞ pðyjsÞpðsÞds
ð6Þ
A common notation is to use prime for prior and double-prime for posterior
p00 ðsÞ / pðyjsÞp0 ðsÞ
ð7Þ
Note that in most practical applications, it is impractical to estimate the coefficient of proportionality and we say that Eq. (7) expresses the kernel of the posterior. Computer-intensive methods rely on the availability of the kernel of the posterior of s in order to generate realizations (e.g., [8]). Before proceeding further with the Bayesian approach, we will briefly review the concept of the generalized probability density function. 3. Generalized pdf It is known from probability theory that the probability density function (pdf) of a random variable s, denoted by f(s), must be nonnegative and integrate to 1,
f ðsÞ P 0;
Z
1
f ðuÞdu ¼ 1
ð8Þ
1
Sometimes, the kernel of the pdf may be given, for example
1 f ðsÞ / exp jsj ; 2
sP0
ð9Þ
Here, the expression is nonnegative everywhere where the pdf is defined, so the first condition is met. As for the second condition, what matters is that the integral is finite because one can compute the coefficient of proportionality that allows us to satisfy the second condition. Since the value of the integral is 2, one can easily see that the second condition is met with coefficient of proportionality 12,
f ðsÞ ¼
1 1 exp jsj ; 2 2
sP0
ð10Þ
However, in Bayesian analysis, it may be advantageous to work with pdfs that do not satisfy the second condition. Examples include:
f ðsÞ / 1 1 f ðsÞ / ; s
ð11Þ s>0
ð12Þ
Such pdfs are sometimes referred to as improper. Here, we are interested in particular functions that are useful in applications and we will refer to them as generalized pdfs. Here are some important points:
5
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
1. Generalized priors are limiting cases. In most interesting cases, a generalized prior can be seen as a limiting case of a proper prior (see discussion in [18,6,1]). For example, consider a Gaussian with mean 0 and variance r2. At the limit of r2 ? 1, the distribution tends to the uniform with bounds from minus infinity to plus infinity, which can be represented through the generalized pdf of Eq. (11). This is a useful distribution that is often used to denote ‘‘noninformative’’ prior for variables that can take values anywhere on the real axis. (The term noninformative is used in a loose sense, since one cannot find a truly and completely noninformative prior. We simply use the term to denote any distribution that lets the likelihood dominate the shape of the posterior.) Another example is when s takes only positive values, like a hydraulic conductivity or a measure of spread in a distribution, and its natural logarithm is Gaussian with mean 0 and variance r2. Then, s follows a lognormal distribution
! 1 ðln sÞ2 ; f ðsÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffi exp r2 s 2pr2
s>0
ð13Þ
At the limit of r2 ? 1, we get the function of Eq. (12). One can easily verify this result by plotting. For a more constructive proof, set ¼ r12 and expand f(s) in a power series in terms of . As tends towardp0, the leading term (i.e., all other terms ffiffi 1 p ffiffiffiffi . Thus, f ðsÞ / 1s . are even smaller) is s 2p
2. Role in Bayes theorem. This kind of function is often used as a prior in Bayesian analysis and may not be appropriate to use in other applications. For instance, it is impractical and even perhaps meaningless to try to generate realizations or compute moments from a generalized pdf, except in the context of Bayesian analysis where it makes sense to introduce additional requirements or constraints coming from the data. In fact, it may be advisable not to interpret it as a distribution of s literally and in isolation of its role in Bayes theorem. For example, if the likelihood is highly informative, using Eq. (11) indicates that we are willing to rely only on the likelihood. The essential point is that the prior is flat over the effective range of the likelihood. One can see a similarity to the generalized functions (e.g., see [22]) used in mathematical physics, as the following example illustrates. The Dirac delta function, d(x x0), is an example of a generalized function and can be seen as the limit of a Gaussian when the variance tends to 0. The idea is that, viewing d as a density, the mass is highly concentrated near a point x0 where R1 another function g(x) is effectively flat so that 1 gðuÞ dðu x0 Þdu ¼ gðx0 Þ. That is, the use of a generalized function is justified in the context of applications by the properties of another function. For generalized priors in the context of Bayesian analysis, this other function is the likelihood. 3. Multivariate generalized pdf. The same basic ideas apply when we refer to the pdf of a random vector s instead of a single variable s. In fact, these cases are even more interesting because there is the additional aspect that the generalized prior may characterize part of vector s, such as the projection of s to a lower dimensional space. We will discuss some cases later.
applications in assigning probabilities (for example [26]). The topic is relevant because the maximum entropy approach produces probability distributions that are commonly generalized, as we will see later in examples. The basic idea is as follows: Let s be the vector of unknown quantities. Information about s may be conveniently expressed in terms of some overall measures of s, such as the square norm of s, or the maximum value of absolute value of s, etc. These are scalars functions gk(s), such as
gðsÞ ¼ ksk22 ¼
m X
s2i
ð14Þ
jsi j
ð15Þ
i¼1
or
gðsÞ ¼ ksk1 ¼
m X i¼1
In the probabilistic terms used in stochastic methods, these scalar functions can be treated as having known expected values, as a way of introducing information, or restricting the allowable solutions without overly prejudging the solution. The maximum entropy formalism chooses the probability distribution f(s) that maximizes the functional known as entropy, subject to the constraints given by known expected values. To wit, select the pdf f(s) to maximize the functional
S¼
Z
f ðsÞ lnðf ðsÞÞds
ð16Þ
subject to the constraints
Z
g k ðsÞf ðsÞds ¼ V k ;
k ¼ 1; . . . ; K
ð17Þ
where gk(s) is a scalar function of s and Vk is a known value. A trivial example would be g(s) = 1 and corresponding V = 1, which expresses that the integral of a probability distribution must be 1 (normalization condition); however the prior pdf needs to be known within a constant so this condition is of no assistance here. A more useful g may be some norm of vector s, such as the ones in Eqs. (14) and (15). (Examples that are more specific and relevant to geostatistical approaches are to be given soon.) To state that the expected value of such a norm is known enforces a constraint on the solution, e.g., by eliminating solutions that have infinite values or infinite changes. The solution to this functional maximization problem is (see Appendix A)
f ðsÞ exp½k1 g 1 ðsÞ kK g K ðsÞ
ð18Þ
Perhaps a cautionary note is in order: Although many different improper priors can be constructed, only a subset is likely to be useful in applications. In the next section, we will see a method to devise such priors.
where k1, . . . , kK are scalar multipliers. The nature of this distribution will be discussed in the next section. As for the values of kk, they can be selected following a variety of approaches. One is based on the knowledge or the guess of the values of the moments Vk and then solving the system of K equations given by expression (17). Another approach, which utilizes directly the data is to treat the k’s as structural parameters to be estimated according to available methods [16,10]. Notice that this approach makes it clear that the prior information introduces reasonable conditions about the structure. This is in sharp contrast to the way traditional geostatistics is introduced, where a long list of seemingly restrictive assumptions are made such as that the function is stationary and that the covariance or the semivariograms is exactly known.
4. Max-entropic assignment of prior distribution
5. The likelihood function
The maximum entropy formalism for assignment of a prior distribution has been put forward by Jaynes and has been well explained and advanced in books such as [25,5] and has found
The likelihood is the pdf of the data y given a specific value of s. A posteriori, y is given and s is a random quantity; thus, in the context of the Bayesian approach, the likelihood is a function of s that
6
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
behaves more like an un-normalized pdf of s than anything else. In an intuitive sense, this function is proportional to the pdf that would be supported by the data alone. However, in the case of inverse problems, this distribution is not proper. Let us examine the issue in more detail. The relation between y and s is often modeled through
y ¼ hðsÞ þ e
ð19Þ
where h vector function from m-dimensional space to n-dimensional and e is a random vector that is usually modeled as having mean 0 and a certain pdf. Then, the likelihood, which is the conditional pdf of y given s, is the same pdf as that of e with mean increased by h(s). More specifically, consider the prevalent model that e is Gaussian with mean 0 and covariance R. In this case, the likelihood is Gaussian with mean h(s) and covariance R. However, if n is smaller than m, the likelihood constitutes an improper pdf of s because its integral is not bounded. This becomes easier to see if we consider the more specific case of linear observation function, h(s) = Hs. We can then write the likelihood T
1
pðyjsÞ / expððy HsÞ R ðy HsÞÞ
ð20Þ
This is the exponential of quadratic function of s, which indicates a Gaussian distribution. However, in order to be a proper Gaussian distribution, it should be possible to write it
/ expððs lÞT V 1 ðs lÞÞ where V1 is invertible m m matrix. If the problem is algebraically underdetermined, n < m, it is impossible to determine such a V. In a way, then, the essence of the ill-posedness of the problem is that the likelihood of the data is not sufficient to characterize the probability distribution of s.
modeled with a priori infinite variance? This issue is important because, as we will discuss later, improper priors are invaluable tools in the Bayesian approach to inverse problems. In addition to the points we made before, there is an important point related to the multivariate nature of the problem. In intuitive terms, though an improper prior may not fully characterize s, it characterizes part of s. Let us approach the issue in more precise mathematical terms. For specificity, consider that L is rank n p and define q = Ls. Then, from (22) one can see that
1 p0 ðqÞ / exp qT q 2
ð25Þ
meaning that q follows a normal distribution with zero mean and identity covariance matrix. Thus, the prior specifies the distribution of some linear transformation or increments of s. In the example above, the increment is the difference (s1 s2). 6.1. Relation to intrinsic functions We will explore the connection of generalized priors to the concept of Matheron’s semivariograms and generalized covariance function [20,11,13]. An ordinary covariance function, like the exponential, is associated with an ordinary prior distribution. For example, for the common geostatistical model that the mean is constant and the covariance function is exponential,
E½sðxi Þ ¼ b;
E½ðsðxi Þ lÞðsðxj Þ lÞ ¼ r
2
jxi xj j2 exp l
! ð26Þ
then, with the assumption of Gaussian distribution the prior of a vector of s values is
1 p0 ðsÞ / exp ðs XbÞQ 1 ðs XbÞ 2
ð27Þ
6. Quadratic norms and relation to the geostatistical approach
jx x j2 where X is a vector of 1’s and Q ij ¼ r2 exp i l j . The covariance
In this section, we will focus on the case of a single g that is a quadratic function of s,
matrix Q is positive definite, provided of course that there are no redundant entries in s, like two values at the same location. By contrast, an intrinsic model with linear semivariogram or generalized covariance function states
g¼
1 1 kLsk2 ¼ sT LT Ls 2 2
ð21Þ
where L is an m p by m matrix (more about how it can be chosen later), where p is a nonnegative integer. This case is interesting for two reasons. (1) They are manageable in terms of optimization and statistical analysis and yet yield good results in many applications; and (2) it is instructive to see their relation to existing methods. We will also use this case to illustrate a number of important issues. The prior pdf is
1 p0 ðsÞ / exp sT LT Ls 2h
ð22Þ
where h is a positive structural parameter. Perhaps the first feature to notice is that for positive p this function is not a proper probability distribution [8]. A simple example will illustrate this point. Consider m = 2 and L = [1, 1], then
LT L ¼
1
1
1
1
ð23Þ
1 E ðsðxi Þ sðxj ÞÞ2 ¼ hjxi xj j 2
ð28Þ
That is, the mean is a constant and the ‘‘semi-variance’’ is a linear function of separation distance. With this information, we cannot form an ordinary prior probability density function, since neither the mean nor the covariance function are known. It is known (e.g., see [12]) that the negative semivariogram (for the constantmean case) can be used instead of a regular covariance provided that only ‘‘authorized increments’’ of s are involved, i.e., linear combinations of s entries that are independent of the value of the mean. For example, s(xi) s(xj) and s(xi) 2s(xj) + s(xk) are authorized increments. Let us now revisit this problem and connect to the concept of a generalized covariance function. Consider the following model.
E½s ¼ Xb;
E½ðs XbÞðs XbÞT ¼ Q
ð29Þ
Then,
and
1 p0 ðsÞ ¼ p0 ðs1 ; s2 Þ / exp ðs1 s2 Þ2 2h
E½sðxi Þ sðxj Þ ¼ 0;
ð24Þ
The integral is infinite and the same holds true about other moments. This may create confusion and discussions ensue on how to interpret this kind of function. For example, does the fact that R1 2 s exp 12 ðs1 s2 Þ2 ds1 is infinite somehow mean that s1 is 1 1
1 p0 ðsjbÞ / exp ðs XbÞT Q 1 ðs XbÞ 2
ð30Þ
which is a proper prior, provided that the covariance function is a proper positive definite function, such as the exponential
covariance KðhÞ ¼ r2 exp hl . Now assume that b is random with uniform prior distribution. After integrating over b, one can write the prior marginal pdf of s as
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
1 p0 ðsÞ / exp sT Gs 2
ð31Þ
where
G ¼ Q 1 Q 1 XðXT Q 1 XÞ1 X T Q 1
ð32Þ
One can easily verify that G is of rank no larger than n p as follows. We can write G as
G ¼ Q 1 ½I XðXT Q 1 XÞ1 X T Q 1
ð33Þ
where the term in the square brackets is a projection matrix with p eigenvalues equal to 0 and m p eigenvalues equal to 1. Thus G is not invertible and the expression of Eq. (31) is a generalized prior. This example illustrates that a generalized prior can be interpreted as the kernel of a marginal pdf. The example also illustrates that the mean and covariance we started with are subsumed into the scalar sTGs. Thus, one could arrive at the same point by simply choosing a reasonable norm. This approach allows one to use a broader repertory of models but also to choose models with a form that is more amenable to computations, while at the same time meeting the usual objectives of introducing information about the structure of the unknown function. A specific example will illustrate the point. Consider a function s(x) defined on a line. Let s be created through uniform discretization of s(x) into 5 values. Then, referring to Eq. (21), a possible L is the first-order differentiation matrix, the 4 5 matrix
2
6 0 6 L¼6 4 0
3
1
0
0
0
1
1
0
0
1
1
07 7 7 05
0
0
1
0
ð34Þ
1 1
Notice that in L the nonzero elements are 8 out of 20. If instead of 5 there were m values, the ratio of nonzero elements would be m2 indicating that the ratio of nonzero elements would be very small when m is large. The most important feature is in terms of scaling of the problem: the number of nonzero elements of the L matrix increases roughly linearly with the number of unknowns m. Thus, working with the generalized prior Eq. (21) with an L of the form (34) has important computational advantages. It is also noted that the product LTL is
2
1
1
0
0
0
3
7 6 0 7 6 1 2 1 0 7 6 7 L L¼6 6 0 1 2 1 0 7 7 6 0 1 2 1 5 4 0 0 0 0 1 1 T
ð35Þ
Furthermore, let us compare this formulation with the linear generalized covariance model.
2 3 1 6 7 617 6 7 7 X¼6 6 1 7; 6 7 415 1
2
0
1 2 3 4
3
7 6 6 1 0 1 2 3 7 7 6 7 Q ¼ h6 6 2 1 0 1 2 7 7 6 4 3 2 1 0 1 5 4 3 2 1 0
ð36Þ
Then, we can compute G,
2
1
1
0
0
0
3
7 6 0 7 6 1 2 1 0 7 1 6 6 0 1 2 1 0 7 G¼ 7 2h 6 7 6 0 1 2 1 5 4 0 0 0 0 1 1
ð37Þ
Thus, introducing the generalized prior (21) with the sparse matrix L of (34) is mathematically equivalent to using the intrinsic geosta-
7
tistical model with linear semivariogram or generalized covariance function. This mathematical equivalence was anticipated (see [13]). However, the main point is with regards to the computational cost: Adopting formulations with sparse matrices, or for that matter other structures that can be exploited, may reduce the computational cost significantly. This idea has been exploited in another paper [19]. The example also shows that the concept of the generalized prior helps bridge the gap between stochastic inversion methods and variational methods such as the ones used in the context of Tikhonov regularization [24]. One can use the Tikhonov regularization term as a means to rank possible solutions (or realizations) rather than simply find a single optimal solution. One can also use statistical estimation techniques developed in the context of stochastic methods to select the appropriate weights of the regularization versus the data-fit terms in variational methods (see [13]).
7. Some practical issues One of the issues in applications is that inversions usually produce overly smooth or ‘‘blurred’’ images. In other words, what could easily have been an abrupt change in conductivity measured through a borehole flowmeter [7] becomes a gradual one as part of the inversion. One of the principal reasons for this blurring is in the nature of the inverse problem. Due to lack of sufficient information, there are many possible solutions. If one plots the ensemble average, as is often effectively done in practice, the process of averaging of different solutions becomes responsible for the blurring. For example, if the abrupt change is at a different location at each conditional realization, then the average shows a gradual transition that may appear unrealistic given what is known from geology or is seen in other site-specific data such as from core samples. The point is that the plot represents the ensemble average. However, there is another reason that is associated with the use of quadratic norms, which are associated with Gaussian priors, that may make it difficult to see abrupt changes even in realizations. One characteristic of quadratic norms is that they result in assigning a higher a priori probability to possible solutions with many small changes in the unknown function as opposed to a few large ones. Of course, these methods are sufficiently flexible to accommodate abrupt changes through more or less manual intervention at a specific application, such as by introducing multiple zones each with a different mean value [7]. However, this procedure is ad hoc and may be appropriate for simple cases or when there is additional information, such as from geophysical surveys, to guide the selection of the zones. Other methods are possible that focus on finding boundaries when it is known that heterogeneity can be represented through zones (e.g., [3]). An extreme example where the Gaussian model may be inappropriate is the classification or indicator problem that can be posed as follows. In a certain domain, as the subsurface in the case of hydrogeologic applications, there are samples Z(xi) at sampling locations x1, x2, . . . , xn that allow us to classify the location as belonging to one of M categories H. For example, one may have H1 = silty loam, H2 = sand, and H3 = gravel. Ideally, the samples should give error-free classification at sampled locations, but we do recognize that in actual applications some uncertainty may remain. The main challenge is how to classify at locations that were not sampled. In this case, we are not interested in small changes in the values of some attribute but we want to identify the few boundaries between different facies; thus certain methods like ordinary kriging may be inappropriate. We will illustrate some of these points through a simple interpolation problem. Consider a function s(x) that we have measured
8
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
cal model with linear semivariogram); the second is the sum of absolute steps; the third is the count of how many times the step is nonzero. We consider five realizations, i.e., possible solutions, shown in Fig. 1. s1, s2, and s3 involve small random steps, which were generated from three different distributions (Gaussian, symmetric exponential, asymmetric exponential). There are really no striking differences in appearance among the three functions. The next realization, s4, has 9 small steps and one step that is much larger than the rest. The final realization, s5, has a single nonzero step. Note that all realizations were selected to appear the same with respect to criterion g2. The results are given in the following table:
Selected Realizations 1.75
s1 s
1.5
2
s
3
1.25
s
4
s5
s(x)
1 0.75 0.5 0.25 0 0
1
2
3
4
5
x
6
7
8
9
10
s1 s2 s3 s4 s5
g2
g1
g0
1 1 1 1 1
2.59 3.04 2.86 1.31 1.00
10 10 10 10 1
Fig. 1. Five different solutions between two points of known values.
The first four look the same with respect to criteria g2 and g0. With respect to criterion g1, the fourth solution looks considerably better, i.e., more likely, than the first three. The reason is that criterion g1 does not penalize a single large step, unlike the quadratic case but only looks at the total length of steps. The last case, s5, is different from the others in that there is a single step. This feature is captured by the third criterion, g0. Furthermore, on the same example, consider that we plot the maximum probability estimate, i.e., the one that maximizes the probability of intermediate values based on the prior we have selected. One arrives at the following minimization problems:
Most Probable 1 for g2 for g
1
for g
0
(x)
0.75
s
MP
0.5
g 2 : min s
0.25
g 1 : min s
0 0
1
2
3
4
5
6
7
8
9
10
x Fig. 2. Maximum probability solutions for g2 (unique), g1 (one of many), and g0 (one of ten).
without error at x = 0 and 10, s0 = 1, s10 = 0. We limit our attention to solutions that reproduce the data exactly so that, in this example, the focus is only on the prior (which in this special case is the same as the posterior) that guides the evaluation of solutions at the intermediate locations where there are no data. There are many possible solutions, which are values at the nine uniformly spaced intermediate locations x = 1, . . . , 9. Let us compare several different possibilities that we will evaluate according to the following g, which involve norms like those in Eq. (14),
g 2 ¼ ðs1 s0 Þ2 þ ðs2 s1 Þ2 þ þ ðs10 s9 Þ2
ð38Þ
g 1 ¼ js1 s0 j þ js2 s1 j þ þ js10 s9 j
ð39Þ
g 0 ¼ iðs1 s0 Þ þ iðs2 s1 Þ þ þ iðs10 s9 Þ
ð40Þ
where
iðxÞ ¼
0
x¼0
1 x–0
ð41Þ
The first is a familiar quadratic function that is the sum of squared steps (as already discussed, this is equivalent to using a geostatisti-
g 0 : min s
10 X ðsðkÞ sðk 1ÞÞ2
ð42Þ
k¼1 10 X
jsðkÞ sðk 1Þj
ð43Þ
iðsðkÞ sðk 1ÞÞ
ð44Þ
k¼1 10 X k¼1
where s(0) = 1 and s(10) = 0. The objective function is strictly convex for the g2 criterion (quadratic case) and the unique solution is a straight line between the two measurement points. However, the problem is not strictly convex for the other two minimization problems. Regarding g1 (absolute-values case), there are infinite solutions; specifically, any non-increasing solution, 1 P s(1) P P s(9) P 0 sets the value of the objective function to 1, which is the minimum value. This is another indication that g1 does not discriminate against large steps but does not favor them either as there are many more solutions that have gradual changes than a single large change. Regarding g0, there are exactly 10 possible solutions that correspond to the minimum value 1; s is either 1 or 0 and the switch from 1 to 0 occurs at exactly one step. In Fig. 2, we show the maximum probability solution corresponding to the g2 norm, one of the many maximum probability solutions corresponding to g1, and one of the ten solutions for g0. Even when the ‘‘most-probable value’’ approach fails to give a unique best estimate, one could use alternative approach to summarize the results from the same probabilistic models. For example, one can use the ‘‘ensemble-mean value’’ approach to obtain a best estimate. For the g2 criterion, the ensemble mean is the same as the most likely solution, so it is a straight line. Interestingly, for g1 and g0, the mean is the same as for g2, i.e., a straight-line interpolation. Thus, depicting the ensemble average tends to obscure
9
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
differences between probabilistic models. The point is that the criterion that is used to select the representative solution is as important as the probability that is chosen. Yet another way to depict the results is based on the perspective of classifying the results into a class with values equal to (or close to) 1 and a class with values equal to (or close to) 0. The classification problem is somewhat different from that of estimation in that, instead of seeking ensemble mean or most probable values, on delves for a good classification rule and evaluates its performance through the probabilistic model that allows us to evaluate the probability that the classification is right. For example, one may adopt the rule that a location is classified as belonging to 1 or 0 if the median of all realization is larger or smaller that the midpoint 0.5. This results in a picture that looks like the interpolation that one obtains with the nearest-neighbor interpolation method. However, a picture depicting a classification is not exactly comparable to a picture depicting a most probable or ensemble average solution. This example helps to illustrate the following points: 1. If the variability of the actual function can be described in terms of many small and gradual variations, it is generally preferable to use the quadratic criterion, which also is well studied, has nice mathematical properties (like convexity) and may be amenable to efficient computational treatment. Nevertheless, if one were to use a criterion like absolute values, one may not get dramatic different results. 2. If one has reasons to expect that the variability is dominated by a single change or a few changes, one may want to avoid the Gaussian distribution because it tends to break a large change into many small ones. Instead, one should be better off using criteria like g1 that do not unduly penalize a large change. 3. For problems such as the classification (i.e., the indicator) problem where one definitely knows that there are very few changes and that small changes are to be disregarded, one should not use Gaussian prior distributions (that focus on small changes). Using the absolute-value formulation is not particularly helpful because the absolute value formulation does not penalize large changes but does not favor them either. Instead, one should use alternative criteria, such as g0, which penalize only the number of changes so that a solution that fits the data with only a small number of changes, or ‘‘facies’’, can be identified. Furthermore, one should depict classification based on a rule rather than an ensemble mean. 4. The differences in the solutions obtained using various norms can be seen mainly in the realizations. If one intends to only utilize or present the ensemble mean, as sometimes done in applications, the three criteria may yield similar results, despite the differences in structure that they represent. For g1 and particularly g0, the smearing is due to the averaging rather than a preference for smeared solutions. In other words, the smearing is due to lack of information and not because it was somehow imposed by the prior distribution. In fact, for g0, averaging introduces smearing even though the prior distribution explicitly states that changes are abrupt. 8. Concluding remarks Within a Bayesian framework, the concept of a generalized prior allows one to use models that introduce information about the structure of an unknown function that may not be readily captured by more conventional geostatistical models. The generalized prior is a practical way to rank solutions within a probabilistic framework. Furthermore, one may choose formulations that can be solved more efficiently, because they may involve matrices that are sparse (e.g., [19]) or can be decomposed into
sub-matrices of low rank (e.g., [23]). A crucial consideration for large-scale problems is how the computational cost increases with the number of measurements and the number or nodes where a solution is to be computed. Generalized priors with special structure may be useful in producing inversion algorithms with better performance. Acknowledgments This research was funded by NSF Award 0934596, CMG Collaborative Research: Subsurface Imaging and Uncertainty Quantification. Appendix A The problem is to minimize the negative of the entropy functional (16) subject to the expectation constraints (17) where gk(s) is a given scalar function of s and Vk is a known value. Assume that f(s) is the actual minimizing function, which satisfies both the objective and the constraints, and introduce a variation eg(s) where e is a scalar variable and g(s) is a scalar function of vector s. For reasons that will become obvious later, we limit attention to variations that satisfy
Z
g k ðsÞgðsÞds ¼ 0;
k ¼ 1; . . . ; K
ð45Þ
We want to minimize, with respect to e,
IðeÞ ¼
Z
½f ðsÞ þ egðsÞ lnðf ðsÞ þ egðsÞÞds
ð46Þ
subject to the constraints
Z
g k ðsÞ½f ðsÞ þ egðsÞds ¼ V k ;
k ¼ 1; . . . ; K
ð47Þ
We form the Lagrangian
Lðe; k1 ; . . . ; kK Þ ¼
Z
½f ðsÞ þ egðsÞ lnðf ðsÞ þ egðsÞÞds Z þ k1 g 1 ðsÞ½f ðsÞ þ egðsÞds V 1 Z g K ðsÞ½f ðsÞ þ egðsÞds V K þ þ kK
ð48Þ
Set derivative of the Lagrangian L w.r.t. e equal to 0,
Z
½lnðf ðsÞ þ egðsÞÞ þ 1gðsÞds þ k1 Z g K ðsÞgðsÞds ¼ 0 þ þ kK
Z
g 1 ðsÞgðsÞds
ð49Þ
Combine the terms
Z
½lnðf ðsÞ þ egðsÞÞ þ 1 þ k1 g 1 ðsÞ þ þ kK g K ðsÞgðsÞds ¼ 0
ð50Þ
Since the optimum is obtained when e vanishes,
Z
½lnðf ðsÞÞ þ 1 þ k1 g 1 ðsÞ þ þ kK g K ðsÞgðsÞds ¼ 0
ð51Þ
Furthermore, since this condition must be satisfied for arbitrary g(s), it follows that the term that multiplies it must be 0,
lnðf ðsÞÞ þ 1 þ k1 g 1 ðsÞ þ þ kK g K ðsÞ ¼ 0
ð52Þ
Setting the derivative of the Lagrangian w.r.t. each of the Lagrange multipliers kk equal to 0 produces Eq. (45).
10
P.K. Kitanidis / Advances in Water Resources 36 (2012) 3–10
Thus, from Eq. (52), we obtain that the distribution is given by the following equation:
f ðsÞ ¼ exp½1 k1 g 1 ðsÞ kK g K ðsÞ
ð53Þ
The values of the k’s can be computed by substituting the form of f into the constraints and solving the system of the K equations with K unknowns. However, this is not needed in this work. Also, since we need the un-normalized function f(s) (i.e., within a multiplicative constant), we can just write:
f ðsÞ exp½k1 g 1 ðsÞ kK g K ðsÞ
ð54Þ
References [1] Akaike H. The interpretation of improper prior distributions as limits of data dependent proper prior distributions. J R Stat Soc B 1980;42(1):46–52. [2] Benjamin JR, Cornell CA. Probability, statistics, and decision for civil engineers. NY: MacGraw-Hill; 1970. [3] Cardiff M, Kitanidis PK. Bayesian inversion for facies detection: An extensible level set framework. Water Resour Res 2009;45(10):W10416. [4] Carrera J, Neuman SP. Estimation of aquifer parameters under transient and steady state conditions. 1: maximum likelihood method incorporating prior information. Water Resour Res 1986;22(2):199–210. [5] Christakos G. Modern spatiotemporal geostatistics, Oxford, 2000 [6] deGroot M. Optimal statistical decisions. NY: McGraw-Hill; 1970. [7] Fienen MN, Kitanidis PK, Watson D, Jardine P. An application of Bayesian inverse methods to vertical deconvolution of hydraulic conductivity in a heterogeneous aquifer at oak ridge national laboratory. Math Geol 2004;36(1):101–26. [8] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian data analysis. 2nd ed. CRC Press; 2003. [9] Jaynes ET. Probability theory: the logic of science. Cambridge, UK: Cambridge University Press; 2003. [10] Kitanidis PK. Parameter uncertainty in estimation of spatial functions: Bayesian analysis. Water Resour Res 1986;22(4):499–507.
[11] Kitanidis PK. Generalized covariance functions in estimation. Math Geol 1993;25(5):525–40. [12] Kitanidis PK. Introduction to geostatistics. Cambridge University Press; 1997. [13] Kitanidis PK. Generalized covariance functions associated with the laplace equation and their use in interpolation and inverse problems. Water Resour Res 1999;35(5):1361–7. [14] Kitanidis PK. On stochastic inverse modeling. In: Hyndman DW, Day-Lewis FD, Singha K, editors. Subsurface Hydrology Data Integration for Properties and Processes, Geophysical Monograph, vol. 171, AGU, 2007, p. 19–30. [15] Kitanidis PK. Bayesian and geostatistical approaches to inverse problems. In: Biegler L, Biros G, Ghattas O, Heinkenschloss M, Keyes D, Mallick B, et al., editors. Large-scale inverse problems and quantification of uncertainty. Wiley; 2010. p. 71–85. [16] Kitanidis PK, Lane RW. Maximum likelihood parameter estimation of hydrologic spatial processes by the gauss-newton method. J Hydrol 1985;79:53–71. [17] Laurientiev MM. Some improperly posed problems of mathematical physics. Berlin: Springer; 1967. [18] Lindley DV. Introduction to probability and statistics, part 2 inference. London: Cambridge University Press; 1965. [19] Liu X, Kitanidis PK. Large-scale inverse modeling with an application in hydraulic tomography. Water Resour Res 2011;47(2):W02501. [20] Matheron G. The intrinsic random functions and their applications. Appl Probab 1973;5:439–68. [21] Neuman SP. Calibration of distributed groundwater flow models viewed as a multiple-objective decision process under uncertainty. Water Resour Res 1973;9(4):1006–21. [22] Richards JI, Youn HK. Theory of distributions. Cambridge University Press; 1990. [23] Saibaba A, Kitanidis PK. Efficient methods for large-scale linear inversion for geostatistical applications, submitted for publication. [24] Tikhonov AN, Arsenin VY. Solutions of ill posed problems. New York: Halsted Press; 1977. [25] Tribus M. Rational descriptions, decisions, and designs. NY: Pergamon Press; 1969. [26] Ye M, Neuman SP, Meyer PD, Pohlmann K. Sensitivity analysis and assessment of prior model probabilities in mlbma with application to unsaturated fractured tuff. Water Resour Res 2005;41(12):W12429.