Fisherian or Bayesian methods of integrating diverse statistical information?

Fisherian or Bayesian methods of integrating diverse statistical information?

Fisheries Research 37 (1998) 61±75 Fisherian or Bayesian methods of integrating diverse statistical information? Tore Schweder* Department of Economi...

192KB Sizes 0 Downloads 25 Views

Fisheries Research 37 (1998) 61±75

Fisherian or Bayesian methods of integrating diverse statistical information? Tore Schweder* Department of Economics, University of Oslo, Box 1095, Blindern, 0317 Oslo, Norway

Abstract Directly observed data are in many cases combined with diverse indirect information to draw inference on parameters of interest to the ®shery scientist. The indirect information might be based on previous data, analogous data or the researcher's expert judgement. The Bayesian prior distribution is the most common concept for representing such indirect information, and the Bayesian paradigm is gaining popularity. An alternative methodology based on the likelihood principle is presented and compared to the Bayesian. In the tradition of R.A. Fisher, the method concentrates on the likelihood function, without bringing in prior distributions that are not based on data. To provide for the integration of relevant indirect statistical information into the likelihood function, the concept of indirect likelihood is proposed. The indirect likelihood is treated as an ordinary independent component of the likelihood. If the indirect likelihood of a parameter is based on previous data, the inclusion of the indirect likelihood in the new study amounts to combining the old and the new data. The two methods are explained and compared, and it is argued that the likelihood method often is advantageous in the scienti®c context. # 1998 Elsevier Science B.V. All rights reserved. Keywords: Likelihood analysis; Bayesian methodology; Direct and indirect data; Stock assessment

1. Introduction Fishery science is a peculiar mix of science and engineering. The instrumental view of the engineer is prevailing: that a management scheme is good as long as it works, regardless of the truth of its underlying assumptions. The opposite view also has a support: only through solid science can management advice be given. There is agreement, however, on the need to base ®shery management on scienti®c ®ndings as far as possible, even when scenario experiments or other operation analytic simulation techniques are neces*Corresponding author. Tel.: +47 2285 5127; fax: +47 2285 5035. 0165-7836/98/$19.00 # 1998 Elsevier Science B.V. All rights reserved. PII S0165-7836(98)00127-1

sary to investigating the operating characteristic of a given management scheme. This is re¯ected in the FAO document on the precautionary approach to ®sheries (Anon, 1995 and elsewhere). This paper is concerned with a basic methodological aspect of ®shery science, and in fact, of science in general. There is often a need to integrate statistical information from diverse sources, and the question is how this is best done. The Bayesian methodology is capable of such integration, and is gaining popularity among ®shery scientists. This popularity might partially be due to the perceived incapability of the other main statistical methodologies to allow satisfactory synthesis. A dominant paradigm of statistical inference is the frequentist, where focus is on the distribu-

62

T. Schweder / Fisheries Research 37 (1998) 61±75

tion of hypothetical conclusions if the statistical method were to be used repeatedly on data generated by the probabilistic model. A frequentist technique, Meta-analysis, to summarize results from independent studies of the same parameter has been developed in recent years (Hedges and Olkin, 1985). Meta-analysis is not discussed in this paper, since the Bayesian method and the other contender to the Bayesian paradigm, the Fisherian one, seem to be conceptually cleaner and also more ef®cient in integrating diverse statistical data. In the statistical tradition stemming from R.A. Fisher, the focus is on the likelihood function. The most common way to use the likelihood function is to use its mode as a point estimator for the parameter, and to characterize the associated uncertainty by the inverse curvature at the top of the log-likelihood function, interpreted as an estimate of the frequentist variance of the estimator. This is the method of maximum likelihood. This limited view of the likelihood function, and indeed of the Fisherian approach to statistics, seems to have taken focus away from the potential for integrating diverse statistical information by direct Fisherian use of the likelihood function. To provide for the integration of relevant indirect statistical information into the likelihood function, the concept of indirect likelihood is proposed. The paper is organized as follows: First, the Bayesian methodology is sketched. Under the Bayesian paradigm, sampling variation and scienti®c uncertainty is both conceptualized in probabilistic terms. Before approaching the new data, the Bayesian speci®es his knowledge/uncertainty of a parameter in terms of a probability distribution. By Bayes' formula, this prior distribution is combined with the information contained in the new data, as summarized in the likelihood function. The result is called the posterior distribution, and is a probability distribution representing the beliefs of the investigator after having carried out the analysis. Then, the likelihood principle and the backbone of the Fisherian approach is presented. According to the likelihood principle, all the information in the new data is contained in the likelihood function. I interpret this as a guidance for statistical analysis. One aspect is that information and uncertainty should be expressed in likelihood terms, and not in probabilistic terms. The concepts and methods of likelihood analysis are

brie¯y sketched. This includes the new concept of indirect likelihood that can be used to operationalize in likelihood terms the indirect information, say that obtained from previous studies. By including the indirect likelihood in the total likelihood to investigate, indirect information is integrated with new data and a machinery for utilizing all available information is established. In Section 2, two examples are presented, both in view of the Bayesian and the Fisherian approach. The ®rst example is simple and involves the binomial distribution. Its aim is to exemplify the basic Bayesian and Fisherian concepts, and to compare the two ways of analyzing the simple constructed data. The second example is more involved. It concerns the pro®le likelihood obtained for the maximum sustainable yield rate for north-eastern Atlantic minke whales. The purpose is, again, to illustrate the two approaches, and to investigate the possibly unintentional information content in a prior distribution. Finally, the Bayesian approach and the Fisherian approach to statistical inference are discussed and compared in general terms. The conclusion is that likelihood analysis is the more appropriate mode of scienti®c inference in parametric models. It is argued that there are Fisherian concepts and methods available for important tasks that previously have been thought to be handled best by Bayesian methods, and that conceptual shortcomings in the Bayesian approach are avoided. The controversy between Bayesians and non-Bayesians has been heated throughout most of this century, but has perhaps cooled somewhat in recent years. Some arguments and counter-arguments are presented by Efron (1986). The picture is complicated by the non-Bayesian camp being a mixed bag, consisting basically of frequentists of the Neyman±Wald school and Fisherians (Efron, 1997). Most of my discussion follows traditional lines in the controversy, but arguing in favor of the Fisherian paradigm I focus more than usual on the integration of diverse statistical data. 2. Bayesian methods The most prominent statistician (and a most prominent biologist) in the 20th century, R.A. Fisher, devotes a good portion of his Statistical Methods and

T. Schweder / Fisheries Research 37 (1998) 61±75

Scienti®c Inference (Fisher, 1973) to the Reverend Thomas Bayes and his An Essay Towards Solving a Problem in the Doctrine of Chances from 1763, and to the methodology that has ¯owed from this early contribution. Fisher is sceptical towards the use of Bayesian methods in the natural science proper, since they use the concept of probability in two different and often confusing ways. The ®rst concept of probability is concerned with events that might occur in a future experiment. If the experiment is repeatable, like the ¯ipping of a coin or the running of a transect leg in a survey, the probability of an event is most naturally interpreted frequentistically, as the limiting frequency of the event in the long run of identical experiments. This concept is in the Bayesian paradigm mixed with probability as degree of belief in statements about the state of nature. The Bayesian paradigm is that our incomplete knowledge of the parameter in question can be expressed in a probability distribution. This distribution is called the prior distribution, since it represents the knowledge before the new data have been obtained. The beauty of the Bayesian method is then a conceptually simple method of combining the prior distribution with the information contained in the data to obtain an updated probability distribution representing the knowledge after the data have been seen. This updated probability distribution is called the posterior distribution. When faced with new data, the posterior from the previous study is used as the prior in the new study. The formula for combining a prior distribution with new data is called Bayes' formula. A problem in the Bayesian approach is that information and uncertainty are different categories from randomness. Bayesians have tried to bridge this gap by insisting on the subjective nature of all information, and are equating both prior and posterior probability distributions with belief functions. But to many, there is still a distinction in category between probability and belief. And more seriously, many scientists ®nd it rather `unscienti®c' to regard scienti®c information as genuinely subjective. On the other hand, the Bayesian method is an appealing method of accumulating evidence from different sources. In classical statistics, the focus is on the new data, and the aim is to analyze those ± virtually in isolation from past data and possible indirect sources of relevant data. The cumulative

63

nature of normal science is, however, essential. That accumulation of evidence and integration of diverse data falls naturally within the Bayesian paradigm, while classical statistical methodology has little to offer in this respect, has attracted scientists in ®sheries and other ®elds to Bayes and away from Fisher. To explain the Bayesian method, think of a situation with a parameter of interest, , and data, x, obtained from a statistical experiment related to the parameter. That x provides information on  means that the probability distribution of x depends on . This relationship is made precise in a statistical model for the experiment. The model can be formulated in terms of the model function, f …x; †, which is to be understood as a probability density in x for a given value of . The model function can also be used the other way around: as a function of the parameter for a given value of the data. The function is then called the likelihood function. The likelihood function will be discussed further and in the examples. In addition to the model function, the Bayesian needs a prior probability distribution that represents his belief with respect to the state of nature. The probability density of this prior distribution is called the prior density, and is written fpri(). The core of the Bayesian method is the calculation of the posterior probability density, fpost(), from the probability density of the observed data x for a given value of the parameter , f …x; †, and the prior probability density. Bayes' formula for combining a prior distribution with the likelihood function of the observed data to obtain the posterior distribution follows from the rules of conditional probability, and is fpost …† ˆ R

f …x; †fpri …† : f …x; t†fpri …t† dt

(2.1)

Having obtained the posterior distribution, the question is how to report it. When the posterior distribution has a nice parametric form, as in Section 4.1, the posterior can be reported completely. However, the posterior distribution will often be available only in numerical form. Then, a picture of the distribution, usually its density, is reported, together with a point estimate and a probability interval. The point estimate is usually either the mean or the mode of the density, and the conventional 95% Bayes interval is the shortest interval of posterior probability 0.95. A basic problem in Bayesian reporting is, however, that the

64

T. Schweder / Fisheries Research 37 (1998) 61±75

posterior and its characteristics is based both on the new data and on the investigators' choice of prior distribution. If the next researcher has a different prior distribution but wants to build on the previous study, his problem is to get rid of the original prior. To facilitate scienti®c accumulation of knowledge, some Bayesians suggest that both the likelihood function of the new data and the prior and posterior distributions are reported, preferably the likelihood function in the results section of the paper and the prior and posterior in discussion section (Spiegelhalter et al., 1996). When the parameter has more than one dimension, both the prior and the posterior distributions are multivariate distributions. If, say ˆ(1, 2), where the ®rst component, 1, is of primary interest while the second is more of a nuisance parameter, then inference concerning 1 is obtained from the marginal posterior distribution, with 2 integrated out. Mathematically speaking, there is nothing wrong with Bayes' formula and Bayesian analysis. The Bayesian method is certainly useful in a number of contexts, but it is questionable in the strict scienti®c context. The main objection of Fisher and others is that the probability distribution for the data is a model for the sampling variability and stochasticity in the data generating processes at hand. The prior (and posterior) distributions are, however, weight functions representing the degree of belief in various possible states of nature. Mixing the two is mixing apples and oranges, and it is not obvious that the mathematics of conditional probability applies and what the meaning of the mixture is. There is another serious objection: The prior distribution is bound to have a subjective element. For a given study, the prior might be a posterior from previous studies, and thus be based on data. Some study must, however, have been the ®rst, and then the prior cannot have support from previous data. The use of so-called non-informative priors does, in some sense, minimize the subjective element, but it does not remove it. A problem here is that if the parameter is non-linearly transformed, the shape of the prior density is also transformed. If, say, p has a uniform distribution from 0 to 1, the transformed parameter ˆÿlog p has an exponential distribution. The exponential density is certainly not ¯at, and as such not non-informative. The property of being noninformative is thus unfortunately not transformation

invariant, and some element of subjectivity will always be buried in the prior distribution. A more practical objection to the Bayesian method has been that the calculation of the posterior might be dif®cult. Great advances have, however, been made in recent years. Computing power has increased tremendously, and techniques for evaluating the integral in the denominator in Bayes' formula have improved. The most potent method for carrying out this integration is the Markov Chain Monte Carlo method (Gilks et al., 1995). Gelman et al. (1995) present a broad introduction to Bayesian data analysis. Despite the methodological objections to Bayesian data analysis, it has experienced a growing support among statisticians in recent years, and indeed among ®shery scientists. Bayesian analysis has been felt to `make a lot of sense' in ®sheries, and is relatively easy to explain to other scientists and to managers. In the decision context of the manager, as opposed to the scienti®c context of the researcher, the Bayesian approach is more appropriate. 3. Likelihood methods The Fisherian approach to statistical inference is, of course, best presented by Fisher (1973). Edwards (1992) and Royall (1997) provide refreshing introductions and discussions, and Casella and Berger (1990) give a text-book introduction to likelihood analysis. The likelihood principle is a simple and most farreaching prescription for statistical behavior (Berger and Wolpert, 1984). The likelihood principle is a consequence of the suf®ciency principle and the conditionality principle. The likelihood principle is usually attributed to Birnbaum (1962), but was essentially known by Fisher. The essence of the principle is that in a parametric situation, statistical inference should be based on the likelihood function and on nothing else. This seems to weaken the popular frequentist approach usually associated with J. Neyman, where results over unobserved but potential data enters the inference. It also, in my view, demerits the Bayesian approach, since the prior distribution is not part of the likelihood function, and should thus have no role to play in the inference. The only statistical paradigm truly in tune with the likelihood principle, is in my view the Fisherian, which is

T. Schweder / Fisheries Research 37 (1998) 61±75

explained below. The Fisherian paradigm is also extended with the new concept of indirect likelihood, which allow partially reported previous or indirect data to be integrated with new direct data. In a parametric statistical model for a data structure x, a statistic t(x) is suf®cient if the conditional distribution of x given t is independent of the unknown parameter. The suf®cient statistic carries all the information which is in the data. If we are told the value of t, no additional information is left in x concerning the parameter. The suf®ciency principle is simply that the statistical inference concerning the parameter can be based on the suf®cient statistic, without loss of information. The suf®ciency principle is a self-evident axiom. The conditionality principle has been more open to debate, but it seems impossible to escape it as another self-evident axiom. The conditionality principle is that experiments not actually performed should be irrelevant to conclusions. More precisely, assume your data are obtained from a two-stage process. In the ®rst stage a coin is ¯ipped. If heads come up, experiment A is carried out, and if tails you proceed with experiment B. If the ¯ip of the coin leads to experiment A, you should base your inference concerning the unknown parameter on the observed data in the light of the statistical model for experiment A, and forget about experiment B, since it was not carried out. The conditionality principle is that data should be interpreted in the light of the experiment that was actually carried out. Other experiments not carried out are irrelevant for inference. The suf®ciency principle and the conditionality principle imply the likelihood principle. This principle must therefore be taken as an inescapable dictum for parametric statistical inference. The essence of the likelihood principle is that the observed likelihood function carries all the relevant information (given the data and the model). Inference from the data, given the model, should thus be based on the likelihood function and on nothing else. Although statistical practice in many cases is in breach of the likelihood principle, the likelihood function is central to both frequentist and Bayesian analysis. There is thus every reason to pursue the study of likelihood-related methods. We will do this in two directions. First we present the likelihood function and its standard use to provide maximum likelihood esti-

65

mates of parameters. Next we argue that the likelihood function is the appropriate instrument to integrate information from independent sources. 3.1. The likelihood function and maximum likelihood estimation A parametric statistical model can be regarded as a function, f …x; †, of the data to be observed, x, and the parameter, . Here, x and  are shorthand for a possibly complicated data structure and a parameter vector, respectively. For ®xed , and as a function of x, f …x; † is the probability (density) for the outcome of the data generating experiment. The statistical model is also a model for inference from the observed data, xobs, to the parameter, . It is a remarkable fact that the likelihood principle concludes that all the information in the observed data is contained in the likelihood function de®ned as L…† ˆ f …xobs ; †; regarded as a function of  for x kept at the observed data, xobs. The dual interpretation of the statistical model function implies that probability and likelihood are the two central and dual concepts of statistical modeling and statistical inference. Probability is the concept when modeling a statistical experiment, such as carrying out a trawl survey or observing the stomach content in cod. Having carried out the experiment, the task is to draw inference on the unknown parameter,  from the observed data, xobs. Any ef®cient inference must, by the likelihood principle, be based on the likelihood function and on nothing else. This distinction between probability for stochastic experiments generating data for inference, and likelihood for inference given the observed data, is often blurred. There is a long and unfortunate tradition in statistics of also using probability as the key concept when drawing inference. The Bayesians let both the parameter and the data be stochastic elements with probability interpretation. In the non-Bayesian camp where the frequentist school dominates, the emphasis is also on probabilities in inference. Key concepts are the probability distribution of an estimator, the coverage probability for a con®dence interval, the signi®cance level and test power in hypothesis testing. These probabilities refer to hypothetical replications of the

66

T. Schweder / Fisheries Research 37 (1998) 61±75

experiment, and are understood as properties of the method used for estimation, con®dence interval calculation, or testing, and do not refer to the observed data at hand. Even Fisher, who introduced the likelihood function and who was well aware of the essentials of the likelihood principle, proposed a concept called ®ducial probability for the purpose of interpreting the observed data. This concept has not been very successful. One problem is that Fisher proposed to use the ®ducial density from one study as a prior density for the next, and that leads to inconsistencies. Methods for extracting information from the likelihood function will often have frequentist properties that might be worth noting. One such method is the pro®le likelihood to be discussed below. That con®dence intervals based on the pro®le likelihood are asymptotically ef®cient in the frequentist sense, does, for example, support the use of pro®le likelihoods for the purpose of characterizing the likelihood function. But for given observed data, it is the method's ability to make sense of the likelihood function that matters, not its frequentist properties over potential data sets that could have been observed. Two likelihood functions that are proportional contain exactly the same information. For this reason, it is convenient to work with the log-likelihood ratio, here simply called the log-likelihood function, ^ l…† ˆ log L…† ÿ log L…†; where ^ is a parameter point where L() attains its ^ When the (log-)likelihood maximum, L…†  L…†. function has a unique maximum, which we will ^ is the maximum assume, the locus of the maximum, , likelihood estimate of the unknown parameter. When data are informative for , the log-likelihood is a sharp and pointed function, while a ¯at function re¯ects lack of information. The curvature of l is measured by the second derivative, and minus the curvature at the maximum likelihood estimate is called the (observed) information matrix (when  is a vector, the derivative is a vector and the second derivative is a matrix) iˆÿ

d2 ^ l…†: d2

In normally distributed samples, or in large samples, the information matrix has the frequentist inter-

pretation of being the inverse of the covariance matrix of ^ regarded as a stochastic variable in a hypothetical replication of the experiment. In addition, when the log-likelihood function is approximately quadratic in ^ in a region around the maximum likelihood estimate , 0 1 ^ ^ matrix notation l…† ' ÿ2… ÿ † i… ÿ †, the maximum likelihood estimate has the frequentist interpretation of being obtained by a method that yields estimates that are approximately unbiased, normally distributed and with covariance matrix ˆiÿ1. In that ^ ÿ l…true †† has the frequentist interpretacase, 2…l…† tion of being approximately 2 distributed with as many degrees of freedom as there are free parameters in , and with true as the true value of the parameter. These statements should be quali®ed and made mathematically precise. See Casella and Berger (1990) for more information. In the case of a quadratic log-likelihood, the contours of the log-likelihood function are ellipsoids. Due to the approximate frequentist interpretation, which holds in linear models with normally distributed data or in approximately linear models and large samples, the contour at l…† ˆ ÿ122 …0:95† is often used together with ^ as a summary of the information in the data regarding . Here, 2 …0:95† is the 95% point of the 2 distribution with  degrees of freedom, and  is the number of free parameters in the model. With normal data or large samples, the contour has the frequentist interpretation of being obtained by a method with (approximate) coverage probability 0.95. From the purist likelihood point of view, the contour tells us how sharp the log-likelihood function is in the interesting region of the parameter space, and also how the peak of the log-likelihood is orientated. It is a good idea to calculate a few contours, say at ÿ122 …0:5† and ÿ122 …0:8† in addition to the 95% contour. An extremely important property of the likelihood approach is that the log-likelihood of combined independent data is the sum of log-likelihoods for each individual data component. This follows from the fact that the joint probability density of independent stochastic variables is the product of their marginal probability densities. The log of a product is the sum of logs, and the additivity property of the loglikelihood follows. Thus, if several independent studies provide relevant information on a parameter, their combined information is expressed in the sum of their

T. Schweder / Fisheries Research 37 (1998) 61±75

log-likelihoods. To the extent ®sheries science is accumulative, suf®cient information concerning the log-likelihood function should thus be reported in each single study, to allow results to be ef®ciently integrated across studies. In ®sheries, ecology and other ®elds, it is often useful to integrate information from various sources by way of a deterministic model. The deterministic model might be a population dynamics model, as used for assessing the status of whale stocks, (Raftery et al., 1995), or simply an accounting model as used in VPA analyses to estimate year class strengths. The basic idea is to derive parameters for the various relevant data to enable the likelihood to be calculated within statistical models for each data set. The derived parameters are functions of the basic parameter of the deterministic model. The separate likelihood components are multiplied together, provided the different data are independent. The resulting combined likelihood is thus a function of the basic parameter, and ordinary likelihood analysis is carried out. Likelihood analysis of direct and possibly indirect data in view of a deterministic model is discussed by Schweder and Hjort (1997), who call the method `likelihood synthesis' since it has the potential of integrating data from diverse sources. Likelihood synthesis (and Bayesian synthesis, Raftery et al., 1995, 1997) allow the biologist or economist to concentrate on the substance matter model, which often is deterministic, and the statistician on the probabilistic model necessary to obtain a likelihood function in suitable derived parameters. 3.2. Indirect likelihood As opposed to the Bayesian, classical Fisherian methodology leaves no room for prior information of a gradual statistical type. Prior information can only come in the format of model speci®cations or as constraints on the parameter of the model, not as anything like Bayesian prior distributions. The indirect likelihood introduced by Schweder and Hjort (1996) and studied further by Schweder and Hjort (1997), provides, however, a Fisherian framework for integrating prior statistical information with direct data. A useful distinction can be made between direct and indirect data (Bravington, 1996). The direct data in a study at hand are the new data that typically are

67

explained in the material section and reported in the result section of the study report. The direct data can be previously published, but when their likelihood function is available they will be treated as direct data. The indirect data are additional data gathered from the literature or from expert opinion. They would naturally be presented and discussed in the discussion section in the report, and the integrated analysis and results will appear in that section. The distinction between direct and indirect data is not sharp, but the guiding principle is to distinguish between what are new data and their summary on the one hand, and what is supplementary information from data in the public domain or possibly from subjective judgement. The indirect likelihood to be introduced aims at summarizing the indirect data in a format that makes it easy to combine them with the direct data in a likelihood framework. As noted the indirect likelihood could be based either on subjective judgement, or on past data completely void of subjective judgement other than the ordinary that have entered the past studies in their choice of model and method of analysis. Let the basic parameter be  and let the loglikelihood of the new direct data be ldir(). If the indirect log-likelihood is lind(), the combined loglikelihood is simply l…† ˆ ldir …† ‡ lind …†: This follows from the additivity property of the loglikelihood function: the combined loglikelihood for independent data is the sum of log-likelihoods for the individual data components. The indirect log-likelihood is thus regarded as the log-likelihood of data independent of the new direct data. By the likelihood principle, in the light of the parametric statistical model, all information in the data is contained in the likelihood function. In addition to guide the analysis of the data, the likelihood principle also tells us how results should be reported. If the study has focus on a parameter of general interest, the likelihood principle indicates that suf®cient information should be reported to enable future scientists to reconstruct the likelihood of the current data. This could be done by reporting a parametric ®t to the direct log-likelihood function. If this has been done consistently in past studies, the indirect loglikelihood of the current study would simply be the

68

T. Schweder / Fisheries Research 37 (1998) 61±75

sum of these individual direct log-likelihood functions. Past studies will often not be reported suf®ciently. A report might give the maximum likelihood estimate and a 95% con®dence interval, or perhaps an estimate accompanied by a standard error. If this is all the information that is provided, it will be impossible to reconstruct the log-likelihood function stemming from the observed data of the study. To recover the loglikelihood of a single parameter, we will assume that suf®cient information is provided to make it possible to ®rst reconstruct the whole range of con®dence intervals with degree of con®dence ranging from 0 (the point estimate) to near 100%. Let the family of con®dence intervals be speci®ed by a cumulative distribution function, C, called the con®dence distribution. The quantiles of the con®dence distribution are endpoints in con®dence intervals. A central interval with degree of con®dence 1ÿc would, for example, be (C ÿ1 …c=2†, C ÿ1 …1 ÿ c=2†). Now, in large samples, and in small samples with normally distributed data, the con®dence interval for a one-dimensional parameter, with degree of con®dence 1ÿc, is conventionally calculated from the log-likelihood by solving l…† ˆ ÿ1221 …1 ÿ c†. Since the 2 distribution with one degree of freedom is the distribution of the square of a standard normal variable, the quantile of the 2 distribution relate to the quantile of the standard normal distribution, z, as 21 …1 ÿ c† ˆ z21ÿc=2 ˆ z2c=2 . From these relations, the log-likelihood obtained from the family of con®dence intervals represented by the con®dence distribution C, is lind …† ˆ ÿ12 z2C…† :

(3.1)

The indirect log-likelihood is thus minus half the squared normal score of the cumulative con®dence. The concept implied likelihood was introduced by Efron (1993). The implied likelihood is obtained from Efron's ABC con®dence intervals based on bootstrapping. It is derived by an elegant Bayesian argument. The formula for the implied likelihood is identical to the formula for the indirect likelihood presented above. Traditionally, the results of a likelihood analysis are reported in the format of the maximum likelihood estimate, the standard error calculated as h iÿ1=2 ^ s:e: ˆ ÿl00 …†

and, perhaps, the 95% con®dence interval. This information is suf®cient when the log-likelihood is quadratic. If this is the case, it must be stated. In the non-normal case, it is desirable to report a parametric approximation to the full log-likelihood function, perhaps in the format of the con®dence distribution p C…† ˆ … ÿ2l…††: In this formula, found by inverting Eq. (3.1),  is the standard normal distribution function, and the sign of ^ the square root is that of  ÿ . When the parameter  is multi-dimensional, and interest centers on the one-dimensional derived parameter ˆÿ (), the pro®le likelihood lprof … † ˆ max: ˆÿ …† l…†

(3.2)

is useful. Pro®le likelihoods are discussed by Barndorff-Nielsen and Cox (1989), and are shown to provide asymptotically ef®cient inference under mild conditions. However, for ®nite data the maximum likelihood estimator might be biased in the frequentist sense. Due to the presence of nuisance parameters, the pro®le likelihood may need modi®cation to yield con®dence intervals with correct coverage probabilities. The same problem arises in Bayesian analysis, but is seldom addressed. In `normal' situations, Bayesian and Fisherian analysis lead to exactly the same conclusion, and both analyses are straight forward to carry out. In the `normal' situation, the likelihood function of the vector of parameters, , is quadratic, the prior distribution is either multivariate normal or an improper unlimited uniform distribution. The corresponding indirect log-likelihood is either quadratic or ¯at. Under these conditions, the pro®le log-likelihood of a component of , say 1, agrees, in fact, with the marginal posterior distribution of 1 obtained by the Bayesian. If, however, the prior distribution is non-normal, the log-likelihood is non-quadratic, or the interest parameter is a non-linear function of , Bayesian and Fisherian analysis will typically lead to different conclusions, even when based on exactly the same information. The question is then which of the two methods of analysis that end up with least bias in the results. Asked this way, `bias' is an attribute of the method and not of the data, and the question is how

T. Schweder / Fisheries Research 37 (1998) 61±75

much off the true value of the interest parameter the results from the respective methods are, on an average, when applied repeatedly to the same type of data. 4. Comparative examples The ®rst example might help to see more clearly how the Bayesian method and the likelihood method, respectively, operate, and what constitutes the difference between the two. The second example is technically more demanding, but it essentially bears the same message as the ®rst. 4.1. Example 1: repeated binomial sampling, constructed data First, a random sample of n1ˆ3 mature female minke whales have been caught and x1ˆ2 turned out to be pregnant. Second, an independent random sample of n2ˆ27 mature female minke whales were caught and x2ˆ18 were pregnant. The pregnancy probability, p, of a randomly sampled female minke whale is the parameter of primary interest. This parameter is regarded as a characteristic of the whale population (under current conditions). There is nothing random about p in the usual sense. The parameter is what it is, but we have incomplete knowledge of what its numerical value is. What is random, in the sense of an outcome of a stochastic experiment, is the number of pregnant females in a sample of size n. This number of `successes' is binomially distributed. 4.1.1. Bayesian analysis Assuming no previous knowledge of p, most Bayesians would use a `non-informative' prior density. Since p is a probability lying between 0 and 1, the prior will naturally be chosen as the uniform distribution. The Bayesian would justify his prior on the ground that he is equally inclined to believe that the parameter is within any interval of equal length (within the unit interval). In quantitative terms, he would have a `belief' of 0.1 for the pregnancy probability being between 0 and 0.1, as well as between 0.5 and 0.6, etc. With some partial knowledge of p, he would have used a non-uniform prior density. If, say, around 80% of mature females are known to be pregnant in a comparable situation, one would use a

69

prior that concentrates most of its mass around 0.8, say a distribution with mean 0.8. The probability density representing the uniform prior distribution is fpri …p† ˆ fpri …p† ˆ

1; 0;

0 < p < 1; otherwise:

To frame the belief in probabilistic terms with a uniform distribution, can be regarded as somehow thinking that nature has carried out its grand experiment and come out with a minke whale species with females having a pregnancy frequency at the time and place of sampling, drawn from the uniform distribution. This is, of course, only a thought experiment meant to ease the use of probability calculus. The Bayesian is interested in calculating the probability distribution of p, after he has observed his binomial sample of three mature female minke whales. This amounts to calculating the conditional distribution of p given the observed event. Thus, he needs to calculate the probability of each of the two events, A: of observing Xˆ2 pregnant females in a sample of nˆ3, and B: of nature having drawn the pregnancy probability in the interval (p, p‡dp). Since the ®rst probability is the binomial when p is given, and since the second has probability dp, the joint event has probability Pr(A\B)ˆPr(A|B)Pr(B)ˆ3p2(1ÿp)dp. The posterior distribution is now found from the reverse conditional probability element Pr(B|A)ˆ Pr(A\B)/Pr(A). This is the essential of Bayes' formula Eq. (2.1), and the posterior density is fpost …p† ˆ 12p2 …1 ÿ p†;

0 < p < 1:

When faced with the second sample, the Bayesian might update his posterior distribution obtained from the ®rst sample by using that posterior as a prior distribution to be combined with the likelihood function of the second sample. Or he might ®rst combine the two samples into one binomial sample of size nˆ30, and then apply Bayes' formula with the same uniform prior distribution. The two calculations yield the same result, as is generally the case. This consistency is called the coherence of the Bayesian approach. The ®nal posterior distribution is the beta distribution with shape parameters 21 and 11, which has a density proportional to p20(1ÿp)10. The Bayesian might summarize this posterior distribution by the point of maximum posterior density,

70

T. Schweder / Fisheries Research 37 (1998) 61±75

The second sample of 27 females of which 18 are pregnant, can be combined with the ®rst sample to yield a likelihood function based on all the data in two different ways. The two samples can be pooled, yielding a binomial sample of size nˆ30 and with xobsˆ20, which in turn gives the likelihood function. The other possibility is to multiply together the two likelihood functions, which amounts to adding the two log-likelihood functions. The two methods give the same result l…pjn ˆ 30; x ˆ 20† ˆ l…pjn ˆ 27; x ˆ 18† ‡ l…pjn ˆ 3; x ˆ 2†:

Fig. 1. Binomial log-likelihoods for nˆ3 and nˆ30 (dotted curve).

^ p ˆ 0:667, and the shortest interval of probability 0.95, (0.493, 0.814). 4.1.2. Fisherian analysis The model function in a binomial experiment with nˆ3 and ÿ  with p as the parameter is f …x; p† ˆ 3x px …1 ÿ p†3ÿx . For the ®rst sample, with xobsˆ2, the parameter value of maximum likelihood is ^p ˆ 23, and the log-likelihood function is l(p)ˆ2 log(3p/2)‡log(3(1ÿp)) which is plotted in Fig. 1 as the wider curve. The contour at minus the 95% point of the 2/2 distribution with one degree of freedom, ÿ1.92, is the interval (0.16, 0.98). Together with the maximum likelihood estimate, this gives a summary of the information on p contained in the data and expressed in the likelihood. Due to the small sample size, this interval has not been obtained by a method that gives a frequentist coverage probability of 95%.

The likelihood approach is thus coherent in the same way as the Bayesian approach. The combined loglikelihood is graphed in Fig. 1 as the sharper curve. The maximum likelihood estimate is again found at ^p ˆ 23, and the contour of the log-likelihood function at the 95% point of the 2/2 distribution with one degree of freedom is the interval (0.489, 0.817). This gives a summary of the information in the data. From the near parabolic shape of the log-likelihood, one can argue that n is large enough to allow this interval to have been obtained by a method with approximately the frequentist coverage probability 0.95. In the Bayesian analysis, a uniform prior distribution was used. Does this provide unintentional information on the parameter p? This question is not easy to answer within the Bayesian framework, but we can ask what effect it would have on the likelihood function to include an indirect likelihood corresponding to a uniform con®dence distribution. This indirect loglikelihood is identical to that in Fig. 2 below, except that its domain now is the unit interval. The result is a combined log-likelihood curve slightly shifted to the left of the sharper curve in Fig. 1, and made slightly sharper. The maximum likelihood point is now ^p ˆ 0:568, and the 95% con®dence interval (the contour of the log-likelihood function at ÿ1.92) is the interval (0.485, 0.807). The Bayesian analysis agrees somewhat better with the likelihood analysis when the indirect likelihood corresponding to the uniform con®dence distribution is not included. The Bayesian 95% interval is contained in the 95% likelihood interval when the indirect likelihood is excluded, re¯ecting that the Bayesian conclusion is slightly more `informative'. When the indirect likelihood is included, the 95% likelihood

T. Schweder / Fisheries Research 37 (1998) 61±75

71

0.05, this ¯at prior is usually interpreted as noninformative. The standard interpretation of Bayesian priors and posteriors is that they associate probabilities to intervals. When these intervals are symmetric, with equally much probability lost at both ends, I will call these intervals con®dence intervals. Taking the uniform distribution over the interval (0, 0.05) as representing such con®dence intervals centered at the median 0.025, one has the con®dence distribution C…MSYR† ˆ

Fig. 2. Indirect log-likelihood representing a uniform prior distribution over (0, 0.05).

interval is 3% shorter than the 95% Bayesian interval. Thus, the Bayesian seems to make some use of the unintentional information in his uniform prior, but not full use of it. 4.2. Example 2: minke whale assessment In the catch limit algorithm of the International Whaling Commission (1993), a ¯at prior density over the interval (0, 0.05) was assumed for the parameter MSYR representing the reproductive potential of the stock. Note that this was a technical device in the black box of the algorithm, and not necessarily intended as a Bayesian prior interpreted as a belief function. For the sake of the argument, we will interpret this prior density to apply to the MSYR of the 1‡ stock in the minke whale population, interpreted as a belief function. Except for the constraint, 0MSYR

MSYR ; 0:05

0  MSYR  0:05:

With this interpretation, the indirect log-likelihood calculated from Eq. (3.1) is as shown in Fig. 2, and can hardly be called non-informative. To illustrate the use of indirect likelihood, the indirect likelihood of Fig. 2 is combined with the pro®le likelihood for MSYR found by Schweder, Hagen and Hatlebakk (unpublished note: STAT/11/ 97, Norwegian Computer center). The pro®le likelihood, calculated by Eq. (3.2), was obtained from the likelihood constructed by putting together an estimated series of relative abundance of minke whales in the Barents Sea from 1952 to 1983, and two absolute abundance estimates. The two data sources and the of®cial catch series were brought together in a deterministic Leslie type model used by IWC. There are two free parameters to be estimated in the Leslie type model, the carrying capacity, K, and the maximum sustainable yield rate, MSYR. The resulting likelihood is thus a function of two parameters, while the present interest focus on MSYR only. The likelihood function for this single parameter is obtained from the two-dimensional likelihood as the pro®le when viewing the likelihood surface from far away on the K-axis and perpendicular to the MSYR-axis. 4.2.1. Bayesian analysis Fig. 3 shows the posterior density for MSYR, as calculated from Bayes' formula, using the above uniform prior on MSYR and a so-called non-informative improper prior on K, a ¯at prior on the half-interval (0, 1). Normal probability densities have been ®tted to the tails of the posterior density by eye. The ®t is excellent. In the left tail, the standard deviation of the ®tting normal is 0.0064, while in the right tail, the standard deviation is 0.0056. The posterior distribu-

72

T. Schweder / Fisheries Research 37 (1998) 61±75

Fig. 3. Posterior density for MSYR. The lines represent fitted normal densities, one for each tail.

tion is thus slightly skewed, with a longer tail to the left ± but with that tail cut at zero. The result of the Bayesian analysis is a point estimate of 0.0154 found as the mode of the posterior density for MSYR, and a 95% probability interval of (0.0035, 0.0262). The interval is the standard Bayesian probability interval, i.e. the shortest among intervals of probability 0.95. 4.2.2. Fisherian analysis Fig. 4 shows the pro®le log-likelihood for MSYR obtained from the indirect likelihood for MSYR in Fig. 2, and no indirect information on K other than the restriction K > 0. The ®gure also shows the pro®le loglikelihood based on the direct data alone (dotted curve). The direct log-likelihood is skewed, with more width to the left. The combined log-likelihood is, however, nearly symmetric, at least in the high-

Fig. 4. Profile log-likelihoods for MSYR. The sharper curve use the uniform confidence distribution as an informative indirect loglikelihood, in the dotted curve the indirect log-likelihood is noninformative.

likelihood region, say within the 95% contour. The likelihood analysis is based on exactly the same data as the Bayesian analysis, and the combined likelihood yields a point estimate for MSYR of 0.0165, and a 95% con®dence interval of (0.0052, 0.027). At the outset, there should not be any discriminating prior information regarding MSYR between 0 and 0.05. The sharper curve in Fig. 4 is using the indirect log-likelihood in Fig. 2. The information introduced via the uniform prior distribution is thus spurious, and the sharper curve overstates the information in the data. The 95% con®dence interval (0.0023, 0.027) is therefore correct. This interval is wider than both the Bayesian and the combined likelihood using the spurious information in the uniform prior.

T. Schweder / Fisheries Research 37 (1998) 61±75

5. Discussion It is tempting to compare likelihood inference with Bayesian inference. In the context of scienti®c inference in parametric statistical models I ®nd that the likelihood approach is superior for the following reasons.  According to the likelihood principle, inference should be based on the likelihood function and on nothing else. In Bayesian analysis there is always a prior distribution that influences the inference. This prior is not part of the observed likelihood, and Bayesian inference therefore does not follow the likelihood principle in the strict sense. Viewing the prior distribution as a weight function chosen on more or less solid grounds, along with the various assumptions underlying the statistical model, some Bayesians (Berger and Wolpert, 1984) argue that the posterior distribution is a weighted version of the likelihood function, and as such, Bayesian analysis is in compliance with the likelihood principle.  Likelihood is arguably the key concept in inference, not probability. In the Bayesian approach both parameters and data are treated on the same footing, namely as stochastic elements. The distinction between stochastic variability, described in probability terms, and uncertainties described in likelihood or confidence terms, is thus blurred. Further, if the Bayesian posterior merely is regarded as a weighted version of the likelihood function, then its denomination should be likelihood and not probability. It would then be a bit difficult to see how the posterior from one study could serve as a prior weight function for the next, when order is kept in the units of measurements. Further, why should then inference on a component of the parameter vector be based on the marginal of the posterior weighted likelihood.  To avoid subjective elements in the analysis, the Bayesian will often use what he calls non-informative priors. The concept of non-informativeness is difficult. Even a uniform prior over an interval is carrying information over and above the implied restriction, often unintentionally. The information in the prior is treated in a different way than that in









73

the data. From the examples, it seems that the prior information is given intermediate weight, between nil and that of a corresponding hypothetical piece of data. In Fisherian analysis, indirect data are given the correct weight, and when no indirect information is available, no `non-informative' information needs to be included in the analysis. Without a prior distribution for all the free parameters in the model, the Bayesian is unable to carry through his analysis. When there is information available for some but not all the parameters, the Bayesian has to choose some of his prior distributions on weak ground. This problem does not arise with the likelihood method. Bayesians argue that the difficulty with non-informative priors is unimportant since there usually will be some information available, possibly of a judgemental nature. If the Bayesian believes in his prior, he should also believe in the associated confidence distribution. And if this piece of information is handled on the same footing as other information, e.g. data, one is led to the likelihood approach. If there are more independent sources of indirect data than there are free parameters in a deterministic model used for synthesis, as was the case considered by Raftery et al. (1995), the Bayesian approach gets difficult (Bravington, 1996; Schweder and Hjort, 1996). A version of Bayesian Synthesis that handles an over-determined prior distribution was presented by Raftery et al. (1997). The likelihood approach will, however, work efficiently, particularly when the various likelihood components are compatible. With gross incompatibility, the likelihood function becomes multi-modal, and difficult to interpret. Instead of regarding this as a weakness of the likelihood approach, it might be used as a diagnostic for the compatibility of the various sources of information. When the Bayesian has worked out his joint posterior distribution, his inference for the individual parameters is obtained as marginals in the joint prior distribution. The Fisherian, on the other hand, will typically base his inference for individual parameters on profile likelihoods computed from the joint likelihood. With many parameters, and with little data, the profile

74

T. Schweder / Fisheries Research 37 (1998) 61±75

likelihood is known to be problematic. On this important account, the Bayesian approach seems easier to understand and to use. However, it is not clear that the marginal posterior distribution is unbiased in the face of nuisance parameters and weak data.  Likelihood analyses are often computationally simpler to carry out than Bayesian analysis. Much statistical software use the likelihood method, while less such software is available for Bayesian analysis. The calculation of profile likelihoods involves extensive optimization. In smooth models, optimization by way of Automatic Differentiation (Griewank and Corliss, 1991), has been shown to be extremely efficient in some multi-parameter fisheries models. Although Bayesian methods, in my view, are less appropriate in the context of scienti®c inference in parametric models, they certainly have a role to play in ®sheries science. In ®sheries management, the challenge is to take the best possible regulatory decisions in situations usually characterized by substantial uncertainty. The Bayesian approach can be eminently suited in such decision situations. The internal statistical model in a management procedure might, for e.g., be cast in Bayesian terms, as is the case for the catch limit algorithm of the International Whaling Commission (1993). In the management context, a comprehensive picture of the dynamics of the ecosystem and also of the human system including ®sheries and its regulation is needed. To require this comprehensive model to be properly framed in Bayesian terms, is often to ask too much. Both data and understanding are usually much too weak to allow the likelihood function or the posterior distribution of the parameters of the model to be obtained. In such situations, where neither the Bayesian or the Fisherian method apply to the system as a whole, the method of scenario experimentation provides a possible approach. Schweder et al. (1998) present an application of this method. Scenario experimentation is a distinctly engineering approach. But as far as possible, the assumptions underlying the scenario model and the experimental design should be based on science. By augmenting the Fisherian approach with the concepts of con®dence distributions and indirect like-

lihood, much of what makes the Bayesian method so attractive is now incorporated in the Fisherian method. Although the two approaches might not seem radically different to the practicing ®shery scientist, the conceptual and theoretical superiority should, as I have argued, make the Fisherian approach attractive in the scienti®c context.

References Anon., 1995. Precautionary approach to fisheries. FAO Fisheries Technical Paper, 350/1. Barndorff-Nielsen, O., Cox, D.R., 1989. Asymptotic Techniques for use in Statistics. Chapman & Hall, London. Berger, J.O., Wolpert, R.L., 1984. The likelihood principle. Institute of Mathematical Statistics Lecture Notes ± Monograph Series, vol. 6. Birnbaum, A., 1962. On the foundation of statistical inference (with discussion). J. Am. Statist. Assoc. 57, 269±306. Bravington, M.V., 1996. An appraisal of Bayesian synthesis, with suggested modifications and diagnostics. Rep. Int. Whal. Commun. 46, 531±540. Casella, G., Berger, R.L., 1990. Statistical Inference. Duxbury Press. Edwards, A.W.F., 1992. Likelihood (Expanded edition). Johns Hopkins University Press, London. Efron, B., 1986. Why isn't everyone a Bayesian? (with discussion). Am. Statist. 40, 1±11. Efron, B., 1993. Bayes and likelihood calculations from confidence intervals. Biometrika 80, 3±26. Efron, B., 1997. R.A. Fisher in the 21st century. Technical report, Department of Statistics, Stanford University. Fisher, R.A., 1973. Statistical Methods and Scientific Inference. 3rd ed. Hafner, New York. Gelman, A., Carlin, J., Rubin, D., Stern, H., 1995. Bayesian Data Analysis. Chapman & Hall, London. Gilks, W.R., Richardson, S., Spiegelhalter, D.J. (Eds.), 1995. Markov Chain Monte Carlo in Practice. Chapman & Hall, London. Griewank, A., Corliss, G.F. (Eds.), 1991. Automatic Differentiation of Algorithms: Theory, Implementation and Application. SIAM, Philadelphia. Hedges, L.V., Olkin, I., 1985. Statistical Methods for MetaAnalysis. Academic Press, Orlando. International Whaling Commission, 1993. Rep. Int. Whal. Commn. 43, Annex H. Raftery, A.E., Givens, G.H., Zeh, J.E., 1995. Inference from a deterministic population dynamics model for bowhead whales. (with discussion). J. Am. Statist. Assoc. 90, 402±430. Raftery, A.E., Poole, D., Givens, G.H., 1997. On a Proposed Assessment Method for Bowhead Whale. Paper SC/49/AS6 presented to the Scientific Committee of the International Whaling Commission.

T. Schweder / Fisheries Research 37 (1998) 61±75 Royall, R., 1997. Statistical Evidence, A Likelihood Paradigm. Chapman & Hall, London. Schweder, T., Hjort, N.L., 1996. Bayesian synthesis or likelihood synthesis ± what does Borel's paradox say? Rep. Int. Whal. Commn. 46, 475±479. Schweder, T., Hjort, N.L., 1997. Indirect and direct likelihood and their synthesis Statistical Research Report, no. 12, 1997. Department of Mathematics, University of Oslo.

75

Schweder, T., Hagen, G.S., Hatlebakk, E., 1998. On the effect on cod and herring fisheries of retuning the Revised Management Procedure for minke whaling in the Greater Barents Sea, Fish. Res., 37, 77±96. Spiegelhalter, D.J., Freedman, L.S., Parmar, M.K.P., 1996. In: Berry, D.A., Stangl, D.K. (Eds.), Bayesian Biostatistics. Marcel Dekker, New York.