PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC I. HACKING Cambridge University, Cambridge, England
This symposium dedicated to the memory of Rudolf Carnap is called 'Probability as a dispositional property'. That sounds like a solecism. Carnap taught us about probability as degree of confirmation, PI for short. He wrote much less about P 2 , the disposition or propensity to produce stable relative frequencies in the long run. Yet the announced combination of Carnap and propensities suits me well, for I shall maintain that all degrees of confirmation, PI' depend on statements of propensity using P 2 • If this position is correct, Carnap's program for inductive logic cannot be completed. His e(h, e) will be defined only when e explicitly mentions propensities. Probabilistic propensities are the subject matter of objective statistics. The scope of inductive logic will be limited to the kinds of problems statisticians investigate. Degrees of confirmation will have to do with statistical constraints on reasonable prediction, explanation, and discrimination among hypotheses. Unfortunately these constraints do not always tally. I shall conclude this paper by arguing that a fundamental and hitherto unnoticed divergence between explanation and prediction means that Carnap's degree of confirmation splits in an unforeseen way. 1. Propensities I can be unpolemical about those two debated concepts, propensity and degree of confirmation, for most connections between the two need not depend on the exact elucidation of each. I shall accept the main points just presented by Giere. The root idea of propensities satisfying probability axioms is shared by a vast majority of workers. Some write more about its application to single cases, others about how it is displayed in stable relative frequencies, but there is less real conflict between Popper and von Mises than Giere seems to imply. The 'single-case propensity theorist' has much to learn from both.
486
1. HACKING
I dissent from Giere on one point of emphasis. He tends to suggest that we best comprehend probabilistic propensities when we conceive them in terms of microphysics. If that were right we should be in a bad way, for philosophers of quantum mechanics notoriously disagree about the nature of probabilities. Moreover we constantly use propensity models in every kind of gross affair: agriculture, meteorology, medicine. We need urn models of epidemics to discover whether we have an epidemic and should quarantine the neighbors. It is mere gossip to suggest we know some connection between quanta and infection. Even if current microphysics turned out all wrong and were replaced by a deterministic theory we should still use urn models. For the propensity theorist to rest his arguments on quantum theory is to retain an equivocal joker while handing all the aces to the Bayesian personalist. Remember how the specter of determinism drove the thoroughgoing objectivist, Jacques Bernoulli, to say probability is objective. His modern successor who escapes determinism by muttering microphysics succumbs to the same error. When I say that Bernoulli was wrong to be driven to subjectivism I do not thereby mean that subjectivism is wrong. De Finetti has provided the only successful piece of philosophical reductionism outside pure mathematics: all the reasoning I will express in terms of propensities can be reduced, via his analysis, to personal probabilities and exchangeability. None of what follows in this paper is incompatible with that reduction. Even if A can be reduced to B (chairs to sense data, say) we may still learn more quickly by thinking in terms of unreduced A (chairs) rather than reduced B (sense data). There are deeper reasons for not going along with de Finetti's reduction but they have little to do with probability. I have in mind more general philosophical considerations to the effect that we cannot think of the structure of science in other than realistic terms. I believe we cannot speculate about whether cancer is contagious without admitting such propensities as viral contagion into our ontology. But that is a general question about epistemology, and has nothing specific to do with probable inference. 2. Degrees of confirmation 'c(h, e) = p' means that p is the inductive probability, degree of confirmation or what you will, of the hypothesis h in the light of the evidence e. 'c(h, e) < cti , d)' means that e confirms h less well than d confirms i. Such comparative statements might be asserted even if an exact degree of confir-
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
487
mation c(h, e) does not exist. Comparisons of that sort are called 'probabilistic' if they obey partial-ordering axioms of Koopman of which Suppes reminds us in his talk today. How to read the 'c'? Let us be content with the familiar betting-rate interpretation. Let 'c' be what Giere has just called a 'correct betting rate'. In opening this symposium Stegmiiller has said that at the end Carnap used the word credibility. That is also the terminology of RUSSELL (1948, Ch. 6) and I adopt it here for this idea. Russell said that 'credibility', as I mean it, is objective in the sense that it is the degree of credence that a rational man would give [1948, p. 359]." It is better to say that credibilities are the betting rates that we, who are often irrational, ought to acknowledge when we want to think or argue reasonably. The probability axioms that guarantee coherent betting rates are one canon of rationality. If there are any pairs (h, e) for which there exists a numerical c(h, e), then the value of c(h, e) is a further constraint on the credence function of anyone who wants to think about hand e and who wants to be reasonable. If there are any quartets (h, e, i, d) such that c(h, e) < c(i, d) then you must, on pain of being unreasonable, agree to stiffer odds on i (conditional on d) than on h (conditional on e). This interpretation makes inductive logic compatible with the personalist theory. Credibilities give some constraint on a personal credence function, and the rest is subjective, bounded only by the demands of coherence-or so it may be argued. Now in the present climate of opinion such a bettingrate interpretation is mandatory, but I think it is superficial. Personalism is a doctrine that is idealist (in the philosopher's sense of the word). It aims at policing one's own opinions and decisions, and has only a derivative application to public discourse. Inductive logic should be primarily about reasoning in discussion, and only secondarily about private betting rates. But what is public reasoning? At this juncture let me give the merest sketch of one point of view. Those who do not sympathize may fall back on the betting-rate interpretation given above. First notice that deductive logic settles nothing outside pure mathematics. The connection between p, p => q, and q is a pivot on which reasonable discussion turns. One man uses it to conclude q, another to reject p. A third who cannot tolerate q and yet is convinced of p and p => q goes away in a quandary; so long as he grants that his position is unclear, he is still being reasonable. Inductive logic also has to do with pivots of rational discussion. For example, if c(h, e) is very high, one man will adduce e in support of h.
488
1. HACKING
No more than modus ponens does this compel you to accept h, but it compels you to say something, on pain of being unreasonable. You might grant that II is a plausible hypothesis. Or introduce further data d, which you claim to be relevant to II, diminishing the effect of e. Or call e itself in question. And there are more complex gambits. What you cannot do, on pain of being unreasonable, is to grant e, to grant that c(h, e) is very high, and yet simply deny that there is good reason for h. You are not compelled to accept h, but if you want to stay in the discussion you have to say something more if you want to persist in rejecting h. In that sense, credibilities and inductive logic augment the pivots of rational discussion already given by deductive logic. A 'principle of total evidence' plays a great role in the application of Carnap's logic. It plays none in mine. The principle comes to this: in contemplating h, the c(h, e) that matters is that in which e = e., the total available evidence. But of course in general we cannot state e, and even if we could it is unlikely that c(h, et ) would ever be defined. Carnap's principle may lead one on to a 'fallacy of total evidence'. One may judge that a credibility is of no use unless it is defined relative to the total evidence available. That is a mistake:' a credibility c(h, e) may still be a valuable pivot of reasonable argument even when e falls short of the total available evidence, and even when e includes propositions that are not known for certain. A familiar example is when one reasons "c(h , e) is high; none of the information in addition to e is relevant, so we have good reason for beIi eving h". Traditional inductive logic would say that since additional information i is irrelevant, c(h, ei) = c(h, e). But in my view judgment of' irrelevance may be a far looser matter of discourse than any formal equality. We may judge that i is irrelevant even when c(h, ei) is undefined. Hence we might use c(h, e) in discussion even though e is not all the evidence. We shall see later that the fallacy of total evidence underlies many an unsound criticism of confidence theory in statistics, including my own. 3. Are there any degrees of credibility?
Carnap's degree of confirmation was modeled on Keynes 'probability relation' which, thought Keynes, sensible men can intuit. RAMSEY acidly remarked in his famous (1931, p. 161), that the fundamental objection against Keynes' whole theory (and hence that of Carnap) is that there are no probability relations between hypothesis and evidence to be intuited.
PROPENSITIES, STATISTICS AND INDUCI1VE LOGIC
489
It is curious how papers sympathetic to inductive logic ignore Ramsey's fundamental challenge. The first obligation of the realist who does not like Berkeley or phenomenalism is to kick a stone or hold up a hand and insist that whatever be the external world, at least there is one. Inductive logicians have not honored this first obligation, of providing a prima facie case that there are any interpersonal relations of credibility that satisfy quantitative or even comparative probability axioms. Ramsey's remark is pressing when one recalls that Carnap's program takes for granted that if e states that a lot more P's than Q's have been observed, then lacking other data it is more probable that the next thing to be observed will be P and not Q. I know of no reason for that remarkable opinion, although so far as I know it has been questioned only obliquely, e.g., CARGILE (1966). Of course if we have a random sampling device of known or assumed propensities that delivers us the P's and Q's then we can draw the Carnapian conclusion, but that is quite another matter. In general our inductive experience is neither simple random sampling (as required for Carnap's early systems) nor stratified sampling (as required for Hintikka's more sophisticated versions). Yet Ramsey is not quite right. There are some quartets (h, e, i, d) such that d does confer more credibility on i then e confers on h. Almost any piece of familiar statistical reasoning would show this. It is instructive to make the point using actual detailed examples, but for reasons of space I shall not do so now. One could take, for example, the question of whether an unsual tropical disease is contagious. Using a model M of infection, one can exhibit observations e such that the null hypothesis of no contagion is far less credible, relative to eM, than the hypothesis that the disease is contagious. The point is trivial. Statisticians present such analyses all the time. I mention them here chiefly to recall that, contrary to Ramsey, there are at least some objective comparisons of credibility that are commonly recognized. Leibniz, who first conceived of global inductive logic, did so in part because he was prepared, for metaphysical reasons, to ascribe physical propensities to all state descriptions. I believe that unless one does this one cannot go along with Carnap's programme for global inductive logic. I have heard it suggested that if inductive logic were the analysis of all learning from experience, then we would have to own up to an implicit ascription of propensities to different possible worlds. I agree. Rather than own up to such a metaphysics, I conclude that inductive logic is not the analysis of all learning from experience.
490
I. HACKING
Despite my skepticism about global inductive logic in the manner of Leibniz or Carnap, I have no doubt, from examination of particular statistical examples, that there are objective probabilistic comparisons of credibility of some hypotheses on some data. The data, I claim, must always include some assumptions about propensities. But not all statisticians agree to even my modest claim; quite the contrary. Both main 'philosophical' schools on the foundations of statistics disagree, although they do so for opposite reasons. Savage's personalist theory claims there is only a semblance of objectivity which results from the way in which the likelihood function operates on purely personal prior probabilities. There are no genuinely objective 'credibilities' of the sort I envisage as part of a truncated local inductive logic. But at least the personalist school admits that there is evidence of an inconclusive sort that bears on hypotheses of interest. Neyman's school, followed strictly, maintains that there is no such thing as inconclusive evidence for hypotheses. We can only make decisions about hypotheses, following some pattern of decision-making with desirable characteristics. When we decide for an hypothesis, we do not do so because the evidence makes it credible. Yet in contrast to the personalist, Neyman would insist that there are objectively correct decisions relative to some probability model and some loss function. In what follows I shall often allude to these polar opposites. If we caricature Neyman as objectivity without evidence, and Savage as evidence without objectivity, then we aim at the middle ground, objective evaluation of evidence. Perhaps the best prima facie case that this is possible is simply that most working statisticians believe in it. 4. A correct betting rate Giere reminds us of one correct betting rate or credibility. Relative just to the information that there is a propensity p for event x to occur, the correct rate for betting on x is surely p. This ancient principle is most usefully expressed in a more general setting. A probability model M is a set of mutually exclusive statistical hypotheses purporting to describe some setup. A statistical hypothesis is a statement of the distribution of tendencies for different possible outcomes to occur. So a probability model is a model of the propensities possibly inherent in some setup. For brevity it is convenient to use M to stand also for the proposition, that one of the simple statistical hypotheses in the model is true. Let M be a probability model according to which the probability (propensity) of an event x to occur on
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
491
some specified trial is p, That is, PM(x) = p. Let x also stand for the hypothesis that x occurs. Then assuredly, as Giere says, c(x, M) = p. Unfortunately Giere only postulates this indubitable connection between credibility and propensity. He does not justify it. He rightly rejects the longrun justification, that p is the correct betting rate because if you repeatedly bet at rate p you will probably break even. But what is the justification? In recent times only RUSSELL (1948, p. 399 f.) tries at length, and with some success, to provide one. Long ago Christian Huygens also worried about this question. I believe one can usefully combine these studies with the device of equivalent gambles that Ramsey taught us. But here we show instead that Giere's seemingly trivial 'correct betting rate' is worth thinking about. It is the basis of the most viable idea in statistical analysis, namely, what BIRNBAUM (1969) calls the 'confidence concept'. This concept can be traced back to Jacques Bernoulli and is used in much work derived from Neyman and Pearson although its rationale differs from anything urged by Neyman. 5. The confidence concept In a parametric model of the form M = (X, Q) with sample space X and parameter space Q, an interval estimator f of e E Q is a function from an observation x to subsets of Q. It is well known that many models allow estimators with a constant 'confidence coefficient' o: such that for all e E Q, Pe(e Ef(x») = a. This means that if one value of e is correct, then regardless of which value it is there is a propensity rx for fto give a correct estimate, i.e., the setup has a propensity rx to give an observation x such that f(x) includes the true value e. So we say that PM(fis correct) = a,
(1)
In other models we may only obtain estimators such that PM(fis correct)
~
a.
Neyman evolved well-known criteria for choosing among estimators with a given confidence coefficient. Suppose we have selected an estimator fwith an exact confidence coefficient 0.99. Then we experiment and observe outcome x. Is it reasonable to suppose that the unknown value of x is in the intervalf(x)? Before making any observation, it is a good bet that we will get an x
492
I. HACKING
such that I(x). Indeed, by our correct betting-rate principle, we conclude from (1) that: cif i» correct, M) = 0.99. (2) Alter making the observation, is it a good bet that we did get an x such that 0 e/(x)? If it is, surely we can use I(x) as an estimate of 0. But unfortunately we cannot usually infer from (2) any value of,
cU is correct, M x).
(3)
In actual practice experimenters do regularly use confidence intervals to estimate unknown parameters. Moreover, they assume that the confidence interval summarizes their evidence about the unknown e. I shall now discuss four possible responses to this practice. I conclude this section by arguing that response (d) is correct. (a) The 'official' rationale. According to Neyman, an observation x does not provide evidence that 0 E/(x). However, it is still reasonable to use I(x) to estimate 0, for if one regularly used I in estimating 0 one would nearly always utter correct estimates. But, as has often been pointed out, if one is not going to use lover and over again, this rationale is obscure. To quote from one of many sympathetic but critical discussions, PRATT (1961) referring to tests of efficacy on a medical treatment, says, "The experimenter who is interested not in the method [i.e., not in j"] but in the treatment and this particular confidence interval would get cold comfort" from Neyman's rationale; "the confidence interval theory dodges the problem of what information is available about the treatment." It is worth adding a reminder about the sheer invalidity of the inference, "I is usually right, therefore, it is sensible to use I for this particular body of data." For a counterexample, let/be a standard estimator of the mean in a normal distribution. Let g be identical to I except that for sample sizes over 1000, if the sample mean is exactly 1-, the estimate of the true mean is O. Now g, like f, is usually right, but of course in the rare event that we did get a large sample with mean t, it would be absured to use g to estimate the true mean. Hence the inference "usually right, so suitable now" is invalid.
(b) The 'fiducial' rationale. The practice of using confidence intervals is to be justified by establishing values for (3). But as I have just said, this is in general impossible. At best one can get values for (3) using Fisher's fiducial argument, but at best that argument is valid only for location and scale parameters. The practice of using confidence intervals is far more widespread.
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
493
(c) The defeatest response. The usual practice of confidence intervals is wrong. This is, of course, the conclusion of all Bayesians, who would accommodate confidence theory only insofar as it fits a theory based on prior distributions. The philosopher not already committed to doctrines should be chary of this. The fact is that the vast majority of working statisticians do take confidence intervals as measures of experimental information. Their practice deserves philosophical explication, not denunciation. This is Birnbaum's attitude, and we will follow his terminology: "when a confidence region estimate is interpreted as representing statistical evidence about a parameter point of interest, an investigator or expositor has adjoined to an application of the formal theory his concept of statistical evidence, which we refer to as the confidence concept [1969, p. 122]". What is the rationale of the confidence concept? I think it must lie in what I have been calling inductive logic. (d) Inductive logic. The justification for the confidence concept is not (2) but (1). True, we do not have a value for (2) and we do have observation x among our data. But to suppose that we therefore cannot use (1) in default of (2) is to commit what in Section 3 I called the 'fallacy of total evidence'. That is the fallacy of thinking credibilities are useful only if you can get a credibility conditional on your total evidence. As remarked in that section, so long as we judge other evidence irrelevant-a judgment not necessarily formalized as a credibility-then we can still reason using credibilities conditioned on only part of the total evidence. In particular, we should see (1) as what I called a pivot of rational discourse. If you do not want to conclude that () is in f(x), you must say something. One possible thing to argue is that x itself is evidence that, on this occasion, f, though usually right, has gone wrong. That is precisely how we argue in the trifling counterexample cited at the end of paragraph (a) above. But when we cannot give reasons for thinking that a particular x is relevant to the question of fbeing right, then the fact summarized in (1) does provide ground for using that confidence interval. The only published attempt to formalize such judgments of irrelevance is in connection with the fiducial argument mentioned in (b). I very much doubt that any other general formal criteria of irrelevance will be forthcoming. But that does not matter. The great virtue of the confidence technique is that it is a rather general way of provi?ing one pivot of rational discourse in the evaluation of data. That, I think, is the rationale of the confidence concept.
494
I. HACKING
6. Legal likelihoods Some principles for comparing inductive support may not result simply from equating credibility and propensity. Leibniz was fond of inviting us to search the law courts for examples of objective nondeductive principles of inference. Following his advice, consider the use of fingerprints. A complete print tallying at every point with my finger is surely mine. Thieves are seldom so obliging as to lay a complete fingerprint. Only a few lines are imprinted. If print fragment z agrees with A's finger at 12 points (and disagrees at none) a court will conclude that A left z. Seven points -are deemed good but not conclusive evidence, while two points prove nothing. What principles of inference are in use here? The court is confronted by two observations, first n: a pririt exhibiting n points was found, and second q: this print and A's finger tally at every point. It has two hypotheses: A: the accused left the print in question, and C: there is a mere coincidence and the print agrees with A's finger by merest chance. In addition to the observation nq, the court also has some background information about fingerprints which we hopemay be encoded In some theoretical model M. Apparently the court agrees that, with n = 12, c(A, nqM)
~
c(C, nqM).
(4)
The inequality decreases with n until at n = 2 it is not sufficient for any conviction. We cannot obtain (4) as a consequence of Giere's principle of a correct betting rate, for there is no sensible model on the basis of which to define a propensity PM(A ,nq). There would be such a propensity if everyone in a population left an equal number of prints, and we extracted the class nq, namely, those prints with n points agreeing at every point with A's finger, and then asked for the relative frequency with which prints in this class would have been left by A. No court in its right mind would regularly adopt this model of how the prints on the scene of the crime came into its possession! It follows that (4) does not arise from any direct equation of propensities and credibilities. But there are relevant propensities in the offing. For example, (5) PA(q, n) = 1 for all n. This is the tendency of a print left by A to agree with A's finger. There is also a natural model C of coincidental agreement. We can consider
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
495
any large population in which prints are obtained from members selected at random. Our background lore of fingerprints tells us that
Pdq, 12)
~
0
but
Pdq,2) > O.
(6)
Neither (5) nor (6) gives us any probability of A or C, but we do obtain what Fisher called likelihoods. The likelihood of a statistical hypothesis h relative to evidence e is Ph(e), the probability of getting e if h is true. The likelihood of h relative to e, given observation d, is Ph(e, d), the conditional probability of getting e when d occurs. Naturally likelihoods are not additive, for even when Ph(e) and PiCe) are defined for hypotheses hand i, there need be no PhVi(e) and hence no likelihood of the compound hypothesis h vi. The simplest statistical model of interest to a court is M = A v C, that is, either A left the print, or any resemblance between A's print and that discovered by the detectives is purely coincidental. Apparently a court whose reasoning conforms to (4) also conforms to a general law of likelihood: if hand i are statistical hypotheses in the model M, and the likelihood of h on e exceeds that of i, then c(h, eM) > c(i, eM). Moreover the court accepts that the greater the difference in likelihood, the greater the difference in the credibilities of hand i. Naturally the model M must state the population from which the mysterious print might have come. On a desert island with only eight inhabitants Pc(q, 2) might be 0 while in a large population it will be significantly positive. But even if we agree to a large potential population, the statement (4) is (as always in inductive logic) only a pivot of rational discussion. Hopefully the court admits far more than nqM as evidence or doctrine, and still other facts will have to bear on the question of whether, for example, the print was planted to incriminate A. 7. The sufficiency concept Unfortunately the role of likelihoods in reasoning is far more obscure than our cursory legal example would suggest. I no longer think that an unqualified 'law of likelihood' there employed will do. BIRNBAUM (1962) gave the most important objectivist considerations for the importance of likelihood, but BIRNBAUM (1969) is far more cautious. In the latter paper the author distinguishes several likelihood-oriented 'concepts of statistical evidence' that contrast with the confidence concept. The strongest of these he calls the likelihood concept. Today, instead of considering this
496
1. HACKING
concept, which does seem implicit in my legal example, I wish to examine a weaker likelihood-oriented idea, which Birnbaum calls the sufficiency concept. I choose the weaker principle for even it leads to problems. A statistic t is sufficient for observations x E X if for all () E D, PIJ(x) = PIJ(t(x»)g(x, t) where g is independent of (). Hence sufficient statistics are those statistics that preserve likelihood ratios between different members of D for given evidence x. The sufficiency concept says that sufficient statistics convey as much information about the model as the complete observations. Perhaps we have always used sufficient statistics with some sort of instinct: from early times it has been realized that on Bernoulli trials one learns just as much from the proportion of heads as from the actual sequence of heads and tails. The mean, of course, is sufficient for the complete sequence. Fisher's original rationale for the sufficiency seems to have been based on an idea about information. Sufficiency is a consequence of the likelihood concept, and it is also immediate in Bayesian philosophy. It is almost as immediate for Neyman's school: see LEHMANN (1959, p. 17). CARNAP included something to this effect in his desiderata for inductive logic (1963, p. 976, Axiom A14). Birnbaum's 'sufficiency concept of statistical evidence' is a statement of this central component of so many different theories on the foundations of statistics. In terms of credibility the sufficiency concept comes to this: if in the model M, t is sufficient for an observation x, then for any hypothesis h in M, c(h, Mx) = c(h, Mt(x»). It would be nice if we could combine the sufficiency and confidence concept to start constructing a serious inductive logic, but we cannot. They are not compatible. 8. Incompatibility of sufficiency and confidence concepts An example of Birnbaum's will do here. Suppose that in Bernoulli trials one desires a lower bound on the unknown chance of getting heads. According to the confidence concept such a bound is provided by an estimator fwhich is a function of the outcome x of a series of trials, and whose value is a point 0 ~ f(x) < 1, and such that P(fis correct) ~ 0.99. Among estimators that are a function of the observations alone, there is an optimalf, giving the highest possible lower bound on () consistent with a confidence bound of 0.99. Formally we are not restricted to estimators that are a function of the data alone. As we observe x, let us pick a number n at random from a set
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
497
of N integers. Then we can consider estimators that are a function of (x, n). Among these there is an estimator g with the properties (i) peg is correct, MN) ~ 0.99, and (ii) for all (x, n), g(x, n) ~ I(x), with strict inequality for some (x, n). This is a consequence, of course, of the discrete nature of x. Our g is a 0.99 confidence estimator that is more informative than f. It is more informative in the sense that it never offers a lower bound on (), and sometimes offers a higher bound, than f. Yet in the model MN, x is sufficient for (x, n). Hence according to the sufficiency concept, I tells just as much about unknown () as g. But according to the confidence concept, g can tell us more than f. In response to this fact, the Bayesian may say, "so much the worse for confidence estimators". The student of Neyman may say, "so much the worse for sufficiency as a universal tool of statistics". Birnbaum argues that we must use the sufficiencyconcept to constrain the confidence concept. He says we should use error probabilities in inference, but only so long as the sufficiency concept is satisfied. I have a more radical proposal. 9. Explanation and prediction Philosophers of science are familiar with an extensive literature on the question, whether explanation and prediction are symmetric. If A explains B, would knowledge of A have enabled one to predict B? I shall be neutral about the grand dispute. At first sight statistical explanation and prediction are symmetric. Given (), the greater the value of P(J(x), the better it is to predict that x will occur. Conversely, given x, the greater the merit of () as an explanation of the occurrence of x. So one would expect that inferences aiming at explaining statistical phenomena will be in harmony with inferences predicting statistical phenomena. The confidence concept is plainly founded on the predictive aspect. Just because there is a great propensity for an estimator Ito give the correct answer, it is credible that I will be right. After making an observation x that seems irrelevant to the question whether f is right, we find it credible that () ef(x). The likelihood concept and less powerful ideas like that of sufficiency seem equally to be founded on the explanatory aspect. We are to discriminate among () contained in some model. The observed phenomenon is x. But if t is sufficient for x, it does not matter whether we discriminate among the () on the basis of x or of t(x) , because for any ()1 and ()2' P(Jl (X)/P(J2(X)
498
I. HACKING
= P01(t(X))/Po,(t(x)). That is, comparing the e for power to explain x is the same as comparing the () for power to explain t(x). The confidence concept relies on the predictive aspect, while the sufficiency concept relies on the explanatory aspect. I propose, then, that the incompatibility of sufficiency and confidence concepts is no passing phase in the evolution of statistical ideas. On the contrary it brings to light a deep divergence between two of our most important bases of reasoning, explanation and prediction. Sufficiency is a concept concerned with inferring to the best explanation. Confidence is a concept concerned with inferring to the best prediction. That in practice the concepts often coincide reminds us of the familiar fact that explanation and prediction are very often symmetric. But they do not always coincide. I suggested that in consequence we must abandon Carnap's idea of a single explication of probability, or credibility. I began this paper by saying that we should study c(h, e) only for selected pairs (h, e). That was a restriction of Carnap's global program to a local one. But at least I took for granted that there was a single qualitative relation of credibility which might sometimes be made quantitative. This is no longer tenable. There are different relations of credibility. Propositions may be made credible for different kinds of reasons. One kind of reason is predictive. Another is explanatory. It may sound as if I am wantonly advocating chaos in confirmation theory. On the contrary, I believe this approach will remove some chaos. It will, I believe, resolve some statistical antinomies, of which the most famous is the problem of optional stopping. But instead of developing that lengthy theme, I conclude by applying my speculations to the problem with which Savage concludes his address to this Congress.
10. Testing unrivaled hypotheses
In eighteenth-century disquisitions on the probability of causes, one asked whether some astronomical fact (say the alignment of the planets) is mere chance or the result of some cause. Even today some of these considerations are very powerful. After recomputing probabilities for the old problems, one is forced to conclude that the alignment of the planets betrays such peculiar features that it cannot have arisen by chance. That is, one rejects the null hypothesis that the dispersion is purely random. Savage says that on his own theory such a rejection is hard to explain. For his theory requires a prior distribution of belief over all possible hypotheses. It is hard to descry any specific rival hypothesis to the hypothesis
PROPENSITIES, STATISTICS AND INDUCTIVE LOGIC
499
of random distribution. Not merely are we unable to attach any definite credence to the rivals, we cannot even think of any rivals. So with the best Bayesian will in the world one cannot give a Bayesian analysis of the intuitively acceptable arguments of Daniel Bernoulli or Laplace. Most problems bad for the Bayesian are good for the student of Neyman, and vice versa. But rejection of unrivaled hypotheses is just as odd for the Neyman analysis. We can certainly compute tests with only a 0.01 chance of mistakenly rejecting the null hypothesis. Unfortunately, there are too many; indeed for any outcome there exists some 0.01 rejection region which includes that outcome. Neyman and Pearson overcame this difficulty by selecting that test which has the highest power against rival hypotheses. This solution will not do when there are no rivals to test against. Statistical orthodoxy of different schools has become so entrenched that many now maintain it is impossible to test when rival hypotheses are lacking. Hence many curious things are said about such a valuable device as the Xl test. Moreover, everyone knows how to test in the case of straightforward distributions. When the distribution is Gaussian, for example, we prefer tests whose rejection region lies in the tail area. When there are rival hypotheses, this preference can be explained, but what if there are none? Fisher, indifferent to theoretical objections, used tail area rejection tests with equanimity. But he never offered a satisfying reason for this naturalseeming practice. One obvious feature of tail areas is that hypotheses in the tail areas have the lowest likelihood. Very recently GILLIES (1971) says this is the reason for using tail area tests among tests with a given significance level. But why is this a reason at all? I wish to argue that it is a reason. If we think of likelihoods and significance or confidence levels as being in the same line of business we shall never explain our preference for tail area tests. Hence the Neyman shool, which deals in confidence levels, and the Bayesian school, which deals in likelihoods, are both unable to explain tail area tests without rival hypotheses. But if there are two distinct bases for comparing credibilities relative to statistical data, tail area tests are readily understood. Our preference for a good level of significance-for a low probability of mistaken rejection-is a preference for good predictive characteristics. But we also find an hypothesis better the greater its explanatory power: we should not reject h in the light of e, while retaining in the light of d, if h has more likelihood relative to e than relative to d. Hence we wish a test for which there is a high 'predictive-credibility' that will not mistakenly
500
I. HACKING
reject true hypotheses and we also choose a test such that the null hypothesis is more 'explanation-credible' on any possible observation in the rejection region. Such a test is a tail area test with no rival hypotheses. The classic work of Daniel Bernoulli and Laplace, or indeed Karl Pearson, cannot be explicated by reference to explanation alone or to prediction alone. Their work must be seen as a deliberate synthesis of both bases of credibility studied in this paper. References BIRNBAUM, A., 1962, On the foundations of statistical inference, Journal of the American Statistical Association, vol. 57, pp. 269-326 BIRNBAUM, A., 1969, Concepts ofstatistical inference, in: Philosophy Science and Method, Essays in honor of Ernest Nagel, eds. S. Morganbesser, P. Suppes and M. White (St Martin's Press, New York), pp, 112-143 CARGILE, J., 1966, On having reasons, Analysis, vol. 26, pp. 189-192 CARNAP, R., 1963, The philosophy of Rudolf Carnap, ed. P. Schillp (Open Court, La Salle) GILLIIlS, D. A., 1971, A falsifying rule for probability statements, British Journal for Philosophy of Science, vol. 22, pp, 231-261 LEHMANN, E. L., 1959, Testing statistical hypotheses (Wiley, New York) PRAIT, J. W., 1961, Review of Lehmann 1959, Journal of the American Statistical Association, vol. 56, pp. 163-166 RAMSEY, F. P., 1931, Truth and probability, in: The Foundations of Mathematics, ed. R. B. Braithwaite (Routledge and Kegan Paul, London), pp. 156-198 RUSSELL, B., 1948, Human knowledge, its scope and limits (Allen and Unwin, London)