~ , f : ,- ,.
~
)
STATISTICS& PROBABIUTY LETTERS ELSEVIER
Statistics & Probability Letters 29 (1996) 297-305
Comparing two populations based on low stochastic structure assumptions F.P.A. Coolen University of Durham, Department of Mathematical Sciences, Science Laboratories, South Road, Durham, DH1 3LE, UK Received July 1995; revised September 1995
Abstract
This paper presents a method to compare two populations without assuming conditional independence for the random quantities in each population. The approach is based on finite exchangeability assumptions per population and is predictive by nature. We present comparison based on imprecise previsions for future observations as well as comparison based on predictive imprecise probabilities. The method is inductive, and can be used when there is not any other relevant knowledge about the random quantities available than the information provided by the data, or if one explicitly does not want to use such knowledge.
Keywords: Exchangeability; Imprecise previsions; Imprecise probabilities; Low stochastic structure; Predictive inference
I. Introduction
In this paper we consider an elementary problem in statistics: comparison of real-valued random quantities corresponding to two populations (we prefer the word quantity to variable, following De Finetti (1974)). An often used approach is to test equality of parameters of assumed parametric models, or, in non-parametric approaches, to use the ranks of the observations (e.g. Wilcoxon's test). These methods are presented in many standard textbooks on statistics, e.g. Lindgren (1993) or Box et al. (1978). The usually applied methods assume that the random quantities per population are conditionally independent and identically distributed (ciid). This assumption is hard to justify in practice, in fact hardly ever attention is paid to it. It may well be that there are relevant covariates that are different for members of the same population, and therefore affect the random quantity of interest, but which values we do not know. De Finetti (1974, Chapter 11) makes it perfectly clear that the weaker assumption of (finite) exchangeability is the natural assumption to start many statistical analyses. Exchangeability can be assumed even if the populations are non-homogeneous, but we simply do not know any other relevant characteristics of the individuals than the random quantity of interest. Since many populations in problems of applied statistics are non-homogeneous, the ciid assumption may often be too strong. In this paper we assume exchangeability and this leads to a low stochastic structure situation as described by Geisser (1993, Section 2.1.2), leading to predictive inferences for a future observation of a random quantity of interest. Geisser (1993) does not pay further attention to a predictive non-Bayesian approach related to the exchangeability assumption. Hill (1968, 1988, 1993) discusses the most important 0167-7152/96/$12.00 ~) 1996 Elsevier Science B.V. All rights reserved SSDI 0 1 6 7 - 7 1 5 2 ( 9 5 ) 0 0 1 8 6 - 7
298
F.P.A. Coolen/Statistics & Probability Letters 29 (1996) 297-305
aspects of the inferences related to the low stochastic structure case, making it clear how it can be used as a fully inductive method based on statistical data alone, and Hill (1968, 1988, 1993) also refers to other related or similar ideas presented in the literature. Hill (1968, 1988) tries to remain within a fully Bayesian framework, which is achieved in Hill (1993) when he shows that his inferences are related to a finitely additive prior, or splitting process. Our approach can be regarded as frequentist by nature, but we discuss its logical use in a subjective approach if further information about the random quantities is missing, or if one explicitly does not want to use any other information. The assumption of exchangeability is not strong enough to derive at precise probability statements. However, we are able to derive bounds for probabilities for a future event, which can be regarded as an application of De Finetti's fundamental theorem of probability (1974, Section 3.10). The concepts that we use to deal with such bounds are imprecise previsions and imprecise probabilities, see Walley (1991). In short, suppose that we are interested in an uncertain quantity A. In a subjective framework (Walley, 1991) that is a generalization of De Finetti's (1974) theory, your lower prevision E(A) for A is the supremum of all 'prices' you want to pay to get the uncertain quantity A, and your upper prevision E(A) for A is the infimum of all 'prices' for which you want to sell A (some unit of linear utility is needed for the prices, see Walley (1991, Section 2.2)). If one is not familiar with these concepts, E(A) and E(A) can be considered as lower and upper bounds for the expected value of A. Imprecise probabilities are simply imprecise previsions for events, so with A an indicator function that is 1 if the event occurs and 0 else. Because we use both imprecise previsions and imprecise probabilities in this paper, we denote a lower probability for A by P(A) and an upper probability for A by P(A). We compare two populations, say X and Y, by making predictive inferences for the next observations per population, Xn+l and Ym+x respectively. A natural way to model preference of Xn+l to Ym+l is by E(X,+1 - Ym+l ) > 0, which means that we would want to get X,+1 - Ym+l for free, or even perhaps for a positive price. Another way to model such preference is by E(X,+I ) > E(Ym+1 ), which implies E(X,+I - Ym+l ) > 0 but is stronger than that. This second way of modelling preference has the advantage that we can analyse the populations on their own, which, for example, makes it easier to handle different sample sizes per population. Section 2 introduces the predictive inference results based on low stochastic structure assumptions on which our methods are based. In Sections 3 and 4 we develop the methods to compare two populations, using imprecise previsions and imprecise probabilities respectively. In Section 5 we present two small examples to illustrate our approach.
2. Predictive inference based on low stochastic structure assumptions
We start with some well-known results on predictive inference for a low structure case, see Geisser (1993, Section 2.1.2). Suppose that 7"1,T2. . . . . T,,, Tn+l are real-valued absolutely continuous and exchangeable random quantities. Let Z1 < Z2 < .. • < Zn be the order statistics for the sample /'1, T2. . . . . Tn, and define Z0 = - o e , Zn+a -- c~. Then
P(Zj < Tn+l < Z j + l ) : 1/(n ÷ 1) and
P(Zj < Tn+l < Zj+k) = k/(n + 1), for j = 0 . . . . . n and k = 1. . . . . n - j ÷ 1. This is the probability that a random quantity lies in a random interval. Denoting observed values of random quantities by their corresponding lower cases, (zj,zj+k) is a k/(n + 1)-confidence interval for the future T,+1. This can be extended to m future observations, where we have to be careful since we only assumed finite exchangeability for the past and future observations, so we cannot treat them as observations from a binomially distributed population.
F.P.A. Coolen/Statistics & Probability Letters 29 (1996) 297 305
299
We can relate the following betting interpretation to this interval. Before observing the Tj, for j = 0 . . . . . n + l , you can regard the bet that will pay one (unit of linear utility, so to say) if zj < tn+l < Zj+k and zero otherwise. The value (or fair betting price) of this bet will be k/(n + 1), meaning that if you either buy or sell this bet for this price, your expected profit is zero. This is a frequentist-like setting, since the bet is set before actual observation of the Tj, arid exploring the idea of repeating this strategy over and over again, infinitely, for different trials, in which sense it is indeed completely similar to the usual confidence interval interpretation. Therefore, this method also adopts the disadvantages of the confidence interval theory, that are often emphasized by Bayesians and that are mostly based on the idea that after observing t l , . . . , tn you may perhaps not feel happy about the betting deal that you have set before the observations. However, to pay more attention to this 'not feeling happy' one would need more knowledge about the processes that determine the lifetimes than the finite exchangeability assumption and the observed values only. Our frequentist-like setting seems to be generally applicable without more assumptions. Hill (1968, 1988) discusses this low structure aspect in detail, illuminating its logic for situations where very little is known about the distributions of interest, but Hill (1968, 1988) tries to remain within the Bayesian framework by assuming a posterior distribution for the data not related to a prior, which seems to be in conflict with the generally accepted Bayesian concept of explicit specification of a prior instead of a posterior. In a recent paper Hill (1993) shows that his posterior is related to a finitely additive prior, or splitting process, and it may be an interesting topic for future research whether a prior, or a class of prior distributions related to imprecise inferences (Walley, 1991), can be found for which our inferences are related to the posteriors in the Bayesian context. In the Bayesian framework no argument for a particular model is given related to data. Such an argument would be inductive and completely based on the data. In the situation that we do not have any additional information (or explicitly do not want to use any), the method suggested here is inductive which strongly justifies adoption of predictive probabilities as used in this paper. The predictive probabilities that we use can also be regarded as direct probabilities as discussed by Dempster (1963). With regard to our exchangeability assumption we must be careful, for example there may be patterns in the observations suggesting that the random quantities are not exchangeable.
3. Comparison based on imprecise previsions In Section 1 we introduced imprecise previsions as a concept to model preferences. In this section we compare two populations X and Y. We are interested in n + 1 members of population X, denoting the random quantities of interest by X/, i = 1,..., n + 1, and m + 1 members of population Y, denoting the random quantities of interest by Yj, j = 1.... , m + 1. For the method presented in this section we must restrict the values that the quantities can have to lx < X,. < rx, i -- 1. . . . . n + 1 and ly < Yj < ry, j = 1. . . . . m + 1, with real-valued Ix, rx, ly, ry assumed to be known. Combining this with the theory presented in Section 2 leads to similar predictive probability statements for Xn+l and Ym+l based on observations xa,...,xn and Yl . . . . . Ym respectively. For ease of notation, and without loss of generalization, we assume that Xl < x2 < • • • < Xn and Yl < Y2 < "'" < Ym. The low stochastic structure assumptions lead to predictive probabilities P(lx < Xn+l "~ Xl ) = P(xi < Xn+l < Xi+l ) = P(xn < Xn+l < rx) = 1/(n + 1 ),
fori=
1. . . . . n - 1
and
P ( l y < Ym+l < Y l ) = P ( Y j < Ym+l < Yj+I) = P ( Y m < Ym+l < r y ) = 1 / ( m + 1),
for j = 1. . . . . m - 1. Using these predictive probabilities, the lower prevision for Xn+l is easily derived by putting the mass 1/(n + 1) as far as possible to the left in each interval, leading to E(Xn+l ) ---
1 n+
1
lx + Z Xi i=l
•
300
F.P.A. Coolenl Statistics & Probability Letters 29 (1996) 297-305
Analogously, the upper prevision is derived by putting the mass to the extreme right per interval, E(Xn+l) --
rx +
xi i=1
Similarly, we get
• /
1(m)
E(Ym+I)-- m + 1
lyq-ZYj j=l
and
E(Ym+1)- m + l
ryq-~_~yj
.
j=l
As discussed in the introduction, we say that strong preference for Xn+l when compared to Ym+l can be modelled by E(X#+I) > E(Ym+I ), which leads to a sufficient condition for strong preference for X,+I to Ym+I given by
1
Ix+y~Xi
n+ 1
>-~-~
i=1
ry-k-Zy j
.
j=l
Analogously, a sufficient condition for strong preference for Ym+l
m+----1 ly+
yj j=l
>~-~
rx+ZX
i
to
Yn+1 is given by
.
i=1
For all other situations we do not explicitly say which of the two next observations is preferred. If m -- n these sufficient conditions are ~i~=1 x i - E~=l YJ > ry -- lx and ~--~'=1 YJ - Ei~=l xi > rx - ly, respectively, ?/ n leaving the values ly - rx ~ )--~i=1 xi - ~ j = l YJ ~ry -- lx for which we do not explicitly state preference for either group. It is easy to see why we needed the restriction to bounded real-valued random quantities in this section. If lx -- - c ~ , the lower prevision for Xn+l would also be - c ~ , while for rx = cxDthe upper prevision would be c~. Without these bounds we would never be able to express strong preference using these imprecise previsions. However, it seems that for practical applications one is always able to state bounds for the observations, although it may be hard not to take them too wide. In Section 4 we consider comparison between two groups based on imprecise probabilities, where it is not necessary to state bounds for the random quantities.
4. Comparison based on imprecise probabilities In this section we do not have to restrict the possible values of the real-valued random quantities. If we would know bounds, as used in Section 3, these could be taken into account quite easily, but we restrict the presentation to - o e < Xi < cxD and - c ~ < Yj. < c~. Analogous to Section 3 we start at the assumed predictive probabilities P ( - o o
F.P.A. CoolenIStatisties&ProbabilityLetters29 (1996)297-305 fori=
1. . . . . n - 1
301
and
P(-oc < Ym+l < Yl) -- P(Yj < Ym+l < Yj+I) = P(Ym < Ym+l < Oe) = 1/(m + 1), for j = 1. . . . . m -- 1. For ease of notation we define x0 = -cxD. Furthermore, we introduce si as the number of observed y values per interval bounded by consecutive x values, so
si =#{yjtxi< yj
i = 0 ..... n - l ,
and s. = #{y~lx. < y j < oo}.
We derive lower and upper probabilities for the event X,+l > Ym+i by looking at extreme positions of these random quantities given the predictive probabilities for the intervals. The lower probability P(X,+I > Ym+l) is derived by putting the probability mass 1/(n + 1) within each interval for Xn+l at the infimum of the values per interval ( - o e for the first interval), and the mass l/(m + 1) within each interval for Ym+l at the supremum of the values per interval (c~ tbr the last interval). A lower bound for the probability for the event X,+I > Ym+l is (taking into account that P(Ym+I < -cx~) = 0): n
P(X.+I > Ym+l) ~
1 ZP(Ym+I
)
n ~ - " l i= 1
)
'= m + l n-1
1
(n + 1)(m + 1) Z'---'~'n
~ j ~sj o
j=0
Since we can actually get arbitrarily close to this lower bound (see the construction that we used), we cannot justify a greater lower bound without additional assumptions and this bound can be regarded as a lower probability n--I
1
P_(Xn+I > Ym+l) = ( n + 1 ) ( m + 1) Z ( n -j)sj. j=0
The upper probability ff(Xn+l > Ym+l) is derived by putting the probability mass 1/(n + 1) within each interval for X,+I at the supremum of the values per interval (oo for the last interval), and the mass 1/(m + 1) within each interval for Ym+l at the infimum of the values per interval ( - c ~ for the first interval). An upper bound for the probability for the event )(,+1 > Ym+l is (taking into account that P(Ym+I < cx~) = 1):
P(X~+l > Ym+l) <~
~_,P(Y~+I
+ - -
n+l
i=1
1 ~'l
sj-1 i=1
m+l
~
I n+m+
1 + Zn-1 (n
\j=l
(n + 1)(m + 1)
+ - -
+
j=O
n+l
--j)sj
I
.
302
FP.A. Coolen/Statistics & Probability Letters 29 (1996) 297-305
Again, since we can actually get arbitrarily close to this upper bound, we cannot justify a smaller upper bound without additional assumptions and this bound can be regarded as an upper probability
n--I P ( x . + l > rm+l) =
(n+ l)(m+l)
n+m+l+Z(n--j)s
/ j
.
j=0
These imprecise probabilities can be used for predictive compansons between the two populations. For example, one could say that one prefers X,+I to Ym+~ if the lower probability for the event Xn+l > Ym+l exceeds a certain value 1 - ~, while Ym+l is preferred to Xn+l if the lower probability for the event Ym+~ > X,+1 exceeds the same 1 - ~, where (assuming that P(X,+I = Ym+I) = 0) we can use P(Ym+1 > X~+I) = 1 - P ( X , + I > Ym+l ) (this follows from the underlying symmetry). To design an experiment using such a form of preference one may be able to use the fact that the imprecision A, which is the difference between the upper and lower probability for an event, does not depend on the data otherwise than through n and m,
A(Xn+I > Ym+l ) = P(X,+l > Ym+l ) - P(Xn+I > Ym+l ) = (n + m + 1 )/(n + 1 )(m + 1 ). If we choose a value of ~ to express strong preference related to lower probabilities, then X,+I is preferred if P(X,+I > Ym+l)>~ 1 - ~ and Ym+l is preferred if P(Ym+l >X,+l)~> 1 - ~ . Therefore, an obvious necessary condition for strong preference of either X,+I or Ym+I is P_(Xn+~ > Ym+l) + P(Ym+I >Xn+l )/> 1 -c~, so
P___(Xn+1 ~> Ym+I)+PP(Ym+ 1 > S n + l ) = 1 --
A(Xn+ 1 >
Era+l) = n m / ( n + 1 ) ( m + 1 ) ~ 1 - ~,
is a necessary condition (obviously not sufficient). For example, with m = n we would need to take at least n~>
v/1 - c~ 1 _ v/1 _ c~"
For a = 0.1, 0.05, 0.01 this implies that we can only have strong preference for either Xn+l or Yn+l if n is at least 19, 39, 199, respectively. Contrary to the method in Section 3, in this section the observed values xi and yj have not been used explicitly, only the ordering of these observations plays a role.
5. Examples In this section we illustrate the methods presented in Sections 3 and 4 by two examples, using small data sets taken from elementary textbooks on statistics. The calculations needed for our methods are simple. Especially Example 2 illustrates the idea behind the low stochastic structure assumption made in this paper, and indicates that in many real-world applications o f statistical methods we have more knowledge than just the data. If we want to use all our knowledge for the inferences, we can use a subjective concept as developed by Goldstein (1994). But even if there is more knowledge there may be reasons not to use it, e.g. in experiments to verify certain conjectures. Example 1. The following data are presented by Box et al. (1978, p. 159). A chemical reaction was studied by making 10 runs with a standard method X, and 10 runs with a new, supposedly improved (that is leading to higher results) method Y. Table 1 lists the ordered results that were obtained. We apply the methods of Sections 3 and 4 to derive predictive inferences for Xll and Y11, making the necessary exchangeability assumption for these methods. For the method o f Section 3 we need to assume
F.P.A. Coolenl Statistics & Probability Letters 29 (1996) 297-305
303
Table 1 Ordered test results, Example 1 Method X Method Y
40.1 58.7
45.8 64.7
48.6 66.5
50.7 68.1
51.5 73.5
52.6 73.7
54.6 74.9
56.3 78.3
57.4 80.4
64.5 81.0
bounds for the possible values of results according to methods X and Y, for the moment say lx < X, < rx and ly < Yj < ry. The imprecise previsions for All and Y11 based on these data and bounds are 1 E(X~l) = Ti- (Ix + 522.1), E(X11 ) = ~ (rx + 522.1), E(Yll) = ~1- (ly + 719.8), 1
E(Yll) = ~
(ry -~- 719.8).
These suggest a strong preference for Yll compared to X11, so E(Yll ) > E(X11 ), if rx - ly < 197.7. We cannot judge on the bounds without knowing more from the actual situation and the meaning of the figures, but e.g. if these numbers are percentages then rx - ly would be at most 100, indeed suggesting strong preference according to the imprecise previsions method o f Section 3. Straightforward application o f the method of Section 4 leads to imprecise probabilities P_(X11 > Y11)
1 --
- -
--
0.008,
121 2 P(XI1 > Yll) = li- = 0.182,
SO P(Yll > X l l ) -- 1 - 2 / 1 1 = 0.818 and ff(Yll > X l l ) = 1 - 1/121 = 0.992. Whether or not these two bounds provide evidence that is strong enough to claim that method Y is better than method X is not a case for the statistician. However, especially the combination o f both methods in this example gives strong support to the claim that method Y will give a higher result at the next observation than method X. Example 2. We use data on birthweights for 12 male and 12 female babies as presented by Dobson (1983, p. 14), see Table 2. Next to this information, the original data also provided estimated gestational ages for each baby, and there seemed to be a trend of increasing birthweight with gestational age, so this can be treated as an important covariate. For our method, let us just consider the birthweights without the additional information of the estimated gestational ages. Since we know that this is a significant covariate that is not equal for all babies, an assumption o f identical distributions for the weights o f all boys will be hard to justify. But if we do not Table 2 Ordered birthweights (g), Example 2 Male (X)
2625 2975
2628 3163
2795 3176
2847 3292
2925 3421
2968 3473
Female (Y)
2412 2935
2539 2991
2729 3126
2754 3210
2817 3231
2875 3317
304
F.P.A. Coolen / Statistics & Probability Letters 29 (1996) 297-305
know the actual values of this covariate per baby, we can assume exchangeability of the weights of 13 male babies, and exchangeability of the weights of 13 female babies, before 12 weights of each actually become available. Under these assumptions, let us see what our methods tell us about the weights X13 of the next boy a n d Y13 of the next girl to be weighted. For the imprecise previsions method we assume again that there are bounds known for the possible values, using the same notation as before. We get E(X~3) = 1 (Ix + 36288),
ff~(X~3) = 1 (rx + 36288), 1
E(Y13) = ~3 (ly + 34936), E(Yt3) = 1
(ry + 34936).
These imprecise previsions would strongly suggest that )(13 is greater than Y13, so E(X13) > E(Y13), if rr - ly < 1352. The difference between the maximum and minimum observed weights is 1061, but probably in this case the experts would not want to assess bounds that are less than 1352 g. apart. Suppose that bounds lx = ly = 800 and rx = ry = 5000 would be acceptable, then these data do not strongly indicate a heigher weight for the next boy than for the next girl. The method of Section 4 gives imprecise probabilities P(X13 > Y 1 3 ) =
86 - - 0.509, 169
P(X13 > Y13) =
111 = 0.657. 169
These numbers do not indicate that the data provide very strong evidence for X13 > 1113. However, if we were offered to buy a bet that pays 1 if X13 > Y13 and 0 else, we would be willing to buy it for a price even slightly greater than 0.5, whereas we would only want to buy a similar bet on Y13 > Xl3 for prices up to 1 0.657 = 0.343. Combining the imprecise previsions and imprecise probabilities we could conclude that there is some evidence in favour of X13 > Y~3, but the evidence is not very strong. There is an important remark to be made about this example, which may help to understand the low structure assumption used in this paper. Based on this assumption, we put a probability mass 1/13 between consecutive observations before we have the actual values, and once we have the data these serve to actually create the intervals on the real line, not using any other information than the data. For the predictive distribution for XI3, the weight of the next boy, our inferences imply that the probability for the event 2625 < )(13 < 2628 is assumed to be 1/13. It is quite likely that one objects to this inference, thinking that one's actual betting behaviour would not be reflected by this number. This is precisely the situation where one feels unhappy with the bet after seeing the data, and obviously this is related to the presence of some knowledge of birthweights. The essential argument of our approach is that inferences are based on the data only, and it indicates how we can learn from the data. As mentioned before, this feature is excellently discussed by Hill (1968, 1988). If one objects to this example, then delete the nature of the numbers or think of some situation where one would really not have more information than the data only (for example weights of green (X) and yellow (Y) creatures on the planet Mars), and then our methods show that quite strong inferences can be based on little assumptions.
EP.A. Coolen/Statistics & Probability Letters 29 (1996) 297 305
305
Acknowledgements The author w o u l d like to thank a referee for helpful c o m m e n t s that i m p r o v e d the presentation.
References Box, G.E.P., W.G. Hunter and J.S. Hunter (1978), Statistics for Experimenters: An Introduction to Desiyn, Data Analysis, and Model Building (Wiley, New York). De Finetti, B. (1974), Theory of Probability, 2 volumes (Wiley, London). Dempster, A.P. (1963), On direct probabilities, J. Roy. Statist. Soc. Ser. B 25, 100-110. Dobson, A.J. (1983), An Introduction to Statistical Modelling (Chapman and Hall, London). Geisser, S. (1993), Predictive Inference: An Introduction (Chapman and Hall, London). Goldstein, M. (1994), Revising exhangeable beliefs: subjectivist foundations for the inductive argument, in: P.R. Freeman and A.F.M. Smith, eds., Aspects of Uncertainty (Wiley, London) pp. 201-222. Hill, B.M. (1968), Posterior distribution of percentiles: Bayes' theorem for sampling from a population, J. Amer. Statist. Assoc. 63, 677-691. Hill, B.M. (1988), De Finetti's theorem, induction and A(n) or Bayesian nonparametric predictive inference, in: J.M. Bernardo, M.H. DeGroot, D.V. Lindley and A.F.M. Smith, Bayesian Statistics, Vol. 3 (Oxford) pp. 211-241. Hill, B.M. (1993), Parametric models for An: splitting processes and mixtures, J. Roy. Statist. Soc. Ser. B 55, 423-433. Lindgren, B.W. (1993), Statistical Theory (Chapman and Hall, New York, 4th ed.). Walley, P. (1991 ), Statistical Reasoning with Imprecise Probabilities (Chapman and Hall, London).