Journal
of Statistical
Planning
and Inference
11 (1985) 267-276
267
North-Holland
BACK-TO-BACK Paul
MIXTURES
OF DISCRETE
DISTRIBUTIONS
S. HORN
Department of Mathematical Sciences, Mail Location #25, University of Cincinnati, Cincinnati, OH 45221, USA Received
9 December
Recommended
1983; revised manuscript
Abstract: A method
for expanding
very simple: a breakpoint distributions appealing poses,
received
4 September
1984
by V.P. Godambe
are fit. Thus,
is chosen
the choices
for fits of discrete
the method
is a mixture
of two discrete
in light of the ease with which the likelihood
the method
data is given. The method
for the data set on either side of which two separate
is used on the data set that motivated
AIMS Subject Classification: Primary
62P99;
Key words and phrases: Discrete random
Secondary
variable;
equations
distributions. simplify.
is
discrete
The method
For illustrative
is
pur-
its conception. 62ElO.
Mixture
model.
1. Introduction The motivation for the material presented here came from an analysis of residuals from a counting process. Since the process had occasional large jumps, the residuals tended to be skewed toward large values. The analysis consisted of coming up with an appropriate discrete model of the residuals. This was difficult since the residuals consisted of both positive and negative values and were long-tailed and skewed. For long-tailed skewed discrete data, finding a suitable distribution to describe the behavior may be difficult. It may be the case, however, that one distribution fits one side of the data well, and another fits the other side well, but no single distribution provides a good fit for the entire data set. We propose that this fitting of parts of the data separately, or back-to-back, can yield a good fit. The price that is paid is in terms of the degrees of freedom used for the fit: the total number of parameters for both discrete distributions, plus one for the mixture probability, plus one because the fits are taken separately. Degrees of freedom can be saved by allowing the two distributions to share some, or all, of the parameters. This will of course not provide as good as fit as in the case described above.
0378.3758/85/$3.30
0
1985, Elsevier
Science
Publishers
B.V. (North-Holland)
P.S. Horn / Back-to-back mixtures
268
2. Back-to-back
mixtures
We wish to define a discrete probability function, f( . ), with associated random variable X, on the integers, say, such that one side exhibits the behavior of discrete probability function fr( - ), and the other side exhibits the behavior of another discrete probability function f2(. ). The two sides of the domain will be separated at an integer k, the minimum value for f2(. ). This number k will be referred to as the breakpoint. Specifically, let P(X=x)=f(x)=Pf~(x)+(~
-P)MX),
(1)
where k-l iz;_fi(O=
1,
i~kf*(i)=l~
f,(O=O,
izk,
i
“MO=Q
and p=P(X
f(x) =P x= .
..)
(-(x+ -JO,
1)
l))!
Z[X
- w~[,,o](x),
l)... ,
where Zr.j(x) is the indicator function. Thus, back-to-back mixing greatly increases our choices of discrete models. Note that in the above back-to-back mixture model the breakpoint was chosen a priori. This will be the case when there is a natural point at which to separate the data. For example, with signed data a breakpoint of zero (or one) would make sense. However, an a priori choice of a breakpoint, though natural, need not provide the best fit. One could let the breakpoint k vary over the possible data points
P.S. Horn / Back-to-back mixtures
and choose that k that gives, say, the maximum yield likelihood equations much more difficult a priori breakpoint. The breakpoint, k, can be estimated using
269
value for the likelihood.
This will
to solve than those derived
from an
criteria
other
than
maximum
likeli-
hood. For instance, it can be chosen to minimize the chi-squared goodness-of-fit statistic, and the other parameters (functions of k) can be derived by maximum likelihood. Alternatively, a simpler a priori choice of the breakpoint is the sample mode. Of course when the data are used to estimate the breakpoint an extra degree of freedom is used. The rest of this study will be concerned with situations in which there is a natural breakpoint in the data. An attractive feature of this situation is the simplicity of the resulting likelihood equations.
3. Estimation
of parameters
We wish to fit the discrete
distribution
f(.
) of Equation
(1) to the data points,
where fr(x) =f,(~;t?) and fi(x) =f2(x$), where d = (aI, . . . ,cxU) and 8= t, . . . , &) are parameter vectors. Without loss of generality, we may assume that (P Then the logarithm of the likelihood is as x,(~~(...~x,
Xl,
.*-,XN,
follows: L(p,ti,~~A=nlogp+
i: logf,(x;;&) i=l
+
w-n> log (1 -P> +
f
logf*(x,;D~>,
,=n+l
where n is the number of x,‘s less than k. This is the case since f,(x) =0 for xz k and f*(x) = 0 for x< k. Note that rj, the maximum likelihood estimate of p, is the solution to the equation N-n
aL(p,ti,/qq=;- -
ap
1-P
= 0,
that is lj=n/N. Thus, rj is simply the sample proportion of observations less than k. This is reassuring in light of Equation (l), where it was stated that p = P(X< k). An attractive feature of the back-to-back mixture is the fact that the likelihood equations for the remaining parameters are no more difficult to solve than the equations for a non-mixture model. This is because the resulting likelihood equations involving the ~j’s do not involve any Pm’s. Thus we simply fit the points less than k separately, as we would if these were the only data points. The same is true of the data points greater than or equal to k. The two probability functions that describe the data set are related only through the parameter p. This implies that we need not fit using maximum likelihood, we could fit using, for example, method of moments. The point is that we can fit the two sides of the data separately.
270
P.S. Horn / Bark-to-back mixtures
The back-to-back mixture is thus not as difficult to use in practice as might first appear. For example, if fr(. ) is a binomial and f2(. ) is a negative binomial then there are a total of 5 parameters, including the mixture probability, p. As previously noted, the maximum likelihood estimate of p is trivial to compute since it does not involve any of the other parameters. The remaining four parameters are estimated though two sets of two equations, not one set of four equations. The former situation is, of course, much simpler.
4. An example The data given in Table 1 are the residuals from a method of forecasting demand for special types of telecommunication equipment. Knowledge of the distribution of these residuals would be of help to those making future predictions. Note that negative residuals indicate that the forecast was too large, while positive residuals are indicative of under-forecasting. To fit the data from Table 1 we look to a heavy-tailed discrete distribution, say the negative binomial. To fit a negative binomial to these data we will shift it so that
Table
1
Residuals
from forecast
i
fl
(f, = frequency i
of i)
f,
i
i
fi
-28
1
758
25
9
51
1
91
1
-22
2
495
12 11
53
2
54
2 1
1
h
i
fi
-21
3
351
26 27
2
93 96
-20
4
252
28
10
55
1
102
- 19
5
216
29
10
56
3
106
1
- 18
150
30 31
4 5
57
1
107
1
58
1
32
4
59
4
109 112
1 1
- 17
8
6 7
- 16
10
8
- 15 - 14
12
9
75
33
7
60
2
118
1
16 14
10
51
34
5
64
2
11
56
35
8
65
3
120 127
1
- 12
20
12
49
36
5
-11 - 10 -9 -8 -7
29 31 61 63
13 14 15
45 41 30
3 1
68 69 73
1 1 2
100
19 25
2 4 5 3
66 67
16 17
37 39 40 41
-6
142
18
22
42 43
5 4
76 77
1 1
-5 -4
206 304 587 695
19 20 21 22
20 14 14 13
44 46 47
5 3 4
80
1
81 82
1 1
915 1300
23 24
16
48 49
1 3
83 84
1 1
9
50
3
86
I
-13
-3 -2 -1 0
Ill 95
1
137
1
139 154 167 168
1 1 1
188
1
1
271
P.S. Horn / Back-to-back mixtures
the sample minimum, - 28 corresponds to the point 0. It is important to note that this shift and subsequent fit will force, P(X< -28)=0. This will cause a logistical problem if there is no natural minimum for the data being examined. Since there is no reason, in this example, why an observation could not be - 29, there is some problem with the single distribution fit. However, since it could be argued that such observations are infrequent, this problem is not worth troubling over. Using the formulas
found
maximum likelihood our fitted model is P(X=x)
in Johnson
estimates
=
x= -28,
and Kotz (1969), p. 132, we compute
N and P to be 22.65 and 1.308 respectively.
the
Thus,
r(so*65 +X) (1.308/2.308)X+28(1/2.308)22~65, r(22.65)((28 + x)!) -21,
-26 ,....
In Figure 1, histograms of the empirical data and the fitted negative binomial are given (though not all 22 cells are shown). The fit is not very good as shown by the diagram and the chi-square statistic, which has a value over 6000. We will now fit a back-to-back mixture. We must first decide on the breakpoint, k. Since the data are residuals, and thus signed, k= 0 is an appropriate point at which to split the data set. This has the interpretation of treating residuals from strictly over-forecasting differently from the other residuals. Thus, the negative (
_
Chi-squared statistic= Degrees of freedom-
Residual
Fig. 1. Negative
binomial
fit.
P.S. Horn / Back-to-back
212
mixtures
negative (2 0) residuals will be used to give another fitted distribution. From Table 1, we compute the sample mean and variance of the negative data to be -3.402 and 9.163 respectively. We further compute the sample mean and variance of the non-negative data to be 5.317 and 142.375 respectively. Since for both parts of the data set the variances are much greater than the absolute values of the means, and we wish to use only a few parameters, we turn to suitable oneparameter distributions to describe each part of the data set. We will fit a geometric distribution, with parameter r, to the negative part of the data, and a (shifted) zeta distribution, with parameter Q, to the nonnegative part of the data (Johnson and Katz, p. 240). This yields the fitted model P(X=x)=O.4224(1
-0.7061)(0.7061)-X~1Z,,<,,(x)
+0.5776(x+
x= . . . . - l,O, 1, . . . .
1)-1~6”5/2.251~,,01(x>,
In Figure 2, the empirical and fitted histogram are given along with the chisquared statistic. Note the improvement of the back-to-back model over the previous model. With the single distribution fit the value of the chi-squared statistic was 6267 with 19 degrees of freedom (22 (cells) -2 (parameters)I), while the above back-to-back fit has a chi-squared value of 966 with 17 degrees of freedom. (The negative data have 6 (8 - I- 1) and the nonnegative data have 12 (14- I- 1) degrees of freedom respectively. An extra degree of freedom is used to estimate the
Chi-squared statistic= C@Vees of freed-
-8
-7
-6
-5
-‘I
-3
-2
-I
0 1 Residual
Fig. 2. Back-to-back
2
3
15
geometric/zeta
6
7
fit.
89
966 I,
0:
Ewirical
a:
llodel
10
II
12
13
P.S. Horn / Back-to-back
mixture
probability.)
Although
both chi-squared
mixtures
statistics
will often be the case with a large data set, the back-to-back
213
are highly significant,
as
fit is an improvement
over the single fit by a factor of almost 6.5, while only using two more degrees of freedom. The improvement in the fit, as shown by the diagrams and the chi-squared statistics is worth the price of two degrees of freedom. Furthermore, back-to-back mixtures do not suffer from the logistical restrictions of the single discrete distribution. With the back-to-back mixture there is no restriction on the minimum value of a residual. Even if there were a natural minimum in the above example, a back-to-back mixture would still have been appropriate: a truncated geometric distribution could have been used to model the negative residuals. As a further example, let us fit a back-to-back model to these data with the negative values again described by a geometric (parameter r), but now with a negative binomial distribution (parameters N and P) to describe the non-negative values. This yields the fitted model P(X=x)=O.4224(1 +
o
-0.7061)(0.7061)~X~‘Z,,<,,(x) 5776
W.4243 + 4 (0.9261)X(1 -O.9216)o.4243&o,(x), I-(0.4243)x!
x=...,
-l,O,l,....
Figure 3 gives the empirical and the fitted histograms along with the chi-squared statistic. Again we see a great improvement in the fit. Here the value of the chisquared statistic is 274 with 16 degree of freedom. Thus, for the price of three degrees of freedom we see an improvement over the single distribution fit by a factor of almost 23. (For one degree of freedom, this back-to-back mixture model is an improvement over the previous back-to-back mixture model by a factor of 3.5.) As a final example let us fit the following back-to-back model:
x=
... . -
l,O, 1, . ..)
where pL t-p2 +p3 = 1. This model is an improvement over the previous back-toback models in that by fitting the breakpoint, 0, separately, we do not have to decide whether to include it with either the strictly positive or strictly negative data points. Of course, by fitting the breakpoint separately the fitted frequency will necessarily equal the observed frequency at a cost of one degree of freedom. If we fit the above model to the data with the negative data described by a geometric distribution, and the positive data described by a negative binomial distribution, we get the fitted model P(X=x)=O.4224(1
-0.7061)(0.7061)-X-‘I,,
P.S. Horn / Back-to-back mixtures
Chi-squared statistic= Degrees of freedan=
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
274 16
11
12
13
Residual
Fig. 3. Back-to-back
+ 0.4075
geometric/negative
binomial
fit.
(o.9344)X- I(1 - o.9334)o~46~01,,>01(x), =(x- o.5340) QO.466O)(x - l)!
x= . ..) -l,O,l,.... Figure 4 gives the empirical and fitted histograms along with the chi-squared statistic, which is equal to 213 with 15 degrees of freedom. Thus, for the price of four degrees of freedom the above back-to-back mixture model is an improvement over the single distribution fit by a factor of almost 30. (For one degree of freedom this fit is better than the previous back-to-back fit by about 20%.)
5. Iso-back-to-back
mixtures
In the first example of a back-to-back mixture model, we used two different distributions, Poisson and geometric, with different parameters, A and r. Similarly, in the first back-to-back fit we used two different distributions, geometric and zeta, with different parameters, r and Q_ The reason for this was that we believed that the negative residuals behaved differently from the non-negative ones. This was obvious by looking at Table 1 where the nonnegative residuals are much more skewed.
P.S. Horn / Back-to-back
mixtures
275
h
Fig. 4. Back-to-back
geometric/breakpoint/negative
binomial
fit.
We need not adhere to such generalities if we believe that the data on both sides of the breakpoint, k, have the same shape. LetS( . ;&) be a discrete probability function defined on the non-negative integers. Now, let us define the iso-back-to-back mixture model of f( . ) with breakpoint, k, as, P(X=x)=fO=pfU-
1 -x;~~~~,
-p)f(X--k;~)I~~~o,(x),
x= . . . . - l,O, 1, . . . . Note that, unlike the previous back-to-back model, the parameter vectors in this model are the same. The iso-back-to-back could even be used with different types of discrete distributions. For example, one could fit non-negative data using a binomial distribution with parameters N and p, and fit the negative data with a Poisson distribution with parameter A = Np. This would save one degree of freedom. In general, one would use the iso-back-to-back mixture model if one believes the data exhibit similar shapes on either side of the breakpoint and/or there are not many degrees of freedom to give up.
6. Conclusion There is much to be gained
by back-to-back
mixing,
especially
with long-tailed,
P.S. Horn / Back-to-back mixtures
216
skewed, discrete data. For the price of a few degrees of freedom, substantial gains can be made in fitting a back-to-back mixture over a single fit of the whole data. Back-to-back mixtures may be especially appropriate if there is a natural ‘break’ in the data. For example, with signed data, like residuals, the point zero will often be appropriate. That is, treat the strictly negative residuals separately from those that are greater than or equal to zero. If one believes that the data have similar shapes on either side of the breakpoint, and/or there are not many degrees of freedom in the first place, then iso-back-toback mixtures would be appropriate. These models are similar to the ordinary backto-back mixtures, except that the two distributions use some or all of the same parameters. Thus, not as many degrees of freedom are used for the fit and both sides of the data may be modeled with similarly shaped distributions. Though not examined in this study, back-to-back mixtures may be appropriate with bimodal data; the obvious breakpoint lying between the two modes. Back-toback mixtures that use more than one breakpoint are also possible - the likelihood equations, though more numerous, would be no more difficult to solve than in the single breakpoint mixture. In the residual setting, for example, we could fit a separate parameter to P(X=O) at the cost of one degree of freedom. This would preclude our having to group the zeroes with either the strictly positive or strictly negative residuals. Finally, we note that all that has been presented for discrete models holds for continuous models as well. (There will be a discontinuity at the breakpoint(s), however.) This would be an interesting area for further study.
Acknowledgement The author
wishes to thank
the referees
for their helpful
suggestions.
References Blischke,
W.R. (1964). Estimating
the parameters
of mixtures
of binomial
distributions.
J. Amer. Statist.
Sot. 59, 510-528. Blischke, W.R. (1962). Moment estimators tions. Ann. Math. Statist. 33, 444-454. Johnson,
for the parameters
N.I. and S. Kotz (1969). Discrete Distributions. John
of a mixture
of two binomial
Wiley and Sons, New York.
distribu-