Journal
of Statistical
Planning
and Inference
11 (1985) 57-69
North-Holland
NUMERICAL Dankmar
ESTIMATION
27 February
Recommended
discussed
to the problem
measures
converging
along with the ones developed Examples design.
manuscript
are taken
algorithm from
the area
(P)
62605;
procedure;
hood;
Optima1
In this paper
1984
estimation;
a concave
space is considered.
are studied.
earlier by Wynn,
Key words and phrases: Iterative
1. The problem
26 February
of maximizing
on a topological
AMS Subject Classification: Primary
Nonparametric
received
Sen
solution
the set of all probability and a rapidly
Free University Berlin, Kelchstr. 31, 1000 Berlin 41, Federal
1983; revised
by P.K.
Abstract: An iterative cedure
MEASURE
BGHNING
Department of Epidemiology, Republic of Germany Received
OF A PROBABILITY
Computational
Fedorov,
Atwood,
of mixture
likelihoods
Secondary
62K05.
Maximum experimental
functional
@ defined
Convergence aspects
on
of this pro-
of this algorithm
Wu and others are provided.
likelihood
and optimal
estimation;
experimental
Mixture
likeli-
design.
and examples we will consider
Find JEW
the following
problem:
such that @(6^)zq5(a) for all 6~d.
@ : A + R U { - m} is a concave functional defined in a neighborhood of the set LI of all probability measures (p.m.‘s) on the Bore1 sigma-algebra of a compact topological space T, often subset of the p-dimensional Euclidean space Rp. We will look at two examples for problem statistical interest. Both can have continuous
(P) which arise in different T.
areas of
Example 1.1 (optimal experimental design). In (approximate) optimal design theory we try to find an approximate design 6 (e.g. a p.m.) on the design space T such that the information matrix J(6) := jTffT da is as ‘big’ as possible (Kiefer, 1974; Silvey, 1980), where f: T-t R, is a known regression function vector. The ‘bigness’ of J(6) is measured in respect to a concave functional such as @D(S) = log det (J(6))
(1)
with the convention log 0 = - 00. A p.m. s^ which solves (P) for q3= I&, is called an approximate D-optimal design. Other criteria are in use so that we may consider a 0378-3758/85/$3.30 0 1985, Elsevier Science Publishers
B.V. (North-Holland)
58
D. Biihning / Numerical estimation of a probability measure
criterion &.(a) = Z(J(6)) for some concave functional Z defined on the set of all information matrices. Example
1.2 (maximum likelihood estimation of a mixture distribution). sider a parametric family of densities
We con-
I(.%t) with respect to a sampling space XC lR,_,and a parameter space T. The mixture distribution comes in if we want to determine the prior distribution of the parameter t. Then I(x, S) :=
f(x, t)&dt) 5T
(2)
also defines a family of densities with new parameters space d. For example, let x be Poisson distributed, then (2) becomes 1(x, 6) := me-‘P/x!&dt). s0 The problem of finding the mixing distribution of a Poisson process is treated in Simar (1976) and Bohning (1982). A more general discussion of mixture likelihoods can be found in Lindsay (1983) and Everitt and Hand (1981). Often the problem of finding a mixing distribution is formulated in terms of finding the distribution of a latent variable rather then in terms of finding a prior distribution. For example, in medical diagnosis a situation arises where we have a symptom spaceX={l,..., m} and a disease space T= { 1, . . . , n}. Then l(i, j) gives the conditional probability that a person has the symptom i given that he/she has disease j. In this case (2) reduces to (3)
I(i, S) = i I(i, j)S(j) j=l
and 6 gives the distribution of the disease variable in the population. In other latent-variable approaches X is finite and T continuous (Bartholomew, 1980). Having observed xi, . . . , xN then the problem of maximum likelihood estimation leads to (P) with ~(6) = @ML(S)= i log I&j, t)d(dt). i=l 5T
(4)
Let nj denote the number of those xl, . . . , xN which are equal to i. It then follows that in the Poisson case, (4) becomes &.0(a) = $, log
me-‘t’&dt) 0
(9
D. Bdhning / Numerical estimation of a probability measure
59
and in the medical diagnosis problem (see (3)), (4) becomes hIAG8 = i 4 log i=l
i: 4khx.d
(6)
*
I-1
We now introduce the technical apparatus. Let d+:={6~dj#(6)>-oo}, and, for KEIR, dK:={G~dI@(G)~K). Also let do denote the set of all p.m. with finite support, that is, all those 6~d which have a representation
Notations.
with cXi>O, C (xi= 1, tie T, and 6, denoting the unit measure at t E T. Also d;
:=d+ndD,
&:=d%Ll D-
I+denotes the set of all signed measures, and S(6) the support of 6 E B. (i) @is continuous on d+ with respect to the vague topology, and if 6,+6Ed\d+ then #(6,)-+-03. (ii) For all 6 Ed + and a11y E f the directional derivative at 6 in the direction y, Assumptions.
exists and is linear in y. Specifically the directional derivative at 6 in any vertex direction a,-6 exists and is equal to V@(&C$)- 174(&d). We assume further that the latter is continuous in 1. For SEA define TMAX(S) := t E T’s!‘: F’@(& S,.) = l’@(d, S,) , l
1
TMIN(G) := t E S(S)1 inf P@(& S,,) = V@(6,6,) . L
I’ES(d)
1
Note that TMAX(G) and TMIN(S) are both subsets of T and nonempty. (iii) For all y > 0, K E IR, there exists &, =&(y, K) E (41) such that V@(6,6, - 6) 2 y for all /3~ [0,&l, all 6&i, Remark. Assumption compact, from which p. 224). & is vaguely compact subset, it is Because of the second
implies
@(6 + /?d(t’){d, - a,,}) - @(cl)2 +S(t’),L?y
all te T, and all t’ETMIN(G).
(i) ensures the existence of a solution of problem (P). T is we can conclude that d is vaguely compact (Bauer, 1978, closed, since 4 is continuous and, as a closed subset of a itself compact. Therefore C$attains its maximum on d:. part of assumption (i) there exists K E IR such that
60
D. Biihning
/ Numerical
SUP @@I=
sup @(6).
6EAE
6EA;
Assumption
estimation
(ii) allows the definition
of a probability
of our algorithm.
measure
Note that we have (7)
and ;:I: FG(4 4 - 4 2 G(J) - Q(d) for S, FEN + and s^ solution of problem theorem of Whittle (1973) which confirms
(8) (P). From (8) follows the equivalence of
the equivalence
for all 6~d. Assumption (iii) is essential for the convergence (to an optimum) of our algorithm. It asserts that the ascent of the functional will be at least linear if a certain non-optimality condition is satisfied. If 4 = v/ 0 m, where rn is any convex linear mapping ma d +/RN, e.g. m{l-/7)6+/36}=(1-&m(6)+/3m(6), b~[O,l], &JE~, and t,~:lR,+ll?U{-00) is continuously differentiable on MK := {m(6)16~rl, r@z(6))zK}, KE R, then assumption (iii) is met. This can be worked out similarly to the proof of the corrollary in Bohning (1982). Note that for all functionals considered thus far, assumptions (i)-(iii) are fulfilled.
2. The algorithm
and its convergence
For problem (P) the Fedorov-Wynn algorithm (Fedorov, 1972; Wynn, 1970) provides a solution in the following way. Wu (1978) suggested that the name VertexDirection Method is more appropriate. Vertex-Direction Method (VDM) 1. Select 6, ELI;. Set j= 1. 2. Choose tj E TMAX(Gj). 3. Find pjc [0, l] such that ~(6j+Pj(6,-6j))=4s~~l~(~j+P(df,-6j)). E >
4. Set
Sj+l
:=Sj+pj(6,,-6j),
set j :=j+
1, and go to 2.
Atwood (1973) suggested some refinements of the VDM. The Exchange Method of Bohning (1981) can also be viewed as an improvement of the VDM. A new concept was introduced by Atwood (1976) and Wu (1978). Atwood suggested a project-
61
D. BShning / Numerical estimation of a probability measure
ed Newton algorithm (which we refer to here as NEWTON) and Wu discussed various iteration functions, including a projected gradient algorithm (which we call here PROGRAD). With respect to the VDM, PROCRAD and especially NEWTON have an improved convergence behavior. On the other hand, NEWTON has a high numerical complexity. Our aim has been to find an algorithm which retains a low numerical complexity while showing a ‘good’ convergence rate. Algorithm 1 1. Select 6, Ed;. Set j= 1. 2. Choose fj” E TMAX(Gj), ty’ E TMIN(Gj). 3. Find pJ E [O, I] such that
Whereas the VDM always leaves can move the mass from one point exchanges of points if it is needed. weights of two points in one step
mass at the initial support points, Algorithm 1 to another in one step. It therefore supports the On the other hand Algorithm 1 varies only the while the others are kept constant.
2.1. If c$ meets assumption created by Algorithm 1, we have
Theorem
lim
(i), (ii), and (iii), then for any sequence (Sj)
sCsj)=~~G(6).
j-co
Proof.
By construction
we have
“‘r~(6j+,)r~(6j)2”‘~~~ for all j. Therefore 6jEdg’ verges. Let I#I+ be its limit.
:=@(6,)
for all Jo IN. Since (I) Assume
is bounded
above,
it con-
@+ < @(s^>. Then ~~(6j,6di’-6j)1~(6^)-~(6j)1~(6^)-~+ for some suitable
y, using
(8). Thus
by assumption
2Y>O (iii),
~(6j+*)-~(6j)1~(6,+P,6j(tj'2')(S,("-Bt:z)))-~(~j)
r+6j(tjq&y>o
for all je N. Since O$l 1s . vaguely compact there exists a subsequence ajA+BE ng’. For the sake of simplicity let us call this new sequence
(9) (S,,), with (Sj). Recall
62
D. Biihning / Numerical estimation of a probability measure
that Sj+S
in the vague topology
for every
s continuous
fdFj+
means
that
fd6
i
f on T. Consequently,
for each convergent
sequence
(tj),
tj E S(Fj), there exists t E S(F) such that lim tj=t
and
fif
Sj(tj)=F(t).
j-m
Now let (tf’)k
be a convergent
lilifnm t,f” E S(8) Thus
we attain
subsequence
and
with tf’ E TMIN(Fj,).
Then
F_: ~j~(tj(k2’)= S(lim tjz’) := v.
from (9) that there exists jO, h\l such that
~(sj,+,)-~(sj,)ravlPOY>o
for infinitely Before tically
many j,?j,,
contradicting
we give numerical
convergent
3. An algorithm
demonstrations
algorithm
for finding
for finding
of Algorithm
of interest
of (#(Sj)).
1 we discuss
a quadra-
the line minimizer.
1 we have to solve a problem
Find BE [0, l] such that f(1 -/?,/?)lf(l
f is the functional
property
the line minimizer
In step 3 of the VDM and Algorithm ing type: (LM)
the Cauchy
-P,/3)
which we have here defined
for all PE
of the follow-
[O,ll.
on IR, - for reasons
which
will soon become clear. One can apply the Newton-Raphson iteration to find /?. The main disadvantage of this concept lies in the fact that the iterated elements may leave the feasible region [O, 11. An example of this behavior is discussed in Bohning (1983). We therefore have developed a technique which forces all iterated elements to stay in the feasible region and exploits the positivity of the gradient of the functional of interest, a concept introduced by Silvey, Titterington and Torsney (1978). Thus, we assume from hereon that f has positive partial derivatives. An alternative formulation of (LM) is (LM’)
Find
(p^i, &*
E SZ such that
Here
(0, l)-+fR+
h2)Lf(PI,PI) for all@I, bdTESZ.
D. Biihning / Numerical estimation of a probability measure
s,(P) :=g
(A 1 -P),
g2m
:=g
1
(P,
63
1 -P).
2
Then a solution of problem (LM) can be characterized arbitrary A>O of F: (41) x R, -+(O, 1) given by Q/-7 A) = P(& (P))%/%
as the fixed point in /I for
(P)Y + (1 - P)(s2(P))A 1
We now want to find BE (0,l) such that F(fi,A)=p^
for all A>O.
The idea of the algorithm is as follows. We start at some initial value p and try to find a new value F(P, A) - dependent on A - such that the difference F(P, A) - F(b A) = QB, A) -B
(11)
is small. Of course, we do not know the fixed point p^.Therefore we choose a firstorder approximation of (1 l), namely (aF/J&(p, A). We now look for a value A(p) such that
It is not difficult to see that
in which we used the following abbreviations: gj = gi(P)?
81 =gl(P) = dd!$(PX
i= 42.
(12) becomes zero if A is choosen according to (13) assuming that the expression on the right side of (13) exists. Thus we are in the position to formulate the following. Algorithm
2
1. Choose /3(l) E (0 1) Set j= 1 2. If g@j’) =g2&)), stop * 3. Compute A”)=A(/Ij’) according * to (13). If not possible, stop.
D. BBhning / Numerical estimation of a probability measure
64
4. Set /?(j+‘) =F(p(j), Remark.
A(j)), j=j+
If Algorithm
17 and go to 2.
2 terminates
in step 2, we know that a maximizing
value of
p has been found. If it terminates in step 3, algorithm 2 has failed. Note that this cannot occur if f is strictly concave in a neighborhood of p^. If V’f denotes the Hessian off, we have in this case -1
g;(P) -s;(P)
=
T
(1 1
O’f(P,l-P)
-: (
CO >
for all p in this neighborhood of p^. Because gi(/3) =g2(p) + o(1) as p-+p^, we find that (13) is well defined in a neighborhood of p^. We now come to the main theorem of this section.
3.1. Let f be defined on S: and have continuous partial derivatives up to the order of three. In addition, let f be strictly concave in a neighborhood of p^. Then there exists q > 0 such that if Ip(‘) -PI
Proof. From the remarks following the formulation of Algorithm 2 we can conclude that there exists vi >O such that A(/?) is well defined if lb-81 5~~. Now, let us define
BK,(fi
:=B,(/J)xKK,(jJ))C
is a compact set. We now apply Taylor’s MYUP))
[O, I] x IR
theorem:
-FUA(P))
=~(8,1(8))(8-8) +I
a*F
-(P+cr(p^-P),1(P>)(p^-P>*
2 km* with a E [O, 11. Because we arrive at IP^-F(PJ(P)I for PEB~,@) derived.
and
with
(aF/ap)(fi,#l(/?)) = 0 and a*F/(ap)* is bounded
on B&,(p)
4p^-Pl* O
+w,
from
which
the
Theorem
can
easily
be
Example 3.2. f is the log-likelihood function of the logarithmic series, that is f (&, p2) = R log p1 - log( - log p2) where R is the mean of the observed frequencies.
D. BGhning / Numerical
Table
estimation
of a probability
65
measure
1
Convergence
of Algorithm
2 for R=
1.5
pti+l)
Step j
pti,
/.ti)
A(i)
1
0.53489
0.50000
0.03871
3.58870
2
0.53359
0.53489
0.00083
3.31964
3
0.53359
0.53359
-
3.32906
According to (13) we find n(/3) = l/{ 1 + [p/log(l -p)]}. We consider two examples. (i) R= 1.5: this is the good-natured example, here the classical Newton procedure also leads to the solution p^=O.5336. (ii) R= 1.02: this is the ill-natured example; the classical Newton procedure converges to p= - 138.14. Here p^=O.O387. We now take both cases as test examples for our Algorithm 2. In example (i) the MLE is found in the second step. See Table 1. In the other example p^ is found in the 8-th step starting as far away from p^ as we can. See Table 2. In column 4 is given
This column shows nicely the superlinear convergence rate of Algorithm ly column 5 of Table 2 shows that there will not be a specific constant rather a wide range of possible values of optimal A-parameters.
4. Computational 4.1. Example We
consider
@oIA defined and
in
(6)
T= { 1,2,3},
with
h =4,
N-‘(r~i)~=
2
Convergence Step j
considerations
with finite T
(0.15,0.1,0.2,0.55)T,
Table
2. EspecialA-value but
of Algorithm @J+
1)
2 for R=
1.02 p(J)
,W
0.80080
0.99990
0.79286
0.40566
0.80080 0.40566 0.17890 0.08365 0.04908
0.48150
1.98543
0.38204
4.53846
0.17890 0.08365 0.04908 0.03974 0.03872 0.03871 0.03871
0.03974 0.03872 0.03871
Ati)
1.12178
0.32056
10.82418
0.23082
23.56630
0.09934
40.40839
0.00954 -
49.98867 51.31539 51.33332
66
D. Biihning / Numerical estimation of a probability measure
(I(4 _i)) =
To execute step 3 we used Algorithm if Ipro+‘)-p~)l < 10-7
2 with the additional
stopping
rule to terminate
Table 3 shows the results of algorithm 1. We mention that a constant value of il = 1 in Algorithm 2 defines an EMalgorithm (Dempster, Laird and Rubin, 1977). It should be mentioned that in this example the EM-algorithm based on the iteration
is not accurate
Table
to four decimal
places after 1000 iterations!
3
Results
of Algorithm
Step j
($)
1 for @= @DIA Number Ei]
Time
A(P”‘)
of steps
in Algorithm
2
to solve step 3 1
3 1
l/3 l/3
2
3
0.1785
2
0.3333 0.4882
0.004
1 .I1
6
0.012
55.17
5
l/3
3
4
1
0.1785
2
0.1325 0.6891
0.6985 1.1004 1.2011 1.0106 0.9788 1.0106
0.19
4.29
5
1.0808 0.9825 0.9825
3
0.2056
2
0.1053 0.6891
10
3 2
0.2103 0.0424 0.1473
0.071
152.34
3
0.9999 0.9999
11
1
0.2103
0.079
8.49
2
1.0000
2
0.0424 0.7473
3 2
0.2103 0.0424
12
0.7473
0.029
119.82
5
0.9944 0.9944 1.0026
1.0000 0.9999 0.9999 0.086
152.42
2
1.0000 1.0000 1.0000
67
D. Bbhning / Numerical estimation of a probability measure
4.2. Example with continuous T We choose @= c#J,,defined in (1) and fT(t)=(1,t,t2,t3,(t-a)!+),
T=[-1,
+ll.
Because of inequality (8) we can use i7f#J,(S) := sup l7@,(S, 6, - 6) tET
as a criterion for rate of convergence. l7@,(S, 6, - 6) =_P(t)J@-If(t)
Note that - 5.
Table 4 compares Algorithm 1 with three well-known algorithms for various a. As mentioned in Section 2, NEWTON (Atwood, 1976) is a fast converging algorithm, whereas PROGRAD (Wu, 1978) converges more slowly. These two algorithms along with VDM are used as comparative algorithms. The data for NEWTON, PROGRAD,and VDM are taken from Table 7 in BGhning (1981, p. 71). Although Algorithm 1 is slower than NEWTONin all examples, in the second and third number it is again better than VDM and PROGRAD.It can be regarded as a strong competitor to NEWTON since Algorithm 1 needs less computational effort than NEWTON.
Appendix
Here we offer some technical details to incorporate Algorithm 2 into step 3 of Algorithm 1. Let again @be defined in a neighborhood of d +, and, in addition to VC$J(~, a,), let it possess the second order directional derivative
a2
WWJ :=*Gw+m
p=.
for all 6~d + and all t E T. Step 3 of Algorithm defined by
Table 4 Number of iterations
a
VDM
0.2 0.4 0.6 0.8
such that m(Sj)
< 1 .O (first number),
1 leads to the maximization
0.1 (second
number),
off
0.01 (third number)
PROGRAD
NEWTON
Algorithm
7, >so, >50
4,26,26
5,8,10
6,11,13
I, >50, >50
4,15,21
5,6,9
-I,12 14
4, >50, >50 4, >50, z-50
4,9, 11 4, 10, 18
5,8,10 7,10,12
6, IO, 12 4,9, 11
1
D. Btihning / Numerical estimation of a probability measure
68
Table
5
@
defined
@D
(1)
@ML
(4)
@PO
(5)
@DIA
(6)
in
FW>d61)
F2@(&&)
u(t) :=fT(t)J(s)-‘f(t)
u-TUMs)-1fW12
ii, “jr
e::t;(dt)
with
i= 1,2 and
From these expressions it is not difficult to compute formulas to compute V@ and V %#Ifor various @.
g,,g;,g,,g;.
Table
5 lists the
References Atwood,
C.L.
(1973).
Sequences
(1976).
Convergent
converging
to D-optimal
designs
Ann. Statist. 1,
of experiments.
342-352. Atwood,
C.L.
design
sequences
for sufficiently
regular
optimality
criteria.
Ann.
Statist. 4, 1124-l 138. Bartholomew, D.J. (1980). Factor analysis for categorical data. J. Roy. Statist. Sot. Ser. B 42, 293-321. Bauer, H. (1978). Wahrscheinlichkeitstheorie und Grundziige o’er A4aBheorie. De Gruyter, Berlin. BGhning, D. (1981). Numerische Verfahren in der optimalen Versuchsplanung. Inaugural-Dissertation am Fachbereich Mathematik der Freien Universitlt Berlin. BGhning, D. (1982). Convergence of Simar’s algorithm for finding the MLE of a compound Poisson process. Ann. Statist. 10, 1006-1008. BGhning, D. (1983). Maximum likelihood
estimation
of the logarithmic
series distribution.
Statist&he
Hefte 24, 121-140. Biihning,
D. and K.-H.
Hoffmann
(1982). Numerical
procedures
for estimating
probabilities.
J. Statist.
Comput. Simulation 14, 283-293. Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from incomplete data via the EM-algorithm (with discussion). J. Roy. Statist. Sot. Ser. B 39, l-38. Everitt, B.S. and D.J. Hand (1981). Finite Mixture Distributions. Chapman and Hall, London. Fedorov, Klimko).
V.V. (1972). Academic
Theory of Optimal Experiments (Transl. Press,
New York.
and ed. by W.J.
Studden
and E.M.
69
D. BGhning / Numerical estimation of a probability measure Kiefer,
J. (1974). General
equivalence
theory
for optimum
designs
(approximate
theory).
Ann.
Statist.
2, 849-879. Lindsay,
B.C. (1983). The geometry
of mixture
likelihoods:
S.D. (1980). Optimal Design. Chapman
Silvey,
Silvey, S.D., D.M. Titterington
and B. Torsney
and Hall,
a general
(1978). An algorithm
space. Commun. Statist. A 7, 1379-1389. Simar, L. (1976). Maximum likelihood estimation
of a compound
Ann. Statist. 11, 86-94.
theory.
London-New
York.
for optimal Poisson
designs on finite design
Ann. Statist. 4,
process.
1200-1209. Tsay,
J.Y.
(1976). On the sequential
construction
of D-optimal
designs.
J. Amer. Statist. Assoc. 71,
.
67 l-674. Whittle,
P. (1973). Some general
points
in the theory
of optimal
experimental
design.
J. Roy.
Statist.
Sot. Ser. B 35, 123-130. Wu,
C.F.
(1978).
Some
iterative
Commun.
procedures
for generating
nonsingular
optimal
designs.
generation
of D-optimum
experimental
designs.
Ann. Math. Statist.
Statist. A 7, 1399-1412. Wynn,
H.P.
(1970). The sequential
41, 1655-1664.