Numerical estimation of a probability measure

Numerical estimation of a probability measure

Journal of Statistical Planning and Inference 11 (1985) 57-69 North-Holland NUMERICAL Dankmar ESTIMATION 27 February Recommended discussed ...

626KB Sizes 6 Downloads 66 Views

Journal

of Statistical

Planning

and Inference

11 (1985) 57-69

North-Holland

NUMERICAL Dankmar

ESTIMATION

27 February

Recommended

discussed

to the problem

measures

converging

along with the ones developed Examples design.

manuscript

are taken

algorithm from

the area

(P)

62605;

procedure;

hood;

Optima1

In this paper

1984

estimation;

a concave

space is considered.

are studied.

earlier by Wynn,

Key words and phrases: Iterative

1. The problem

26 February

of maximizing

on a topological

AMS Subject Classification: Primary

Nonparametric

received

Sen

solution

the set of all probability and a rapidly

Free University Berlin, Kelchstr. 31, 1000 Berlin 41, Federal

1983; revised

by P.K.

Abstract: An iterative cedure

MEASURE

BGHNING

Department of Epidemiology, Republic of Germany Received

OF A PROBABILITY

Computational

Fedorov,

Atwood,

of mixture

likelihoods

Secondary

62K05.

Maximum experimental

functional

@ defined

Convergence aspects

on

of this pro-

of this algorithm

Wu and others are provided.

likelihood

and optimal

estimation;

experimental

Mixture

likeli-

design.

and examples we will consider

Find JEW

the following

problem:

such that @(6^)zq5(a) for all 6~d.

@ : A + R U { - m} is a concave functional defined in a neighborhood of the set LI of all probability measures (p.m.‘s) on the Bore1 sigma-algebra of a compact topological space T, often subset of the p-dimensional Euclidean space Rp. We will look at two examples for problem statistical interest. Both can have continuous

(P) which arise in different T.

areas of

Example 1.1 (optimal experimental design). In (approximate) optimal design theory we try to find an approximate design 6 (e.g. a p.m.) on the design space T such that the information matrix J(6) := jTffT da is as ‘big’ as possible (Kiefer, 1974; Silvey, 1980), where f: T-t R, is a known regression function vector. The ‘bigness’ of J(6) is measured in respect to a concave functional such as @D(S) = log det (J(6))

(1)

with the convention log 0 = - 00. A p.m. s^ which solves (P) for q3= I&, is called an approximate D-optimal design. Other criteria are in use so that we may consider a 0378-3758/85/$3.30 0 1985, Elsevier Science Publishers

B.V. (North-Holland)

58

D. Biihning / Numerical estimation of a probability measure

criterion &.(a) = Z(J(6)) for some concave functional Z defined on the set of all information matrices. Example

1.2 (maximum likelihood estimation of a mixture distribution). sider a parametric family of densities

We con-

I(.%t) with respect to a sampling space XC lR,_,and a parameter space T. The mixture distribution comes in if we want to determine the prior distribution of the parameter t. Then I(x, S) :=

f(x, t)&dt) 5T

(2)

also defines a family of densities with new parameters space d. For example, let x be Poisson distributed, then (2) becomes 1(x, 6) := me-‘P/x!&dt). s0 The problem of finding the mixing distribution of a Poisson process is treated in Simar (1976) and Bohning (1982). A more general discussion of mixture likelihoods can be found in Lindsay (1983) and Everitt and Hand (1981). Often the problem of finding a mixing distribution is formulated in terms of finding the distribution of a latent variable rather then in terms of finding a prior distribution. For example, in medical diagnosis a situation arises where we have a symptom spaceX={l,..., m} and a disease space T= { 1, . . . , n}. Then l(i, j) gives the conditional probability that a person has the symptom i given that he/she has disease j. In this case (2) reduces to (3)

I(i, S) = i I(i, j)S(j) j=l

and 6 gives the distribution of the disease variable in the population. In other latent-variable approaches X is finite and T continuous (Bartholomew, 1980). Having observed xi, . . . , xN then the problem of maximum likelihood estimation leads to (P) with ~(6) = @ML(S)= i log I&j, t)d(dt). i=l 5T

(4)

Let nj denote the number of those xl, . . . , xN which are equal to i. It then follows that in the Poisson case, (4) becomes &.0(a) = $, log

me-‘t’&dt) 0

(9

D. Bdhning / Numerical estimation of a probability measure

59

and in the medical diagnosis problem (see (3)), (4) becomes hIAG8 = i 4 log i=l

i: 4khx.d

(6)

*

I-1

We now introduce the technical apparatus. Let d+:={6~dj#(6)>-oo}, and, for KEIR, dK:={G~dI@(G)~K). Also let do denote the set of all p.m. with finite support, that is, all those 6~d which have a representation

Notations.

with cXi>O, C (xi= 1, tie T, and 6, denoting the unit measure at t E T. Also d;

:=d+ndD,

&:=d%Ll D-

I+denotes the set of all signed measures, and S(6) the support of 6 E B. (i) @is continuous on d+ with respect to the vague topology, and if 6,+6Ed\d+ then #(6,)-+-03. (ii) For all 6 Ed + and a11y E f the directional derivative at 6 in the direction y, Assumptions.

exists and is linear in y. Specifically the directional derivative at 6 in any vertex direction a,-6 exists and is equal to V@(&C$)- 174(&d). We assume further that the latter is continuous in 1. For SEA define TMAX(S) := t E T’s!‘: F’@(& S,.) = l’@(d, S,) , l

1

TMIN(G) := t E S(S)1 inf P@(& S,,) = V@(6,6,) . L

I’ES(d)

1

Note that TMAX(G) and TMIN(S) are both subsets of T and nonempty. (iii) For all y > 0, K E IR, there exists &, =&(y, K) E (41) such that V@(6,6, - 6) 2 y for all /3~ [0,&l, all 6&i, Remark. Assumption compact, from which p. 224). & is vaguely compact subset, it is Because of the second

implies

@(6 + /?d(t’){d, - a,,}) - @(cl)2 +S(t’),L?y

all te T, and all t’ETMIN(G).

(i) ensures the existence of a solution of problem (P). T is we can conclude that d is vaguely compact (Bauer, 1978, closed, since 4 is continuous and, as a closed subset of a itself compact. Therefore C$attains its maximum on d:. part of assumption (i) there exists K E IR such that

60

D. Biihning

/ Numerical

SUP @@I=

sup @(6).

6EAE

6EA;

Assumption

estimation

(ii) allows the definition

of a probability

of our algorithm.

measure

Note that we have (7)

and ;:I: FG(4 4 - 4 2 G(J) - Q(d) for S, FEN + and s^ solution of problem theorem of Whittle (1973) which confirms

(8) (P). From (8) follows the equivalence of

the equivalence

for all 6~d. Assumption (iii) is essential for the convergence (to an optimum) of our algorithm. It asserts that the ascent of the functional will be at least linear if a certain non-optimality condition is satisfied. If 4 = v/ 0 m, where rn is any convex linear mapping ma d +/RN, e.g. m{l-/7)6+/36}=(1-&m(6)+/3m(6), b~[O,l], &JE~, and t,~:lR,+ll?U{-00) is continuously differentiable on MK := {m(6)16~rl, r@z(6))zK}, KE R, then assumption (iii) is met. This can be worked out similarly to the proof of the corrollary in Bohning (1982). Note that for all functionals considered thus far, assumptions (i)-(iii) are fulfilled.

2. The algorithm

and its convergence

For problem (P) the Fedorov-Wynn algorithm (Fedorov, 1972; Wynn, 1970) provides a solution in the following way. Wu (1978) suggested that the name VertexDirection Method is more appropriate. Vertex-Direction Method (VDM) 1. Select 6, ELI;. Set j= 1. 2. Choose tj E TMAX(Gj). 3. Find pjc [0, l] such that ~(6j+Pj(6,-6j))=4s~~l~(~j+P(df,-6j)). E >

4. Set

Sj+l

:=Sj+pj(6,,-6j),

set j :=j+

1, and go to 2.

Atwood (1973) suggested some refinements of the VDM. The Exchange Method of Bohning (1981) can also be viewed as an improvement of the VDM. A new concept was introduced by Atwood (1976) and Wu (1978). Atwood suggested a project-

61

D. BShning / Numerical estimation of a probability measure

ed Newton algorithm (which we refer to here as NEWTON) and Wu discussed various iteration functions, including a projected gradient algorithm (which we call here PROGRAD). With respect to the VDM, PROCRAD and especially NEWTON have an improved convergence behavior. On the other hand, NEWTON has a high numerical complexity. Our aim has been to find an algorithm which retains a low numerical complexity while showing a ‘good’ convergence rate. Algorithm 1 1. Select 6, Ed;. Set j= 1. 2. Choose fj” E TMAX(Gj), ty’ E TMIN(Gj). 3. Find pJ E [O, I] such that

Whereas the VDM always leaves can move the mass from one point exchanges of points if it is needed. weights of two points in one step

mass at the initial support points, Algorithm 1 to another in one step. It therefore supports the On the other hand Algorithm 1 varies only the while the others are kept constant.

2.1. If c$ meets assumption created by Algorithm 1, we have

Theorem

lim

(i), (ii), and (iii), then for any sequence (Sj)

sCsj)=~~G(6).

j-co

Proof.

By construction

we have

“‘r~(6j+,)r~(6j)2”‘~~~ for all j. Therefore 6jEdg’ verges. Let I#I+ be its limit.

:=@(6,)

for all Jo IN. Since (I) Assume

is bounded

above,

it con-

@+ < @(s^>. Then ~~(6j,6di’-6j)1~(6^)-~(6j)1~(6^)-~+ for some suitable

y, using

(8). Thus

by assumption

2Y>O (iii),

~(6j+*)-~(6j)1~(6,+P,6j(tj'2')(S,("-Bt:z)))-~(~j)

r+6j(tjq&y>o

for all je N. Since O$l 1s . vaguely compact there exists a subsequence ajA+BE ng’. For the sake of simplicity let us call this new sequence

(9) (S,,), with (Sj). Recall

62

D. Biihning / Numerical estimation of a probability measure

that Sj+S

in the vague topology

for every

s continuous

fdFj+

means

that

fd6

i

f on T. Consequently,

for each convergent

sequence

(tj),

tj E S(Fj), there exists t E S(F) such that lim tj=t

and

fif

Sj(tj)=F(t).

j-m

Now let (tf’)k

be a convergent

lilifnm t,f” E S(8) Thus

we attain

subsequence

and

with tf’ E TMIN(Fj,).

Then

F_: ~j~(tj(k2’)= S(lim tjz’) := v.

from (9) that there exists jO, h\l such that

~(sj,+,)-~(sj,)ravlPOY>o

for infinitely Before tically

many j,?j,,

contradicting

we give numerical

convergent

3. An algorithm

demonstrations

algorithm

for finding

for finding

of Algorithm

of interest

of (#(Sj)).

1 we discuss

a quadra-

the line minimizer.

1 we have to solve a problem

Find BE [0, l] such that f(1 -/?,/?)lf(l

f is the functional

property

the line minimizer

In step 3 of the VDM and Algorithm ing type: (LM)

the Cauchy

-P,/3)

which we have here defined

for all PE

of the follow-

[O,ll.

on IR, - for reasons

which

will soon become clear. One can apply the Newton-Raphson iteration to find /?. The main disadvantage of this concept lies in the fact that the iterated elements may leave the feasible region [O, 11. An example of this behavior is discussed in Bohning (1983). We therefore have developed a technique which forces all iterated elements to stay in the feasible region and exploits the positivity of the gradient of the functional of interest, a concept introduced by Silvey, Titterington and Torsney (1978). Thus, we assume from hereon that f has positive partial derivatives. An alternative formulation of (LM) is (LM’)

Find

(p^i, &*

E SZ such that

Here

(0, l)-+fR+

h2)Lf(PI,PI) for all@I, bdTESZ.

D. Biihning / Numerical estimation of a probability measure

s,(P) :=g

(A 1 -P),

g2m

:=g

1

(P,

63

1 -P).

2

Then a solution of problem (LM) can be characterized arbitrary A>O of F: (41) x R, -+(O, 1) given by Q/-7 A) = P(& (P))%/%

as the fixed point in /I for

(P)Y + (1 - P)(s2(P))A 1

We now want to find BE (0,l) such that F(fi,A)=p^

for all A>O.

The idea of the algorithm is as follows. We start at some initial value p and try to find a new value F(P, A) - dependent on A - such that the difference F(P, A) - F(b A) = QB, A) -B

(11)

is small. Of course, we do not know the fixed point p^.Therefore we choose a firstorder approximation of (1 l), namely (aF/J&(p, A). We now look for a value A(p) such that

It is not difficult to see that

in which we used the following abbreviations: gj = gi(P)?

81 =gl(P) = dd!$(PX

i= 42.

(12) becomes zero if A is choosen according to (13) assuming that the expression on the right side of (13) exists. Thus we are in the position to formulate the following. Algorithm

2

1. Choose /3(l) E (0 1) Set j= 1 2. If g@j’) =g2&)), stop * 3. Compute A”)=A(/Ij’) according * to (13). If not possible, stop.

D. BBhning / Numerical estimation of a probability measure

64

4. Set /?(j+‘) =F(p(j), Remark.

A(j)), j=j+

If Algorithm

17 and go to 2.

2 terminates

in step 2, we know that a maximizing

value of

p has been found. If it terminates in step 3, algorithm 2 has failed. Note that this cannot occur if f is strictly concave in a neighborhood of p^. If V’f denotes the Hessian off, we have in this case -1

g;(P) -s;(P)

=

T

(1 1

O’f(P,l-P)

-: (

CO >

for all p in this neighborhood of p^. Because gi(/3) =g2(p) + o(1) as p-+p^, we find that (13) is well defined in a neighborhood of p^. We now come to the main theorem of this section.

3.1. Let f be defined on S: and have continuous partial derivatives up to the order of three. In addition, let f be strictly concave in a neighborhood of p^. Then there exists q > 0 such that if Ip(‘) -PI
Proof. From the remarks following the formulation of Algorithm 2 we can conclude that there exists vi >O such that A(/?) is well defined if lb-81 5~~. Now, let us define

BK,(fi

:=B,(/J)xKK,(jJ))C

is a compact set. We now apply Taylor’s MYUP))

[O, I] x IR

theorem:

-FUA(P))

=~(8,1(8))(8-8) +I

a*F

-(P+cr(p^-P),1(P>)(p^-P>*

2 km* with a E [O, 11. Because we arrive at IP^-F(PJ(P)I for PEB~,@) derived.

and

with

(aF/ap)(fi,#l(/?)) = 0 and a*F/(ap)* is bounded

on B&,(p)

4p^-Pl* O
+w,

from

which

the

Theorem

can

easily

be

Example 3.2. f is the log-likelihood function of the logarithmic series, that is f (&, p2) = R log p1 - log( - log p2) where R is the mean of the observed frequencies.

D. BGhning / Numerical

Table

estimation

of a probability

65

measure

1

Convergence

of Algorithm

2 for R=

1.5

pti+l)

Step j

pti,

/.ti)

A(i)

1

0.53489

0.50000

0.03871

3.58870

2

0.53359

0.53489

0.00083

3.31964

3

0.53359

0.53359

-

3.32906

According to (13) we find n(/3) = l/{ 1 + [p/log(l -p)]}. We consider two examples. (i) R= 1.5: this is the good-natured example, here the classical Newton procedure also leads to the solution p^=O.5336. (ii) R= 1.02: this is the ill-natured example; the classical Newton procedure converges to p= - 138.14. Here p^=O.O387. We now take both cases as test examples for our Algorithm 2. In example (i) the MLE is found in the second step. See Table 1. In the other example p^ is found in the 8-th step starting as far away from p^ as we can. See Table 2. In column 4 is given

This column shows nicely the superlinear convergence rate of Algorithm ly column 5 of Table 2 shows that there will not be a specific constant rather a wide range of possible values of optimal A-parameters.

4. Computational 4.1. Example We

consider

@oIA defined and

in

(6)

T= { 1,2,3},

with

h =4,

N-‘(r~i)~=

2

Convergence Step j

considerations

with finite T

(0.15,0.1,0.2,0.55)T,

Table

2. EspecialA-value but

of Algorithm @J+

1)

2 for R=

1.02 p(J)

,W

0.80080

0.99990

0.79286

0.40566

0.80080 0.40566 0.17890 0.08365 0.04908

0.48150

1.98543

0.38204

4.53846

0.17890 0.08365 0.04908 0.03974 0.03872 0.03871 0.03871

0.03974 0.03872 0.03871

Ati)

1.12178

0.32056

10.82418

0.23082

23.56630

0.09934

40.40839

0.00954 -

49.98867 51.31539 51.33332

66

D. Biihning / Numerical estimation of a probability measure

(I(4 _i)) =

To execute step 3 we used Algorithm if Ipro+‘)-p~)l < 10-7

2 with the additional

stopping

rule to terminate

Table 3 shows the results of algorithm 1. We mention that a constant value of il = 1 in Algorithm 2 defines an EMalgorithm (Dempster, Laird and Rubin, 1977). It should be mentioned that in this example the EM-algorithm based on the iteration

is not accurate

Table

to four decimal

places after 1000 iterations!

3

Results

of Algorithm

Step j

($)

1 for @= @DIA Number Ei]

Time

A(P”‘)

of steps

in Algorithm

2

to solve step 3 1

3 1

l/3 l/3

2

3

0.1785

2

0.3333 0.4882

0.004

1 .I1

6

0.012

55.17

5

l/3

3

4

1

0.1785

2

0.1325 0.6891

0.6985 1.1004 1.2011 1.0106 0.9788 1.0106

0.19

4.29

5

1.0808 0.9825 0.9825

3

0.2056

2

0.1053 0.6891

10

3 2

0.2103 0.0424 0.1473

0.071

152.34

3

0.9999 0.9999

11

1

0.2103

0.079

8.49

2

1.0000

2

0.0424 0.7473

3 2

0.2103 0.0424

12

0.7473

0.029

119.82

5

0.9944 0.9944 1.0026

1.0000 0.9999 0.9999 0.086

152.42

2

1.0000 1.0000 1.0000

67

D. Bbhning / Numerical estimation of a probability measure

4.2. Example with continuous T We choose @= c#J,,defined in (1) and fT(t)=(1,t,t2,t3,(t-a)!+),

T=[-1,

+ll.

Because of inequality (8) we can use i7f#J,(S) := sup l7@,(S, 6, - 6) tET

as a criterion for rate of convergence. l7@,(S, 6, - 6) =_P(t)J@-If(t)

Note that - 5.

Table 4 compares Algorithm 1 with three well-known algorithms for various a. As mentioned in Section 2, NEWTON (Atwood, 1976) is a fast converging algorithm, whereas PROGRAD (Wu, 1978) converges more slowly. These two algorithms along with VDM are used as comparative algorithms. The data for NEWTON, PROGRAD,and VDM are taken from Table 7 in BGhning (1981, p. 71). Although Algorithm 1 is slower than NEWTONin all examples, in the second and third number it is again better than VDM and PROGRAD.It can be regarded as a strong competitor to NEWTON since Algorithm 1 needs less computational effort than NEWTON.

Appendix

Here we offer some technical details to incorporate Algorithm 2 into step 3 of Algorithm 1. Let again @be defined in a neighborhood of d +, and, in addition to VC$J(~, a,), let it possess the second order directional derivative

a2

WWJ :=*Gw+m

p=.

for all 6~d + and all t E T. Step 3 of Algorithm defined by

Table 4 Number of iterations

a

VDM

0.2 0.4 0.6 0.8

such that m(Sj)

< 1 .O (first number),

1 leads to the maximization

0.1 (second

number),

off

0.01 (third number)

PROGRAD

NEWTON

Algorithm

7, >so, >50

4,26,26

5,8,10

6,11,13

I, >50, >50

4,15,21

5,6,9

-I,12 14

4, >50, >50 4, >50, z-50

4,9, 11 4, 10, 18

5,8,10 7,10,12

6, IO, 12 4,9, 11

1

D. Btihning / Numerical estimation of a probability measure

68

Table

5

@

defined

@D

(1)

@ML

(4)

@PO

(5)

@DIA

(6)

in

FW>d61)

F2@(&&)

u(t) :=fT(t)J(s)-‘f(t)

u-TUMs)-1fW12

ii, “jr

e::t;(dt)

with

i= 1,2 and

From these expressions it is not difficult to compute formulas to compute V@ and V %#Ifor various @.

g,,g;,g,,g;.

Table

5 lists the

References Atwood,

C.L.

(1973).

Sequences

(1976).

Convergent

converging

to D-optimal

designs

Ann. Statist. 1,

of experiments.

342-352. Atwood,

C.L.

design

sequences

for sufficiently

regular

optimality

criteria.

Ann.

Statist. 4, 1124-l 138. Bartholomew, D.J. (1980). Factor analysis for categorical data. J. Roy. Statist. Sot. Ser. B 42, 293-321. Bauer, H. (1978). Wahrscheinlichkeitstheorie und Grundziige o’er A4aBheorie. De Gruyter, Berlin. BGhning, D. (1981). Numerische Verfahren in der optimalen Versuchsplanung. Inaugural-Dissertation am Fachbereich Mathematik der Freien Universitlt Berlin. BGhning, D. (1982). Convergence of Simar’s algorithm for finding the MLE of a compound Poisson process. Ann. Statist. 10, 1006-1008. BGhning, D. (1983). Maximum likelihood

estimation

of the logarithmic

series distribution.

Statist&he

Hefte 24, 121-140. Biihning,

D. and K.-H.

Hoffmann

(1982). Numerical

procedures

for estimating

probabilities.

J. Statist.

Comput. Simulation 14, 283-293. Dempster, A.P., N.M. Laird and D.B. Rubin (1977). Maximum likelihood from incomplete data via the EM-algorithm (with discussion). J. Roy. Statist. Sot. Ser. B 39, l-38. Everitt, B.S. and D.J. Hand (1981). Finite Mixture Distributions. Chapman and Hall, London. Fedorov, Klimko).

V.V. (1972). Academic

Theory of Optimal Experiments (Transl. Press,

New York.

and ed. by W.J.

Studden

and E.M.

69

D. BGhning / Numerical estimation of a probability measure Kiefer,

J. (1974). General

equivalence

theory

for optimum

designs

(approximate

theory).

Ann.

Statist.

2, 849-879. Lindsay,

B.C. (1983). The geometry

of mixture

likelihoods:

S.D. (1980). Optimal Design. Chapman

Silvey,

Silvey, S.D., D.M. Titterington

and B. Torsney

and Hall,

a general

(1978). An algorithm

space. Commun. Statist. A 7, 1379-1389. Simar, L. (1976). Maximum likelihood estimation

of a compound

Ann. Statist. 11, 86-94.

theory.

London-New

York.

for optimal Poisson

designs on finite design

Ann. Statist. 4,

process.

1200-1209. Tsay,

J.Y.

(1976). On the sequential

construction

of D-optimal

designs.

J. Amer. Statist. Assoc. 71,

.

67 l-674. Whittle,

P. (1973). Some general

points

in the theory

of optimal

experimental

design.

J. Roy.

Statist.

Sot. Ser. B 35, 123-130. Wu,

C.F.

(1978).

Some

iterative

Commun.

procedures

for generating

nonsingular

optimal

designs.

generation

of D-optimum

experimental

designs.

Ann. Math. Statist.

Statist. A 7, 1399-1412. Wynn,

H.P.

(1970). The sequential

41, 1655-1664.