Some prediction procedures with special reference to selection

Some prediction procedures with special reference to selection

Journal of Statistical Planning and Inference 24 (1990) 391-401 391 North-Holland SOME PREDICTION PROCEDURES WITH SPECIAL REFERENCE TO SELECTIO...

590KB Sizes 1 Downloads 78 Views

Journal

of Statistical

Planning

and Inference

24 (1990) 391-401

391

North-Holland

SOME PREDICTION PROCEDURES WITH SPECIAL REFERENCE TO SELECTION Ravindra

KHATTREE*

Depariment

of Statistics,

Received

13 April

Recommended

Abstract:

North

Dakota

1987; revised

State University,

manuscript

received

We consider

the problem

in which

mean

higher

than that of any other

timum

decision

of prediction

or some other

rules - to be termed

as selection

indices,

there are constraints

requiring

other criterion selection.

variables

AMS

Classification: Prediction;

variable

function

are constructed.

Optimum

decision

that, in the selected part of the population,

are above certain

prespecified

for a certain

of criterion

subset of same size. To select this particular

is made in two or more stages.

words:

1989

of an unobserved well defined

dices when selection

Key

14 February

ND 58105, U.S.A.

by S. Panchapakesan

of a population

Subject

Fargo,

subset

variable(s)

subpopulation,

We derive selection rules are obtained average

levels. More emphasis

is opin-

when

values of some

is given to two stage

623, 62H.

selection

indices;

multi-stage

selection;

generalized

Neyman-Pearson

lemma.

1. Introduction,

notations

and some basic results

Selection of individuals is exercised on the basis of some concomitant variables (or predictors) to effect improvements in certain criterion variables unobservable at the time of selection. All predictors may not be available at one time and in such a case selection needs to be done in several stages. This paper discusses these and some other related problems. For certain basic aspects of these problems, see the fundamental works of Cochran (1951), Curnow (1961) and Rao (1964,1971,1973). For a mathematical treatment in an abstract sample space setting, one is referred to Noda ments.

(1985).

Also see Khattree

(1987,1988a,

1988b) for some recent

develop-

at the time of Let y,yl,y2, . . . , be the criterion variables which are unobservable selection and let x be the vector of predictors, which are observed. The regression of rj on x will be denoted by qj, which, we assume, exists. We will drop the superscript j if there is only one criterion variable in the context. We will denote the * Presently Avon

Lake,

at Research OH 44012,

037%3758/90/$3.50

0

Department,

Technical

Center,

BFGoodrich

Chemical

U.S.A.

1990, Elsevier

Science

Publishers

B.V. (North-Holland)

Group,

P.O.

Box 122,

392

R. Khattree / Some prediction procedures

cumulative distribution function by F(. ) in a generic way. What are the variables involved should be obvious from what is inside the parentheses that follow F. Mathematically, the problem is to find an objective rule that can be unambiguously applied once the values of x are available and a decision whether or not to select can be made. In other words, the problem can be looked at as that of determining a suitable region cc) in X, the sample space of x as our region of selection. In case one wants to select the ‘best’ a-proportion subpopulation of X, CL)is constrained to be of size a. The concept, which we have called as ‘best’ depends on the choice of the decision maker. Whatever it may be, it should be well defined and we should be able to measure it quantitatively. We will denote it by f(y). Let the joint cumulative distribution of y = (y,, . . . , v,)’ E Y and x E X be F( y, x) and o denote a subset of X of size a, that is, Px(a) = (Y where 0< a~ 1 (for simplicity, we will drop the subscript x and write it as P(U) =a). Let f(y) be an integrable function of criterion variables and let u(x) =E{f(y) 1x}. A subset o* is said to be an a size optimum region in X if it maximizes 1, 1, f(y) dF( y, x) among all subsets of size (x.

The subset CO*= {x: u(x) I k), P(co*) = a, is an a-size optimum region in X. Lemma 1.1.

where k is a constant such that

1.1. Let xc be the characteristic function of the set {y: JJi 2 ai, i = up) is a given vector. If p= 1, and the conditional where a’=(a,,a2,..., lJ,...,Pl, distribution of y given x has a location parameter qx and a scale parameter o, independent of x, then cu* is given by o* = {x: ~~2 k} where k is given by

Example

P(qx 2 k) = a. Example 1.2. Let p = 1, and y be marginally distributed with mean 0. Let f(y) = c-y2 where c is a constant. An optimum region o*, minimizing the dispersion of denotes the conditional variance of y is given by o* = (x: ~~~~ + ~ZZSk} where oti, y. If cr:,x is independent of x, as in the case of normal duces to a confidence interval on qx.

distribution,

then W* re-

Multiple criteria Consider the problem of finding a selection region where several objectives are maximized simultaneously. In particular, let f,( y),f2(y), . . . ,f,(y) be r integrable a, be given positive numbers. Let o be an o-size subset functions and let al, a2, . . . , of X, such that f,(y)

dF(y,x)

= $

f,(u)

dKv,x)

= v

(say).

(1.1)

R. Khattree / Some prediction procedures

A subset

cc)* is said to be a-size

w* is feasible u;w=E(J;(Y)Ix),

optimum

region

in X for the multiple

393

criteria

if

and if w* maximizes v among all a-size w which are feasible. Let r. The following lemma follows from an application i-L&...,

of the generalized

Neyman-Pearson

Lemma.

Lemma 1.2. The subset o* = {x: Cl=, bj ai 1 I> where b,, . . . , b, are constants, such that Cl=, ai b, > 0, P(o*) = a and the condition (1.1) is satisfied, is an a-size optimum region for the multiple criteria. Example 1.3. Let (y’ : x’)’ be jointly normally and let E(y, ) x) = r&. Let a’= (a,, . . . , a,) and dispersion matrix of q. Then w*= {x: b’qr b. is given by P(w*) = a. It is easy to verify

2. Multi-stage

selection:

distributed, where y’= (y,, y,, . . . , y,), q’= (yli, &, . . . , ye:) and /1 denote the 6,) where b’=a’/l-‘/(~‘/l-‘a)“~ and this result.

formulation

Sometimes the measurements on various predictor variables become available at different time points. Consequently a selection program may require repeated selection as more measurements accumulate. We will confine our attention to the procedure where selection at any stage is made from those individuals selected at the previous stage after obtaining additional measurements on them. Thus, the finally selected group will consist of only those who successfully passed through all the screening stages. There could be various approaches to this problem. For details, see Khattree (1985). Also see Finney (1984). Let x’=(x;,x; ,..., x:) where x;=(x, ,..., x,,), x;=(x,,+r ,..., xql) ,... are the predictors, available at r different stages. We will, as earlier, look at the problem as the one of finding an a-size optimum region o in X, where some function say f(y) of criterion variables is maximized. Further let us assume that at stage j, ~j proportions of the population remained at the previous stage is to be selected. We denote the part of the population which is selected at j-th stage by o]. Thus Cuj is a subset of Oj_r with wO~X. For the systematic development of what follows, we need the following lemma, which gives optimum selection index for the last stage in any multi-stage selection procedure. Proof is suppressed. Lemma 2.1. Let f(y)

be an integrable function of criterion variables and suppose we want to find that portion of the population (say w) where E(f(y) / co) is maximized. Suppose a multi-stage selection procedure with r stages is to be followed. Let > o,*_ , be any choice of selection regions up to r - 1 stages, satisfying of>Cl$>... P(Oi*)=nj=, i=l,2 ,..., r - 1. Then the optimum region at r-th stage CI$*C w,*_, is given by CI$ = {xEc&,: E(f(y) )x)&J (2.1) OZj,

394

R. Khatfree / Some prediction procedures

The constant k, in (2.1) is determined from the condition P(W:) = fl ai.

(2.2)

i=l

Next we try to find selection The answer

region

to this is contained

for the previous

in the following

stage, the namely

lemma,

(r- 1)-th.

which is a generalization

of a result in Rao (1964). Lemma

2.2. Given that at some stage (say the j-th) the rule given by (2.1) and (2.2) is to be implemented, the optimum region for the just previous stage ((j-1)th stage) is given by cqpl =

(x;,x;,

>ca u dF(u I t ki

. . . ,x;_ 1)‘:

i +12"-1)

I

'm

1 x1 ,...,xj-1)

1x 1, . . ..Xj_1)

dF(u

~ ~:j-” 1

s4 I

(2.3)

where u is defined as the random variable u = E(f(_V)

(2.4)

/ xI9 ... 9 xj)

and Ajj-‘), i = 1,2; are constants to be chosen such that i-1

P(o~*_~> = fl a,.

(2.5)

t=1

Proof.

This involves

the maximization

of

am dF(x 1~*-*sxjml) s W/--l subject

udF(u

y\ k,

1x1,

xj-l)

to

a, **‘aj_l

dF(Xi, . . ..Xj_1)

=

dF(Xi, . . ..Xj-t)

‘Icc dF(u I4

cI w,-I and

!rU/-l Now an application by (2.3) and (2.5).

of Neyman-Pearson’s

1XI , . . ..Xj_l) Lemma

=

aI **.Clj.

results in a region CO;_1 as given

As can be seen, these selection indices become very complicated if the number of stages is large. We will, from here on, consider the problem of two stage selection. We denote regression of yi on Xj,Xj_ 1, . . . , x1 by q;. The superscript is dropped if there is only one criterion variable under discussion. In the following we will consider the case when one desires to maximize a certain probability.

R. Khattree / Some prediction procedures

2. I. Maximization

of probability

395

(single criterion)

The objective here is the same as that in Raj ((1954), Theorem 1) but selection is performed in two stages. Raj noted that under the assumption of multivariate normality, this objective was equivalent to that of maximizing the mean. This however is not the case in two stage selection. For simplicity, we will consider the case when there is only one criterion variable. Similar results, in principle, can be stated for the case when we have a vector of criterion variables. Let y. be the cut off point for criterion variable y. We define

PY,,

Ix = P(YlYo

Ix)

(2.6)

We have the following theorem for the optimum selection index at second stage. Proof is by choosing f(y) as the characteristic function of the set {y:y2yo} in Lemma 2.1. Theorem 2.1. For the objective of maximization of P(yzy,), in the two stage procedure, where y. is a known point on y-axis, the optimum region at the second stage is (2.7)

where k2 is such that (2.8)

P(~~“l.2 k,) = alaz,

whatever may be the choice of CO;satisfying P(oT) = al, and therefore the selection index is pY,,1x. Similarly, theorem:

with the same

substitution,

in Lemma

2.2,

we have the following

2.2. Given that the optimum region for stage 2 is as in (2.7) and (2.8), the optimum region at stage 1 is

Theorem

n

1 CO;=

x,:

I?,,IxdWz

i

1x,)+4

1 up,,~x’k, where A, and A2 are chosen so that

P(4)

=

a,,

P(o2*) = “, a2. Results become more definite and computationally the variables are jointly distributed as multivariate

dF(x, t\ Pvnjxzkz

1x,> 2 AZ 1

(2.9)

(2.10) (2.11) simpler if we assume that all normal with zero mean. For

R. Khattree / Some prediction procedures

396

notation

sake, we write, (2.12)

(y, x’)’ = (y, x;, x;)’ - N(O, C) where Z=

is assumed

to be known.

=yY

-5x,

=YQ

z;,,

-%x,x,

z;,x,

2 XZY

=x,x,

=x,x,

We formally

(2.13) 1

state the result in the following:

Suppose a two stage selection strategy is followed with intensities of selection as aI and a2, respectively, and with an objective of finding a second stage selection region CO;where P(yz yO) is maximum. Under the assumption in equation (2.12), an optimum selection region for two stage selection is given by .7 a$= x,: , rl _k pyolxWvz b,)+4 Wrl, 1x1)2& 3 1 ! rlz”k; i I’ 2’ ; Result 2.1.

0; = (x: Q2k2*}

where A, and A2 and kt are such that 00 P(4) = aI, Wvid YI k;

=

a1 a2,

and nz is defined in the paragraph following

the proof of Lemma 2.2.

Remark 2.1. Note that the choice of Ai and A2 may not be unique, different WT.

leading

to two

2.2. Two stage selection for finding the most homogeneous part of population Here, we state, without proof, the selection region, when our objective is to find a region w in the space of x, which is most homogeneous for a single criterion It is assumed that expected value variable, in the sense that E(y’ 1co) is minimum. of y in the original population is zero. Theorem

2.3. For the objective of minimization

of E( y2) in two stage procedures,

the optimum selection region is: 0: = {C&x;)‘:

J’(Y)

x,,x,)+E2(y

/ x1,xz)sb),

(2.14)

1x)1 dF(x, 1xl>

W(Y j xl +E*(Y

Q-4),,

+A,

I

c k&,,

d&c,

/ x,)s&

I

,

(2.15)

R. Khattree / Some prediction procedures

397

where (CD&, is the xl-section of w;, that is, (w&,

= {x2: (x;,x;)‘E4)r

and kz, A, and AZ are to be chosen such that LW:)

(2.16)

= al,

(2.17)

P(wZ) = CC,a2.

We can simplify the results (2.14) through (2.17) under the assumptions of normality, as given by (2.12) and (2.13). Under these assumptions, we have the following result: Result 2.2. If y, x,, x2 are jointly distributed as in (2.12) and if two stage selection strategy is followed with intensities of selection a, and az, respectively, then an optimum selection region in the space of x, which is most homogeneous for the single criterion variable y, is given by

where n, and nz are defined in the paragraph following tl =

C-k;- rlrb’ao,

t, = (k:- rl,b’oo,

the proof of Lemma 2.2,

0: = var(r12/ XI>,

and where k;, ,I2 and A1 are determined so that I

= al,

P(w2*) = al a2. It may be noted that Remark 2.1 (regarding optimum region) also applies here.

3. Two stage selection with restrictions

the nonuniqueness

of the first stage

on other criterion variables

Certain problems of restricted prediction have been considered by Rao (1964) and Khattree (1987,1988a) in case of unistage selection. In this section, we generalize some of these ideas for a two stage selection strategy. yP) be the vector of criterion variables, and we want to maximize Lety’=(yt,..., mean of y, subject to the conditions that means of y2, . . . , yP do not fall below certain specified levels s2, . . . , sp. As earlier, we assume that the proportions selected at

R. Khattree / Some prediction procedures

398

the two stages are predetermined and are a, and a2, respectively. Further, we qf’)‘. The next theorem provides the optidenote by vi, i= 1,2, the vector (vi’, y~f,. . . , mal selection region at stage 2, for this purpose. Proof follows from an application of the generalized Neyman-Pearson Lemma. Theorem 3.1. With the notations as described earlier, the optimal selection region

at stage 2 is given by CO;= ((x;,x;)‘: b’rj2zk2}

(3.1)

where k2 is such that (3.2)

P@$) = a1 a27

whatever may be the choice of CO:satisfying P(o$)=al. The vector b in (3.1) is to be determined so that restrictions on the criterion variables ~2, ~3, . . . , y, are satisfied. Given the region C$ as above, the following theorem.

the selection

region

at first stage will be given by

Theorem 3.2. With CO;as given in (3. l), an optimal region at stage 1 is given by

a,*= k,:

$I+A,

~~,~~~*idF(rl~Ix,)al,j

(3.3)

where I is the vector of integrals defined by

k2 is such that (3.2) holds good, and p, A,, A0 are chosen so that all the restrictions are satisfied and that P(o$) = oz,. The regions described in the above two theorems may be of some practical utility if (3.1) and (3.3) can be further simplified. We do so under the assumption of multivariate normality of all the variables. In case all the variables are distributed as multivariate normal as given in (2.12) and (2.13), then following Rao (1964, pp. 40-43), b can be found as the solution to the problem: max subject

(3.5)

AV2(i) b

to

i=2 , ...,p,

A,*(i)b ~~,

z(k,) b’&b

= 1,

R. Khattree / Some prediction procedures

399

where *co k, t e-t=/2 dt, I

Z(kZ) = ~&

AV2=D(q2) and A,2C,j is the i-th row of Aq2. Problem (3.5) can easily be solved, using the algorithm described in Khattree (1987). To simplify (3.3), it is enough to evaluate integrals in I. These integrals have been evaluated in Khattree (1985). In this case, we obtain the following result:

Under the assumption of multivariate normality, in addition to those described in Theorem 3.1 and 3.2, an optimum two-stage selection region is given by Result

3.1.

CO;= {x: b’qz>k2], with r2

=

E(Y

I Xlr

x2),

VI

=

E(Y

lx,),

b’/&b = 1,

and Z as defined in (3.4). The coefficients kZ, A,, A0 and p are chosen so that all restrictions and size conditions are satisfied. As can be seen, the optimum region given by the above result is very complex and will be difficult to compute. As an alternative, one may try the following selection regions: CO; = {xl: c’rj,>k,},

CO;= {(x;,x;)‘:

where c and b are to be chosen P(c’q, zk,)

= tx,

b’q,rk,],

so as to satisfy

and

P(b’q, 2 k,) = a, u2.

4. Other two stage selection indices Rao (1964) suggested the use of various other two stage selection indices, which are intuitive in nature, but may be appealing for practical applications. Another suggested selection region may be of the type c2k k2 in iz axis and [r I a, ql 2 b in (it, qt) space, where C2 is a given linear selection index which may not be optimum and [, is the regression of & on x1. The constants k2, a and b are to be chosen so that the conditions on the selection intensities are satisfied and that mean improvement is maximum. In the following, we evaluate the mean improvement due to application of the indices proposed above with intensities a2 and al, respectively. Let us for simplicity denote (JJ, qr, [r, c2) by (z,,, zI, z2, zs) and let us also assume that z = (zo, zt, z2, zs)‘- F&CO, R*),

where

R. Khattree / Some prediction procedures

400

H----I 1

R*=

@Ol

eo2

eo3

e10

R

e20 e30

is the correlation matrix of z. z. taken as linear combination

of other variables,

will be

z. = t,z2+t2z2+t3z3+Error. Computation

of E*(zo) directly

may be difficult.

E*(z, - I~z(~~)) etc., where B, = (ei2, e,s) and are similarly

defined. E*(Zl

-fh

we compute quantities

Instead,

~(23) = (z2, ~3)‘.

The other quantities

NOW, z(23))

const.

=

-

exp

(RI -1’2(z1 - 4 24

dz, dz2 dz3

c

Lz3JJ

-~<4~2)-~1 = yl,

PII

say,

where

and, where N,(c), c2 101I) stands for the mass of the rectangle standard bivariate normal distribution. We can similarly evaluate the other E*(q-B,zc23))

= Y1,

terms.

E*(%-&z(31))

x1 > cl, x21 c2 for the

We then have = Y2,

E*(z3-&~(12))

= Y3.

leads Now substitution of B, and ~(23) etc. in the above and a little simplification to E*(z,, z2,z3)‘= (21- R)-' y where y’= (y,, y2, y3). Therefore, the mean improvement in y=zo would be t’(21- R)-‘y, where k2, a and b are to be found such that

R. Khattree / Some prediction procedures

and

401

m 03 dF(rl,, 51) = al, c .b i
and where t = (tt, t,, t3)‘. Comparison of the mean improvement due to the index given here and those suggested by Rao will throw some more light on the relative merits of these indices. A theoretical comparison, however, seems impossible. Through simulation, one can determine the relative advantages of using any of the indices discussed above, in particular cases.

Acknowledgements The author wishes to thank Profs. C.R. Rao, M.B. Rao and the three anonymous referees for their extremely helpful comments and suggestions, which improved the presentation of the paper considerably.

References W.G. (1951). Improvement by means of selection. 2nd Berke/ey Symp. Math. Statist. and Probab., 449-470. Curnow, R.N. (1961). Optimal programmes for varietal selection. J. Roy. Statist. Sot. Ser. B 23(2), 282-3 18 (with discussion). Finney, D.J. (1984). Improvement by planned multi-stage selection. J. Amer. Statist. Assoc. 79, 501-509. Cochran,

Khattree,

R. (1985). Some contributions

construction. Khattree,

Ph.D.

Thesis,

to the statistical

University

R. (1987). On selection

theory

of prediction:

Selection

indices and their

of Pittsburgh.

Comm. Statist. - Simulation Comput. 16(4), 1093-

with restriction.

1103. Khattree,

R. (1988a).

Inequality

restricted

linear

selection

indices.

Comm. Statist. - Theory Methods

17(9), 2959-2980. Khattree,

R. (1988b).

Asymmetric

Selection

Indices.

Comm. Statist. - Theory Methods 17(11), 3981-

3993. Noda,

K. (1985). Optimal

construction

of a selection

of a subpopulation.

Ann. Inst. Statist. Math. 37,

415-435. Raj, Rao,

D. (1954). On optimum C.R.

(1964).

Problems

selections

of selection Comp. Symp. on Statist., 29-5 1.

from multivariate involving

populations,

programming

Sonkhyd 14, 363-366. In: Proc. of IBM Sci.

techniques.

Rao, C.R. (1971). Advanced Statistical Methods in Biometric Research. Wiley/Haffner, Rao,

C.R.

(1973). Linear Statistical Inference and its Applications.

Wiley, New York.

New York.