Journal
of Statistical
Planning
and Inference
24 (1990) 391-401
391
North-Holland
SOME PREDICTION PROCEDURES WITH SPECIAL REFERENCE TO SELECTION Ravindra
KHATTREE*
Depariment
of Statistics,
Received
13 April
Recommended
Abstract:
North
Dakota
1987; revised
State University,
manuscript
received
We consider
the problem
in which
mean
higher
than that of any other
timum
decision
of prediction
or some other
rules - to be termed
as selection
indices,
there are constraints
requiring
other criterion selection.
variables
AMS
Classification: Prediction;
variable
function
are constructed.
Optimum
decision
that, in the selected part of the population,
are above certain
prespecified
for a certain
of criterion
subset of same size. To select this particular
is made in two or more stages.
words:
1989
of an unobserved well defined
dices when selection
Key
14 February
ND 58105, U.S.A.
by S. Panchapakesan
of a population
Subject
Fargo,
subset
variable(s)
subpopulation,
We derive selection rules are obtained average
levels. More emphasis
is opin-
when
values of some
is given to two stage
623, 62H.
selection
indices;
multi-stage
selection;
generalized
Neyman-Pearson
lemma.
1. Introduction,
notations
and some basic results
Selection of individuals is exercised on the basis of some concomitant variables (or predictors) to effect improvements in certain criterion variables unobservable at the time of selection. All predictors may not be available at one time and in such a case selection needs to be done in several stages. This paper discusses these and some other related problems. For certain basic aspects of these problems, see the fundamental works of Cochran (1951), Curnow (1961) and Rao (1964,1971,1973). For a mathematical treatment in an abstract sample space setting, one is referred to Noda ments.
(1985).
Also see Khattree
(1987,1988a,
1988b) for some recent
develop-
at the time of Let y,yl,y2, . . . , be the criterion variables which are unobservable selection and let x be the vector of predictors, which are observed. The regression of rj on x will be denoted by qj, which, we assume, exists. We will drop the superscript j if there is only one criterion variable in the context. We will denote the * Presently Avon
Lake,
at Research OH 44012,
037%3758/90/$3.50
0
Department,
Technical
Center,
BFGoodrich
Chemical
U.S.A.
1990, Elsevier
Science
Publishers
B.V. (North-Holland)
Group,
P.O.
Box 122,
392
R. Khattree / Some prediction procedures
cumulative distribution function by F(. ) in a generic way. What are the variables involved should be obvious from what is inside the parentheses that follow F. Mathematically, the problem is to find an objective rule that can be unambiguously applied once the values of x are available and a decision whether or not to select can be made. In other words, the problem can be looked at as that of determining a suitable region cc) in X, the sample space of x as our region of selection. In case one wants to select the ‘best’ a-proportion subpopulation of X, CL)is constrained to be of size a. The concept, which we have called as ‘best’ depends on the choice of the decision maker. Whatever it may be, it should be well defined and we should be able to measure it quantitatively. We will denote it by f(y). Let the joint cumulative distribution of y = (y,, . . . , v,)’ E Y and x E X be F( y, x) and o denote a subset of X of size a, that is, Px(a) = (Y where 0< a~ 1 (for simplicity, we will drop the subscript x and write it as P(U) =a). Let f(y) be an integrable function of criterion variables and let u(x) =E{f(y) 1x}. A subset o* is said to be an a size optimum region in X if it maximizes 1, 1, f(y) dF( y, x) among all subsets of size (x.
The subset CO*= {x: u(x) I k), P(co*) = a, is an a-size optimum region in X. Lemma 1.1.
where k is a constant such that
1.1. Let xc be the characteristic function of the set {y: JJi 2 ai, i = up) is a given vector. If p= 1, and the conditional where a’=(a,,a2,..., lJ,...,Pl, distribution of y given x has a location parameter qx and a scale parameter o, independent of x, then cu* is given by o* = {x: ~~2 k} where k is given by
Example
P(qx 2 k) = a. Example 1.2. Let p = 1, and y be marginally distributed with mean 0. Let f(y) = c-y2 where c is a constant. An optimum region o*, minimizing the dispersion of denotes the conditional variance of y is given by o* = (x: ~~~~ + ~ZZSk} where oti, y. If cr:,x is independent of x, as in the case of normal duces to a confidence interval on qx.
distribution,
then W* re-
Multiple criteria Consider the problem of finding a selection region where several objectives are maximized simultaneously. In particular, let f,( y),f2(y), . . . ,f,(y) be r integrable a, be given positive numbers. Let o be an o-size subset functions and let al, a2, . . . , of X, such that f,(y)
dF(y,x)
= $
f,(u)
dKv,x)
= v
(say).
(1.1)
R. Khattree / Some prediction procedures
A subset
cc)* is said to be a-size
w* is feasible u;w=E(J;(Y)Ix),
optimum
region
in X for the multiple
393
criteria
if
and if w* maximizes v among all a-size w which are feasible. Let r. The following lemma follows from an application i-L&...,
of the generalized
Neyman-Pearson
Lemma.
Lemma 1.2. The subset o* = {x: Cl=, bj ai 1 I> where b,, . . . , b, are constants, such that Cl=, ai b, > 0, P(o*) = a and the condition (1.1) is satisfied, is an a-size optimum region for the multiple criteria. Example 1.3. Let (y’ : x’)’ be jointly normally and let E(y, ) x) = r&. Let a’= (a,, . . . , a,) and dispersion matrix of q. Then w*= {x: b’qr b. is given by P(w*) = a. It is easy to verify
2. Multi-stage
selection:
distributed, where y’= (y,, y,, . . . , y,), q’= (yli, &, . . . , ye:) and /1 denote the 6,) where b’=a’/l-‘/(~‘/l-‘a)“~ and this result.
formulation
Sometimes the measurements on various predictor variables become available at different time points. Consequently a selection program may require repeated selection as more measurements accumulate. We will confine our attention to the procedure where selection at any stage is made from those individuals selected at the previous stage after obtaining additional measurements on them. Thus, the finally selected group will consist of only those who successfully passed through all the screening stages. There could be various approaches to this problem. For details, see Khattree (1985). Also see Finney (1984). Let x’=(x;,x; ,..., x:) where x;=(x, ,..., x,,), x;=(x,,+r ,..., xql) ,... are the predictors, available at r different stages. We will, as earlier, look at the problem as the one of finding an a-size optimum region o in X, where some function say f(y) of criterion variables is maximized. Further let us assume that at stage j, ~j proportions of the population remained at the previous stage is to be selected. We denote the part of the population which is selected at j-th stage by o]. Thus Cuj is a subset of Oj_r with wO~X. For the systematic development of what follows, we need the following lemma, which gives optimum selection index for the last stage in any multi-stage selection procedure. Proof is suppressed. Lemma 2.1. Let f(y)
be an integrable function of criterion variables and suppose we want to find that portion of the population (say w) where E(f(y) / co) is maximized. Suppose a multi-stage selection procedure with r stages is to be followed. Let > o,*_ , be any choice of selection regions up to r - 1 stages, satisfying of>Cl$>... P(Oi*)=nj=, i=l,2 ,..., r - 1. Then the optimum region at r-th stage CI$*C w,*_, is given by CI$ = {xEc&,: E(f(y) )x)&J (2.1) OZj,
394
R. Khatfree / Some prediction procedures
The constant k, in (2.1) is determined from the condition P(W:) = fl ai.
(2.2)
i=l
Next we try to find selection The answer
region
to this is contained
for the previous
in the following
stage, the namely
lemma,
(r- 1)-th.
which is a generalization
of a result in Rao (1964). Lemma
2.2. Given that at some stage (say the j-th) the rule given by (2.1) and (2.2) is to be implemented, the optimum region for the just previous stage ((j-1)th stage) is given by cqpl =
(x;,x;,
>ca u dF(u I t ki
. . . ,x;_ 1)‘:
i +12"-1)
I
'm
1 x1 ,...,xj-1)
1x 1, . . ..Xj_1)
dF(u
~ ~:j-” 1
s4 I
(2.3)
where u is defined as the random variable u = E(f(_V)
(2.4)
/ xI9 ... 9 xj)
and Ajj-‘), i = 1,2; are constants to be chosen such that i-1
P(o~*_~> = fl a,.
(2.5)
t=1
Proof.
This involves
the maximization
of
am dF(x 1~*-*sxjml) s W/--l subject
udF(u
y\ k,
1x1,
xj-l)
to
a, **‘aj_l
dF(Xi, . . ..Xj_1)
=
dF(Xi, . . ..Xj-t)
‘Icc dF(u I4
cI w,-I and
!rU/-l Now an application by (2.3) and (2.5).
of Neyman-Pearson’s
1XI , . . ..Xj_l) Lemma
=
aI **.Clj.
results in a region CO;_1 as given
As can be seen, these selection indices become very complicated if the number of stages is large. We will, from here on, consider the problem of two stage selection. We denote regression of yi on Xj,Xj_ 1, . . . , x1 by q;. The superscript is dropped if there is only one criterion variable under discussion. In the following we will consider the case when one desires to maximize a certain probability.
R. Khattree / Some prediction procedures
2. I. Maximization
of probability
395
(single criterion)
The objective here is the same as that in Raj ((1954), Theorem 1) but selection is performed in two stages. Raj noted that under the assumption of multivariate normality, this objective was equivalent to that of maximizing the mean. This however is not the case in two stage selection. For simplicity, we will consider the case when there is only one criterion variable. Similar results, in principle, can be stated for the case when we have a vector of criterion variables. Let y. be the cut off point for criterion variable y. We define
PY,,
Ix = P(YlYo
Ix)
(2.6)
We have the following theorem for the optimum selection index at second stage. Proof is by choosing f(y) as the characteristic function of the set {y:y2yo} in Lemma 2.1. Theorem 2.1. For the objective of maximization of P(yzy,), in the two stage procedure, where y. is a known point on y-axis, the optimum region at the second stage is (2.7)
where k2 is such that (2.8)
P(~~“l.2 k,) = alaz,
whatever may be the choice of CO;satisfying P(oT) = al, and therefore the selection index is pY,,1x. Similarly, theorem:
with the same
substitution,
in Lemma
2.2,
we have the following
2.2. Given that the optimum region for stage 2 is as in (2.7) and (2.8), the optimum region at stage 1 is
Theorem
n
1 CO;=
x,:
I?,,IxdWz
i
1x,)+4
1 up,,~x’k, where A, and A2 are chosen so that
P(4)
=
a,,
P(o2*) = “, a2. Results become more definite and computationally the variables are jointly distributed as multivariate
dF(x, t\ Pvnjxzkz
1x,> 2 AZ 1
(2.9)
(2.10) (2.11) simpler if we assume that all normal with zero mean. For
R. Khattree / Some prediction procedures
396
notation
sake, we write, (2.12)
(y, x’)’ = (y, x;, x;)’ - N(O, C) where Z=
is assumed
to be known.
=yY
-5x,
=YQ
z;,,
-%x,x,
z;,x,
2 XZY
=x,x,
=x,x,
We formally
(2.13) 1
state the result in the following:
Suppose a two stage selection strategy is followed with intensities of selection as aI and a2, respectively, and with an objective of finding a second stage selection region CO;where P(yz yO) is maximum. Under the assumption in equation (2.12), an optimum selection region for two stage selection is given by .7 a$= x,: , rl _k pyolxWvz b,)+4 Wrl, 1x1)2& 3 1 ! rlz”k; i I’ 2’ ; Result 2.1.
0; = (x: Q2k2*}
where A, and A2 and kt are such that 00 P(4) = aI, Wvid YI k;
=
a1 a2,
and nz is defined in the paragraph following
the proof of Lemma 2.2.
Remark 2.1. Note that the choice of Ai and A2 may not be unique, different WT.
leading
to two
2.2. Two stage selection for finding the most homogeneous part of population Here, we state, without proof, the selection region, when our objective is to find a region w in the space of x, which is most homogeneous for a single criterion It is assumed that expected value variable, in the sense that E(y’ 1co) is minimum. of y in the original population is zero. Theorem
2.3. For the objective of minimization
of E( y2) in two stage procedures,
the optimum selection region is: 0: = {C&x;)‘:
J’(Y)
x,,x,)+E2(y
/ x1,xz)sb),
(2.14)
1x)1 dF(x, 1xl>
W(Y j xl +E*(Y
Q-4),,
+A,
I
c k&,,
d&c,
/ x,)s&
I
,
(2.15)
R. Khattree / Some prediction procedures
397
where (CD&, is the xl-section of w;, that is, (w&,
= {x2: (x;,x;)‘E4)r
and kz, A, and AZ are to be chosen such that LW:)
(2.16)
= al,
(2.17)
P(wZ) = CC,a2.
We can simplify the results (2.14) through (2.17) under the assumptions of normality, as given by (2.12) and (2.13). Under these assumptions, we have the following result: Result 2.2. If y, x,, x2 are jointly distributed as in (2.12) and if two stage selection strategy is followed with intensities of selection a, and az, respectively, then an optimum selection region in the space of x, which is most homogeneous for the single criterion variable y, is given by
where n, and nz are defined in the paragraph following tl =
C-k;- rlrb’ao,
t, = (k:- rl,b’oo,
the proof of Lemma 2.2,
0: = var(r12/ XI>,
and where k;, ,I2 and A1 are determined so that I
= al,
P(w2*) = al a2. It may be noted that Remark 2.1 (regarding optimum region) also applies here.
3. Two stage selection with restrictions
the nonuniqueness
of the first stage
on other criterion variables
Certain problems of restricted prediction have been considered by Rao (1964) and Khattree (1987,1988a) in case of unistage selection. In this section, we generalize some of these ideas for a two stage selection strategy. yP) be the vector of criterion variables, and we want to maximize Lety’=(yt,..., mean of y, subject to the conditions that means of y2, . . . , yP do not fall below certain specified levels s2, . . . , sp. As earlier, we assume that the proportions selected at
R. Khattree / Some prediction procedures
398
the two stages are predetermined and are a, and a2, respectively. Further, we qf’)‘. The next theorem provides the optidenote by vi, i= 1,2, the vector (vi’, y~f,. . . , mal selection region at stage 2, for this purpose. Proof follows from an application of the generalized Neyman-Pearson Lemma. Theorem 3.1. With the notations as described earlier, the optimal selection region
at stage 2 is given by CO;= ((x;,x;)‘: b’rj2zk2}
(3.1)
where k2 is such that (3.2)
P@$) = a1 a27
whatever may be the choice of CO:satisfying P(o$)=al. The vector b in (3.1) is to be determined so that restrictions on the criterion variables ~2, ~3, . . . , y, are satisfied. Given the region C$ as above, the following theorem.
the selection
region
at first stage will be given by
Theorem 3.2. With CO;as given in (3. l), an optimal region at stage 1 is given by
a,*= k,:
$I+A,
~~,~~~*idF(rl~Ix,)al,j
(3.3)
where I is the vector of integrals defined by
k2 is such that (3.2) holds good, and p, A,, A0 are chosen so that all the restrictions are satisfied and that P(o$) = oz,. The regions described in the above two theorems may be of some practical utility if (3.1) and (3.3) can be further simplified. We do so under the assumption of multivariate normality of all the variables. In case all the variables are distributed as multivariate normal as given in (2.12) and (2.13), then following Rao (1964, pp. 40-43), b can be found as the solution to the problem: max subject
(3.5)
AV2(i) b
to
i=2 , ...,p,
A,*(i)b ~~,
z(k,) b’&b
= 1,
R. Khattree / Some prediction procedures
399
where *co k, t e-t=/2 dt, I
Z(kZ) = ~&
AV2=D(q2) and A,2C,j is the i-th row of Aq2. Problem (3.5) can easily be solved, using the algorithm described in Khattree (1987). To simplify (3.3), it is enough to evaluate integrals in I. These integrals have been evaluated in Khattree (1985). In this case, we obtain the following result:
Under the assumption of multivariate normality, in addition to those described in Theorem 3.1 and 3.2, an optimum two-stage selection region is given by Result
3.1.
CO;= {x: b’qz>k2], with r2
=
E(Y
I Xlr
x2),
VI
=
E(Y
lx,),
b’/&b = 1,
and Z as defined in (3.4). The coefficients kZ, A,, A0 and p are chosen so that all restrictions and size conditions are satisfied. As can be seen, the optimum region given by the above result is very complex and will be difficult to compute. As an alternative, one may try the following selection regions: CO; = {xl: c’rj,>k,},
CO;= {(x;,x;)‘:
where c and b are to be chosen P(c’q, zk,)
= tx,
b’q,rk,],
so as to satisfy
and
P(b’q, 2 k,) = a, u2.
4. Other two stage selection indices Rao (1964) suggested the use of various other two stage selection indices, which are intuitive in nature, but may be appealing for practical applications. Another suggested selection region may be of the type c2k k2 in iz axis and [r I a, ql 2 b in (it, qt) space, where C2 is a given linear selection index which may not be optimum and [, is the regression of & on x1. The constants k2, a and b are to be chosen so that the conditions on the selection intensities are satisfied and that mean improvement is maximum. In the following, we evaluate the mean improvement due to application of the indices proposed above with intensities a2 and al, respectively. Let us for simplicity denote (JJ, qr, [r, c2) by (z,,, zI, z2, zs) and let us also assume that z = (zo, zt, z2, zs)‘- F&CO, R*),
where
R. Khattree / Some prediction procedures
400
H----I 1
R*=
@Ol
eo2
eo3
e10
R
e20 e30
is the correlation matrix of z. z. taken as linear combination
of other variables,
will be
z. = t,z2+t2z2+t3z3+Error. Computation
of E*(zo) directly
may be difficult.
E*(z, - I~z(~~)) etc., where B, = (ei2, e,s) and are similarly
defined. E*(Zl
-fh
we compute quantities
Instead,
~(23) = (z2, ~3)‘.
The other quantities
NOW, z(23))
const.
=
-
exp
(RI -1’2(z1 - 4 24
dz, dz2 dz3
c
Lz3JJ
-~<4~2)-~1 = yl,
PII
say,
where
and, where N,(c), c2 101I) stands for the mass of the rectangle standard bivariate normal distribution. We can similarly evaluate the other E*(q-B,zc23))
= Y1,
terms.
E*(%-&z(31))
x1 > cl, x21 c2 for the
We then have = Y2,
E*(z3-&~(12))
= Y3.
leads Now substitution of B, and ~(23) etc. in the above and a little simplification to E*(z,, z2,z3)‘= (21- R)-' y where y’= (y,, y2, y3). Therefore, the mean improvement in y=zo would be t’(21- R)-‘y, where k2, a and b are to be found such that
R. Khattree / Some prediction procedures
and
401
m 03 dF(rl,, 51) = al, c .b i
and where t = (tt, t,, t3)‘. Comparison of the mean improvement due to the index given here and those suggested by Rao will throw some more light on the relative merits of these indices. A theoretical comparison, however, seems impossible. Through simulation, one can determine the relative advantages of using any of the indices discussed above, in particular cases.
Acknowledgements The author wishes to thank Profs. C.R. Rao, M.B. Rao and the three anonymous referees for their extremely helpful comments and suggestions, which improved the presentation of the paper considerably.
References W.G. (1951). Improvement by means of selection. 2nd Berke/ey Symp. Math. Statist. and Probab., 449-470. Curnow, R.N. (1961). Optimal programmes for varietal selection. J. Roy. Statist. Sot. Ser. B 23(2), 282-3 18 (with discussion). Finney, D.J. (1984). Improvement by planned multi-stage selection. J. Amer. Statist. Assoc. 79, 501-509. Cochran,
Khattree,
R. (1985). Some contributions
construction. Khattree,
Ph.D.
Thesis,
to the statistical
University
R. (1987). On selection
theory
of prediction:
Selection
indices and their
of Pittsburgh.
Comm. Statist. - Simulation Comput. 16(4), 1093-
with restriction.
1103. Khattree,
R. (1988a).
Inequality
restricted
linear
selection
indices.
Comm. Statist. - Theory Methods
17(9), 2959-2980. Khattree,
R. (1988b).
Asymmetric
Selection
Indices.
Comm. Statist. - Theory Methods 17(11), 3981-
3993. Noda,
K. (1985). Optimal
construction
of a selection
of a subpopulation.
Ann. Inst. Statist. Math. 37,
415-435. Raj, Rao,
D. (1954). On optimum C.R.
(1964).
Problems
selections
of selection Comp. Symp. on Statist., 29-5 1.
from multivariate involving
populations,
programming
Sonkhyd 14, 363-366. In: Proc. of IBM Sci.
techniques.
Rao, C.R. (1971). Advanced Statistical Methods in Biometric Research. Wiley/Haffner, Rao,
C.R.
(1973). Linear Statistical Inference and its Applications.
Wiley, New York.
New York.