Error rates for classifying observations based on binary and continuous variables with covariates

Error rates for classifying observations based on binary and continuous variables with covariates

COMPUTATIONAL STATISTICS & DATAANALYSIS ELSEVIER Computational Statistics & Data Analysis 21 (1996) 627-645 Error rates for classifying observations...

1MB Sizes 1 Downloads 73 Views

COMPUTATIONAL STATISTICS & DATAANALYSIS ELSEVIER

Computational Statistics & Data Analysis 21 (1996) 627-645

Error rates for classifying observations based on binary and continuous variables with covariates Chi-Ying Leung Department of Statistics, The Chinese Universityof Hong Kong, Hong Kong Received June 1994; revised May 1995

Abstract The location linear discriminant function for classifying an observation consisting of both binary and continuous variables is extended to the case where parts of the continuous data are generated from the covariates common to all the underlying populations. Asymptotic expansion of the distribution of the studentized location linear discriminant adjusted for covariates under each population is used to construct the plug-in estimator of the expected error rate. Performance of the plug-in rule is evaluated by both the plug-in estimator of the overall expected error rate and the cross-validation estimator of the overall actual error rate based on the leave-one-out method of Lachenbruch and Mickey (1968) in a simulation study. Results of a Monte Carlo study are presented.

Keywords: Location linear discriminant adjusted for covariates; Plug-in estimator; Cross-validation; Simulation

1. Introduction

In this article, we consider estimation of the overall expected and actual error rates for classifying an individual into one of two given populations Ht and /I2 based on both binary and continuous variables in the presence of continuous covariates. The problem of classification based on mixed discrete and continuous variables without covariates has drawn a lot of attention recently. The references in McLachlan (1992) have listed some of the latest advances on this topic. To trace the development of the problem, refer in particular to Krzanowski (1975), Tu and Han (1982), Vlachonikolis (1985, 1990) and more recently Leung (1994a). The framework of our investigation is based on the location model of Olkin and Tate (1961) 0167-9473/96/$15.00 © 1996 Elsevier Science B.V. All rights reserved SSDI 0 1 6 7 - 9 4 7 3 ( 9 5 ) 0 0 0 3 6 - 4

628

C-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

modified for classification by Krzanowski (1975) and generalized by Leung (1994b) to the case where continuous covariates are present. T o formulate the problem, let U' = (~Z ' , Y') where Z ' = (Z1 ... ,Z,), r = 2 e and Y = (Y1, Y2, ..., Yp+q). Here Z denotes the incidence vector corresponding to the r possible states of : binary variables X ' = (X~, X2 ..., X:). Each Zs takes the value one for a particular state of each of X~, . . . , X : and takes the value zero elsewhere, for s = 1,2, ..., :. The probability function for Z = z under/7~ is given by f ( z IPl) = l--I p~,'~ where p~ = (pll, p2i,... ,p•i), m=l

O < prni= E(ZmlHi) < l,

i

zm= l,

m=l

~" Prni = l rn=l

and

YIHi,

Zm=l,

ZR----0,

m~k=l,

2,...,r,,,Np+q

,

,

where Y' = ( ytl)', yt2)'), y t l r = ( Y 1 , . . . , Yp) are the major discriminators consisting of the first p continuous variables with E(Ytl)l/7i, Z,n = 1, Zk----0, m ~ k = 1, ..., r) = / ~ i and yt2)' = (yp+ 1 , . . . , Yp+q) are the covariates consisting of the last q continuous variables with E(y(2) I/7/, Zrn = 1, Z k = 0, m ~ k -- 1,..., r) -- 2, k, m -- 1, ..., r; i = 1, 2. Z is nonsingular and partitioned as P

q

P[Ell

2712]

q

Z21 Z2z "

2 is k n o w n and is assumed to be zero without loss of generality. With k n o w n priors for/71 a n d / 7 2 , assuming all the parameters are k n o w n and Z , = 1, the Bayes rule allocates U t o / / i if and only if U m > t, where t ~ ( - - ~ , + ~ ) and Urn ~_ [y(1)

_ fly(2) _ ½(/3,m 1 .jr_ ~rn2)]tS ~-.21(~ml _ ~-~m2) -- log(prn2/Pml)

for m = 1, 2 , . . . , r where fl -- 271227221 and 271.2 -- 2711 - f12722fl'. W h e n the parameters are unknown, we assume that data based on large independent training samples of size nl and n2 from H1 and/72 respectively are available. Suppose that nrn~observations belong to location m f r o m / 7 , m = 1, 2, ..., r; i = 1, 2. To construct a sample-based allocation rule in the absence of continuous covariates, each of the three competitive approaches, namely the estimative approach in Krzanowski (1975), the testimative approach in Krzanowski (1982) and the predictive approach in Vlachonolis (1990), can be adopted depending on the required computational effort. Amongst these three approaches, the estimative approach is more appealing because it is less c o m p u t i n g intensive. In the absence of binary variables. Leung (1983) has shown that the estimative allocation rule obtained by plugging in covariates adjusted unbiased estimate formed by the continuous variables has asymptotically smaller

629

C.-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

overall expected error rates. When binary variables are present, the classification between H1 a n d / / 2 can be broken down into several locations, each corresponding to the observed values of the binary variables. Covariates adjusted unbiased parameter estimates in each location can be constructed in the usual manner, thereby a sample-based allocation rule is constructed. To be specific, given Zm = 1, m = 1,..., r, the sample-based allocation rule, say 0m, is constructed by substituting in Um the following estimates, namely /~mi= n=i/n~, /~i = f ~ ~ 7(2) mi Z ' j = I .vth)., %mj,, . mj,! Y m i = (v17(1 ~ = n..-1,-.,., h = 1, 2, with Y~,ji: (y~],, y(2),~ and "-' _ ) ' ]7(2),~ i = 1, 2; m = 1,...,r; _

L

£ = n*-'

/YlI,

2 n~

2 2 (Y-J,-

f.i)(Y.j,_ _

Y.i)-'

-

m=l i = l j = l

P

=,*-' :L

q

m=t

q

F{"

Lz I

with

n*=~,~(n,l+n,2-2)=nl+n2-2r, ]~=,~12,~22 ~ and Z I . 2 = n * (~,11 - F.12Z221F.21)/n where n = n* - q. The plug-in classification rule, say Urn, is the covariates adjusted version of the usual location linear discriminant function. Given Zm = 1, U is allocated to H1 if and only if j~y(2) _ ½(/~1 + / ~ , 2 ) ] ' Z i - . 2 1 (/~,1 - - / & , 2 ) - - l o g ~ m 2 / / ~ . l ) m = 1,2,..., r

0., = [Y °) -

> t,

for the cut-off point t, t e ( - ~ , + ~ ) .

2. Distribution of studentized 0m In this section, we consider the distribution of 0= under/-/i, m = 1,..., r; i = 1, 2. To facilitate the derivation of the two probabilities of misclassification associated with 0~, for location m, m = 1, 2, ..., r, an asymptotic expansion of the distribution up to the second-order term of z]~, 1(0 m "~ (--1)i,d~/2 + Iog~m2//~ml)) where A2 = (/~1 - / ~ 2 ) ' , ~ 1 . 2 ( / ~ , - / ~ 2 ) , Am = (z/2) 1/2 for i = 1, 2; m = 1, 2 , . . . , r , is derived. The result is given in Leung (1994b). The two probabilities of misclassification can be derived by using a similar technique. Detailed derivation can be found in the appendix in Leung (1994b) and will be ski~ped. The result is stated in the following: --1

A

^

Theorem (Leung, 1994b). Suppose that the followino conditions hold: C1"

p

lim

nm2/nml = k,. > O, m = 1, 2,...,r.

n l , n2--~ oo

C2:

p lim n~l/nml=ks.m>O,m,s=l,2,...,r. n l ---~ 0 0

C-Y. Leung/ ComputationalStatistics& DataAnalysis21 (1996)627-645

630

Let e,m(t) = P r { ( - 1)'Urn > ( - 1)'tlH,}, i = 1, 2, c = q/(n - 1), a = [(n - 1)/ (n + q - 1)] 1/2 and let p lim.,..2-, oo n2/nl = k > 0. Then, given Zm = 1, for m = 1, 2, . . . , r, el=(t) = ~(almt) + n-l(blm

(1) (2)

' + dl,.t)~b(almt) + O(n-2),

e2m(t) = 4~(a2=t) + n-l(b2.,t + d2,.,)dp(a2m,) + O(n-2), where al.,, = aA= 1 t + log(Pm2/Pml) --

k)A~,lP~,~(1- Pro1){ a3 } 2 a + --f -- a3A~n2[t + log(Pm2/Pml)]

b l . , = (1 +

--

dlmt =

(1

+ k-1)Amlpm21(1Pm2){a 2

I - 1 p -~

a3 } -2 + a3Am2[t + log(p,n2/P.1)]

3a%-1 + c)[3( 1 + k)pm~ - (1 + k-1)p~d] -~ -JaA. + ( p - 1)a(1 4Am

-- a(1 + c)[t + log(p=2/pml)]

~" 3 ( p - - 1)

~2(i-77)-A= +

(p-

3)

- 2A3., -

x [(1 + k)p~,~ + (1 + k-~)Pm2~]

3a%-s- "~ a3(l + c)[t +log(p.,2/P=l) - A-~-~]

4(1 + c)a.,J

2a.,

a4c_d~ (1 + k)pa# + (1 + k-a)p~,21 x (8(1 + c) + 4 + ,4,7,2 (1 + k - ' ) p g ~ - ( 1 + k)p~,~

a'ca=, l 20¥7)3

x [t + log(p,~2/pml)-I + d,~4I(1 + k)p~,~ + (1 + k- l )p~,1 +

+ 207c)J I-ta%aa= -1

+

2,4 2 (1 + c)

log(Pm2/Pral)]2}

a2mt = --aAmXlt+log(pm2/Pml)+d-~ ] b2mt

=

(I + k-i)AT"lP~2t(1- p'2) { a3 aa } 2 a + -~ + - ~ [ t + log(p.,2/pml)] --

(l + k)ATnIp~'21(1-- Pml) { 2 a

a3 2

03 } -~ [t + log(p.2/P.1)]

C.- Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

d2,,n I . p - 1 =

4

631

3a4c -]

+ c)I-3(1 + k-~)pT~2~ - ( 1 + k)p[,(] -~ J a A . + ( p - 1)a(1 4Am

3 ( p - 1)

+ a(1 + c)[t + log(p,.2/pm~)] 2-O-+c)-A,. +

(p - 3) 2A 3

x [(1 + k)p~,~ + (1 + k-~)p~,~]

3a4c } a3(1-Fc)[t+log(Pm2/Pml)q-d-d---~-'2] 2A.

4(1 + c)Am +

(1 + k)p~,~ + (1 + k-~)p~,2~

×

8(1 + c) +

+ a~

a4cA21

(1 + k-1)p;d - (1 + k)p;.; + 2-6 ¥ c ) J

x [t + log(pm2/Pmt)] + A,:4[(1 + k)p~,~ + (1 + k-~)p~,2~ 2A2m aacA 2 -] + ~ + 2-0 ~ c)J It + l°g(P'2/Pml)]2

}

and ¢~(. ) and ~b(') denote respectively the cumulative distribution function and the probability density function for the standard normal distribution.

3. Overall expected and actual error rates The result stated in the above theorem can be used to derive the overall expected error rate associated with the estimative classification rule. Assuming equal prior probabilities for H1, a n d / / 2 , the overall expected error rate for the cut-off point t, t e ( - oo, + ~), is given by

~.(t) = ~"

p.,{~(a,.,) + n-l(b,., +

pmiei,~(t)/2 ~--

i=lm=l

a,.,)ep(a,.,)}/2,

i=lm=l

where ain,t, bi,~,, dim,, ~b(a~mt),~b(ai~), m = 1, 2, ..., r; i = 1, 2, are given in the above theorem. For varying, t, ~(t) gives a direct measure of the long-term performance of the classification rule for the cut-off point t. In practice, it is also of primary concern to investigate how well the rule performs in classifying an observation given the training samples. An appropriate measure in this regard is the overall actual error rate given by

"i

6a(t) = ~.

i=lm=l

Pmieaim(t)/2,

C.-Y Leung/ ComputationalStatistics& DataAnalysis21 (1996)627-645

632 where

e~im(t) = P r { ( - 1)iUm >

(-1)'tl/-/,,/~ml,/~.2, ~ml, ]~n2, £1.2, £11, £12},

m = 1,2,...,r; i = 1,2. Notice that

e.~,.(t) = ~(Dm I It -+- log(Pm2/Pml) -- ~ral]) ea2,.(t) = ~(Dm a [ - t - l o g (Pm2/Pml) + (,.2]) where

~mi

[/J'mi -- ½(~ml + ~ra2)]'£12

(//'ml --

]~n2)

and D . = [(/ira1-- ~m2) ^ ,~V'l.2~V'1.2~V'1.2(/2,ml ^-1 ^-x ^ -- /~m2)] 1/2,

i = 1, 2; m = 1, 2,..., r.

Hence •a(t) = ( m ~= l Pmx~(Dml[t q- log(~m2/~ml)--~ml]) +m=l~ P m z ~ ( D m l [ - - t - - l o g ( P m 2 / P m x ) + ~ m 2 ] ) ) / 2 " Estimates of both ~(t) and E,(t) provide a direct assessment of the performance of the classification rule. The traditional plug-in estimator $(t) of 6(t) given by b(t) = 2 2 lyre= " l pmle,.i(t)/2, ^ ^ ^ where e,.i(t) is obtained by plugging in suitable estimates in expressions (1) and (2) omitting the remainder term, has been recommended as reviewed in the literature. See McLachlan (1992, Ch. 10, pp. 367-370). In contrast, to estimate ~.(t), the leave-one-out cross-validation estimator advocated by Lachenbruch and Mickey (1968) rather than the simple biased estimator given by ^

A--

1

^

^

(m~=l pml(~2)(A,. [ t + log(Pm2/Pma)-- ^A-~-~I)

is preferred. The leave-one-out cross-validation estimator ecv.(t) of ~.(t) which adjusts the apparent bias is constructed as follows. For m = 1, 2,..., r, let tQm(i,j ) denote the value computed to test-classify the j t h observation from the ith sample in location m based on the plug-in classification statistic constructed from the remaining nx + n2 - 1 observations in the training sample for j = 1, 2,..., n,.i; i = 1, 2; m = 1, 2,..., r. Then tlm~

~,~m(t) = n,:~1 )-', I{i-1)'o,<,,j~>{-~l',/, i = 1, 2,

j=l

C-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

633

where

IE=

1

event E occurs,

0,

otherwise. "

=

^

^

x E~ = x pm~eolm(t)/2 = Y 2= I Ri/2ni w h e r e

~

nm~

nl =

~ I{{-1)'O.ti,j~>{-,)'t} ra=l j=l

= t o t a l n u m b e r o f m i s c l a s s i f i e d o b s e r v a t i o n s i n H~, i = 1, 2.

Remark: It should be emphasized that the training samples based rule 0,,, m = 1, 2, ... , r , t h e r e s u l t i n t h e t h e o r e m as well as t h e c o n s t r u c t i o n o f ~cva(t) s u g g e s t e d a b o v e r e l y o n t h e a s s u m p t i o n t h a t b o t h nx a n d n2 a r e r e a s o n a b l y l a r g e

Table 1(a) Values of asJo, bm~o,dmio, i = 1, 2; m = 1, 2, and the leading term L(0) and correction term C(0) in #(0) for p = 4, q = 1; 0 = 0.25, 0.50, 1.00, 1.50, and N = 400, 800, 1200

0

1.50

400 800 1200

1.00

400 800 1200

0.50

400

a120

a210

a220

dllo

d120

d210

b11o

blzo

bzlo

b22o

L(0)

C(0)

~(0)

d220

- 1.267 0.976 -1.268 0.977 - 1.268 0.978

-1.729 -1.684 0.020 0.003 -1.730 -1.685 0.021 0.004 - 1.730 -1.686 0.021 0.004

-1.312 0.637 -1.313 0.638 -1.313 0.638

6.777 0.063 6.777 0.063 6.777 0.063

6.331 0.002 6.334 0.001 6.335 0.001

5.687 0.065 5.690 0.064 5.690 0.064

-0.653 1.320 -0.653 1.322 -0.653 1.322

-1.345 -1.278 0.174 0.080 -1.345 -1.279 0.176 0.081 1.346 -1.279 0.176 0.081

-0.719 0.881 -0.720 0.882 -0.720 0.882

6.567 0.148 6.564 0.148 6.564 0.148

6.396 0.004 6.398 0.002 6.399 0.001

5.633 0.152 5.635 0.150 5.636 0.149

- 1.192 - 1.058 1.901 0.966 -1.192 -1.059 1.907 0.969 -1,193 -1.059 1.909 0.970

0.060 0.956 0.060 0.956 0.060 0.956

6.106 0.277 6.098 0.277 6.095 0.277

13.383 0.007 13.385 0.003 13.386 0.002

10.184 0.284 10.185 0.280 10.185 0.279

5.037

- 1.634 - 1.368 0.868 16.232 8.382 -4.540 - 1.635 - 1.368 0.869 16.267 8 . 4 0 1 -4.551 - 1.636 - 1.369 0.869 16.279 8.407 -4.555

- 50.553 0.338 - 50.586 0.388 - 50.596 0.338

99.809 0.010 99.835 0.005 99.844 0.003

55.069 0.348 55.078 0.343 55.080 0.341

- 20.972

0.193 1.087 0.193 1.087 0.193 1.087

800 1200 0.25

allO N

400 -

800 -

1200 -

1.135 10.255 1,136 10.279 1.136 10.286

5.632 5.663 5.633

5.232 5.232

5.031 5.029

- 20.988 - 20.933

634

C.-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

Table 1(b) Values of am~o,bmio,dmio, i = 1, 2; m = 1, 2, and the leading term L(0) and correction term C(0) in 4(0) for p = 5, q = 1; 0 = 0.25, 0.50, 1.00, 1.50, and N = 400, 800, 600

0

1.50

400 800 1200

1.00

400 800 1200

0.50

400 800 1200

0.25

a~o

al20

a210

a220

dllo

dl20

d210

bll 0

b120

b210

b220

L(0)

C(0)

4(0)

d220

N

400 800 1200

- 1.267 0.976 - 1.268 0.977 - 1.268 0.978

- 1.729 - 1.684 - 1.312 0.020 0 . 0 0 3 0.637 - 1.730 - 1.685 - 1.313 0.021 0.004 0.638 - 1.730 - 1.686 - 1.313 0.021 0.004 0.638

8.443 0.063 8.442 0.063 8.442 0.063

7.832 0.002 7.836 0.001 7.837 0.001

7.095 0.065 7.098 0.064 7.099 0.064

-0.653 1.320 -0.653 1.322 -0.653 1.322

-1.345 -1.278 -0.719 0.174 0.080 0.881 -1.345 -1.279 -0.720 0.175 0 . 0 8 1 0.882 -1.346 -1.279 -0.720 0.175 0 . 0 8 1 0.883

8.199 0.148 8.196 0.148 8.195 0.148

8.265 0.005 8.268 0.002 8.269 0.001

7.273 0.153 7.276 0.150 7.276 0.149

- 1.192 - 1.058 1.901 0.966 - 1.192 - 1.059 1.907 0.969 - 1.193 - 1.059 1.909 0.970

6.019 0.277 6.010 0.277 6.007 0.277

18.976 0.009 18.976 0.004 18.976 0.003

14.367 0.286 14.367 0.281 14.367 0.280

131.441 0.012 131.449 0.006 131.452 0.004

76.269 0.350 76.266 0.344 76.265 0.342

0.193 1.087 0.193 1.087 0.193 1.087 1.135 -10.255 1.136 -10.279 1.136 - 10.286

0.060 0.956 0.060 0.956 0.060 0.956

-1.634 -1.368 0 . 8 6 8 -71.923 1 6 . 2 3 2 8 . 3 8 2 -4.540 0.338 --1.635 -1.368 0.869 -71.943 1 6 . 2 6 7 8.401 -4.551 0.338 -1.636 -1.369 0 . 8 6 9 -71.951 1 6 . 2 7 9 8 . 4 0 7 -4.555 0.338

7.033 7.034 7.034 6.559 6.556 6.556 5.286 5.280 5.278 -34.055 -34.064 -34.068

d u e to the large n u m b e r of u n k n o w n p a r a m e t e r s in a given l o c a t i o n m o d e l . W h e n nl a n d n2 a r e o n l y m o d e r a t e , s o m e o f the p a r a m e t e r e s t i m a t e s r e q u i r e d for the c o n s t r u c t i o n o f the p l u g - i n classification statistic m a y n o t be available. K r z a n o w s k i (1975) h a d a n i n - d e p t h i n v e s t i g a t i o n o f this p r o b l e m in the a b s e n c e o f c o n t i n u o u s c o v a r i a t e s . A t w o - s t a g e e s t i m a t i o n p r o c e d u r e is a d o p t e d to deal w i t h this situation. I n the first stage, the values of the b i n a r y v a r i a b l e s a r e used to f o r m t w o s e p a r a t e r - w a y c o n t i n g e n c y tables. T w o a p p r o p r i a t e s e c o n d - o r d e r l o g - l i n e a r m o d e l s are fitted to d a t a via i t e r a t i v e scaling as s u g g e s t e d in H a b e r m a n (1972) w h e r e b y a n e s t i m a t e of n,,,i, s a y n,,i, is o b t a i n e d a n d a n e s t i m a t e of P,,i is given by/~,,~ = ~,,,ffnu m = 1, 2 , . . . , r; i = 1, 2. I n the s e c o n d stage, for the t r a i n i n g s a m p l e s f r o m Hi, i = 1, 2, a n a p p r o p r i a t e full r a n k s e c o n d - o r d e r m u l t i v a r i a t e n o r m a l linear r e g r e s s i o n m o d e l o f Y o n the b i n a r y v a r i a b l e s X I , X2 . . . . . X t is fitted to d a t a . A n e s t i m a t e o f p~,~, say /~,i, is g i v e n b y the fitted v e c t o r r e s p o n s e /~(Y [H~, Zm = 1, Zk = 0,

635

C. - Y. Leung/ Computational Statistics & Data Analysis 21 (1996) 627-645

m # k = 1, 2 .... , r), i = 1, 2; m, k = 1, 2, ..., r. The estimate of Z, say ~, is obtained by pooling the two matrices of sums of squares and products of residuals after fitting the two regression models. The method can be easily applied to cope with the situation where continuous covariates are present. For details, see Leung (1994b).

4. A simulation study To appraise the performance of the training samples based covariates adjusted classification rules as well as the effect of covariates adjustment, a small-scale

Table 2(a) Values of e(t) (without brackets) and ~(t) (bracketed) in three decimal places based on a single simulation for q = 1, p = 4; 0 = _+ 0.25, _+0.50, _+ 0.50, _+ 1.0, _+ 1.5, N = 400, 800, 1200, and

t = 0.00, + 0.50, + 1.0 t 0

N - 1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.087 (0.075) 0.069 (0.074) 0.079 (0.073)

0.064 (0.067) 0.058 (0.066) 0.076 (0.066)

0.068 (0.065) 0.072 (0.064) 0.056 (0.064)

0.074 (0.067) 0.068 (0.066) 0.079 (0.066)

0.093 (0.074) 0.071 (0.074) 0.073 (0.074)

- 1.00

400 800 1200

0.143 (0.177) 0.191 (0.175) 0.164 (0.175)

0.163 (0.158) 0.150 (0.156) 0.144 (0.156)

0.149 (0.151) 0.147 (0.149) 0.137 (0.149)

0.151 (0.158) 0.160 (0.156) 0.153 (0.156)

0.155 (0.177) 0.183 (0.176) 0.167 (0.175)

-0.50

400 800 1200

0.343 (0.339) 0.317 (0.338) 0.354 (0.337)

0.336 (0.299) 0.300 (0.296) 0.294 (0.295)

0.286 (0.284) 0.275 (0.280) 0.296 (0.279)

0.294 (0.299) 0.304 (0.296) 0.298 (0.295)

0.330 (0.340) 0.349 (0.339) 0.322 (0.339)

-0.25

400 800 1200

0.414 (0.421) 0.405 (0.421) 0.420 (0.421)

0.382 (0.366) 0.346 (0.362) 0.370 (0.361)

0.349 (0.347) 0.338 (0.342) 0.344 (0.341)

0.297 (0.369) 0.362 (0.366) 0.368 (0.365)

0.416 (0.432) 0.414 (0.432) 0.418 (0.432)

0.25

400 800 1200

0.400 (0.421) 0.387 (0.421) 0.414 (0.421)

0.382 (0.366) 0.358 (0.362) 0.359 (0.361)

0.314 (0.347) 0.336 (0.342) 0.331 (0.341)

0.355 (0.369) 0.334 (0.366) 0.364 (0.365)

0.399 (0.432) 0.397 (0,432) 0.409 (0.432)

0.50

400 800 1200

0.332 (0.339) 0.329 (0.338) 0.327 (0.337)

0.297 (0.299) 0.288 (0.296) 0.302 (0.295)

0.235 (0.284) 0.279 (0.280) 0.291 (0.279)

0.275 (0.299) 0.295 (0.296) 0.312 (0.295)

0.374 (0.340) 0.342 (0.339) 0.328 (0.339)

1.00

400 800 1200

0.158 (0.177) 0.169 (0.175) 0.182 (0.175)

0.163 (0.158) 0.123 (0.156) 0.153 (0.156)

0.162 (0.151) 0.141 (0.149) 0.146 (0.149)

0.168 (0,158) 0.144 (0.156) 0.152 (0.156)

0.135 (0.177) 0.198 (0.176) 0.160 (0.175)

1.50

400 800 1200

0.097 (0.075) 0.064 (0.074) 0.072 (0.073)

0.069 (0.067) 0.068 (0.066) 0.073 (0.066)

0.055 (0.065) 0.068 (0.064) 0.063 (0.064)

0.080 (0.067) 0.066 (0.066) 0.058 (0.066)

0.093 (0.074) 0.085 (0.074) 0.075 (0.074)

C-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

636

Table 2(b) Values of e(t) (without brackets) and ~(t) (bracketed) in three decimal places based on a single simulation for q = 1, p = 5; 0 = + 0.25, + 0.50, + 1.0, + 1.5, N = 400, 800, 1200, and t = 0.0, _+ 0.50, + 1.0 t

0

N - 1.0

-0.5

0.0

0.5

1.0

-1.50

400 800 1200

0.068 (0.075) 0.077 (0.074) 0.063 (0.074)

0.068 (0.068) 0.072 (0.067) 0.063 (0.066)

0.066 (0.065) 0.061 (0.064) 0.069 (0.064)

0.088 (0.068) 0.073 (0.067) 0.061 (0.066)

0.061 (0.075) 0.071 (0.074) 0.076 (0.074)

- 1.00

400 800 1200

0.182 (0.178) 0.174 (0.176) 0.166 (0.175)

0.158 (0.159) 0.164 (0.157) 0.160 (0.156)

0.137 (0.152) 0.156 (0.150) 0.150 (0.149)

0.138 (0.159) 0.165 (0.157) 0.157 (0.156)

0.182 (0.178) 0.166 (0.176) 0.190 (0.176)

--0.50

400 800 1200

0.320 (0.339) 0.329 (0.338) 0.342 (0.337)

0.278 (0.300) 0.281 (0.297) 0.290 (0.296)

0.274 (0.285) 0.290 (0.281) 0.271 (0.280)

0.319 (0.300) 0.271 (0.297) 0.312 (0.296)

0.314 (0.340) 0.386 (0.339) 0.350 (0.339)

-0.25

400 800 1200

0.440 (0.420) 0.404 (0.420) 0.424 (0.420)

0.294 (0.368) 0.356 (0.363) 0.319 (0.362)

0.295 (0.349) 0.321 (0.343) 0.335 (0.341)

0.361 (0.371) 0.388 (0.367) 0.379 (0.365)

0.391 (0.432) 0.429 (0.432) 0.456 (0.432)

0.25

400 800 1200

0.443 (0.420) 0.434 (0.420) 0.411 (0.420)

0.344 (0.368) 0.362 (0.363) 0.366 (0.362)

0.328 (0.349) 0.344 (0.343) 0.335 (0.341)

0.319 (0.371) 0.333 (0.367) 0.358 (0.366)

0.403 (0.432) 0.427 (0.432) 0.438 (0.432)

0.50

400 800 1200

0.326 (0.339) 0.310 (0.338) 0.335 (0.337)

0.308 (0.300) 0.259 (0.297) 0.297 (0.296)

0.279 (0.285) 0.288 (0.281) 0.296 (0.280)

0.263 (0.300) 0.275 (0.297) 0.286 (0.296)

0.379 (0.340) 0.340 (0.339) 0.335 (0.339)

1.00

400 800 1200

0.137 (0.178) 0.170 (0.176) 0.163 (0.175)

0.146 (0.159) 0.156 (0.157) 0.154 (0.156)

0.168 (0.152) 0.158 (0.150) 0.145 (0.149)

0.176 (0.159) 0.148 (0.157) 0.154 (0.156)

0.137 (0.178) 0.168 (0.176) 0.166 (0.176)

1.50

400 800 1200

0.065 (0.075) 0.070 (0.074) 0.083 (0.074)

0.057 (0.068) 0.063 (0.067) 0.067 (0.066)

0.073 (0.065) 0.066 (0.064) 0.061 (0.064)

0.065 (0.068) 0.060 (0.067) 0.066 (0.066)

0.066 (0.075) 0.066 (0.074) 0.076 (0.074)

s i m u l a t i o n s t u d y is c a r r i e d o u t t o c o m p a r e t h e v a l u e s o f ~(t) a n d ~cva(t) w i t h t h e i r t r u e p o p u l a t i o n v a l u e s f o r v a r i o u s c u t - o f f p o i n t s . F o r simplicity, o n l y the cases w h e r e d = 1, q = 1, p = 4, 5 are c o n s i d e r e d . It is a s s u m e d t h a t t h e l o c a t i o n p r o b a b i l i t i e s for H , a n d / 7 2 a r e Px, = 0.3, p21 = 0.7; P12 = 0.6, P22 = 0.4. F o r i = 1, 2, let nl o b s e r v a t i o n s be g e n e r a t e d f r o m / 7 i w i t h n , , o b s e r v a t i o n s f r o m p

N,+,\IL0 J' 1L0!

1 w i t h p r o b a b i l i t y P,,i,

m = 1,2;

637

C.-Y. Leung/ Computational Statistics & Data Analysis 21 (1996) 627-645

where lp is a p x I column vector consisting of ones and ~c t~0, .o' p - l J ,v

~,~1

t

m t

~(ov_l, - 0 ) ,

--

- 0 , ov_l ), '

1

m= 2

~m2

~

t

,

t

~ (or_l,0),

m= 1

m= 2

op_ 1 is a (p -- 1) x 1 column vector consisting of zeros. A total of N = nl + n2 (nl = n2 = n) observations from both/I1 and/I2 are generated. Values of ~(t), ~(t), ~a(t) and ~,~(t) for n = 200, 400, 600; 0 = +0.25, +0.50, + 1.00, + 1.50; t = 0.0, +0.5, + 1.0 are computed using the appropriate expressions in Sections 2 and 3. Findings based on a single simulation study indicate that values of ~(t), 3a(t) and ~¢v~(t) exhibit some variability. Values of ~(t) are fixed and can be computed in a rather straightforward manner by omitting the remainder terms in (1) and (2) in the

Table 3(a) Values of ec,a(t) (without brackets) and ~a(t) (bracketed) in three decimal places based on a single simulation for q = 1, p = 4; 0 = __+0.25, __+0.50, + 1.0, __+1.5, N = 400, 800, 1200, and t = 0.0, __+0.50, ___1.0 t

0

N

- 1.0

- 0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.070 (0.075) 0.076 (0.074) 0.068 (0.075)

0.058 (0.067) 0.068 (0.067) 0.059 (0.066)

0.068 (0.066) 0.069 (0.064) 0.051 (0.063)

0.070 (0.068) 0.073 (0.065) 0.059 (0.068)

0.080 (0.077) 0.080 (0.072) 0.063 (0.076)

- 1.00

400 800 1200

0.163 (0.166) 0.185 (0.177) 0.162 (0.175)

0.150 (0.161) 0.174 (0.156) 0.140 (0.155)

0.163 (0.151) 0.146 (0.149) 0.140 (0.149)

0.178 (0.159) 0.164 (0.159) 0.144 (0.155)

0.180 (0.169) 0.186 (0.174) 0.166 (0.177)

-0.50

400 800 1200

0.360 (0.337) 0.330 (0.322) 0.363 (0.346)

0.305 (0.305) 0.299 (0.300) 0.313 (0.298)

0.283 (0.284) 0.279 (0,278) 0.298 (0.277)

0.313 (0.303) 0.290 (0.301) 0.316 (0.296)

0.340 (0.330) 0.354 (0.340) 0.378 (0.330)

-0.25

400 800 1200

0.410 (0.415) 0.410 (0.422) 0.423 (0.425)

0.350 (0.375) 0.366 (0.362) 0.373 (0.361)

0.358 (0.345) 0.349 (0.341) 0.345 (0.343)

0.403 (0.347) 0.381 (0.358) 0.371 (0.360)

0.453 (0.420) 0.441 (0.426) 0.434 (0.423)

0.25

400 800 1200

0.398 (0.404) 0.411 (0.403) 0.413 (0.406)

0.358 (0.372) 0.351 (0.365) 0.348 (0.357)

0.310 (0.349) 0.323 (0.341) 0.338 (0.339)

0.340 (0.358) 0.361 (0.360) 0.356 (0.368)

0.370 (0.418) 0.439 (0.405) 0.435 (0.417)

0.50

400 800 1200

0.280 (0.341) 0.344 (0.338) 0.358 (0.330)

0.260 (0.267) 0.289 (0.296) 0.305 (0.302)

0.243 (0.283) 0.274 (0.280) 0.285 (0.278)

0.223 (0.299) 0.293 (0.298) 0.313 (0.295)

0.280 (0.365) 0.335 (0.341) 0.356 (0.336)

1.00

400 800 1200

0.208 (0.172) 0.153 (0.171) 0.163 (0.179)

0.170 (0.158) 0.130 (0.153) 0.138 (0.154)

0.170 (0.152) 0.126 (0.150) 0.134 (0.148)

0.178 (0.158) 0.135 (0.155) 0.143 (0.156)

0.188 (0.171) 0.161 (0.180) 0.175 (0.172)

1.50

400 800 1200

0.073 (0.076) 0.090 (0.072) 0.075 (0.070)

0.065 (0.067) 0.078 (0.066) 0.069 (0.066)

0.063 (0.065) 0.076 (0.064) 0.065 (0.064)

0.058 (0.068) 0.073 (0.068) 0.073 (0.065)

0.073 (0.082) 0.079 (0.076) 0.071 (0.077)

638

C.- Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

Table 3(b) Values of e¢,,(t) (without brackets) and ~a(t) (bracketed) in three decimal places based on a single simulation for q = 1, p = 5; 0 = + 0.25, + 0.50, + 1.0, + 1.5, N = 400, 800, 1200, and t = 0.0, __+0.50, _ 1.0 t 0

N -

1.0

-

0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.083 (0.075) 0.066 (0.072) 0.084 (0.072)

0.073 (0.065) 0.058 (0.066) 0.079 (0.066)

0.070 (0.066) 0.058 (0.064) 0.074 (0.063)

0.083 (0.068) 0.066 (0.065) 0.074 (0.068)

0.083 (0.076) 0.069 (0.076) 0.080 (0.075)

- 1.00

400 800 1200

0.153 (0.185) 0.185 (0.174) 0.179 (0.172)

0.155 (0.159) 0.165 (0.154) 0.153 (0.157)

0.158 (0.151) 0.160 (0.150) 0.152 (0.149)

0.155 (0.160) 0.178 (0.163) 0.162 (0.154)

0.178 (0.178) 0.188 (0.176) 0.180 (0.180)

-0.50

400 800 1200

0.335 (0.343) 0.349 (0.327) 0.335 (0.343)

0.303 (0.299) 0.324 (0.293) 0.284 (0.299)

0.303 (0.283) 0.303 (0.281) 0.280 (0.278)

0.310 (0.303) 0.308 (0.286) 0.286 (0.299)

0.340 (0.331) 0.348 (0.361) 0.333 (0.347)

- 0.25

400 800 1200

0.435 (0.439) 0.389 (0.408) 0.420 (0.426)

0.305 (0.357) 0.346 (0.361) 0.358 (0.356)

0.300 (0.346) 0.331 (0.341) 0.339 (0.340)

0.310 (0.381) 0.351 (0.367) 0.370 (0.373)

0.403 (0.426) 0.428 (0.433) 0.424 (0.455)

0.25

400 800 1200

0.418 (0.450) 0.456 (0.433) 0.405 (0.414)

0.350 (0.361) 0.406 (0.360) 0.367 (0.357)

0.340 (0.342) 0.340 (0.344) 0.343 (0.341)

0.363 (0.361) 0.360 (0.353) 0.350 (0.363)

0.425 (0.424) 0.416 (0.431) 0.432 (0.440)

0.50

400 800 1200

0.328 (0.328) 0.375 (0.330) 0.359 (0.338)

0.280 (0.303) 0.325 (0.288) 0.332 (0.296)

0.275 (0.305) 0.288 (0.279) 0.304 (0.282)

0.313 (0.301) 0.295 (0.295) 0.307 (0.295)

0.358 (0.363) 0.339 (0.347) 0.363 (0.337)

1.00

400 800 1200

0.198 (0.163) 0.186 (0.171) 0.155 (0.177)

0.193 (0.161) 0.164 (0.161) 0.144 (0.154)

0.160 (0.151) 0.144 (0.150) 0.141 (0.148)

0.178 (0.158) 0.154 (0.153) 0.160 (0.157)

0.205 (0.164) 0.171 (0.175) 0.181 (0.173)

1.50

400 800 1200

0.068 (0.073) 0.065 (0.070) 0.068 (0.075)

0.068 (0.067) 0.056 (0.065) 0.058 (0.067)

0.063 (0.065) 0.054 (0.064) 0.059 (0.064)

0.070 (0.067) 0.063 (0.065) 0.063 (0.067)

0.083 (0.074) 0.074 (0.069) 0.068 (0.076)

theorem given the values of n, 0 and t. Notice that ~(t) remains unchanged when 0 is switched to - 0 . T o highlight the effect of covariates adjustment and give a clear picture of the magnitudes of aimt, bi,,t, direr, i = 1, 2; m = 1, 2,..., r, for a given value of t in the evaluation of ~(t), only cases corresponding to t = 0; r = 2; p = 4, 5; q = 1; 0 = 0.25, 0.5, 1.00, 1.50 are considered to reduce the computational burden. F o r t = 0, along with ~(0), the leading term L(0) and the correction term C(0) where L(O) = y2= 1 ~2= ~ pm,~(aimo)/2, C(O) = ~2= ~ , =2 ~ Pro,(b ,too + di,o)dp(a,mo)/2n, and ~ ( 0 ) - L ( 0 ) + C(0) are tabulated in Tables l(a) and (b) for n = 200, 400, 600; 0 = 0.25, 0.50, 1.00, 1.50. Tables 2(a) and (b) contain values of ~(t) (without brackets) and ~(t) (with brackets) for n = 200, 400, 600; 0 = _ 0.25, __+0.50, + 1.00,

C-K Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

639

_ 1.50; t = 0, ___0.5, _ 1.0 for q = 1, p = 4 a n d q = 1, p = 5 respectively in a single s i m u l a t i o n study. Tables 3(a) a n d (b) t a b u l a t e values of ee~,(t) (without brackets) a n d ~,(t) (with brackets) for the same c o m b i n a t i o n s of values of n, 0, t, q a n d p. The above s t u d y is replicated 50 i n d e p e n d e n t times. T h e replicate m e a n of ~(t) based on 50 replications a n d its s t a n d a r d deviation (in brackets) are t a b u l a t e d in Tables (4a) a n d (b). T h e values for the replicate m e a n of Y,(t) a n d its s t a n d a r d deviation (in brackets) are t a b u l a t e d in Tables 5(a) a n d (b). The c o r r e s p o n d i n g values for the replicate m e a n of ecv,(t) a n d its s t a n d a r d deviation (in brackets) are t a b u l a t e d in Tables 6(a) a n d (b). Replications reduce the variability a n d p i n p o i n t the sensitivity of the results based on a single simulation.

Table 4(a)

Values of of the replicate mean (without brackets) and replicate standard deviation (with brackets) of ~(t) based on 50 replications of a single simulation in three decimal places. The tabulated values are based on p = 4, p = 1; 0 = ± 0.25, ± 0.50, ± 1.0, ± 1.5, N = 400, 800, 1200, and t = 0.0, __ 0.50, ± 1.0 t 0

N -

1.0

-

0.5

0.0

0.5

1.0

- 1.50

400 0.072(0.023) 800 0.072(0.012) 1200 0.074(0.012)

0.065(0.013) 0.066(0.015) 0.066(0.008)

0.064(0.014) 0.063(0.011) 0.064(0.007)

0.067(0.020) 0.066(0.013) 0.066(0.013)

0.075(0.019) 0.073(0.013) 0.073(0.011)

- 1.00

400 0.174(0.028) 800 0.172 (0.024) 1200 0.175(0.014)

0.156(0.035) 0.154(0.021) 0.155(0.015)

0.150(0.032) 0.149(0.015) 0.149(0.013)

0.156(0.023) 0.154(0.019) 0.154(0.013)

0.176(0.034) 0.172(0.023) 0.175(0.022)

-0.50

400 0.329(0.050) 800 0.330(0.045) 1200 0.333 (0.027)

0.291(0.053) 0.292(0.033) 0.294(0.021)

0.276(0.043) 0.276(0.023) 0.273(0.017)

0.301(0.030) 0.293(0.027) 0.291(0.025)

0.330(0.058) 0.332(0.030) 0.336(0.021)

- 0.25 ,

400 0.410(0.064) 800 0.410(0.039) 1200 0.420(0.030)

0.357(0.040) 0.359(0.041) 0.357(0.032)

0.338(0.041) 0.336(0.033) 0.337(0.026)

0.363(0.049) 0.363(0.042) 0.358(0.035)

0.414(0.045) 0.422(0.035) 0.430(0.036)

0.25

400 0.408 (0.065) 800 0.407(0.032) 1200 0.415 (0.049)

0.355(0.052) 0.359(0.028) 0.361(0.028)

0.340(0.036) 0.335(0.029) 0.336(0.028)

0.355(0.043) 0.360(0.034) 0.357(0.042)

0.424(0.066) 0.427(0.039) 0.426(0.030)

0.50

400 0.333 (0.050) 800 0.335 (0.023) 1200 0.336(0.024)

0.295(0.031) 0.294(0.021) 0.290(0.028)

0.277(0.038) 0.275(0.027) 0.278(0.019)

0.293(0.042) 0.291(0.031) 0.295(0.025)

0.335(0.047) 0.333(0.038) 0.336(0.030)

1.00

400 0.174(0.028) 800 0.173 (0.019) 1200 0.175 (0.017)

0.157(0.031) 0.153(0.023) 0.155(0.013)

0.149(0.028) 0.147(0.015) 0.146(0.019)

0.151(0.031) 0.153(0.019) 0.155(0.013)

0.173(0.034) 0.171(0.020) 0.174(0.018)

1.50

400 0.075 (0.021) 800 0.073 (0.014) 1200 0.074(0.011)

0.067(0.023) 0.065(0.008) 0.066(0.011)

0.063(0.015) 0.063(0.009) 0.064(0.010)

0.068(0.016) 0.066(0.009) 0.065(0.008)

0.074(0.013) 0.074(0.011) 0.073(0.011)

640

C.-Y. Leungl Computational Statistics & Data Analysis 21 (1996) 627-645

Table 4(b) Values of the replicate mean (without brackets) and replicate standard deviation (with brackets) d(t) based on 50 replications of a single simulation in three decimal places. The tabulated values are based on p = 5, q = 1; 6 = j, 0.25, f 0.50, f 1.0, _+ 1.5, N = 400, 800, 1200, and t = 0.0, + 0.50, f 1.0

t N

0

-1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.076 (0.024) 0.072 (0.009) 0.073 (0.010)

0.066 (0.016) 0.065 (0.011) 0.066 (0.007)

0.065 (0.018) 0.063 (0.012) 0.063 (0.010)

0.067 (0.014) 0.065 (0.016) 0.065 (0.008)

0.073 (0.016) 0.074 (0.016) 0.073 (0.010)

- 1.00

400 800 1200

0.173 (0.432) 0.173 (0.027) 0.175 (0.017)

0.153 (0.032) 0.154 (0.023) 0.155 (0.020)

0.147 (0.031) 0.147 (0.018) 0.148 (0.012)

0.156 (0.030) 0.154 (0.020) 0.155 (0.014)

0.175 (0.038) 0.172 (0.030) 0.177 (0.026)

-0.50

400 800 1200

0.333 (0.061) 0.332 (0.034) 0.332 (0.026)

0.288 (0.034) 0.291 (0.025) 0.291 (0.025)

0.276 (0.03 1) 0.278 (0.026) 0.275 (0.024)

0.292 (0.038) 0.289 (0.032) 0.296 (0.023)

0.334 (0.042) 0.336 (0.037) 0.335 (0.028)

- 0.25

400 800 1200

0.407 (0.044) 0.409 (0.051) 0.418 (0.032)

0.353 (0.048) 0.356 (0.035) 0.359 (0.026)

0.343 (0.039) 0.336 (0.025) 0.337 (0.026)

0.353 (0.051) 0.361 (0.035) 0.364 (0.032)

0.408 (0.063) 0.427 (0.035) 0.432 (0.041)

0.25

400 800 1200

0.402 (0.056) 0.414 (0.038) 0.420 (0.032)

0.356 (0.049) 0.353 (0.034) 0.363 (0.032)

0.338 (0.053) 0.359 (0.036) 0.334 (0.024)

0.360 (0.031) 0.359 (0.046) 0.359 (0.027)

0.415 (0.047) 0.422 (0.046) 0.425 (0.035)

0.50

400 800 1200

0.330 (0.044) 0.335 (0.031) 0.333 (0.034)

0.289 (0.042) 0.294 (0.037) 0.292 (0.022)

0.273 (0.035) 0.275 (0.034) 0.276 (0.018)

0.291 (0.044) 0.295 (0.023) 0.293 (0.020)

0.325 (0.043) 0.334 (0.027) 0.333 (0.029)

1.00

400 800 1200

0.171 (0.030) 0.172 (0.025) 0.172 (0.019)

0.157 (0.031) 0.154 (0.022) 0.154 (0.020)

0.147 (0.022) 0.149 (0.021) 0.149 (0.013)

0.156 (0.023) 0.157 (0.020) 0.154 (0.017)

0.175 (0.030) 0.173 (0.027) 0.173 (0.019)

1.50

400 800 1200

0.074 (0.019) 0.073 (0.019) 0.073 (0.012)

0.067 (0.015) 0.065 (0.011) 0.066 (0.008)

0.065 (0.019) 0.065 (0.013) 0.063 (0.010)

0.067 (0.014) 0.066 (0.011) 0.066 (0.006)

0.073 (0.016) 0.076 (0.016) 0.073 (0.010)

5. Discussion From the tabulated values in Tables l(a), l(b), 2(a), and 2(b) based on a single simulation, the following facts concerning e(t) and Z(t) are observed: 1. Reading across the values in Tables l(a) and (b) for the most popular cut-off point t = 0, up to the order stated in the theorem, c(O) is dominated by the leading term L(0). The correction term C(0) affects E(0) only slightly. Although the values of airno, binto, dinto, m = 1, 2; i = 1, 2, vary from populations and locations, a(O), L(0) and C(0) change very little for fixed 8 as N increases. 2. For fixed 8, N and t, Z(t) does not decrease as p increases. In contrast, 2(t) fluctuates as p increases.

641

C.-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

3. For fixed p, 0 and N, ~(t) is smallest a t = 0 and is symmetric about t = 0. In general, ~(t) assumes smaller values in the vicinity of t = 0. The behaviour of ~(t) is not quite symmetric about t = 0. 4. Both ~(t) and ~(t) are relatively insensitive to slight departures from the zero cut-off point. The increase in both ~,(t) and ~(t) for It[ > 1 becomes noticeable for N > 800. 5. ~(t) is a function of 101. Covariance adjustment is essential for 101 > 0.5 where correlation with the covariates is high. When 0 < 101 _ 0.5 and the correlation with the covariates is moderate, ~(t) assumes larger values. Some bias is present in ~(t)

Table 5 (a) Values of the replicate mean (without brackets) and replicate standard deviation (with brackets) of ~a(t) based on 50 replications of a single simulation in three decimal places. The tabulated values are based of p = 4, q = 1; 0 = _ 0.25, _ 0.50, _ 1.0, + 1.5, N = 400, 800, 1200, and t = 0.0, + 0.50, _ 1.0

t 0

N - 1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.075 (0.006) 0.074 (0.007) 0.074 (0.005)

0.068 (0.005) 0.066 (0.003) 0.066 (0.001)

0.065 (0.002) 0.064 (0.001) 0.064 (0.001)

0.067 (0.005) 0.067 (0.03) 0.066 (0.002)

0.074 (0.006) 0.073 (0.005) 0.074 (0.006)

- 1.00

400 800 1200

0.177 (0.016) 0.176 (0.012) 0.175 (0.008)

0.158 (0.008) 0.155 (0.005) 0.156 (0.003)

0.151 (0.003) 0.149 (0.001) 0.149 (0.001)

0.157 (0.007) 0.156 (0.004) 0.156 (0.006)

0.177 (0.013) 0.175 (0.008) 0.175 (0.009)

-0.50

400 800 1200

0.339 (0.026) 0.336 (0.018) 0.337 (0.013)

0.299 (0.017) 0.296 (0.010) 0.296 (0.007)

0.284 (0.06) 0.280 (0.005) 0.279 (0.002)

0.300 (0.015) 0.297 (0.009) 0.295 (0.007)

0.339 (0.022) 0.338 (0.013) 0.339 (0.017)

-0.25

400 800 1200

0.422 (0.042) 0.418 (0.028) 0.422 (0.024)

0.364 (0.018) 0.364 (0.019) 0.362 (0.012)

0.347 (0.016) 0.342 (0.009) 0.340 (0.003)

0.370 (0.024) 0.367 (0.012) 0.365 (0.013)

0.427 (0.030) 0.430 (0.026) 0.434 (0.026)

0.25

400 800 1200

0.422 (0.053) 0.416 (0.023) 0.419 (0.040)

0.365 (0.028) 0.363 (0.014) 0.361 (0.011)

0.346 (0.010) 0.342 (0.005) 0.340 (0.003)

0.369 (0.018) 0.365 (0.017) 0.363 (0.014)

0.437 (0.063) 0.433 (0.033) 0.431 (0.022)

0.50

400 800 1200

0.338 (0.030) 0.338 (0.015) 0.338 (0.014)

0.300 (0.012) 0.297 (0.009) 0.295 (0.006)

0.283 (0.008) 0.280 (0.004) 0.279 (0.002)

0.300 (0.013) 0.297 (0.010) 0.295 (0.008)

0.341 (0.029) 0.337 (0.022) 0.338 (0.016)

1.00

400 800 1200

0.177 (0.012) 0.176 (0.009) 0.175 (0.09)

0.159 (0.007) 0.157 (0.008) 0.155 (0.003)

0.151 (0.003) 0.149 (0.002) 0.149 (0.001)

0.157 (0.008) 0.156 (0.006) 0.155 (0.004)

0.177 (0.017) 0.175 (0.008) 0.175 (0.009)

1.50

400 800 1200

0.075 (0.007) 0.074 (0.005) 0.073 (0.004)

0.068 (0.004) 0.066 (0.003) 0.066 (0.002)

0.065 (0.003) 0.064 (0.001) 0.064 (0.001)

0.068 (0.004) 0.066 (0.003) 0.066 (0.002)

0.075 (0.006) 0.074 (0.005) 0.074 (0.005)

642

C.-Y. Leungl Computational Statistics & Data hudysis

21 (1996) 627-645

Table 5 (b) Values of the replicate mean (without brackets) and replicate standard deviation (with brackets) of E,(t) based on 50 replications of a single simulation in three decimal places. The tabulated values are based of p = 5, q = 1; 6 = + 0.25, f0.50, f 1.0, k 1.50, N = 400, 800, 1200 and t = 0.0, f 0.50, * 1.0

t

e

N -1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.076 (0.008) 0.074 (0.005) 0.073 (0.003)

0.068 (O.OJO3) 0.067 (0.002) 0.066 (0.002)

0.066 (0.004) 0.064 (0.001) 0.064 (0.001)

0.068 (0.007) 0.067 (0.002) 0.066 (0.002)

0.075 (0.007) 0.074 (0.005) 0.074 (0.003)

-1.00

400 800 1200

0.176 (0.012) 0.176 (0.011) 0.175 (0.006)

0.158 (0.007) 0.155 (0.005) 0.156 (0.005)

0.152 (0.005) 0.150 (0.002) 0.149 (0.001)

0.160 (0.008) 0.157 (0.004) 0.156 (0.003)

0.177 (0.013) 0.175 (0.010) 0.176 (0.009)

-0.50

400 800 1200

0.339 (0.030) 0.336 (0.021) 0.337 (0.011)

0.300 (0.011) 0.296 (0.008) 0.296 (0.006)

0.285 (0.013) 0.281 (0.005) 0.280 (0.002)

0.301 (0.013) 0.296 (0.008) 0.296 (0.006)

0.341 (0.022) 0.339 (0.020) 0.339 (0.014)

- 0.25

400 800 1200

0.422 (0.033) 0.419 (0.040) 0.420 (0.024)

0.365 (0.022) 0.362 (0.018) 0.364 (0.010)

0.348 (0.011) 0.343 (0.005) 0.341 (0.003)

0.369 (0.020) 0.365 (0.017) 0.366 (0.012)

0.425 (0.038) 0.435 (0.03 1) 0.436 (0.033)

0.25

400 800 1200

0.418 (0.045) 0.422 (0.028) 0.424 (0.025)

0.368 (0.023) 0.363 (0.019) 0.364 (0.011)

0.348 (0.011) 0.343 (0.005) 0.341 (0.003)

0.372 (0.018) 0.367 (0.021) 0.366 (0.011)

0.432 (0.045) 0.432 (0.035) 0.431 (0.022)

0.50

400 800 1200

0.338 (0.026) 0.338 (0.016) 0.337 (0.017)

0.300 (0.014) 0.297 (0.011) 0.295 (0.001)

0.285 (0.006) 0.281 (0.005) 0.279 (0.002)

0.299 (0.012) 0.296 (0.009) 0.296 (0.008)

0.338 (0.024) 0.339 (0.016) 0.338 (0.015)

1.00

400 800 1200

0.177 (0.012) 0.175 (0.001) 0.175 (0.008)

0.159 (0.010) 0.156 (0.006) 0.155 (0.003)

0.152 (0.007) 0.150 (0.002) 0.149 (0.001)

0.159 (0.007) 0.157 (0.004) 0.156 (0.003)

0.178 (0.011) 0.175 (0.009) 0.176 (0.007)

1.50

400 800 1200

0.075 (0.010) 0.074 (0.005) 0.073 (0.004)

0.068 (0.004) 0.067 (0.003) 0.066 (0.002)

0.066 (0.003) 0.064 (0.001) 0.064 (0.001)

0.068 (0.004) 0.067 (0.002) 0.066 (0.004)

0.075 (0.008) 0.074 (0.005) 0.073 (0.003)

and z(t) is not symmetric with respect to 8. Covariance adjustment has a similar effect on Z(t). 6. c(t) does not increase as the squared Mahalanobis distance given by LIP = 402, m = 1, 2 (in our simulation) in each of the population increases. The behaviour of b(t) with respect to the squared Mahalanobis distance is similar. 7. As N increases, e(t) decreases rather slowly. The behaviour of Z(t) is quite stable as N increases. The slow convergence of s(t) to its true value indicates the presence of bias. The behaviour of i,,,(t), in contrast, is not quite stable as N increases.

C.-Y. Leung/ Computational Statistics & Data Analysis 21 (1996) 627-645

643

Further examination of the value of }(t) in Tables 4(a) and (b) based on 50 replications of a single simulation reveals a similar pattern for ~(t) reported above. The application of replications has removed the apparent asymmetry and variability in }(t). The replicate mean of ~(t) demonstrates more symmetry with respect to t and is closer to the true value ~(t). The findings are consistent with the results based on a single simulation. To summarize, ~(t) behaves reasonably well in relation to ~(t) for N > 800. The values of}(t) and ~(t) are well below 0.5 for all the cases considered. When q and d increase, both ~(t) and ~(t) are likely to increase for fixed p, 0, N and t. The values of }eva(t) and ~a(t) in Tables 3(a) and (b) based on a single simulation suggest a similar pattern for ~cv,(t) in relation to ~,(t) and analogous conclusions Table 6(a) Values of the replicate mean (without brackets) and replicate standard deviation (with brackets) of ecv,(t) based on 50 replications of the leave-one-out cross-validation estimate in a single simulation in three decimal places. The tabulated values are based of p -- 4, q = 1; 0 = +_ 0.25, __.0.50, _+ 1.00,

_ 1.50, N = 400, 800, 1200, and t = 0.0, _ 0.50, _ 1.0 t 0

N - 1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.074 (0.039) 0.073 (0.020) 0.075 (0.018)

0.065 (0.025) 0.069 (0.016) 0.065 (0.015)

0.065 (0.023) 0.063 (0.018) 0.061 (0.016)

0.066 (0.032) 0.067 (0.019) 0.064 (0.013)

0.075 (0.023) 0.074 (0.021) 0.070 (0.013)

- 1.00

400 800 1200

0.179 (0.045) 0.176 (0.033) 0.174 (0.041)

0.155 (0.027) 0.157 (0.020) 0.156 (0.022)

0.150 (0.030) 0.151 (0.031) 0.148 (0.022)

0.155 (0.031) 0.155 (0.030) 0.154 (0.023)

0.179 (0.042) 0.177 (0.033) 0.173 (0.025)

- 0.50

400 800 1200

0.338 (0.066) 0.332 (0.061) 0.339 (0.031)

0.300 (0.049) 0.295 (0.033) 0.297 (0.026)

0.284 (0.042) 0.281 (0.029) 0.280 (0.029)

0.296 (0.051) 0.295 (0.046) 0.296 (0.025)

0.348 (0.066) 0.339 (0.036) 0.340 (0.030)

-0.25

400 800 1200

0.424 (0.055) 0.418 (0.068) 0.420 (0.041)

0.369 (0.044) 0.367 (0.038) 0.360 (0.037)

0.342 (0.044) 0.342 (0.032) 0.341 (0.033)

0.368 (0.052) 0.367 (0.046) 0.362 (0.030)

0.434 (0.101) 0.434 (0.047) 0.440 (0.040)

0.25

400 800 1200

0.425 (0.073) 0.417 (0.074) 0.421 (0.038)

0.359 (0.072) 0.364 (0.038) 0.360 (0.026)

0.346 (0.041) 0.340 (0.035) 0.337 (0.032)

0.363 (0.062) 0.366 (0.043) 0.363 (0.030)

0.433 (0.073) 0.432 (0.051) 0.430 (0.040)

0.50

400 800 1200

0.343 (0.044) 0.336 (0.042) 0.340 (0.037)

0.298 (0.060) 0.294 (0.031) 0.295 (0.028)

0.282 (0.043) 0.275 (0.032) 0.276 (0.032)

0.297 (0.054) 0.295 (0.034) 0.294 (0.030)

0.344 (0.032) 0.333 (0.044) 0.338 (0.026)

1.00

400 800 1200

0.173 (0.047) 0.172 (0.033) 0.172 (0.030)

0.154 (0.036) 0.156 (0.031) 0.154 (0.024)

0.148 (0.046) 0.148 (0.023) 0.149 (0.015)

0.157 (O.04O) 0.154 (0.036) 0.155 (0.017)

0.175 (O.043) 0.175 (0.030) 0.174 (0.022)

1.50

400 800 1200

0.073 (0.023) 0.074 (0.034) 0.073 (0.024)

0.070 (0.026) 0.066 (0.018) 0.065 (0.012)

0.064 (0.018) 0.063 (0.013) 0.064 (0.018)

0.067 (0.023) 0.065 (0.018) 0.067 (0.018)

0.075 (0.027) 0.073 (0.019) 0.074 (0.021)

644

C.-Y. Leungl Computational

Statistics & Data Analysis 21 (1996) 627-645

Table 6(b) Values of the replicate mean (without brackets) and replicate standard deviation (with brackets) of &.(t) based on 50 replications of the leave-one-out cross-validation estimate in a single simulation in three decimal places. The tabulated values are based of p = 5, q = 1; 0 = + 0.25, f 0.50, k 1.00, + 1.50, N = 400, 800, 1200, and t = 0.0, f 0.50, + 1.0

t

e

N - 1.0

-0.5

0.0

0.5

1.0

- 1.50

400 800 1200

0.074 (0.022) 0.073 (0.018) 0.075 (0.013)

0.067 (0.025) 0.066 (0.018) 0.066 (0.011)

0.065 (0.020) 0.065 (0.017) 0.064 (0.013)

0.067 (0.019) 0.067 (0.017) 0.067 (0.017)

0.076 (0.024) 0.076 (0.021) 0.073 (0.021)

-1.00

400 800 1200

0.177 (0.042) 0.171 (0.037) 0.175 (0.025)

0.159 (0.035) 0.157 (0.034) 0.153 (0.034)

0.154 (0.030) 0.150 (0.023) 0.150 (0.023)

0.161 (0.039) 0.156 (0.030) 0.157 (0.018)

0.178 (0.032) 0.177 (0.027) 0.177 (0.018)

-0.50

400 800 1200

0.344 (0.045) 0.335 (0.039) 0.338 (0.026)

0.302 (0.046) 0.301 (0.025) 0.295 (0.023)

0.292 (0.042) 0.282 (0.032) 0.279 (0.027)

0.301 (0.040) 0.296 (0.034) 0.294 (0.033)

0.341 (0.041) 0.341 (0.038) 0.338 (0.029)

-0.25

400 800 1200

0.424 (0.068) 0.420 (0.047) 0.419 (0.062)

0.369 (0.057) 0.365 (0.053) 0.362 (0.048)

0.354 (0.036) 0.344 (0.037) 0.340 (0.028)

0.376 (0.044) 0.368 (0.039) 0.366 (0.020)

0.433 (0.059) 0.436 (0.060) 0.438 (0.047)

0.25

400 800 1200

0.424 (0.072) 0.420 (0.058) 0.420 (0.030)

0.369 (0.083) 0.364 (0.026) 0.359 (0.031)

0.349 (0.059) 0.345 (0.032) 0.343 (0.028)

0.368 (0.057) 0.364 (0.036) 0.364 (0.029)

0.425 (0.066) 0.429 (0.051) 0.429 (0.039)

0.50

400 800 1200

0.342 (0.058) 0.334 (0.030) 0.339 (0.043)

0.300 (0.048) 0.304 (0.031) 0.296 (0.036)

0.284 (0.053) 0.276 (0.026) 0.280 (0.03 1)

0.296 (0.056) 0.297 (0.031) 0.295 (0.029)

0.338 (0.061) 0.334 (0.037) 0.339 (0.047)

1.00

400 800 1200

0.176 (0.049) 0.175 (0.026) 0.178 (0.027)

0.157 (0.045) 0.156 (0.025) 0.156 (0.022)

0.151 (0.043) 0.149 (0.029) 0.148 (0.027)

0.158 (0.041) 0.157 (0.024) 0.154 (0.018)

0.176 (0.049) 0.176 (0.030) 0.173 (0.019)

1.50

400 800 1200

0.075 (0.025) 0.074 (0.019) 0.074 (0.016)

0.069 (0.026) 0.067 (0.019) 0.066 (0.016)

0.066 (0.026) 0.062 (0.018) 0.065 (0.016)

0.069 (0.022) 0.066 (0.014) 0.065 (0.010)

0.076 (0.026) 0.072 (0.022) 0.073 (0.013)

can be drawn. Despite the fluctuations and asymmetry in both &,(t) and e,(t) for fixed 8 and t as both N and p increase, there is no apparent bias in &(t) relative to C=(t).The results presented in Tables 5(a), 5(b), 6(a) and 6(b) based on 50 replications give further support for the findings from a single simulation. In summary, the plug-in classification rule C?,,,,m = 1,2, . . . , r’, should be used in situations where the correlation between the major discriminators and covariates is strong. Without prior information, a cut-off point preferably equal to zero is essential in a practical situation. For an arbitrary cut-off point t, the plug-in estimator g(t) and the leave-one-out cross-validation estimator &(t) can be used to estimate the overall expected error rate e(t) and the overall actual error rate Za(t) respectively.

C.-Y. Leung / Computational Statistics & Data Analysis 21 (1996) 627-645

645

Acknowledgements The author wishes to thank Professor P. Naeve, Co-Editor, for handling the paper and an anonymous referee for many constructive comments on an earlier draft.

References Anderson, T.W., An asymptotic expansion of the distribution of the studentized classification statistic W, Ann. Statist., 1 (1973) 964-972. Anderson, T.W., An introduction to multivariate statistical analysis, 2nd edn. (Wiley, New York, 1984). Chang, P.C. and A.A. Afifi, Classification based on dichotomous and continuous variables, J. Amer. Statist. Assoc., 69 (1974) 336-339. Cochran, W.G., Comparison of two methods of handling covariates in discriminatory analysis, Ann. Inst. Statist. Math., 16 (1964) 43-53. Cochran, W.G. and C.T. Bliss, Discriminant function with covariance, Ann. Math. Statist., 19 (1948 151-176. Finenberg, S.E., An iterative procedure for estimation in contingency tables, Ann. Math. Statist., 41 (1970) 907-917. Haberman, S.J., Log linear fit for contingency tables, Algorithm, AS 51, Appl. Statist., 21 (1972) 218-225. Krzanowski, W.J., Discrimination and classification using both binary and continuous variables, J. Amer. Statist. Assoc. 70 (1975) 782-790. Krzanowski, W.J., Mixtures of continuous and categorical variables in discriminant analysis: a hypothesis-testing approach, Biometrics, 38 (1982) 991--1002. Lachenbruch, P.A., Covariance adjusted discriminant function, Ann. Inst. Statist. Math., 29 (1977) 247-257. Lachenbruch, P.A. and M.R. Mickey, Estimation of error rates in discriminant analysis, Technomettics, 10 (1968) 1-11. Leung, C.Y., Asymptotic comparison of two discriminants used in normal covariate classification, Comm. Statist. Theory Methods, 12 (1983) 1637-1646. Leung, C.Y., The studentized location linear discriminant function, Comm. Statist. Theory Methods, 18, (1989) 3977-3990. Leung, C.Y., Classification of dichotomous and continous variables with incomplete samples, Comm. Statist. Theory Methods, 23 (1994a) 1581-1592. Leung, C.Y., The location linear discriminant with covariates, Comm. Statist. Simulation Comput. 23 (1994b) 1027-1046. McLachlan, G.J., Discriminant analysis and statistical pattern recognition (Wiley, New York, 1992). Memon, A.Z. and M. Okamoto. The classification statistic W * in covariate discriminant analysis, Ann. Math. Statist., 41 (1970) 1491-1499. Olkin, I. and R.F. Tate, Multivariate correlation models with mixed discrete and continuous variables, Ann. Math. Statist., 32 (1961) 448-465, with correction in 36 (1965) 343-344. Tu, C.T. and C.P. Hail, Discriminant analysis based on binary and continuous variables, J. Amer. Statist. Assoc., 77 (1982) 447-454. Vlachonikolis, I., On the asymptotic distribution of the location linear discriminant function, J. Roy. Statist. Soc. B, 47 (1985) 498-509. Vlachonikolis, I, Predictive discrimination and classification with mixed binary and continuous variables, Biometrika, 77 (1990) 657-662.