An improved chi-squared test for a principal component

An improved chi-squared test for a principal component

Statistics & Probability Letters 5 (1987) 361-365 North-Holland AN IMPROVED CHI-SQUARED August 1987 TEST FOR A PRINCIPAL COMPONENT James R. SCHO...

273KB Sizes 0 Downloads 39 Views

Statistics & Probability Letters 5 (1987) 361-365 North-Holland

AN IMPROVED

CHI-SQUARED

August 1987

TEST FOR A PRINCIPAL

COMPONENT

James R. SCHOTI" Department of Statistics, University of Central Florida, Orlando, FL, USA Received November 1986

Abstract: The expected value is computed for a statistic which is used to test that a specified unit vector is actually the first principal component vector. This suggests an adjusted statistic with a better chi-squared approximation.

Keywords: asymptotic distribution, latent root, latent vector.

1. Introduction Suppose z represents a random vector having an m-variable normal distribution with m e a n vector and covariance matrix A. Let X1 >/X2 >i "'" >I Am denote the latent roots of A, and )'a, "/2,---, %, the corresponding normalized latent vectors. The new set of variables, Yl, Y2,---, Y,,,, defined by y~ = ~'i'z, are called the principal components of x a n d play an important role in statistical analyses, particularly as a means of variable reduction. When the variance of the first principal c o m p o n e n t is sufficiently large, an i m p o r t a n t test of hypothesis is the test of Ho: "h = Yo, where "/0 is some known unit vector. Actually, in practice, one would rarely specify Yo a priori. However, after obtaining a sample estimate, ¢h of Y1, one m a y wish to use another vector, Yo, closely resembling ¢h but being easier to use or interpret. The closeness of these two vectors can be statistically quantified through the p-value of the test mentioned above. Let za, z 2 , . . . , z, be a random sample from the distribution of z and let S be the sample covariance matrix; that is, N

S=(1/n)

~., ( z ; - Y ) ( z , - ~,)' i=l

where n = N - 1. The statistic proposed by Anderson (1963) for testing H 0 is T 1 = n { l f Y o S - l " y o + ( 1 / l l ) Y ~ S ~ [ o - 2},

where l 1 > l 2 > • - • > lm are the latent roots of S. If H 0 is true and Xa is a distinct latent root of A, the asymptotic distribution of T1 is chi-squared with m - 1 degrees of freedom. As will be seen later, this chi-squared approximation is not very good unless n is quite large. The purpose of the note is to give an adjustment to T1 so that it better fits the chi-squared approximation. The approach we use here is one originated by Bartlett (1937) in which T 1 is adjusted so that its expected value, up to terms of order n-a, is identical to that of the chi-squared distribution with m - 1 degrees of freedom. 0167-7152/87/$3.50 © 1987, Elsevier Science Publishers B.V. (North-Holland)

361

V o l u m e 5, N u m b e r

5

STATISTICS & PROBABILITY LETTERS

August

1987

2. Computation o[ the first moment Here we will obtain the expected value of T-- T~/n up to terms of order n -2. It is easily verified that we may assume without loss of generality that 3,0=(1, 0 , . . . , 0 ) ' and A = diag(h 1, h 2 , . . . , h , , ). Let Xl, x2 . . . . . x m denote the latent vectors of S corresponding to ll, 12,.--, lm, respectively. Then T can be expressed as m

m

T= E l[ llixf i + E lllT,1x2i- 2 i=1

i=1

m

m

E(l{ll,--XllX,)xf~+

E(lll;1-hlXT~)x2~+

i=2

i=2

(Xl- X,)2 x2i, i=2

}kl ~ki '

where Xli is the first component of xi. N o w the terms Ii and Xli cai1 each be expanded in terms of the elements ajk = S j k - XkBjk, where Sjk is the (j, k)th element of S and 8jk is the Kronecker delta. For the case in which h i is simple, an expansion for l~ is given by Lawley (1956a), and one for xl; is given by Sugiura (1976). Consequently, we get the following expressions which are accurate up to fourth degree terms in the ajk'S:

(1111i-X11Xi)x2i =

2

aria21)2 + 2 ailallaii h2 (ha hi (Xl h,)3 hi

-

+ ~

-

-

2E j•i

(h 1 -- hi) =( h j -

hi)

2 2 ] + hi allail 2 2 (hi-h,)2(hl-h,) h 3 ( h a -- hi) 2 ailajl

1[

2 2

+2

+g[(xl-x,): 1

-

allanaijajl

aiiailaijajl

aiiail

,.,(hl-h,)(hj-h,)

( h i - hi) 3 2

1

allaii a2

2

ailaij

h~ ( h i - h , ) 2 3L ~1 J'~i ( h 1 -- h i ) 2 ( h i - h j )

(1117,1 - XlhT,1)x2i=

hI x2,

aii a2 (hi-

hi) 2

2 2

+2

aiiail (hi - h,) 3

2~ aiiailaijajl j*i ( h l - h i ) 2 ( h j - h i )

2 2 hl aijail + +E j4=i ( h i - ~ k j ) ( h l - h i ) 2

1 [ ana21

+ 2 ana21aii

"1- g [ ( h l -- hi)2 2

1 a.ana n

( h l -- hi)3

1

h~ ( h i - h , ) 2 + ~ / j ~ l 362

2 2 aiiail

(hi- x,) 2

2E allailaijaj-1 j*i ( h l - h i ) 2 ( h j - h i ) 2

2

aaaf~ (h 1-hi)2(hl-hj)'

]

Volume 5, Number 5

STATISTICS & PROBABILITY LETTERS a

2

+2 (hi - hi) 2

(hi - hi) 3

2

aiiail

+3

ailaijajl

2Y'.

j~i

(hi - hi) 4

(ha - h / ) 2 ( h j -

ail[4

2

--

E

2 2

- j., ailai./ (h.

August 1987

--

( h 1 -- h i )

2

- h,)'(h, - h,) +

ailajlajkaki +2EE j*i k~i ( h 1 -h,) 2(hi-

(h 1

1

~

- h ; ) (h+

h,)

]

(hj--

_

h;

h,)

)2

hi) 2

] ]

ajlaijaklaki +EE j4~i k*i ( h 1 -- hi) ( h j - h i ) ( h k - h i ) 2

°

In order to evaluate the expectation of these expressions, we will need the following expectations in the

ajk'S" 2 2 ) = h21h2 h 3 / n 2 E(a2aa31 E(a~a)=3h]h2/n 2, 2 2 E(anam)=2h3h2/n 2.

h,h2/n, E(ana21)=2h]h2/n 2, E(ama3aaa2)=hah2h3/n 2,

E(a22a) =

All other terms arising have expectations of order n - 3 or less. Using the results above, we obtain, after s o m e simplification,

e(r)

= 1 ( m _ 1) + 1

3(m-

1) + ½ ( m - 1 ) ( m -

2) + 3 Y'~

XlXJ

j=2 (Xl - xj) ~

+ EE

X~Xj + X]

]

]

j>i>l

Hence, if we use the statistic T2 =

q=n-

[3 + ½ ( m - 2 ) +

ClT, where ( m -3l )

~ j) 2 h -1h~l-h-hJ j=2 (

+

(m-

1-

hihj "+h2

]

1) j~>i~>l ( h 1 ' ~ / ) ~ 1 - - - h j )

we get E(T2) = m - 1 + O ( n - 2 ) .

3. A simulation study

T h e adjusted statistic T2 was obtained so that its first m o m e n t agrees, to order n -1, with that of the a s y m p t o t i c distribution. Ideally, if possible, we w o u l d like all m o m e n t s to agree to the same order with the c o r r e s p o n d i n g m o m e n t s of the asymptotic distribution. This w o u l d certainly assure that the adjusted statistic is in fact an i m p r o v e m e n t over the u n a d j u s t e d statistic. Lawley (1956b) has s h o w n that this a g r e e m e n t of m o m e n t s will occur w h e n the statistic in question is a likelihood ratio statistic. However, his result does n o t apply here, since we are not dealing with a likelihood ratio statistic. Instead of c o m p u t i n g some of the higher m o m e n t s of T, we c o n d u c t e d a simulation s t u d y to c o m p a r e the p e r f o r m a n c e s of T 1 and T2. In addition, we consider a third statistic, T3 = c2T, with c 2 = lira cl = n - m - 1, where the limit is taken as hi ~ oo. F o r m - 5, 10, a n d 15, a n d for various values of n a n d h 1,

363

Volume 5, Number 5

STATISTICS & PROBABILITY LETTERS

August1987

Table 1 Simulated type I error probabilities when the nominal significance level is 0.05 p=5

n/h 1

10 15 20 25 50 100

5

15

25

100

Tj

T2

T3

TI

T2

T3

TI

T2

T3

T1

T2

T3

0.3986 0.2588 0.2032 0.1594 0.0982 0.0750

0.0410 0.0570 0.0690 0.0618 0.0602 0.0582

0.1162 0.0992 0.0972 0.0818 0.0702 0.0614

0.3590 0.2104 0.1658 0.1432 0.0878 0.0694

0.0722 0.0660 0.0688 0.0678 0.0598 0.0536

0.0960 0.0762 0.0752 0.0744 0.0620 0.0548

0.3542 0.2218 0.1618 0.1382 0.0872 0.0608

0.0828 0.0740 0.0678 0.0666 0.0570 0.0482

0.0946 0.0786 0.0706 0.0700 0.0580 0.0488

0.3414 0.2064 0.1650 0.1242 0.0794 0.0620

0.0874 0.0732 0.0646 0.0628 0.0550 0.0514

0.0886 0.0738 0.0656 0.0632 0.0550 0.0514

p =10

n/hl

15 20 25 30 50 100

10

0.7074 0.5166 0.3950 0.3112 0.1780 0.1148

20

30

100

~

r3

r~

r2

7"3

r~

~

r3

7"1

r2

7"3

0.0584 0.0750 0.0854 0.0696 0.0684 0.0678

0.1368 0.1174 0.1156 0.0902 0.0786 0.0722

0.6860 0.4970 0.3732 0.2896 0.1762 0.1006

0.0840 0.0882 0.0794 0.0802 0.0702 0.0578

0.1188 0.1056 0.0948 0.0890 0.0734 0.0608

0.6809 0.4880 0.3642 0.2998 0.1764 0.1006

0.0934 0.0910 0.0840 0.0856 0.0710 0.0570

0.1168 0.1040 0.0938 0.0920 0.0750 0.0582

0.6668 0.4718 0.3508 0.2872 0.1770 0.1076

0.1074 0.0928 0.0874 0.0836 0.0724 0.0656

0.1124 0.0956 0.0898 0.0852 0.0730 0.0660

p =15

n/)~ 1

20 25 30 35 50 100

15

25

100

35

r~

r2

r3

7"1

~

r3

ra

r2

T3

7"1

r2

r3

0.8846 0.7284 0.6022 0.4944 0.3170 0.1570

0.0678 0.0912 0.0886 0.0936 0.0698 0.0674

0.1432 0.1406 0.1150 0.1166 0.0822 0.0730

0.8694 0.7146 0.5932 0.4852 0.3218 0.1550

0.0970 0.1056 0.0924 0.0910 0.0806 0.0670

0.1394 0.1264 0.1088 0.1034 0.0910 0.0706

0.8712 0.7178 0.5706 0.4998 0.3186 0.1532

0.1080 0.1100 0.0952 0.0868 0.0792 0.0620

0.1390 0.1282 0.1066 0.0962 0.0840 0.0646

0.8694 0.7101 0.5764 0.4774 0.3086 0.1428

0.1140 0.1078 0.1052 0.0976 0.0832 0.0692

0.1244 0.1124 0.1098 0.0994 0.0866 0.0694

estimates of the actual type I error probability were obtained when the nominal level is 0.05. Each simulated probability is based on 5 000 random samples and for the sake of simplicity the latent roots h 2, h 3 , . . . , h,, were all taken to be 1. Some of the results are summarized in Table 1. We see that all three statistics, in general, yield inflated type I error probabilities. Regardless of the value of h a, T1 can lead to extremely large type I error probabilities if the sample size is not large. On the other hand, T2 yields probabilities that generally fall between 0.05 and 0.1 for any values of the sample size and h i. As is t o be expected, the simplification gained in using T3 instead of T2 is achieved at a cost of slightly higher probabilities.

4. Discussion Throughout this paper we have concentrated on inferences concerning the first principal component vector, ":1- If the first principal component is not sufficient, one may be interested in identifying the first p principal components for some p > 1. In this case, a test of H0: "/h =V0 against /-/1: ~'h ~ ~0, for some h > 1, may be of interest. This can be tested by using the adjusted statistic T2 after some obvious changes in subscripts. However, if we are interested in a variance preserving reduction in the number of variables, then it is not necessary to actually identify the principal components. For instance, if the first two 364

Volume 5, Number 5

STATISTICS & PROBABILITY LEq'TERS

August 1987

p r i n c i p a l c o m p o n e n t s are sufficient a n d the two v e c t o r s 701 a n d 3'02 s p a n the s a m e s u b s p a c e as 3'1 a n d Y2, t P t h e n w e c o u l d use the n e w variables y l = 3'01z a n d Y2 = 3'02z- C o n s e q u e n t l y , i n s t e a d o f c o n s i d e r i n g t w o s e p a r a t e tests o f t h e f o r m Hot: 7i = Y0~ against Hi,: yi ~ Y0~ it m a y b e m o r e a p p r o p r i a t e t o c o n s i d e r t h e o n e test o f H 0 : 3 ' 1 a n d 3'2 s p a n the s a m e s u b s p a c e as 3'01 a n d 3'02- Tests o f this t y p e h a v e b e e n c o n s i d e r e d b y T y l e r (1981, 1983).

References Anderson, T.W. (1963), Asymptotic theory for principal component analysis, Ann. Math. Statis. 34, 122-148. Bartlett, M.S. (1937), Properties of sufficiency and statistical tests, Proc. K Soc. A 160, 268-282. Lawley, D.N. (1956a), Tests of significance for the latent roots of covariance and correlation matrices, Biometrika 43, 128-136. Lawley, D.N. (1956b), A general method for approximating to the distribution of the likelihood ratio criteria, Biometrika 43, 295-3O3.

Sugiura, N. (1976), Asymptotic expansions to the distributions of the latent roots and the latent vector of the Wishart and multivariate F matrices, J. Mult. AnaL 6, 500-525. Tyler, D.E. (1981), Asymptotic inference for eigenvectors, Ann. Statist. 9, 725-736. Tyler, D.E. (1983), A class of asymptotic tests for principal component vectors, Ann. Statist. 11, 1243-1250.

365