Testing for the redundancy of variables in principal components analysis

Testing for the redundancy of variables in principal components analysis

Statistics & Probability North-Holland Letters 11 (1991) 495-501 June 1991 Testing for the redundancy of variables in principal components analysis...

490KB Sizes 0 Downloads 34 Views

Statistics & Probability North-Holland

Letters 11 (1991) 495-501

June 1991

Testing for the redundancy of variables in principal components analysis James R. Schott Department of Statistics, University of Central Florida, Orlando, FL 32816, USA Received November 1989 Revised August 1990

Abstract: When all of the important principal components have zero coefficients on the same original variables, then those variables are redundant and may be eliminated. Tyler (1981) derived a statistic suitable for testing such a hypothesis. An asymptotic expansion for the mean of this statistic is obtained and used to calculate a Bartlett adjustment factor. The performances of the unadjusted and adjusted statistics are investigated in a simulation. Keywords: Asymptotic

expansion,

Bartlett

adjustment

factor,

dimensionality

reduction.

1. Introduction Many aldies, particularly in their initial stages, involve the collection of a large number of variables on different subjects or objects. Principal components analysis is a technique useful in reducing the dimensionality of the resulting data set without the loss of much information. Let the p x 1 vector x represent the vector of observed values of the variables mentioned above for one randomly selected subject. If its covariance matrix 52 has latent roots hi > . . . a A, and corresponding latent vectors y,, . . . , y,, then the i th principal component of x is given by vi = y;x, has variance h,, and is uncorrelated with the other p - 1 principal components. Consequently, if for some k
q):

the vectors

e,,,...,

Let I, > I, > . . . > f, be the latent the corresponding latent

Yl,...>_Q

This research Florida.

was supported

0167-7152/91/$03.50

ep lie in the subspace

generated

by yk + 1, . . . , y, _

roots of a sample covariance matrix with n degrees of freedom, and vectors. Put E = (e,, . . . , e,), Y, = (Y~+~, . . . , y,), and D, =

in part by an In-House

Research

0 1991 - Elsevier Science Publishers

Award

from the Division

B.V. (North-Holland)

of Sponsored

Research,

University

of Central

495

Volume 11, Number 6

diag{l,lk+d(lk+l

STATISTICS & PROBABILITY

.

-

, IjlP/(ZP - l,)‘}.

LETTERS

June 1991

Then Tyler’s test rejects H,(k,

4) if r = rank(Y,Y;E)

< q, or

if r = q and Tk,q = n i

tr( E’y,y,‘E(

E’Y,D,Y;E)-l)

j=l

exceeds the 1 - (Y quantile of the &i-squared distribution with kq degrees of freedom. This is an asymptotic OL level test for H,(k, q) since, under H,(k, q), as n + 00, r + q in probability and T k.9 -+ xi,, in distribution. The purpose of this paper is to investigate this null approximating distribution and improve it through the use of a Bartlett adjustment factor.

2. The Bartlett adjustment

factor

The basic idea behind Bartlett which can be expressed as

adjustment

factors

is a rather

simple

one. If a test statistic

E(T)=a{l+c/n+O(n-3’2)} then the mean of the adjusted E(T*)

statistic

T has a mean

(1) T * = (1 - c/n)T

can be expressed

as

= a + O(n-3’2);

that is, the mean of T * comes closer to the mean, a, of the asymptotic null distribution than does the mean of T. Some general results concerning the effect of this adjustment on other moments and its connection to the normalizing constant for the conditional distribution of a maximum likelihood estimator can be found in Lawley (1956b), Barndorff-Nielsen and Cox (1979, 1984). Although these results may very well extend to a more general setting, the papers above have only shown their application to the adjustment of a likelihood ratio statistic. We will obtain the expected value of Tk,q in the form of (1) by first getting an asymptotic expression for Tk,q in terms of the elements of the matrix A = r’Sr - A. Here S is the sample covariance matrix, y,), and A = diag(A,, . . . , Xp). We will assume throughout this paper that we are sampling r=(Y,,..., form a normal distribution; if this is not the case it may not be advisable to use a sample covariance matrix-based principal components analysis in the first place (Devlin, Gnanadesikan and Kettenring, 1981). Our main result is given in the following theorem. Theorem.

Let r,, be the q x p - k matrix containing the last q rows and last p - k columns of r, X,X,/(X, - X,)2} and tij = r2,Ajr12. Define $J& to be the (u, u)th Aj=diag(h,hk+,/‘CXk+, -xJ)2,..., element of the ( p - k) x ( p - k) matrix Qj = ri2Q,T’r2,. Then the mean of Tk,q can be expressed as E(Tk,q) where

496

= kq{l

+ ck,q/n + 0(n-3’2)},

Volume

11, Number

6

STATISTICS

j=l

u=k+l

u=k+l

& PROBABILITY

LE-M’ERS

June 1991

i


XJ”X3, {h; - 3h,h”X,

+ X,X,(A,

+ X”)}

-W&u-k~lLk,o-k (hX,Xjh, -

$

i:

i=l

j=l

I2

(AZ,

-

XJ)4(h”

-

‘Jj4

AlA,)

&,A

u=k+l

(x,-AJ)2(h,-Ai)2(XJ-X,)

I+/

Proof. If u,, . . . , up are the latent can be written as Tk,q = n i

vectors

of T’ST

and U, = (uk+ i, . . . , u,), then, if H,(k,

4) is true Tk,q

tr( E’T u,u;r’E(E’rci,o,v,‘r’E)~‘)

j=l

=n;

5 j=l

/=k+l

5 m=k+l

5; a=1

Y~+h-l,IY~+h~I,~UIJU~~Z~~(~)t

(2)

fi=l

where zap(j) is the (a, P)th element of ( E’I’U2DJU2’T’E)-‘. For our purposes we will need an expression for C4 which is accurate up through fourth ordered terms in the elements of A. Since the asymptotic expression (Sugiura, 1976) for uIJ (1 Zj) involves first and higher ordered terms, we will need to specify z+(j) up through second ordered terms. Let DJ* = 0, - Aj, and let U,, be the q X p - k matrix containing the last q rows of U,. If we put U,T = U,, - Ip_k we find that under H,(k, q),

= {52,+r2,(Dj*+AjU,;’ = fJJ:’ - f2J-1r,2(DJ* +

+ U,;A J + U,;A / U,; ’ + DJ* U,; ’ + U,; D,* + U,; D,* U,; ’ ) r,, > - 1 f A,U,;’

+ U2;Aj + U,;A,U,;’

tiJ-1r12( oJ* + aju,; f+ ~,;a,)r;~~~:lr,,( D,*

+ D,*U,;’ + a,~,;!

-t U,;D,* )r12tiJ-’ +

u,;aJ)r;,fiJpl,

where this last equality is correct up through second ordered terms. If we denote the (a, P)th element of QJ:’ by o$ and the i th diagonal elements of Aj and D,* by S/ and d/ *, respectively, it then follows that to the same accuracy

497

Volume 11. Number 6

STATISTICS & PROBABILITY

y=l

6=1

x

p=-1 ~=l

u=k+l

LETTERS

June 1991

x=k+l

d’*dj* u

yh+u-l,uyh+p-l,uyh+S-l,xyh+7-1,x

x

i

XY)@;Q* +

i

(Y h+p-l,uYh+y--l,uY~y

+

Yh+S-l,uYh+s-l,uYyp

YX

y=k+l P c

+i y=k+l

Y;p”Y,;iw~a~~

3

u=k+l

1

Using the asymptotic expansions of Lawley (1956a) and Y,“G” = Yh+y-l,uYh+S-l,u+ Yh+G-l,uYh+y-l,u. Sugiura (1976) for Zi and uij, respectively, we obtain an expression for zap(j) in terms of the elements of A accurate up through second ordered terms. This is then used in (2) to get an expansion for Tk,q accurate up through fourth ordered terms. Since A + A - W,(A, n), the various moments in the elements of A, needed to compute the mean of this expansion, are obtained using the moments of the Wishart distribution. After some simplification the asymptotic expansion for E(T,,,) can be obtained in the from given above. If 4 =p - k, the hypothesis H,( k, q) states that the subspace generated by yl, . . . , yk is identical to that generated by e,, . . . , 6-k. In this special case the expression for ck,q simplifies further since Qj = A;‘. It follows from Schott (1989) that this simplified version can be expressed as where

I-

Ck,p-k

=3+:(k-l)+{k(p-k))-1

i

5

j=l

u=k+l

5 u=k+l

(x

_h;;l u

_+> J

u

J

U
I

j=l

u=k+l

5

+;; i=l

(Xu-hj)2

j=l

u=k+l

A4 4-x2x2.-2A2A.X. u ‘I

(hi-h,)2(hj-h,)2

UlJ

1

.

3. A simulation study The performances of the adjusted and unadjusted statistics were compared in a simulation. The actual probabilities of type I error were estimated for the unadjusted statistic, Tk,q, the adjusted statistic, and a third statistic, Tk*q*. This last statistic was adjusted using czq = k + q + 1 K$ = (1 - ‘k,$n)T,,,, which approximates c~,~ when A, is large relative to Xk+i since c~,~ + c.& as Xk/Xk+i --* cc. In computing the quantity ck# sample estimates were used in place of parameters. The nominal significance level used was 0.05 and each estimate of type I error probability was computed from 1000 simulations. For simplicity, attention was restricted to covariance matrices 52 which are diagonal and have hk+l = * . . = A, = 1. Some of the results are presented in Table 1. We find that, in general, the adjusted and unadjusted statistics yield inflated type I error probabilities. The adjustment can result in a dramatic improvement, particularly if k and 4 are not both small. 498

Volume 11, Number 6

STATISTICS & PROBABILITY

LETTERS

June 1991

Table 1 Simulated probabilities of type I error when the nominal significance level is 0.05 (P, k q) = (10,294) n

0,~ X2) (10, 1.5)

20 25 30 50 75

(1072)

(1035)

(25,lO)

T2.4

T2:*

Tz:4

T2,4

T,;*

T2:4

T2.4

T2;*

T2:4

T2.4

T2:*

T2:4

0.615 0.594 0.559 0.447 0.376

0.446 0.451 0.430 0.375 0.330

0.119 0.129 0.143 0.115 0.102

0.541 0.483 0.402 0.268 0.183

0.348 0.324 0.269 0.196 0.142

0.079 0.085 0.075 0.066 0.058

0.321 0.231 0.200 0.117 0.119

0.130 0.087 0.108 0.070 0.080

0.058 0.044 0.074 0.054 0.068

0.271 0.223 0.184 0.120 0.097

0.103 0.102 0.083 0.068 0.064

0.083 0.086 0.071 0.065 0.060

(P, k q) = (10,2,1) n

(h

X2)

(10, 1.5)

20 25 30 50 75

T2,1

0.155 0.156 0.132 0.121 0.106

(l&2)

(10,5)

(25,lO)

T2:1*

771

T2.1

T2:l*

T2::

T2.1

T2:,

0.112 0.130 0.104 0.097 0.096

0.064 0.078 0.064 0.059 0.052

0.125 0.136 0.085 0.073 0.073

0.087 0.107 0.061 0.059 0.066

0.055 0.080 0.051 0.047 0.055

0.102 0.073 0.087 0.051 0.059

0.062 0.044 0.063 0.036 0.051

*

T2:1

T2.1

T2:

0.065 0.043 0.063 0.038 0.052

0.101 0.079 0.077 0.073 0.064

0.062 0.049 0.059 0.061 0.056

*

T2:1

0.062 0.050 0.059 0.063 0.056

(P, k q) = (15,2,1) n

(h

AZ)

(15, 1.5)

25 30 35 50 75

(15,2)

(15,lO) T27

T2.1

T2tl*

Tz:1

T2.1

T27

0.058 0.062 0.038 0.049 0.053

0.078 0.077 0.059 0.063 0.051

0.064 0.051 0.045 0.052 0.047

0.076 0.072 0.057 0.058 0.050

0.075 0.078 0.066 0.066 0.050

0.052 0.056 0.043 0.055 0.045

T2.1

%*

T2:1

T2.1

T2I;

0.091 0.092 0.087 0.103 0.074

0.064 0.065 0.066 0.091 0.068

0.059 0.051 0.050 0.061 0.037

0.099 0.094 0.071 0.064 0.056

0.066 0.073 0.053 0.053 0.048

*

(35915) *

T2,:

0.061 0.064 0.054 0.059 0.047

(P. k q) = (10,4,5) ?I

&,

A,,

h

X4)

(10, 10, 5, 1.5)

20 25 30 50 75

(25, 10, 10, 10)

T4.5

G?*

=4:

T4.5

=4>*

7-d

T4.5

T45*

T4:5

T4.5

7-4?5*

T4f

0.949 0.913 0.880 0.705 0.576

0.732 0.522 0.660 0.563 0.482

0.445 0.222 0.389 0.333 0.234

0.917 0.813 0.744 0.468 0.307

0.564 0.522 0.453 0.276 0.196

0.251 0.222 0.191 0.106 0.081

0.800 0.633 0.508 0.298 0.217

0.292 0.206 0.189 0.133 0.099

0.065 0.051 0.077 0.083 0.069

0.680 0.564 0.479 0.239 0.158

0.166 0.154 0.143 0.098 0.065

0.059 0.095 0.090 0.073 0.056

(P. k q) = (10,4,2) n (A,, x2, x3,

h4)

(10, 10, 5,1.5)

20 25 30 50 75

(10, 10, 5, 5)

(10,10,592)

(l&10,

T4,2

T42

*

0.521 0.468 0.411 0.289 0.221

0.301 0.305 0.280 0.219 0.176

5,2)

(25,10,10,10)

(10, 10,5, 5)

Vi

7’4.5

G*

T,T,

T4,2

T41;*

T4:

T42

** T4.2

TC

0.083 0.096 0.089 0.092 0.067

0.456 0.383 0.303 0.198 0.148

0.266 0.224 0.186 0.137 0.100

0.075 0.077 0.057 0.062 0.054

0.298 0.257 0.224 0.111 0.102

0.128 0.111 0.106 0.061 0.066

0.070 0.063 0.072 0.046 0.054

0.286 0.205 0.188 0.113 0.094

0.100 0.085 0.089 0.066 0.064

0.080 0.069 0.073 0.060 0.060

499

Volume 11, Number 6

STATISTICS 8r PROBABILITY

LETTERS

June 1991

Table 1 (continued) (p, .k 4) = (15,432) n

(43 ha x3, X4) (1515,

25 30 35 50 75

10,1.5)

(15,15, 10, 2)

(15,15,10,10)

T4,2

&2*

T4:2

T4,2

T4;*

7’4;

T4.2

0.310 0.313 0.277 0.207 0.172

0.166 0.191 0.179 0.150 0.133

0.040 0.046 0.055 0.038 0.033

0.287 0.267 0.237 0.186 0.111

0.137 0.129 0.149 0.121 0.079

0.042 0.037 0.055 0.046 0.043

0.207 0.180 0.125 0.109 0.095

(30,15,15,15)

T4;*

7%

T4,2

X4;*

X6

0.087 0.081 0.064 0.063 0.069

0.075 0.078 0.059 0.063 0.069

0.197 0.153 0.135 0.114 0.097

0.071 0.058 0.066 0.051 0.063

0.069 0.055 0.064 0.051 0.064

Regardless of the sample size, T& produces probabilities that fall, with few exceptions, between 0.05 and 0.10 when the k th and (k + 1)th roots are well separated; from our limited simulations, Xk/Xk+t > 3 seems to be sufficient. If k + q is not too large, say less than or equal to +p, Tkyq generally continues to perform well for small samples and ratios hk/hk+r as small as 1.5. As k + q increases, however, increasingly large sample sizes are required for small values of Xk/Xk+r. The computationally simpler adjusted statistic T&* does not perform nearly as well as T& unless both k and q are small or the sample size is quite large.

4. An example To illustrate the hypothesis testing problem discussed in this paper, we use the data of Jackson and Morris (1957) from a study of the quality of pictures produced by a photographic process. The data were generated as follows. A film strip was given a graded series of exposures to white light and was processed. Optical densities of the resultant steps were then measured through red, green and blue filters having narrow transmission bands. In the study three steps were included: one at a high density level, one at an average density level and one at a low density level. As a result, there are nine measurements, three colors at each level. The sample covariance matrix, based on 108 degrees of freedom can be found in the paper referenced above. The latent roots of this matrix are tabulated in Table 2(a) while the first two principal component vectors are given in Table 2(b). Since the first two principal components account for 75% of the total sample variance for the nine variables, we will concentrate on these two components. We find that the sample coefficients corresponding to the three low density variables have small values in both principal components. This naturally leads to the question of whether these three variables can be eliminated. Consequently, we consider the

Table 2 Data on the quality of pictures produced by a photographic process (a) Latent roots of the sample covariance matrix 878.5,

196.1,

128.6,

103.4,

81.3,

37.8,

7.0,

5.7,

3.5

(b) Principal component vectors Component

1 2

500

Average density

High density

Low density

red

green

blue

red

green

blue

red

green

blue

0.305 - 0.486

0.653 -0.151

0.483 0.588

0.261 - 0.491

0.324 -0.038

0.271 0.373

0.002 0.057

0.006 0.054

0.014 0.088

Volume

11, Number

6

STATISTICS

& PROBABILITY

LETTERS

June 1991

hypothesis H0(2, 3). Although the second and third sample roots are not that far apart it appears from our simulations that the adjusted statistic should perform adequately. For example, in testing H,(2, 4) when p = 10 and n = 75, the estimated significance level was 0.102 for X, = 1.5 and 0.058 for X, = 2. Routine calculation yields the unadjusted statistic T,,, = 3.45. When compared to the &i-squared distribution with 6 degrees of freedom, we see that H,(2, 3) is not rejected at any reasonable significance level. To compute the adjusted statistic, we note that c2,s = 10.3 and c& = 6. These then yield the adjusted statistics T25 = 3.12 and T$* = 3.26, so that for this particular example, the adjusted statistics lead to the same conclusion as the unadjusted statistic.

Acknowledgement This paper has benefited

from some helpful

comments

and suggestions

from the referee.

References Bamdorff-Nielsen, 0. and D.R. Cox (1979), Edgeworth and saddlepoint approximations with statistical applications (with discussion), J. Roy. Statist. Sot. Ser. B 41, 279-312. Bamdorff-Nielsen, 0. and D.R. Cox (1984), Bartlett adjustments to the likelihood ratio statistic and the distribution of the maximum likelihood estimator, J. Roy. Statist. Sot. Ser. B 46, 483-495. Devlin, S.J., R. Gnanadesikan and J.R. Kettenring (1981) Robust estimation of dispersion matrices and principal components, J. Amer. Statist. Assoc. 76, 354-362. Jackson, J.E. and R.H. Morris (1957), An application of multivariate quality control to photographic processing, J. Amer. Statist. Assoc. 52, 186-99. Lawley, D.N. (1956a), Tests of significance for the latent roots

of covariance and correlation matrices, Biometrika 43, 128-136. Lawley, D.N. (1956b), A general method for approximating to the distribution of the likelihood ratio criteria, Biomefrika 43, 295-303. Schott, J.R. (1989), An adjustment for a test concerning a principal component subspace, Statist. Probab. Lett. 7, 425-430. Sugiura, N. (1976), Asymptotic expansions to the distributions of the latent roots and latent vector of the Wishart and multivariate F matrices, J. Multivariate Anal. 6, 500-525. Tyler, D.E. (1981), Asymptotic inference for eigenvectors, Ann. Statist. 9, 725-736.

501