A note on subset selection for matrices

A note on subset selection for matrices

Linear Algebra and its Applications 434 (2011) 1845–1850 Contents lists available at ScienceDirect Linear Algebra and its Applications journal homep...

170KB Sizes 26 Downloads 168 Views

Linear Algebra and its Applications 434 (2011) 1845–1850

Contents lists available at ScienceDirect

Linear Algebra and its Applications journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / l a a

A note on subset selection for matrices F.R. de Hoog a , R.M.M. Mattheij b,∗ a

CSIRO Mathematics, Informatics and Statistics, GPO Box 664, Canberra, ACT 2601, Australia Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, P.O. Box 513, 5600 MB Eindhoven, The Netherlands b

ARTICLE INFO

ABSTRACT

Article history: Received 6 October 2008 Accepted 22 November 2010 Available online 11 January 2011

In an earlier paper the authors examined the problem of selecting rows of a matrix so that the resulting matrix is as “non-singular" as possible. However, the proof of the key result in that paper is not constructive. In this note we give a constructive proof for that result. In addition, we examine a case where as non-singular as possible means maximizing a determinant and provide a new bound and a constructive proof for this case also. © 2010 Elsevier Inc. All rights reserved.

Submitted by V. Mehrmann AMS Classification: 15Z 65F Keywords: Determinant Submatrix Overdetermined

1. Introduction In [3], the problem of selecting k rows from an m × n matrix such that the resulting matrix is as non-singular as possible was examined. That is, for X ∈ R m×n with n < m and rank (X ) = n, find a permutation matrix P ∈ R m×m so that ⎤

⎡ PX

A

= ⎣ ⎦ , A ∈ R k ×n ,

(1)

B

where A is the matrix in question, n  k < m and rank (A ) = n. To motivate this problem, consider the problem of regression where we have a vector of n observations y = A θ + δ . Here, A ∈ R k×n is a design matrix whose rows are a subset of the rows of X ∈ R m×n , θ ∈ R n is a vector of unknown parameters that is to be determined and δ ∈ R k is a ∗ Corresponding author. E-mail addresses: [email protected] (F.R. de Hoog), [email protected] (R.M.M. Mattheij). 0024-3795/$ - see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.laa.2010.11.053

1846

F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850

vector whose components are independent and identically normally distributed. Such problems occur when observations are expensive and only a subset of all possible measurements is feasible. The least

squares estimate of the unknown parameters is θˆ = A + y where A + is the Moore–Penrose inverse. In optimal design (see, for example, [6]), the design matrix is chosen to be as non-singular as possible   and the exact meaning of this depends somewhat on the application. For example, minimizing A + F =   −1 Trace A T A ensures that the expected mean squared error of θ is minimized. E-optimal designs   maximize the smallest singular value of A (or equivalently, maximize A + 2 ) and D-optimal designs

minimize the volume of the confidence ellipsoid for θ and is equivalent to minimizing det A T A . Further applications of subset selection are described in [3]. It should be noted, however, that the term subset selection is also used in the context of finding a sparse approximations (see, for example, [4]) to regression problems and is also used in the context of finding low rank approximations to X (see, for example, [1]). These are related but quite different problems from that considered in the present note. Row selection is often implemented using a QR-decomposition of X T with column interchange to maximize the size of the pivots (see [2] and also [4], Section 12.2). This algorithm usually works well but there are examples [5, p. 31] where the pivot size does not adequately reflect the size of the singular values. As a consequence bounds from the analysis of such algorithms would lead to poor bounds for

the singular values and related quantities such as det A T A .   In [3], the present authors derived upper bounds for A + F and the singular values of A. In this   note, we extend these results by deriving a constructive derivation for the bounds on A + F and new

lower bounds for det A T A . 2. Results We can rewrite (1) as ⎤

⎡ PX

Q

B

Y

=⎣ ⎦=⎣

where Q





A

:= A A T A

− 1 2

1 ⎦ AT A 2

(2)

and Y

:= B A T A

− 1 2

.

It follows that

= AT A

1

(3)

and, by applying the usual variational formulation for singular values to (3), we obtain

σl2 (A )  σl2 (X)  1 + Y 22 σl2 (A ) , l = 1, . . . , n.

(4)

2

I + YT Y



1 2 AT A

,

XT X

Here σl (A ) and σl (X ) are the singular values of A and X, respectively. Thus, the singular values of A will not be small if VertYVert 2 is not large. In addition,





(5) det X T X = det I + Y T Y det A T A .



and so maximizing det A T A is equivalent to minimizing det I + Y T Y . Thus, in terms of singular

values σl (A ) and the determinant det A T A , choosing a permutation P for which Y is not too large, should result in a matrix A which is reasonably well conditioned. We now show that a permutation exists so that Y  is not large indeed. This result was established

in [3] by assuming that P is chosen to maximize det A T A ; that proof, however, is not constructive.

F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850

1847

Here, we give a construction based on a greedy algorithm where rows of X are deleted, one at a time, so as to minimize the Frobenius norm of Yat each step. Theorem 1. There is a permutation matrix P so that (2) holds with

(m − k) n , n  k < m. k−n+1

Y 2F 

(6)

Proof. We proceed by induction and first show that Theorem 1 is true when k = m − 1. Let xjT , j =

−1 −1  

1, . . . , m denote the rows of X , p := arg min xjT X T X xj and B = xpT . Since mxpT X T X xp j

−1  T T  m xj = n, it follows that j=1 xj X X

Y 2F  xpT XT X − xpT x T

−1

xp

−1



xpT X T X xp  −1 1 − xpT X T X xp

=

n



m−n

.

Thus the result is true whenk = m − 1. Now, suppose that Theorem 1 is true when k = l > n. Our aim is to show that Theorem 1 is then also true for k = l − 1 and our approach is to modify the partitioning given in (1) by deleting a row from A and appending it to B. Specifically, we let A = [a1 , . . . , al ]T , Q = [q1 , . . . , ql ]T , Y = [y1 , . . . , yl ]T  T  T and define Aj := a1 , . . . , aj−1 , aj+1 , . . . , al , Bj := aj , BT . Then, for some permutation matrix Pj , we have ⎡ Pj X

=⎣

⎤ Aj Bj

⎦,

Aj

∈ R(l−1)×n , Bj ∈ R(m−l+1)×n .



Not all choices of j will necessarily result in a matrix Aj that is full rank, but since det AjT Aj =





 2 det A T A − aj ajT = 1 − qj  det A T A and the columns of Q are orthonormal, it follows that

qr 2   1, r = 1, . . . , l and that there are at least l − n choices for j that result in a full rank matrix. For qj  < 1, Aj is a full rank matrix and it follows that ⎡ Pj X

=⎣

⎤ Aj



Bj



=⎣

⎤ Qj Yj

1 ⎦ A T Aj 2 j

,

where

Qj

= Aj AjT Aj

− 1 2

We have  2 Y j  F



= Trace 



, Yj = Bj AjT Aj

− 1 2

.

− 1

− 1 2 2 AjT Aj BjT Bj AjT Aj

= Trace BjT Bj AjT Aj

−1 



1848

F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850

= Trace = Trace = Y 2F Now let p

:= arg





BT B + aj ajT



AT A

− aj ajT

−1 

−1  I − qj qjT   2

  −1   2   + 1 − qj 22 Yqj  + qj 2 .

+ qj qjT

YT Y



2

min

{j|qj <1 }

 2  2 1 − q j 2 Y p F

   Yj 2 . Then, F







    2  1 − qj 22 Y 2F + Yqj  + qj 22 . 2

  Clearly, this last inequality holds also when qj 2 = 1. On summing the two sides of this inequality  2  2     overjand noting that lj=1 qj 2 = n and lj=1 Yqj  = Y 2F , we obtain 2

  (l − n) Yp 2 F

Thus, Pp X



⎤ Ap

=⎣

and  2 Y p  F

=

(l − n + 1) (m − l) n  (l − n + 1) Y 2F + n  + n = (m − l + 1) n. l−n+1



Bp



=⎣

⎤ Qp Yp

1 ⎦ A T Ap 2 p

  

− 1  2  2 T Bp Ap Ap   



F

Hence (6) also holds when k

, Ap ∈ R(l−1)×n , Bp ∈ R(m−l+1)×n

(m − l + 1) n (l − n)

= l − 1 and Theorem 1 now follows by induction. 

  Having established a bound for Y 2F it is straightforward to establish bounds for A + F . Corollary 1. There is a permutation matrix P such that (1) holds with    + A 

 F



 n (m − n + 1)   + X  . 2 k−n+1

Proof. See [3] for a proof of this and other results on bounds for the values of A.

singular We can also use the bound for Y 2F to establish a bound fordet A T A . 

Corollary 2. There is a permutation matrix P such that

det A T A





k−n+1 m−n+1

n

det X T X .

(7)

Proof. From the arithmetic–geometric mean inequality we have  n

1 det I + Y T Y  1 + Y 2F . n The result then follows by applying this inequality to (5) and then using Theorem 1. 

A somewhat tighter bound can be obtained by analyzing a greedy algorithm where det A T A is maximized at each step.

F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850

Theorem 2. There is a permutation matrix P



T

det A A



⎝

m  j−n

j

j=k+1

∈ Rm×m such that (1) holds with





⎠ det X T X .

(8)

Proof. We by induction and first show that Theorem 2 is true when k = m − 1. Let p  proceed

  T  T arg max det X T X − xj xjT , A := x1 , . . . , xp−1 , xp+1 , . . . , xm and B := xp . Then, j

det A T A



= det XT X − xp xpT  = =

m 1 

m j=1

det X T X

m  1 

m j=1 m−n m

det

ApT Ap



− xj xjT



−1 



xj det X T X 1 − xjT X T X

det X T X .

j



= det A A − ap apT l j=1

det A T A

l  1

=

l j=1

= 1, . . . , l be defined as in the proof



− aj ajT



−1 



aj det A T A 1 − ajT A T A

det A T A l ⎛ ⎞ m  l−n j−n ⎝ ⎠ l j j=l+1 l−n

=  =

T

l 1



m  j−n j=l

.

j

Thus, ⎡ Pp X

=⎣

⎤ Ap Bp

⎦,

Ap

∈ R(l−1)×n , Bp ∈ R(m−l+1)×n

and

det

ApT Ap









m  j−n j=l

:=



Thus the result is true when k = m − 1. Now, suppose that (6) holds when k = l > n. Let Aj and Bj , j 

 of Theorem 1 and let p := arg max det A T A − aj ajT . Then,

1849

j

Hence (8) also hold when k



⎠ det X T X .

= l − 1 and Theorem 2 now follows by induction. 

1850

F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850

3. Discussion We now compare the bounds given in Corollary 2 and Theorem 2 which are the same for n We have         m m +1   k+1−n m+1−n j−n j−n = log − log + log log j k+1 m+1 j j=k+1 j=k+2       m + 1 k+1−n m+1−n x−n  log − log + log dx k+1 m+1 x k+1       n n k+1−n = n log + m log 1 − − k log 1 − . m+1−n m+1 k+1 Thus, for n

2 

m 

log

= 1.

j−n j

j=k+1



− n log



k+1−n m+1−n





 m log 1 −



n m+1



− k log 1 −

n



k+1

 0.

This demonstrates that the bound (8) given in Theorem 2 is superior to the bound given by (7) in Corollary 2. This difference can be substantial when kis relatively small. For example, if mis large relative to n and k = n, then m 



j−n

! 

m+1−n

j

j=k+1

k+1−n

n



 1−

n m+1

m " 

1−

n k+1

k





n+1 e

n

.

It should also be noted that m  j−n j=k+1

j



k! (m − k)! m!

,

and hence the bound (8) in Theorem 2 is also superior to the value of det A T A averaged over all permutations P. Acknowledgements The authors are grateful to an anonymous referee whose detailed comments have resulted in a substantially improved presentation of the results. References [1] C. Boutsidis, M.W. Mahoney, P. Drineas, An improved approximation algorithm for the column subset selection problem, in: Proceedings of the Nineteenth Annual ACM – SIAM Symposium on Discrete Algorithms, New York, New York, January 04–06, 2009, Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, pp. 968–977. [2] P.A. Businger, G.H. Golub, Linear least squares solution by Householder transformations, Numer. Math. 7 (1965) 269–276. [3] F.R. de Hoog, R.M.M. Mattheij, Subset selection for matrices, Linear Algebra Appl. 422 (2007) 349–359. [4] G.H. Golub, C.F. van Loan, Matrix Computations, John Hopkins University Press, Baltimore, 1983. [5] C.L. Lawson, R.J. Hanson, Solving Least Squares Problems, Prentice Hall, Englewood Cliffs, 1974. [6] S.D. Silvey, Optimal Design, Chapman and Hall, London, 1980.