Linear Algebra and its Applications 434 (2011) 1845–1850
Contents lists available at ScienceDirect
Linear Algebra and its Applications journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / l a a
A note on subset selection for matrices F.R. de Hoog a , R.M.M. Mattheij b,∗ a
CSIRO Mathematics, Informatics and Statistics, GPO Box 664, Canberra, ACT 2601, Australia Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, P.O. Box 513, 5600 MB Eindhoven, The Netherlands b
ARTICLE INFO
ABSTRACT
Article history: Received 6 October 2008 Accepted 22 November 2010 Available online 11 January 2011
In an earlier paper the authors examined the problem of selecting rows of a matrix so that the resulting matrix is as “non-singular" as possible. However, the proof of the key result in that paper is not constructive. In this note we give a constructive proof for that result. In addition, we examine a case where as non-singular as possible means maximizing a determinant and provide a new bound and a constructive proof for this case also. © 2010 Elsevier Inc. All rights reserved.
Submitted by V. Mehrmann AMS Classification: 15Z 65F Keywords: Determinant Submatrix Overdetermined
1. Introduction In [3], the problem of selecting k rows from an m × n matrix such that the resulting matrix is as non-singular as possible was examined. That is, for X ∈ R m×n with n < m and rank (X ) = n, find a permutation matrix P ∈ R m×m so that ⎤
⎡ PX
A
= ⎣ ⎦ , A ∈ R k ×n ,
(1)
B
where A is the matrix in question, n k < m and rank (A ) = n. To motivate this problem, consider the problem of regression where we have a vector of n observations y = A θ + δ . Here, A ∈ R k×n is a design matrix whose rows are a subset of the rows of X ∈ R m×n , θ ∈ R n is a vector of unknown parameters that is to be determined and δ ∈ R k is a ∗ Corresponding author. E-mail addresses:
[email protected] (F.R. de Hoog),
[email protected] (R.M.M. Mattheij). 0024-3795/$ - see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.laa.2010.11.053
1846
F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850
vector whose components are independent and identically normally distributed. Such problems occur when observations are expensive and only a subset of all possible measurements is feasible. The least
squares estimate of the unknown parameters is θˆ = A + y where A + is the Moore–Penrose inverse. In optimal design (see, for example, [6]), the design matrix is chosen to be as non-singular as possible and the exact meaning of this depends somewhat on the application. For example, minimizing A + F = −1 Trace A T A ensures that the expected mean squared error of θ is minimized. E-optimal designs maximize the smallest singular value of A (or equivalently, maximize A + 2 ) and D-optimal designs
minimize the volume of the confidence ellipsoid for θ and is equivalent to minimizing det A T A . Further applications of subset selection are described in [3]. It should be noted, however, that the term subset selection is also used in the context of finding a sparse approximations (see, for example, [4]) to regression problems and is also used in the context of finding low rank approximations to X (see, for example, [1]). These are related but quite different problems from that considered in the present note. Row selection is often implemented using a QR-decomposition of X T with column interchange to maximize the size of the pivots (see [2] and also [4], Section 12.2). This algorithm usually works well but there are examples [5, p. 31] where the pivot size does not adequately reflect the size of the singular values. As a consequence bounds from the analysis of such algorithms would lead to poor bounds for
the singular values and related quantities such as det A T A . In [3], the present authors derived upper bounds for A + F and the singular values of A. In this note, we extend these results by deriving a constructive derivation for the bounds on A + F and new
lower bounds for det A T A . 2. Results We can rewrite (1) as ⎤
⎡ PX
Q
B
Y
=⎣ ⎦=⎣
where Q
⎤
⎡
A
:= A A T A
− 1 2
1 ⎦ AT A 2
(2)
and Y
:= B A T A
− 1 2
.
It follows that
= AT A
1
(3)
and, by applying the usual variational formulation for singular values to (3), we obtain
σl2 (A ) σl2 (X) 1 + Y 22 σl2 (A ) , l = 1, . . . , n.
(4)
2
I + YT Y
1 2 AT A
,
XT X
Here σl (A ) and σl (X ) are the singular values of A and X, respectively. Thus, the singular values of A will not be small if VertYVert 2 is not large. In addition,
(5) det X T X = det I + Y T Y det A T A .
and so maximizing det A T A is equivalent to minimizing det I + Y T Y . Thus, in terms of singular
values σl (A ) and the determinant det A T A , choosing a permutation P for which Y is not too large, should result in a matrix A which is reasonably well conditioned. We now show that a permutation exists so that Y is not large indeed. This result was established
in [3] by assuming that P is chosen to maximize det A T A ; that proof, however, is not constructive.
F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850
1847
Here, we give a construction based on a greedy algorithm where rows of X are deleted, one at a time, so as to minimize the Frobenius norm of Yat each step. Theorem 1. There is a permutation matrix P so that (2) holds with
(m − k) n , n k < m. k−n+1
Y 2F
(6)
Proof. We proceed by induction and first show that Theorem 1 is true when k = m − 1. Let xjT , j =
−1 −1
1, . . . , m denote the rows of X , p := arg min xjT X T X xj and B = xpT . Since mxpT X T X xp j
−1 T T m xj = n, it follows that j=1 xj X X
Y 2F xpT XT X − xpT x T
−1
xp
−1
xpT X T X xp −1 1 − xpT X T X xp
=
n
m−n
.
Thus the result is true whenk = m − 1. Now, suppose that Theorem 1 is true when k = l > n. Our aim is to show that Theorem 1 is then also true for k = l − 1 and our approach is to modify the partitioning given in (1) by deleting a row from A and appending it to B. Specifically, we let A = [a1 , . . . , al ]T , Q = [q1 , . . . , ql ]T , Y = [y1 , . . . , yl ]T T T and define Aj := a1 , . . . , aj−1 , aj+1 , . . . , al , Bj := aj , BT . Then, for some permutation matrix Pj , we have ⎡ Pj X
=⎣
⎤ Aj Bj
⎦,
Aj
∈ R(l−1)×n , Bj ∈ R(m−l+1)×n .
Not all choices of j will necessarily result in a matrix Aj that is full rank, but since det AjT Aj =
2 det A T A − aj ajT = 1 − qj det A T A and the columns of Q are orthonormal, it follows that
qr 2 1, r = 1, . . . , l and that there are at least l − n choices for j that result in a full rank matrix. For qj < 1, Aj is a full rank matrix and it follows that ⎡ Pj X
=⎣
⎤ Aj
⎦
Bj
⎡
=⎣
⎤ Qj Yj
1 ⎦ A T Aj 2 j
,
where
Qj
= Aj AjT Aj
− 1 2
We have 2 Y j F
= Trace
, Yj = Bj AjT Aj
− 1 2
.
− 1
− 1 2 2 AjT Aj BjT Bj AjT Aj
= Trace BjT Bj AjT Aj
−1
1848
F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850
= Trace = Trace = Y 2F Now let p
:= arg
BT B + aj ajT
AT A
− aj ajT
−1
−1 I − qj qjT 2
−1 2 + 1 − qj 22 Yqj + qj 2 .
+ qj qjT
YT Y
2
min
{j|qj <1 }
2 2 1 − q j 2 Y p F
Yj 2 . Then, F
2 1 − qj 22 Y 2F + Yqj + qj 22 . 2
Clearly, this last inequality holds also when qj 2 = 1. On summing the two sides of this inequality 2 2 overjand noting that lj=1 qj 2 = n and lj=1 Yqj = Y 2F , we obtain 2
(l − n) Yp 2 F
Thus, Pp X
⎡
⎤ Ap
=⎣
and 2 Y p F
=
(l − n + 1) (m − l) n (l − n + 1) Y 2F + n + n = (m − l + 1) n. l−n+1
⎦
Bp
⎡
=⎣
⎤ Qp Yp
1 ⎦ A T Ap 2 p
− 1 2 2 T Bp Ap Ap
F
Hence (6) also holds when k
, Ap ∈ R(l−1)×n , Bp ∈ R(m−l+1)×n
(m − l + 1) n (l − n)
= l − 1 and Theorem 1 now follows by induction.
Having established a bound for Y 2F it is straightforward to establish bounds for A + F . Corollary 1. There is a permutation matrix P such that (1) holds with + A
F
n (m − n + 1) + X . 2 k−n+1
Proof. See [3] for a proof of this and other results on bounds for the values of A.
singular We can also use the bound for Y 2F to establish a bound fordet A T A .
Corollary 2. There is a permutation matrix P such that
det A T A
k−n+1 m−n+1
n
det X T X .
(7)
Proof. From the arithmetic–geometric mean inequality we have n
1 det I + Y T Y 1 + Y 2F . n The result then follows by applying this inequality to (5) and then using Theorem 1.
A somewhat tighter bound can be obtained by analyzing a greedy algorithm where det A T A is maximized at each step.
F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850
Theorem 2. There is a permutation matrix P
T
det A A
⎛
⎝
m j−n
j
j=k+1
∈ Rm×m such that (1) holds with
⎞
⎠ det X T X .
(8)
Proof. We by induction and first show that Theorem 2 is true when k = m − 1. Let p proceed
T T arg max det X T X − xj xjT , A := x1 , . . . , xp−1 , xp+1 , . . . , xm and B := xp . Then, j
det A T A
= det XT X − xp xpT = =
m 1
m j=1
det X T X
m 1
m j=1 m−n m
det
ApT Ap
− xj xjT
−1
xj det X T X 1 − xjT X T X
det X T X .
j
= det A A − ap apT l j=1
det A T A
l 1
=
l j=1
= 1, . . . , l be defined as in the proof
− aj ajT
−1
aj det A T A 1 − ajT A T A
det A T A l ⎛ ⎞ m l−n j−n ⎝ ⎠ l j j=l+1 l−n
= =
T
l 1
m j−n j=l
.
j
Thus, ⎡ Pp X
=⎣
⎤ Ap Bp
⎦,
Ap
∈ R(l−1)×n , Bp ∈ R(m−l+1)×n
and
det
ApT Ap
⎛
⎝
m j−n j=l
:=
Thus the result is true when k = m − 1. Now, suppose that (6) holds when k = l > n. Let Aj and Bj , j
of Theorem 1 and let p := arg max det A T A − aj ajT . Then,
1849
j
Hence (8) also hold when k
⎞
⎠ det X T X .
= l − 1 and Theorem 2 now follows by induction.
1850
F.R. de Hoog, R.M.M. Mattheij / Linear Algebra and its Applications 434 (2011) 1845–1850
3. Discussion We now compare the bounds given in Corollary 2 and Theorem 2 which are the same for n We have m m +1 k+1−n m+1−n j−n j−n = log − log + log log j k+1 m+1 j j=k+1 j=k+2 m + 1 k+1−n m+1−n x−n log − log + log dx k+1 m+1 x k+1 n n k+1−n = n log + m log 1 − − k log 1 − . m+1−n m+1 k+1 Thus, for n
2
m
log
= 1.
j−n j
j=k+1
− n log
k+1−n m+1−n
m log 1 −
n m+1
− k log 1 −
n
k+1
0.
This demonstrates that the bound (8) given in Theorem 2 is superior to the bound given by (7) in Corollary 2. This difference can be substantial when kis relatively small. For example, if mis large relative to n and k = n, then m
j−n
!
m+1−n
j
j=k+1
k+1−n
n
1−
n m+1
m "
1−
n k+1
k
≈
n+1 e
n
.
It should also be noted that m j−n j=k+1
j
k! (m − k)! m!
,
and hence the bound (8) in Theorem 2 is also superior to the value of det A T A averaged over all permutations P. Acknowledgements The authors are grateful to an anonymous referee whose detailed comments have resulted in a substantially improved presentation of the results. References [1] C. Boutsidis, M.W. Mahoney, P. Drineas, An improved approximation algorithm for the column subset selection problem, in: Proceedings of the Nineteenth Annual ACM – SIAM Symposium on Discrete Algorithms, New York, New York, January 04–06, 2009, Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, pp. 968–977. [2] P.A. Businger, G.H. Golub, Linear least squares solution by Householder transformations, Numer. Math. 7 (1965) 269–276. [3] F.R. de Hoog, R.M.M. Mattheij, Subset selection for matrices, Linear Algebra Appl. 422 (2007) 349–359. [4] G.H. Golub, C.F. van Loan, Matrix Computations, John Hopkins University Press, Baltimore, 1983. [5] C.L. Lawson, R.J. Hanson, Solving Least Squares Problems, Prentice Hall, Englewood Cliffs, 1974. [6] S.D. Silvey, Optimal Design, Chapman and Hall, London, 1980.