The SVD and reduced rank signal processing

The SVD and reduced rank signal processing

Signal Processing 25 (1991) 113 133 Elsevier 113 The SVD and reduced rank signal processing L o u i s L. S c h a r f Department of Electrical and Co...

1MB Sizes 0 Downloads 29 Views

Signal Processing 25 (1991) 113 133 Elsevier

113

The SVD and reduced rank signal processing L o u i s L. S c h a r f Department of Electrical and Computer Engineering, University of Colorado, Boulder 80309-0425, USA Received 13 June 1990 Revised 18 June 1991

Abstract. The basic ideas of reduced-rank signal processing are evident in the original work of Shannon, Bienvenu, Schmidt, and Tufts and Kumaresan. In this paper we extend these ideas to a number of fundamental problems in signal processing by showing that rank reduction may be applied whenever a little distortion may be exchanged for a lot of variance. We derive a number of quantitative rules for reducing the rank of signal models that are used in signal processing algorithms.

Zusanunenfassung. Die Grundgedanken der rangreduzierten Signalverarbeitung gehen aus den Arbeiten yon Shannon, Bienvenu, Schmidt, und Tufts und Kumaresan hervor. Diese Grunds~itze werden hier auf eine Reihe elementarer Aufgabenstellungen der Signalverarbeitung/ibertragen. Dazu wird gezeigt, dal3 Rangreduktion dann angewendet werden kann, wenn hohe Varianzen gegen kleine Verzerrungen ausgetauscht werden k6nnen, Es werden einige quantitative Regeln hergeleitet, die eine Reduktion des Ranges verschiedeneI:, in der Signalverarbeitung/iblicher Modelle erm6glichen. R6sum& Les idles de base du traitement du signal fi rang r6duit sont clairement expos~es dans les travaux de Shannon, Bienvenu, Schmidt, et Tufts et Kumaresan. Nous 6tendons dans cet article ces id6es d nombre de probl6mes fondamentaux en traitement du signal en montrant que la r6duction de rang peut 6tre appliqu~e chaque lois qu'une faible distortion peut &re 6chang6e contre beaucoup de variance. Nous d6rivons plusieurs r6gles quantitatives permettant de i-eduire le rang de mod61es de signaux utilis6s dans des algorithmes de traitement.

Keywords. ARMA, analysis and synthesis, bandlimited, bias and variance, block quantizer, complex exponential, detector, distortion, eigenvalues and eigenvectors, Grammian, least squares, linear constraints, linear models, low-rank, matrix approximation, order selection, orthogonal decomposition, orthogonal subspace, parsimony, projection, pseudo-inverse, quadratic minimization, rank reduction, rate-distortion, signal processing, signal subspace, singular values and singular vectors, subspace splitting, SVD, Wiener filter. This p a p e r is d e d i c a t e d to P r o f e s s o r D e a n W . Lytle in r e c o g n i t i o n o f his t h i r t y years o f teaching at the U n i v e r s i t y o f W a s h i n g t o n in Seattle.

I. Introduction In V o l u m e X o f his OEuvres, u n d e r R u l e V for D i r e c t i o n o f the M i n d , Ren6 D e s c a r t e s said " M e t h o d consists entirely in p r o p e r l y o r d e r i n g a n d a r r a n g i n g the things to which we s h o u l d p a y attent i o n " . W i t h a t o u c h o f license we can claim this rule for statistical signal processing, wherein singular values (or eigenvalues) are p r o p e r l y o r d e r e d to

d e t e r m i n e the singular vectors to which we s h o u l d p a y attention. These singular vectors then assume the role o f ' m o d e s ' t h a t are used to c o n s t r u c t r e d u c e d - r a n k linear t r a n s f o r m a t i o n s for signal processing. I n this review o f the singular value d e c o m p o sition ( S V D ) a n d its a p p l i c a t i o n to r e d u c e d - r a n k signal processing, we begin with a n u m b e r o f philosophical c o m m e n t s a b o u t m o d e l i n g , in general, a n d the i m p o r t a n c e o f the distortion-variance trade in r e d u c e d - r a n k signal m o d e l i n g , in particular. W e i n t r o d u c e a class o f linear m o d e l s a n d illustrate their richness with a n u m b e r o f familiar e x a m p l e s f r o m the t h e o r y o f statistical signal processing. W e

0165-1684/91/$03.50 © 1991 Elsevier Science Publishers B.V. All rights reserved

114

L.L. Scharf / Reduced rank signalprocessing

establish that the SVD is the natural linear algebraic tool for determining the dominant structure of a linear model and for constructing low-rank signal processing algorithms. From here on we proceed by example to illustrate the power of the SVD (or, in some cases, the EVD) for solving problems in reduced-rank signal processing. For every example we derive a full-rank solution, based on a principle of optimality, and then show that rank can be reduced to produce a solution which is actually superior in many ways to the original solution. For example, a minimum variance unbiased (MVUB) estimator, whose mean-squared error equals its variance, can often be replaced with a reducedrank, biased estimator whose mean-squared error is smaller than that of the MVUB estimator. All that is sacrificed is a little b i a s . . , in exchange for a lot of variance saved. In fact, this is a key idea that a little distortion can often be introduced into a solution in exchange for a big savings in variance. The exchange pays off whenever the sum of distortion plus variance is smaller than the variance of the undistorted solution. We shall find that each problem brings its own natural definitions of distortion and variance and, consequently, its own rules for selecting the order of the best reducedrank solution.

2. General comments on modeling

The traditional approach to modeling has been to write down a mathematical model that perfectly represents the signal or system under study. Typically, the model is a linear one for which a difference equation, impulse response, transfer function, covariance sequence, or power spectrum is given. As a practical matter, the exact order of the model may be unknown and some of the parameters of the model may be unknown. If the model itself is to be used in a signal processing algorithm, then the order and the unknown parameters must be estimated (or identified). There is a generally-held view that economical models (or parsimonious models, as they are called) are better than luxury Signal Processing

models, because luxury models often produce spurious effects. These effects are artifacts of the procedure used to (imperfectly) identify the model, rather than actual characteristics of the signal or system under study. The principle of parsimony lies at the very heart of the scientific method, where we state and test hypotheses in terms of mathematical models that are just complicated enough to model what we can measure, but no more complicated. Such a fundamental principle should have a set of quantitative rules for its application. The AIC rules of Akaike [1] and the CAT rule of Parzen [12] are notable examples of rules for choosing the order of a parsimonious autoregressive (AR) model for a time series. But my favorite example of the principle at work is Shannon's rate distortion theory [19]. Shannon established the mathematical connection between the complexity of a source (its bit-rate) and its distortion ~ and derived formulas for trading one against the other. Our aim in this paper is to extend the philosophies of Akaike, Parzen and Shannon to a number of basic problems in statistical signal processing. With these remarks as preamble, I would like to suggest that the principle of parsimony applied to signal and system modeling says that a model should be complicated enough to reproduce the most important properties of a signal or system but simple enough to resist the spurious effects that are associated with the use of the model in a signal processing application. Sometimes these spurious effects can be directly associated with coefficient quantization, parameter uncertainty or model order mismatch, but usually they show up in a measure of performance that is intimately tied to the problem under study. When we set about to apply the principle of parsimony to signal modeling, we find that rank reduction is the appropriate procedure for reducing the complexity of a model. As we proceed from the most complicated model to the simplest model we find that complicated models are distortion-free Shannon's distortion is actually what we would call distortion (or model bias) plus variance.

L.L. Scharf / Reduced rank signal processing

and variance-limited, whereas simple models are distortion-limited and variance-free. The trick is to find just the right tradeoff between simplicity and complexity to produce the best compromise between distortion and variance. This requires that we determine the appropriate definitions for distortion and variance and derive order selection rules for trading one against the other.

3. Linear models Throughout this paper we will be concerned with linear models for signals. These models take the form

x=HO

(1)

where x = [x0 Xl . . . xu-~] x is a sequence of signal samples, H is a model matrix and 0 = [01 02 . . . Op]T is a vector of coefficients. There are two ways to write out the terms in this linear model. First, if we denote the matrix H by its columns, H=[hl

hz

...

hp].

(2)

Then the signal x may be written out as a linear combination of modes : P

x = E h.0..

(3)

n=l

That is, we think of the columns of the model matrix H as modes and the entries in 0 as mode weights. Second, if we denote the matrix H by its rows,

=lc l=c

(4)

Lc~_,J then the nth measurement x, is the correlation between the nth row c ,T and the parameter vector 0:

x.=cV.O.

(5)

In the linear model we say that the model is (i)

115

overdetermined if N > p (measurements exceed parameters), (ii) determined if N = p , and (iii) undetermined if N < p (parameters exceed measurements). The overdetermined case describes filtering, enhancement, deconvolution and identification problems. The underdetermined case describes inverse and extrapolation problems. If the coefficients 0 are unknown but not drawn from a probability distribution, then the model for x is said to be deterministic. If the coefficients are drawn from a multivariate distribution, then the signal x inherits a multivariate distribution and we say that the signal model is stochastic. For example, if the coefficients 0 are drawn from a normal distribution with mean zero and covariance matrix Roo, which we denote 0:N[0, Roo], then the signal x :N[0, HRooH T ] is drawn from a normal distribution with mean zero and covariance matrix HRooH T. The nth measurement x, is distributed as x, :N[0, CT, Rooc,]. A linear model brings an algebraic characterization for the signal, as well. In this algebraic characterization we say that the columns, or modes, of H=[hl hz . . . hp] span a signal subspace . This signal subspace contains all measurements that can be constructed from the modes of H, nothing more and nothing less. We assume that the modes of H are linearly independent. Then the dimension of the signal subspace is p, meaning that every vector that lies within it is a linear combination of just p linearly independent vectors. To say that an N-vector lies within a subspace of dimension p < N is to say that the signal is constrained by the modes that construct it to have fairly predictable characteristics. From the linearly independent modes that span the signal subspace. , we can construct N - p modes that are orthogonal to them. These modes, which we organize into the matrix A = [al a 2 . . . aN-p], span an orthogonal subspace of dimension N - p . Any vector constructed from the modes of A lies inside the subspace and outside the signal subspace . Taken together, the subspaces and span Euclidean N space •N, meaning that any vector that lies Vol. 25, No. 2, November 1991

116

L.L. Scharf / Reduced rank signal processing

in R N may be decomposed into a part that lies in the signal subspace ( H ) and a part that lies in the orthogonal subspace (A). We call ( S ) = (H)~(A) a direct sum of subspaces that spans R N. The orthogonal subspace ( A ) is usually called the noise subspace to distinguish it from the signal subspace. But, really, this is a misnomer because noise typically fills up all of Euclidean space. This algebraic language may be given the geometric picture of Fig. 1. In the figure, the subspaces ( H ) and ( A ) are orthogonal, but the modes within each individual subspace are only linearly independent and not necessarily orthogonal. The vector y in the figure is a linear combination of a signal x that lies in the subspace ( H ) and a noise vector n that has no respect for subspaces. The trick is to identify the subspace ( H ) , determine how much of the measurement lies within it, and use this information in a signal processing procedure.

/ Fig.

X

~ 1.

Signal and orthogonal subspaces.

3.2. Complex exponentials

Let x, be a signal that is composed of complex exponential modes: p

x , = ~ B,r',cos(og, t + $ , ) .

(8)

n~l

The samples of x, may be organized into the linear model

Fx°] Xl

X=

3.1. Polynomial models

t..X~_ I

Suppose a signal x is a polynomial of the form xt = 01 + 02t + " • • + Opt (p-l).

(6)

The samples of the signal may be organized into the vector x = [x0 Xl . . . xN-1]v:

where V and 0 are the following matrix and vector: 1 r 1 cos 021

I![00 °] LX; _ I -I

=

1° 2o

( N - l) ° =HO.

l

. . .

1 I

. . .

21

...

lP-I

2P.-1 (N-'I)P- t

l

rp cos cop

...

r~

0 r, sin col

...

r~¢ i s i n ( N - l)co I

• ..

' cos(N-l)cop '2

0 rp sin cop

[

rpu J s i n ( N _ l ) c o p

I BiCos¢l]

02

O= I Bpcos ,. I

I-B, sin*,/

p (7)

Any polynomial signal lies in the subspace spanned by the modes, or columns, of H. Signal Processing

.-.

V= kr ~ - l c o s i N _ l ) c o .

X:

(9)

= vo,

(10)

L-Bp fin q~p_l In this linear model, the columns of V are the damped sine and cosine modes we associate with the study of rational dynamical systems• This

L.L. Scharf / Reduced rank signal processing

brings us to the alternative characterization of the next example.

3.3. ARMA impulse responses Let H(z) denote the transfer function of an autoregressive moving average (ARMA) system with p - 1 zeros and p poles: p--1

y~ b.z-" H(z) = B(z) _,=o A(z) ~ a,z-"'

ao-- 1.

117

The matrix A is determined exclusively by the autoregression coefficients for the filter H(z)=B(z)/ A(z). The columns of A are orthogonal to every x that can be produced by the difference equation A(z){x,}=B(z){3t}, independently of how the moving average coefficients are chosen. Therefore the columns of A form a span for the orthogonal subspace ( A ) . The synthesis form of the linear model is

(11)

n=O

The impulse response of this A R M A system obeys the recursion

A(z){x,} = B(z) {6,},

where H consists of the first p columns of G:

(12) go

which may be written out more explicitly as 0,

G=

t<0,

t

~ anxt-n = bt, ,=0 (0,

(13)

O<~t
When written out for t = 0 , 1. . . . . N - 1 , recursions produce the matrix equation

these

Ii ][x°lfbit al

i

1

0

..

-..

''' •.

al

xt

1

xp

"

ap

' • •

g'

Lg;-I

at

1

x

0,

go

t<0,

~ ang,-n = 1, t = 0 , n=o

O,

(18)

t>O.

The columns of H span the signal subspace (H> and, of course, H and A are orthogonal:

ATH = O.

ATx = 0

Eap... a, 1 0

gl

(17)

-1

We call this the analysis form of the linear model, because the analysis matrix G- ~analyzes the vector x to produce the moving average coefficients b, followed by a tail of zeros. In fact, it is the southern part of G - i , which we denote by the ( N - p ) x N prediction error matrix A T, that annihilates x:

..

"'"

=[HI*].

2

.

o x E:]

A T

. go

The gt are simply the impulse response of the autoregressive filter 1/A(z) :

bl

!

".

F

0

°.

,

ap



"''



al



.

(15)

(19)

Examples 3.2 and 3.3 are, in fact, equivalent. Therefore A T v = o , meaning that the prediction error matrix A T determines the mode matrix V and vice versa. This fact underlies Prony's method and all modern variations on it [8]. There is one other way to characterize the linear model for an A R M A impulse response. In this characterization, the homogeneous part of the analysis model is rewritten, with the signal values organized into a data matrix and the autoregressive Vol. 25, No. 2, November 1991

L.L. Scharf / Reduced rank signal processing

118

parameters organized into a parameter vector:

Xp+ 1

• • •

Xl

Xa

p

"'"

XN_

l

LIIJ

3•4• Bandlimited signals

1, S(eJ°)= 0,

-iln-~< 0~.
(21)

The covariance sequence for x, is the sinc sequence r, = ( i l / n ) sinc(ilnt). The covariance matrix for the signal vector x = [x0 x, . . . XN- d T is the Toeplitz matrix

I ro r rl

- 1

F1

" " •

.

,o

FN-- 1 1

...

" • •

(22) ro

This covariance matrix has the eigenvalue decomposition (EVD) T,

X2=diag[o'~Z cr~ . . . o'~¢], (23)

where the columns of the orthogonal matrix U are the Slepian sequences. The number of dominant eigenvalues in the diagonal matrix X: equals Signal Processing

(25)

with I(r) a rank r identity and r = [Nil]. The meansquared error in this reduced-rank approximation is 0.2 = tr E [ x - x,][x - xr] v = t r [ I - l(r)]X2[I - I(r)] N

=

~

cr2.

(26)

n=r+l

This is a small number when the rank r is r = Nil.

4. Why the SVD?

Suppose the signal xt is a stationary, bandlimited sequence whose power spectrum is the narrow band spectrum

R=UX2U

where U is the matrix of Slepian sequences and 0 is a random vector with mean zero and covariance matrix X2. A reduced-rank approximation to x is x ( r ) = UI(r)O,

= O.

This form of the analysis model shows that the rank of the data matrix X is, in fact, p and not p + 1. It further shows that the prediction error vector a = [ a p . . , a~ 1]v lies in the null space of the matrix X. This means that we have two equivalent characterizations of the autoregressive filter 1/ A(z): its corresponding prediction error vector a lies in the null space of X and its corresponding prediction error matrix A spans the space orthogonal to H.

R=

(24)

x = UO,



(20) XN_I_

N i l , t h e time bandwidth product of the snapshot x. The linear model for x is

For every problem we treat in this paper there exists an implicit or explicit linear model for the signal component of the measurement• For deterministic signal models the mean of the measurement obeys a linear model, and for stochastic signal models the covariance obeys a linear model of sorts that it inherits from the linear model for the signal. The resulting algorithms for drawing inference about the signal are linear transformations on the measurements or quadratic forms in the measurements. Why, therefore, should the SVD play such an important role in the theory of inference on linear models? There are three reasons• (i) The SVD is the appropriate linear algebraic device for approximating a measurement matrix by a lowrank matrix; when the measurement matrix is a noisy version of a signal matrix and the signal matrix is constructed from a linear signal model, then the signal matrix is low-rank and we expect the reduced-rank measurement matrix to approximate the signal matrix better than the noisy measurement matrix• (ii) The SVD is the natural way

119

L.L. Scharf / Reduced rank signal processing

to split a space into dominant and subdominant subspaces; if we view a measurement matrix as a span for a subspace, then the SVD decomposes the span into an equivalent orthogonal span from which the dominant and subdominant subspaces can be identified. (iii) The SVD of a model matrix may be used to orthogonally decompose the pseudo-inverse, the Grammian, and the projection operator that are constructed from the model matrix; these decompositions uncouple the modes of a solution so that the relative effects of each orthogonal mode may be studied without interference from other modes. Let us elaborate.

4.1. Matrix approximation In our study of ARMA impulse responses in Section 3.1, we noted that the signal vector x and the signal matrix X obey the homogeneous equations

ATx = O,

Xa = 0.

(27)

The second form of these equations shows that the signal matrix X is rank deficient. However, if the signal matrix X is replaced by the noisy measurement matrix Y= X + N, the resulting matrix will be full-rank with probability one for any reasonable probability distribution on N. The SVD of Y, namely Y = U Z V v, produces a sequence of approximations to Y of the form Y(r) = UZ(r) V T, where Z(r) is the rank r version of Z obtained by setting the trailing p - r singular values to 0. The rth approximation is the rank r matrix that best approximates Y in the sense of least squares. This argument is not as convincing as it looks at first blush, because it ignores the fact that the pattern of Y (which is the same as the pattern of X) is destroyed by the approximation Y(r). So, really, the low-rank approximation of a matrix by its reduced-rank SVD is quite primitive. Nonetheless, this basic idea, applied to an overfitted version of (20), brings enormous improvements to linear prediction techniques when applied to model identification [8]. A more sophisticated approximation would be one that replaced the measurement

matrix by a low-rank matrix that shared the pattern of the original matrix Y. Typically such approximations are obtained by successive approximations. For example, SVDing, averaging along diagonals, SVDing, and so on produces a low-rank Toeplitz approximation of yyT. This approximation should equal the least squares fit of a complex exponential model, with non-negative coefficients, to a correlation sequence, because every rank deficient Toeplitz matrix has a spectral representation in which the spectrum is a line spectrum. As far as ! know, there is no proof that the sequence of SVD-Toeplitz approximations converges to a complex exponential model for the correlation sequence. I doubt whether it does.

4.2. Subspace spfitting The SVD of a measurement matrix X produces a dominant subspace < U(r)> and a subdominant subspace , where denotes the subspace spanned by the columns of the matrix U. That is, in the decomposition

Y= UZ

VT

= [U(r) I U(p - r ) ]

xI~'(or)IZ(p-r)JLv~pZr)J 0

If

gT(r)l,

(28)

the subspaces (U(r)> and ( U ( p - r)> may be used in a direct sum to produce the subspace ( Y> :

=< U(r) >G .

(29)

Similarly, the subspaces (V(r)> and < V ( p - r ) ) span ~P and decompose the Grammian of Y into its dominant and subdominant modes:

yT y= VZ2 V v = [ V(r) I V ( p - r)]

[ 0

ZZ(p-r)J[_VV(p--r)J"

(30)

Vol.25, No. 2, November1991

120

L.L. Scharf / Reduced rank signal processing

These orthogonal decompositions are of inestimable value in the study of least squares solutions and their low-rank approximations.



5. Identification of low-rank subspaces Fig. 2. Orthogonal decomposition of the measurement space ( Y ) into dominant and subdominant spaces (U(r)) and (U(p-r)).

This decomposition of subspaces, illustrated in Fig. 2, is equivalent to the least squares approximation of Yby a low-rank matrix Y(r). In fact, by approximating Y with the SVD Y(r), we automatically split the subspace ( Y ) into two orthogonal subspaces. The dominant one, namely (Y(r)), equals (U(r)), and the subdominant one, namely \< Y(r)>, equals ( U ( p - r)). The orthogonality of the dominant and subdominant subspaces is a consequence of the orthogonality principle for least squares solutions.

Perhaps the most fundamental problem in statistical signal processing is the subspace identification problem. Suppose Y is a measurement matrix, each of whose columns consists of a signal and noise:

Y=[yl yz ... yu],

y,=x,+n,.

(32)

The vector x is known to lie in a rank p subspace, but the subspace is unknown. That is,

x,= HO,.

(33)

The problem is to identify the matrix H or, equivalently, the signal subspace ( H ) . The squared error between Y and the matrix 110 may be written as e2 = tr[ Y - HO][Y - HO]T,

(34)

where 0 is the coefficient matrix 0 = [01 02 . . . ON]. The least squares solution for 0 is

4.3. Orthogonal decomposition of pseudo-inverses, grammians and projections In our study of least squares problems we encounter three matrices that trace their heritage to the model matrix H in the linear model x = 110: the pseudo-inverse H # = ( H T H ) - 1HT, the Grammian G = (HVH), and the projection P(H) = H ( H T H ) - I H v. When H is replaced by its SVD H = U27 Va, the matrices decompose as follows:

H # = (HTH) - 1HT = V ~ ' - 1U T, G= (HTH) = V,~2V T,

(31)

P(H) = H(HT H) - l i l t = UU "r. We say that the SVD is a doubly orthogonalizing decomposition that orthogonalizes the inner product HVH and the outer product H H T. As a consequence, it orthogonalizes the matrix H, the Grammian of H, and the projection onto ( H ) . SignalProcessing

O= (HTH)-1H'ry,

(35)

and the corresponding value of e 2 is e 2 = t r [ l - P(H)] YYV[I- P(H)].

(36)

5.1. Unstructured subspace When the subspace H is not required to have any particular structure, then the fitting error e 2 is minimized by P(H)= UI(p)U v, where I(p) is a rank p identity matrix and U is the orthogonal matrix in the EVD of y y T = UZ2 U v. Equivalently, it is minimized by choosing H = UZ(p) V T and 0 = V~-I(p)UVY, where Y= U Z V v is the SVD of Y and ~ ( p ) is the reduced-rank version of Z obtained by zeroing the trailing N - p singular values. So, the least squares approximation of the rank p subspace that best describes ( Y ) is the subspace (U(p)), where U(p) is the dominant subspace in the E v D y y T = U~,2UT or in the SVD Y= US V -c. The

L.L. Scharf/ Reduced rank signalprocessing resulting squared approximation error is e 2 = t r [ I - Ul(p) U T ] U~, 2uT[I-- UI(p) U T ] N

= E cry.

(37)

p+l

This solution decomposes the measurement matrix Y into two orthogonal components U S ( p ) V v and U Z ( N - p ) V v, and also decomposes the sample correlation matrix y y T into the orthogonal components UZ2(p) U T and Uff,2(N-p) U T. This is also the maximum likelihood solution when each of the measurements y is independent of the others and is distributed as an N[HO,, azI] random vector. The result extends to the identification of rank p subspaces when the distribution of y, is N[HO.+ fls, O'2I] and fls is a rank one impulsive interference of unknown amplitude ft. Then the problem is to minimize

121

SVD is inappropriate, for it ignores the side information that the subspace has a very special structure. The problem becomes one of identifying the parameters in a parametric model for the subspace. Let us illustrate with the prototypical exampleidentification of an A R M A subspace. In our study of A R M A impulse responses, we found that the vector of impulse responses x obeys the analysis and synthesis equations

E 01

G-I=

apl

x=Hb,

." AT

1

(39)

G = [ H I *],

e 2 = t r [ I - P ( H ) ] [ I - P(s)] y y T x [ I - P(s)][I- P(H)],

I g0

(38)

where P(s) is the rank one projector P(s)= s(sTs)- IsT. The solution is to choose H to be the left singular vectors in the SVD of [ I - P(s)] Y. If s itself is known only to lie in a finite set S, then the SVD of [ I - P ( s ) ] Y is computed for each s contained in S. The choice of s that minimizes the sum of trailing squared singular values in (38) is the best choice of s, and the left singular vectors of the SVD for [ I - P(s)] Y determine the best rank p subspace. This and related problems are treated in more detail in [3].

5.2. Structured subspace The solution of the previous paragraphs proceeds in the absence of any prior information about the structure of the rank p subspace that is to be identified. What if we know that the subspace has a special structure, such as the structure that is associated with autoregressive moving average impulse responses or, equivalently, sums of damped, complex exponential modes? Then the

G-

gl "

[fiN-1

O] go gl "'"

"-.



gl

go

The problem of identifying the A R M A model is one of observing y = Hb + n and minimizing the squared error e z = t r [ y - Hb][y - Hb] T

(40)

with respect to a=[ap . . . a j 1] and b = [b0 • • • bp_ 1]T. The minimization with respect to b produces the solution b = (HTH)

-IHTy,

Hb = P(H)y.

(41)

The resulting mean-squared error is e e = t r [ l - P(H)]yy'r[I - P(H)] =yT[I-P(n)]y.

(42)

But the subspaces ( H ) and ( A ) are orthogonal decompositions of R N, so P ( H ) + P ( A ) = L This Vol. 25, No. 2, November 1991

L.L. Scharf / Reduced rank signalprocessing

122

means that we may rewrite the problem of identifying a as one of minimizing

e 2 =yTp(A)y,

(43)

where P(A) is the projector onto A:

".. ap

".. ...

".. aj

y = x + n,

x = HO,

(46) n : N[0, rr2I]

H e ~Jv ×p,

are available to the experimenter. The problem, generally, is to invert for 0 and filter for 3:. If N < p , the problem is underdetermined and solutions for and $ are the minimum norm solutions

P(A) = A(ATA) - lAX,

A=

We assume that observations of the form

.

(44)

H T= C,

This problem has been studied in [5, 6, 9], where iterative algorithms are proposed for its solution. Once A is determined, H is determined from the AR impulse response of 1/A(z). Then the solution for b in (41) is known as Shanks method [20]. Note that, when the projector P(A) is approximated by AA v, the problem is min yVAA Ty = min a T yT Ya,

O= HT( H H T) -~y = c( c T c ) - ~y,

(45)

(47)

Yc= HO=y. If N > p , the problem is overdetermined, and the least squares solutions are

O=(Hvn)-IHVy,

~=HO=P(H).v.

In the overdetermined problem, the distribution of 3: is normal, 3c :Nix, rr2p(H)],

which is the covariance method of linear prediction. There remains work to be done on the minimization of e 2 =yTp(A)y. In particular, one would like to solve this problem for a sequence of increasedrank approximations to P(A), corresponding to reduced-rank approximations to P ( H ) and reduced-order approximations to A(z).

6. Reducing the rank of least squares solutions Rank reduction is a principle for reducing the order of a prior linear model in the theory of linear least squares. Reducing order introduces model distortion, meaning noise-free data cannot be perfectly modeled, but it saves mightily on estimator variance when noise is present. In this section we review the theory of rank reduction for filtering in overdetermined linear models. More details may be found in [ 14, Chapter 9; 14, 15], where the theory is extended to filtering and inversion in overdetermined and underdetermined problems. Signal Processing

(48)

(49)

meaning :t is an unbiased estimator of x = 140, with error covariance matrix E(a~- x ) ( ~ - x) T = cr2P(H), and mean-squared error e2(p) = tr rr2P(H) = rr2p. (The trace of a rank p projector is p.)

6.1. Reduced rank estimator If the rank p projector P ( H ) is replaced by the rank r approximating projector P(r), then the resulting estimator :t(r) is

Yc(r) = e(r)y : N[P(r)x, rrZP(r)].

(50)

The mean-squared error of the reduced-rank estimator ~r is the sum of distortion plus variance: e2(r) = E(~(r) - x)T(~(r) -- X)

= xT[p(H) - P(r)]x + rr2r.

(51)

This compares with the mean-squared error e2(p) = rr2p for the full-rank estimator. So, rank reduction will reduce variance whenever e2(r) < rr2p. However, as x is unknown, the model

L.L. Scharf / Reduced rank signal processing bias [ l - P ( r ) ] x

is unknown and the distortion xX[P(H) - P ( r ) ] x is unknown. This means that we must have a way to estimate mean-squared error in order to estimate when and how rank may be reduced.

6.2. Order selection The mean-squared error of the reduced-rank estimator ~(r) may be written out as

123

clearly superior, as it operates entirely in the subspace ( H ) where its covariance is 0-2[P(H) - P(r)]. Conversely, the estimator h(r) operates in and out of the subspace, where its covariance is 0-2[[1- P ( H ) ] + [ P ( H ) - e(r)]]. The component of h(r) outside the subspace is useless because it contains no information about x and it contributes extra covariance o-211- P(H)]. In fact, the statistic b(r) is just h(r) projected onto ( H ) :

p--r

e2(r) = ~

l u~i)x 12+ r0- 2,

(52)

~(r) = P ( H ) ~ ( r ) = [ P ( H ) - P(r)ly.

(54)

i--I

where the u~) are ordered eigenvectors in the orthogonal decomposition of P ( H ) : P

P(H) = ~

x H(i)U(i).

(53)

The statistic b(r) is an unbiased minimum variance estimator of the model bias b(r). The meansquared error of the estimator is

i--I

The basic problem is to determine the rank r, and the ordering of eigenvectors u¢o, that minimizes an estimate of e2(r). The value of r0-2 is assumed known. It is the model bias b(r)= [ P ( H ) - P(r)]x and its corresponding distortion x T [ p ( H ) - P(r)]x that are unknown. Two statistics come to mind as estimators of model bias b(r): (i) h(r) = [ I - e(r)]y :N[b(r), 0-2(1- P(r))]; (ii) b(r) = [P(H) - P(r)].v :N[b(r), 0-2(p(H) - P(r))]. The estimator h~ is the fitting error between y and the low-rank estimator :t(r), and the estimator b(r) is the difference between the full-rank estimator .t and the low-rank estimator J:(r). These two estimators are illustrated in Fig. 3. The estimator ~(r) is

~

= tr a 2 [ p ( H ) - P(r)] = 0-2(p _ r).

(55)

This result shows that the estimator ~(r) is a lowvariance estimator of b(r) when r is near p. The result also shows that the statistic b(r)Tb(r) is a biased estimate of the distortion b(r)Tb(r). That is,

E(~(r)-b(r))X(~(r)-b(r)) =E~(r)X~(r)-b(r)Xb(r) -b(r)Tb(r)+b(r)Tb(r) = E~(r)Xb(r) - b(r)Tb(r) = 0-(p - r)

(56)

or

Y

E~(r)Tb(r) = b(r)Tb(r) + 0-2(p _ r).

,,'¢

%,.."/

E(b(r)-b(r))V(b(r)-b(r))

0,-

This is a very significant result because it shows that the estimator b(r)Tb(r) must be corrected by - 0 - 2 ( p - r ) if it is to be an unbiased estimator of the distortion b(r)Tb(r). For our estimator of e2(r), we propose the unbiased estimator ~2(r)

Fig. 3. Contrasting the two estimators of bias, ~(r) and h(r).

(57)

=

~(r)X/~(r) - tr2(p - r) + r0-2

= y T ( P ( H ) - P(r))y + (2r _ p ) 0-2.

(58)

Vol. 25, No. 2, November 1991

L,L. Scharf / Reduced rank signal processmg

124

From our study of the statistic b(r)T~(r), we know that ~2(r) is unbiased: EO2(r) = b(r)Tb(r) + r ( r 2 = e Z ( r ) .

(59)

The order selection rule for rank reduction is r* = argrninr ~2(r). The estimated mean-squared error ~2(r) is actually computed as follows: ~2(r) =yT(p(H) -- P(r))y + ( 2 r - p ) c r 2 p r

= E l

by the order selection rule. That is, X(r)= diag[o-(~), o'(2). . . . . o-~)], where o-(~)is the singular value corresponding to the u(0 in (61).

12+ ( 2 r - p ) o "2.

(60)

7. Reducedrank Wienerfilters Let x and y be zero-mean random vectors whose composite covariance matrix is

R=E[Xy][XT yT]=[R~ RXYlRyy~"

(62)

i=l

This form for the estimated mean-squared error shows that the eigenvectors of P(H) should be ord, ered according to the rule

I (61) and the dominant u(i), i = 1, 2 . . . . . r, used to construct the rank r projector P(r). Beginning with r = 0, the estimated mean-squared error O2(r) is computed and plotted versus r, as in Fig. 4. The selected rank r* for the low-rank estimator i ( r ) is the value of r that minimizes ~2(r). The reduced-rank model for H is H(r) = UE(r) V T, where UE VT is the SVD of H and E(r) is the rank r version of E selected

YTp(H)Y~

We would like to exploit this covariance (or correlation) in order to estimate x from y. From this second-order characterization of the random vectors x and y, we may manufacture a linear model of the form

ix]i,: ol[lx y

HIu'

where H is the matrix H=Ry~Rf~ ~ and u is a random vector with correlation EuuT=Q=Ryy RyxRxx 1Rxy. We call this a synthesis or channel model for x and y, because it shows how to synthesize the measurement y from x and the artificial channel H. See Fig. 5(a). It is easy to check that this decomposition of x and y produces a block Cholesky factor of R:

I I R = Ryx R xx I

p(~2 J(2c-p)

~

Fig. 4. Typical plot of ~2(r) versus r for order selection. signal Processing

IAL 0

Q 0

RTx~i]RxY(64)

The matrix Q is the Schur complement of Ryy. The alternative model for x and y is the analysis or inverse channel model [ y ] = [0

-p~2_

0][R~x 0][I

K]I~I,

(65)

where K is the matrix K=RxyRyy I and n is a random vector with correlation EnnT= P = Rxx- Rxy R~ 1Rye. This model, illustrated in Fig. 5(b) shows how the inverse channel K produces x from y. This

L.L. Scharf / Reduced rank signal processing X=

i

=X

-x

ne

H u=

This filtering model produces a block diagonalization of R:

K

=y

(a)channel model

Y"

-

125

o1:[

-- y

-KJLRyx

[0

(b) inverse channel model

×

:::1

I ] -K v -

K T

(69)

n =

-K ( - K ( r ) )

y -K

(c) f i l t e r i n g model

Fig. 5. (a) Channel model, (b) inverse channel model and (c) filtering model.

The analysis and filtering representations may be concatenated as in Fig. 5(c). For the moment, ignore the parenthetical (K(r)) in the figure. Evidently the filtering diagram provides the following decomposition:

I:3:E '3:] 1['1

model produces a block Cholesky factorization of

I

y

R-I.

-R~lRyx

If we take the Wiener filter as we have derived it, then 2 = Ky and the corresponding mean-squared error is

IlL 0

e 2 = tr

P = Enn T = Rxx - Rxy R ~ 1Ryx.

P,

(66) The matrix P is the Schur complement of Rxx.

7.1. The Wiener filter The estimator 2 = Ky produces errors that are orthogonal to y:

E[x - i]y v = Rxy - KRyy = 0.

_K][y].

7.2. Reducing rank What if we replace the matrix K on the filtering side of Fig. 5(c) by the approximating matrix K(r)? The resulting diagram uses the parenthetical (K(r)) in Fig. 5(c). The new map from the vectors (n, y) to the vectors (n(r), 2(r)) is

(67)

From the principle of orthogonality as it applies to linear minimum mean-squared error estimators, we know that ~ is the L M M S E estimator of x from y. The difference between 3: and x is the vector n in the analysis model of (65). Therefore we can compute the Wiener filtered vector ~ and the noise n from x and y as follows: I:l = [:

(71)

(68)

In(r)]

[~ -K(r)][I =[~

K

K - K(r)][n].

K(r) JLyJ

n

(72)

The implications here are profound: the approximating estimator ~(r) has covariance K(r)RyyK(r) and the approximating error n(r) has a new component [K- K(r)]y with covariance [K-K(r)]Ryy[K-K(r)] T. In fact, the composite Vol. 25, No. 2, November 1991

126

L.L. S c h a r f / Reduced rank signal processing

covariance structure for (.~(r),

n(r)) is

and °~2~

E[n(r)][n(r) v ~'(r) T ]

K-K(r)][P

x

when Rxx = U~, z U x and Rvy = U(Z2+ o-2I)U v. Our treatment of this problem is not quite complete because, at the time of this writing, we do not have a convincing argument for introducing the extra distortion ~.2 in exchange for a reduction of something else. Nonetheless, the result seems compelling.

ROw]

JL0

K(r) I

0

I[K-K(r)]T

K(r) v]

~P + [K- K(r)]Ryy[K- K(r)] T =L K(r)Ryy[K- K(r)] T

8. Quadratic minimization with linear constraints

[K- K(r)]RyyK(r) T] (73) K(r)RyyK(r) T J" The rank r Wiener gain matrix K(r) introduces extra variance in the amount e 2 = t r [ K - K(r)]Ryy[K- K(r)] T I/2

= tr[KRyy

- K(r)Ryy

2 2 + 0-2)) (~°/(0".

n=r+l

L.~(r)J

=[:

N E

1/2

I/2

][KRyy

- K(r)Ryy

1/2 T

] .

(74) The matrix KR ~/2=p - I / ~"*yy l / 2 - R--*'x),'**yy R-T~2 has ** yy **xy R yy an SVD of the form

KR )IV 1/2 = U , ~ V T. *~

A standard problem in multisensor array processing is to construct a beamformer which passes the amplitude of a design wave undistorted while producing a small output in response to noise. This problem is illustrated in Fig. 6, where the output of the beamformer is y = OTu. When u = ci, with cz the ith design wave, then y = OVc~= cTo = X~and we say that the response is distortionless. When u = n, a zero-mean noise with covariance Enn T= R, then the mean-squared value of y2 is e2=Ey 2= EOTnnTO=OTRO. The beamformer problem is therefore rain OTRO subject to 1t0= x,

(75)

The rank r version of this matrix that best approximates it (by minimizing e 2) is

(78)

where H is the constraint matrix, 0 is the beamformer and x is the vector of constrained responses:

K(r)g'~ 2 = e Z ( r ) V T, (76)

K(r) = U~,(r) VT R ~,1/2.

H=

The extra variance is then

eiJ 2

,

X

x2

=

.

LXNJ

C

n

e 2= y. o'2,

(77)

(79)

.

We shall call H the matrix Ca` and the matrix H a`

r+l 1/2 where the o'i are the singular values of KRyy ! It is not hard to show that



K(r) = U &ag I i Signal Processing

0"2,

.....

0-~ 0-~)

o'r2

] UT

(0-~+ ~)J

y=Xt

U:Ct

0T U =

~

zy

H=rl

Ey2 = oTRo

Fig. 6. Constrained beamformer.

127

L.L. Scharf / Reduced rank signal processing

the constraint matrix C with corresponding constraint space ( C ) . The dimension of C is p x N with p > N. Therefore this problem is equivalent to underdetermined least squares, with a weighted squared error criterion for 0, rather than the usual unweighted criterion 070. The solution to this problem is

optimum solutions) may be decomposed into a component that lies in the N-dimensional constraint subspace ( C ) and a component that lies in a p-N-dimensional free subspace ( A ) that is orthogonal to ( C ) . That is,

(80)

Let us call UI uT the orthogonal decomposition of P ( C ) and UzU~ the orthogonal decomposition of the projector P ( A ) onto the free subspace (A). Then 0 may be represented as

Oo= R - I C(C'rR - I C ) - ix,

and the corresponding minimum variance is e 2 = xT(cTR

-l C) -Ix.

(81)

The beamformer 00 is called a minimum variance, distortionless look (MVDL) beamformer. 8.1. Subspace decomposition o f the beamformer

As it stands, the analytical solution for 00 does not bring much insight. Let us see if we can develop an alternative algebraic characterization and a corresponding geometrical picture. To this end, let us observe that any beamformer 0 that hopes to compete for the title 'MVDL' must produce the minimum norm solution to the constraint equations when projected onto the constraint space C. That is, P ( C ) O = c ( c T c ) -IcT 0 =C(CrC)

Ix=O,,.

(82)

The constraint space ( C ) is just the linear subspace spanned by the design waves {ci}~u. Therefore, we can illustrate the projection property of (82) in Fig. 7 and argue that any competing solution (including of course the minimum norm and

loci of ~. that satisfy constraint P(C) ~ =~m ~o

P(C)O+ P(A)O =Om+P(A)O.

o=

(83)

(84)

0 = 0,, + U2z,

where z + l~e- N. The matrix U~ is called a blocking matrix because U ~ C = O, meaning design waves are blocked by U2. With this characterization of 0 we can redraw the beamformer of Fig. 6 as the parallel combinations of subspace beamformers in Fig. 8(a-c). Figure 8(c) comes from writing y = OVu as y = [Ore+ U2z]'ru = nt - hi,

T

nl =O,,u;

(85)

fi, = - z V U ~ u .

Then hi is an estimator of nl, y is the estimator error, and Ey 2 is the variance of the estimator error. When u = n, this variance is e2=EyZ=[Om + U2z]TR[Om+ U2Z].

(86)

This quadratic form is minimized with respect to z to produce the solution (87)

z = - ( U T R U 2 ) - I UTROm.

This decomposition of 0 into its subspace components has been used by Applebaum and Chapman [2] and Griffiths and Lim [7] to design adaptive beamformers where adaptation takes place only in the subspace (U2) that is orthogonal to the subspace ( C ) spaned by the constraints. The corresponding value of e02 is (88)

e 2 = O~ QO,,

where Q is the oblique projector Fig. 7. Illustratingconstraints, the minimumnorm solution0 m and the optimum solution 0o.

Q = I - U2(U~RUz)-'U~R.

(89)

Vol. 25, No, 2, November1991

128

L.L. Scharf / Reduced rank signal processing

Om •

.~T

P(C)

1

LI :

:y

u:

[

:y

P(A) (a) subspace decomposition of beamformer

(b) blocking matrix decomposition of beamformer ~m

~ U:

[

I /11 :y

(c) e s t i m a t o r i n t e r p r e t a t i o n Fig. 8. Subspace decompositions of the MVDL beamformer.

8.2. Reducing rank in the constraint and free subspaces

variance is xV[CT(r)R-IC(r)]-lx. A weighted measure of distortion plus variance is

When the constraint space ( C ) is low-dimensional, then the free space ( A ) is high-dimensional, and vice versa. In the first case, the span of the free space, namely U2, may be reduced from its full rank of p - N. Then the blocking matrix U2 in Fig. 8 is replaced by the reduced-rank blocking matrix U~(r), r < p - N , and the optimum solutions for z and e 2 are

y x T [ p ( c ) -- e(r)]x +(1 - ~/)xa'[C a'(r)R - ' C ( r ) ] - ~x.

(91)

The trick is to find the rank r constraint matrix C(r) that produces the best trade for distortion (missed constraints) and variance. A dimension of ( C ) is discarded only when it produces a modest miss in constraints and a large reduction in variance.

z(r) = - [ U~(r) R U2(r) ] --1 U~(r)RO,,, 2

~T

e o(r) = 0], Q(r)Om,

(90)

Q(r) = I - U2(r)[ U~(r)RU2(r)] -~ U~(r)R.

The trick is to find the best rank r blocking matrix U~(r) to minimize e2(r). Conversely, if the constraint space is large, the dimensions of the constraint space can be reduced by replacing the rankp projector P ( C ) by the rank r projector P(r). The constraints are then missed, but there are more dimensions to be used in the free subspace to minimize the variance. The difference between x and x(r) is [ P ( C ) - P(r)]x and the new Signal Processing

9. Rate distortion and reduced-rank block quantizers In this section we take up the question of approximating a random vector x with a low-rank, quantized, random vector ~r. We begin with the simpler problem of approximating x with an infinite precision, low-rank approximation x(r) and then proceed to the harder problem of quantizing the approximation. 9.1. Low-rank approximation of a random vector Let x denote a zero-mean random vector with covariance matrix E x x "r= Rxx. A low-rank

129

L.L. Scharf/ Reduced rank signalprocessing approximation of x is

x(r) = Kx,

(92)

where K is an N × N rank r matrix. The meansquared error between x and xr is e 2 = tr E [ x - x(r)][x- x(r)] T = t r [ I - K]R~x[I- K] T.

insight. Assume, as illustrated in Fig. 9(c), that white noise n, with covariance trL is introduced into a channel between the two halves o f the projector P(r). The resulting approximant to x is .~(r)= UI(r)(UTx+n). The error between x and k(r) is x - ~(r) = U [ I - I(r)] UTX-- UI(r)n, and the meansquared error is

(93)

e2(r) = tr U [ I - l(r)] UTRxxU[I- I(r)] U v

The solution for K that minimizes e2 is

K= UI(r) U T,

+ tr Ul(r)cr2H(r) U v (94)

N

= ~. crZi+ rcr2. where Rx~ = US 2U T is the EVD of Rxx. The corresponding mean-squared error is

(96)

i=r+ 1

The best choice of rank minimizes eZ(r).

e 2 = t r [ I - 1(012211 - I(r)] N

=

~

o'~,

22=diag[tr~ c r ~ . . , cr~v].

i=r+l

(95) The matrix K is a rank r projection and the cry, i>~r+ 1, are the trailing eigenvalues of R~:~. With this solution we redraw Fig. 9(a, b), where the rank r matrix K has been decomposed into a coder U T, an editor I(r) and a decoder U. The process of rank reduction may be interpreted as coding x with UTx, editing the elements of UTx for the dominant modes I(r)UTx, followed by decoding with

U: x(r) = U[I(r)( UTx)]. Typically the pattern of eigenvalues for R ~ splits the eigenvectors of U into dominant and subdominant sets. Then the choice of r for a low-rank approximant x(r) is more or less obvious. But there is another way to proceed that brings additional K

X ~

x(r)

(a) approximating x

I(r)

UT

X ~

~



.~



9.2. Rate distortion Now consider a variation on the problem of approximating a random vector x with a low-rank approximant x(r). In this variation, the editor I(r) of Fig. 9, which selects the components of UTX that have dominant eigenvalues, is replaced by a quantizer Q(r) that independently quantizes the components of UTx that have dominant eigenvalues. We call this quantizer a reduced-rank block quantizer and illustrate it as in Fig. 10(a). The quantizer Q(r) is treated as a noise source that introduces independent white noise n, whose mean is zero and whose covariance is

R.n =

2~ v.2 = o'.z

= diag[ v~, v~ ..... 2B

°.

(97)

x e

u r ~

O(r) ~

(b) coder, editor, decoder UT

X~

~

I(r)

i

T

~

U "- x ( r )

v~],

The variance v.2 is the variance associated with a uniform rounding quantizer that uses B. bits for a zero-mean variable whose variance is ¢r.2. See

U ~ x(r)

Enn v

U ~

"- X ( r )

(a) block quantizer U(r)

T

U(r)

(c) noisy problem

(b) equivalent noise model n

Fig. 9. A p p r o x i m a t i n g a r a n d o m v e c t o r b y a l o w - r a n k r a n d o m vector.

Fig. 10. R e d u c e d - r a n k block q u a n t i z e r . Vol. 25, No. 2, November 1991

130

L.L. Scharf / Reduced rank signal processing

Fig. 10(b). The mean-squared error between x and 5@) is

e2(r) = tr E[x - ~(r))][x -.f(r)] v = t r [ I - Ul(r) UT]Rxx[I - UI(r) U T] + tr U(r)RnnU(r) T N

r

o.,z tt=r+ l

".

(98)

n=l

There are two problems of interest: (i) fix the number of bits B = ~]= ~B, and then find Bn to minimize mean-squared error e2(r) ; this is the rate-distortion problem; (ii) fix the mean-squared error e2(r) and find B, to minimize B = ~]~= ~B,; this is the distortionrate problem. The rate-distortion and distortion-rate problems are solved as follows. Minimize e2(r) with respect to B, under the constraint that ~ = ~B, = B: o.,z &B.

.

~

B,-B

"+20

.=,

.

1

= -2o',22 -2n° + 20 = 0.

(99)

is fixed, then the formula for ea(r) is used to determine 0, which in turn determines r, B, and B. You can see that the solutions to these two problems are determined by the Lagrange multiplier 0, which slices eigenvalues into dominant ones and subdominant ones for rank reduction according to the formula 2 ~ > . . . > 2 ~ > 0 > 2 ) + 1 > . . . > 2 ~ . Of course, the slicing levels depend on B in the ratedistortion problem and on a 2 in the distortion-rate problem. This is Shannon's basic result, interpreted in terms of rank reduction as a trade of variance r N (~,=l 0) for 'distortion' (~,=r+l o.2)-See [17; 14, Chapter 9] for similar discussions.

10. R e d u c e d - r a n k

Gauss--Gauss

detectors

The classical binary detection problem is to decide which of two models for a system has produced a set of measurements. In the Gauss-Gauss case, the problem is to measure x and test H0 :x :N[0, R0]

versus

Hi :x :N[0, RI].

The result for B, is

(102)

B, =½ log2 o.]/0.

(100)

This is non-negative for o.~/0 ~>1. So, evidently the rank r is r = m a x n : o'2/01> 1. The resulting solutions for rank, bit allocation, and mean-squared error are

The log-likelihood ratio test is H~, q~(x)={; ~H0,

l(x) > r/,

(103)

l(x)<~O,

where the log-likelihood is the quadratic form

r = max n: o',2/0 t> 1, B,---½ log2 o'2/0,

1 (x) = xr(R o i _ R ~-l)x.

(104)

(101) N

B = ~ B.-- ~ max(0, ½log2 o'2/0), n=l

e2(r) =

n=l N

N

~

o'2+ ~ 0 = Y, min(0, o-]).

n=r+l

n=l

n=l

The pair [B, e2(r)] is the rate-distortion pair and the pair [e2(r), B] is the distortion-rate pair. If B is fixed, then the formula for B is used to determine 0, which in turn determines r, Bn and e2(r). If e2(r) Signal Processing

This log-likelihood may be written in two other illuminating forms, by defining the 'signal-to-noise ratio' matrix S and by finding the EVD of S: (i) S = R o l / 2 R l ( R o l / 2 ) T,

~,0~_--~l/2ral/2~T.~-,0 ,o ) ,

(ii) S = UA U T; A=diag[;Ll 22 . . . ~N];

£1~>2z~ >'''~>)~u.

131

L.L. Scharf/ Reduced rank signalprocessing

One way to proceed with a study of which eigenvalues are important and which are unimportant is to study the divergence that l (x) brings about H1 and H0:

Then the log-likelihood ratio l (x) may be rewritten as

(i) l ( x ) = y ' r ( I - S - l ) y , y = R o l / 2 x , (ii) l ( x ) = y T U ( I - A - l ) u ' r y . These two structures are illustrated in Fig. 11. Under the hypothesis H0, the transformed measurement y = R o I/2x is distributed as y : N[0, I], and under hypothesis HI, y is distributed as y :N[0, S]. The quadratic form l (x) is therefore distributed as a 'chi-squared-like' random variable. It is not hard to show [14, Chapters 2 and 4] that the characteristic function for l ( x ) is the following under H0 and Hi: Ho : ~(co) -

1

[det[I+ 2jco(I- A - 1)]] 1/2, (105)

H1 : ~ ( c o ) -

1

[det[I+ 2j co(A - I)]] 1/2"

These results show that unity eigenvalues of the signal-noise ratio contribute nothing to log-likelihood, meaning that they may be discarded. The resulting reduced-rank likelihood ratio detector is illustrated in Fig. 1 l(c). Perhaps we can discard other eigenvalues that are not quite identity. What seems to matter from investigation of characteristic functions are the differences ~,i-1 and 1 - 1/A,i.

~

J= En, l (x) - Enol (x) = tr(S + S - 1_ 21) = tr(A + A - i _ 21) N

= y~ ( ~ n + l / Z . - 2 ) n=l N

= E (Zn' / 2 - 1/(Z.)1/212"

(106)

n=l

So, to the extent that divergence characterizes detector performance, an eigenvalue is unimportant whenever ~,n+ 1/A,n-2 is small. If we agree that the detector of Fig. 11 (c) can have rank no larger than r, then the following algorithm selects the appropriate ~ and corresponding u~ for the set G of 'good detector modes' [16]:

Let i = 1 , j = N Fork=l tordo If (,~ >I A,f 1) then i~G i=i+1

I(×)

(a)full-rank quadratic detector (b)full-rank quadratic detector inorthogonal form ~

It(x)

(c)low-rankquadratic detector inorthogonal form Fig. 11. Log-likelihoodstructures. Vol. 25, No. 2, November 1991

L.L. Scharf / Reduced rank signal processing

132

Else if (~ti< &/I) then jeG j=j- 1 End if Increment k. Example. Suppose s is an N[0, R0] signal with R0 the rank-deficient covariance Ro=a2~UI(r)U T= a~P(r) and noise n is N[0, az, I]. Under Ho x = n and under H1 x = s + n . Then the signal-to-noise ratio matrix S is S = U A U T,

(107)

A = I + (a~/cr~)I(r), I _ A - I = (cr,/((rs 2 2+ ¢r,))I(r). 2

008)

The prewhitened vector y = R o J/2x is cro ix, and the log-likelihood ratio becomes the following quadratic form in the rank r projection matrix Pr: l(x)=yTu(I-A-1)uTy Trr

0 - 2 / 0 -2 s/ n

=x t;~l(r)v

r/ xrrT

x

~ s + ~n 2

2

_ crs/cr, xTp(r) x 2 2 ~Ts + ITn 2

2

cr~/cr~ (p(r)x)T(p(r)x),

2 + {y2 (Ts

P(r) = UI(r) U r.

(109)

The likelihood ratio is the power in the projection P(r)x, normalized by the ratio of o-~/o',2 and 2 2 O's + O'n.

11. Concluding remarks In this paper, I have tried to present, in a tutorial manner, the principles of rank reduction for signal processing. There are two ideas that fit together to produce successful applications of these principles. The first idea is that reduced-rank (or parsimonious) signal processing algorithms can be derived by trading distortion for variance in a signal processing solution. The second is that distortion and Signal Processing

variance may often be characterized in terms of SVDs and EVDs for data matrices, model matrices, correlation matrices, projections and the like. We have illustrated the application of the principle to a number of problems that arise over and over again in signal processing. Many of the results have been reported in previous papers, but a few are new. Among the latter we mention the reducedrank Wiener filter of Section 7 and the reducedrank quadratic minimizer of Section 8. All of the results of this paper relate to linear estimators. But the principles apply to quadratic estimators as well. In [11, 24] similar ideas have been applied to spectrum analysis and to interpretation and extension of Thomson's multiple-window spectrum estimators [21]. There are many antecedents for the ideas advocated in this paper. Without presuming to present anything like a comprehensive account of them, I would like to acknowledge the work that has influenced my own thinking. In mathematical physics, 'regularized' solutions to inverse problems are always employed. Model reduction has been an important issue in the identification and approximation of dynamical systems, and the SVD found some of its earliest applications in the field of control and system identification [25]. Tom Mullis first alterted me to the equivalence between model reduction and rank reduction, and he used this equivalence to study the approximation of discretetime systems in [10]. Emmanuel Parzen recognized the importance of parsimonious autoregressive modeling in the late '60s and early '70s and advocated rules based on asymptotic properties of estimators [12]. The order selection rules of Akaike reflect the same concern for controlling model complexity [1]. Don Tufts recognized Shannon's rate distortion theory as a rank reduction theory and encouraged me to view it this way. The resulting insight led us to a number of successes for what I have termed the distortion-variance trade [15, 22]. I remember Ronald Pyke remarking in his Mathematical Statistics class at the University of Washington in the late '60s that unbiasedness was not as important to statistical inference as one

L.L. Scharf / Reduced rank signal processing

would infer from its prominent role in the literature. I now see unbiasedness (or distortionlessness, to coin an equally awkward term) as a burdensome constraint that should almost always be traded away for something akin to variance reduction.

[11]

[12]

[13]

Acknowledgments [14]

This work was supported in part by the Office of Naval Research, Arlington, VA 22217, under contract N87-J-1070 and the National Science Foundation under Contract 8622236 to the Center for Optoelectronic Computing Systems at the University of Colorado.

[15]

[16]

[17]

References [1] H. Akaike, "A new look at the statistical model identification", IEEE Trans. Automat. Control, Vol. AC-19, December 1974, pp. 716-723. [2] S.P. Applebaum and D.J. Chapman, "Adaptive arrays with main beam constraints", IEEE Trans. Antennas and Propagation, Vol. AP-24, September 1976, pp. 650-662. [3] R.T. Behrens, "Identification ofsubspaces", Ph.D. Dissertation, University of Colorado at Boulder, December 1990. [4] G. Bienvenu and L. Kopp, "Principe de la goniom~trie passive adaptative", Proc. Colloque GRETSI, Nice, May 1979, pp. 106-1-106-10. [5] Y. Bressler and A. Macovski, "Exact maximum likelihood identification of superimposed exponentials in noise", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP34, October 1986, pp. 1081-1089. [6] A.G. Evans and R. Fischl, "Optimal least squares time domain synthesis of recursive digital filters", IEEE Trans. AE, Vol. AE-21, February 1973, pp. 61-65. [7] L.J. Griffiths and C.W. Jim, "An alternative approach to linearly constrained adaptive beamforming", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-30, January 1982, pp. 27 34. [8] R. Kumaresan and D.W. Tufts, "Estimating the parameters of exponentially-damped sinusoids and pole-zero modeling in noise", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-30, December 1982, pp. 833-840. [9] R. Kumaresan, L.L. Scharf and A.K. Shaw, "An algorithm for pole-zero modeling and spectral analysis", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-34, June 1986, pp. 637-40. [10] C.T. Mullis, "Rational approximation with internal projections", Proc. Twenty-First Asilomar Conf. on Signals,

[18]

[19]

[20] [21] [22]

[23]

[24]

[25]

133

Systems, and Computers,. Pacific Grove, CA, November 1987. C.T. Mullis and L.L. Scharf, "Quadratic estimators of the power spectrum", in: S. Haykin, ed., Advances in Spectrum Analysis and Array Processing, Prentice-Hall, Englewood Cliffs, NJ, 1990. E. Parzen, "Some recent advances in time series modeling", IEEE Trans. Automat. Control, Vol. AC-19, December 1974, pp. 723 730. L.L. Scharf, "Topics in statistical signal processing", in: J.L. Lacoume, T.S. Durrani and R. Stora, eds., Signal Processing, Elsevier, Amsterdam, 1987, Course 2. L.L. Scharf, StatisticalSignalProcessingAddison-Wesley, Reading, MA, 1990. L.L. Scharf and D.W. Tufts, "Rank reduction for modeling stationary signals", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-35, March 1987, pp. 350 355. L.L. Scharf and B.D. Van Veen, "Low-rank detectors for Gaussian random vectors", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-35, November 1987, pp. 1579-.1582. L.L. Scharf and J. Storey, "Rank reduction and order selection for parametric spectrum models", Proc. Third ASSP Workshop on Spectrum Analysis, Boston, November 1986. R.O. Schmidt, "Multiple emitter location and signal parameter estimation", Proc. RADC Spectrum Estimation Workshop, October 1979. C.E. Shannon, "Coding theorems for a discrete source with a fidelity criterion", IRE National Convention Record, Part 4, 1959, pp. 142 163. J.L. Shanks, "Recursive filters for digital processing", Geophysics, Vol. 32, 1967, pp. 32-51. D.J. Thomson, "Spectrum estimation and harmonic analysis", Proc. IEEE, Vol. 70, 1982, pp. 1055 1096. A. Thorpe and L.L. Scharf, "Reduced rank methods for solving least squares problems", Proc. Twenty-third Asilomar Conf. on Signals, Systems, and Computers, October 1989, pp. 609 613. D.W. Tufts and R. Kumaresan, "Singular value decomposition and improved frequency estimation using linear prediction", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-30, August 1982, pp. 671-675. B.D. Van Veen and L.L. Scharf, "Estimation of structured covariance matrices and multiple window spectrum analysis", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-38, August 1990, pp. 1467-1472. H.P. Zeiger and A.J: McEwen, "Approximate linear realizations of given dimensions via Ho's algorithm", IEEE Trans. Automat. Control, Vol. AC-19, April 1974, p. 153.

Vol. 25, No. 2. November1991