Linear prediction and maximum entropy methods in NMR spectroscopy

Linear prediction and maximum entropy methods in NMR spectroscopy

Progress in NMR Specnoscopy, Vol. 20, pp. 515-526, Printed in Great Britain. All rights reserved. 1988. Copyright @.?794565/&3 So.00 + SO r&J 19&l P...

8MB Sizes 15 Downloads 124 Views

Progress in NMR Specnoscopy, Vol. 20, pp. 515-526, Printed in Great Britain. All rights reserved.

1988. Copyright

@.?794565/&3 So.00 + SO r&J 19&l Pergamon Press pk.

LINEAR PREDICTION AND MAXIMUM ENTROPY METHODS IN NMR SPECTROSCOPY DAVID S. STEPHENSON Institute of Organic Chemistry, University of Munich, Karlstrasse 23, 8000 Munich 2, Germany (Received 28 July 1988)

CONTENTS 1. Introduction 1.1. Overview 1.2. Basicconcepts 2. NMR Experiment 3. Fourier Transformation 3.1. Fourier Series 3.2. The Fourier Transform 3.3. The Two Domains 3.4. Discrete Fourier Transformation 4. Attributes of Fourier Transformation 4.1. Advantageous Properties of the DFT 4.2. Disadvantageous Properties of the DFI 5. The Problem Restated 6. Mode&g 6.1. Autoregressive Methods and Linear Prediction 6.1.1. The Spectrum of a Predictive Tie Series 6.1.2. Yule-Walker Equations 6.1.3. The Levinson Recursion 6.1.4. Burg’s Modification 6.1.5. Least-Squares Methods 6.1.6. Prediction Order Selection 6.1.7. Eigenanaiysis Methods 6.1.8. summary of AR and LP Methods 6.2. Functional Models 6.2.1. Damped Exponential Methods 6.2.2. K umaresan and Tufts’s Approach 6.2.3. Avoiding Polynomial Rooting 6.2.4. Summary of Functional Modelling 6.3. Extension to Two Dimensions 7. Spectral and Temporal Discriminating Functions 7.1. Maximum Entropy 7.1.1. Probability Density of One Random Variable 7.1.2. Frieden’s Entropy 7.1.2.1. Gull and Daniell’s Implementation 7.1.2.2. Skill@ and Bryan’s Implementation 7.1.3. Probability Density of Several Random Variables 7.1.4. Autocorrelation and Entropy Rate of a Stochastic Process 7.1.5. Burg’s Entropy 7.1.6. Comments on Maximum Entropy 7.1.7. summary of Maximum Entropy Methods 7.2. Maximum Smoothness 7.3. Extension to Two Dimensions 8. Demonstrations 8.1. One Dimensional 8.2. Two Dimensional Appendix A Some Statistical Concepts reiating to Tie Series Appendix B The z Transform and the Complex Plane Postscript Acknowledgements References 515 JPt4NR.s 20/6-A

516 516 517 518 520 520 521 522 522 524 524 525 529 529 530 532 536 537 539 541 543 545 548 553 553 555 559 562 566 567 570 570 572 574 575 577 579 582 584 589 594 5% 681 602 616 618 619 622 623 623

516

DAVID S. STEPHENSON

1. INTRODUCTION 1.1. Overview This

review

has

been

written

to provide

opportunity to understand several alternatives have recently found application in NMR.

the

general

NMR

to direct Fourier

spectroscopist

Transformation

with an

(FT)

which

These new methods all seem to do the impossible: they promise high resolution from very short records and excellent sensitivity from noisy signals. Indeed it would seem they fulfil a spectroscopist’s wildest dreams. One disadvantage

is, in some cases, a not insignificant

increase

in computation.

However,

if

previous advances in computer Rower can be extrapolated into the future this penalty should become negligible. A second drawback is a possible distrust by the analyst in the validity of their results. This is partly due to their novelty, but also because some methods which have been published have been shown to produce spurious spectral lines and intensity distributions which vary greatly from those expected. I hope to be able to clarify this situation. The review was planned as an excursion into Maximum Entropy (ME) which has interested me ever since I read Sibisi’s original paper.(‘) But we have come a long way since then and spectral estimation methods of a completely different kind also find their place here. This has two unfortunate consequences: firstly the review becomes longer and secondly, once the floodgates are opened, a profusion of techniques culled from other disciplines pours in. It is necessary to set a limit and I have chosen to treat only those methods which have appeared in the NMR literature, pointing out modifications which may be available. The methods seem to me to fall into two classes: those which work conceptually in the frequency domain and seek a spectrum with particular general characteristics (such as a maximum of the configurational entropy or a minimum of the overall second derivative), and those which work conceptually in the time domain and seek the spectrum of a time series which has the same statistical properties as the measured time series. The division is not clear, particularly as regards the major contender in the field at present, namely Maximum

Entropy,

which unfortunately

fact which shall be explained later. In many places I have sacrificed

is represented,

mathematical

by name at least, in both classes, a

rigour to achieve a coherent

story-line.

Indeed, I have gone even further. The bits I have found difficult to understand have been reinterpreted (particularly contour integration, surprisingly I’m not alone in not understanding contour integration@)). The bits I havn’t understood have been left out: I make no apology for this. I am not a mathematician and neither probably are you. But should you wish to fight your own way though the jungle of equations you will be admirably served by the reference list. I have chosen a tutorial approach for two reasons. In the first place the methods are completely novel in NMR and thus need a more comprehensive explanation than mere references to the original literature, although several excellent reviews do exist. Secondly the development of these methods has not yet reached such a stage that their limits can be recognised and in this respect I feel that a basic instruction in the fundamentals

provides a good foundation

considered as an up-to-date Throughout

for future developments.

Appendix to the excellent treatment

The whole paper can be

by Lindon and Ferrige.c3)

the article I have nearly always employed small italic letters for functions of

time x(t) and for sequences of values obtained from such functions by discretisation the The corresponding functions of frequency, derived by transnotation { xk. is used.

Linear prediction and maximum

entropy

517

formation, are denoted by capital letters X(Y), and the sequences formed by sampling at discrete frequencies as {Xi>. To be more specific: measured time data including noise are represented as { dk} which are composed of the data we are interested in {xk] plus the unavoidable noise Ink). The autoregressive methods of Section 6.1 use another type of “noise ” (ek) which can be regarded either as a white noise sequence which “drives” an autoregressive process or as a vector whose elements are a measure of the quality of a linear prediction (the prediction error). 1.2.

Basic Concepts

The principal purpose of doing an NMR experiment is to obtain information which will be of use in answering questions about the structure and dynamics of the substances under investigation. Because the primary data produced during a pulse experiment only rarely cast light on the relevant questions it is usual to transform these data into a spectral form since this is more easily assimilated. The concept of a spectrum as a plot of intensity

against wavelength, fiequency’or energf) is well known, but a more substantial definition is necessary right at the beginning since our chief concern is the second stage of the NMR experiment, namely the spectral analysis of the primary data after the pulsing is complete and before the interpretation is begun. We can immediately

distinguish between two complementary

ways of considering

the

problem in front of us. Although the dividing line is vague, this division is related to the above mentioned partition into two classes. The one takes as its outset the signal which comes from the NMR probe and treats this as a random signal, indeed it is such, being a single-valued mnction of time which conveys information and about which there is some degree of uncertainty before it occurs. (‘) The uncertainty arises frum noise in the system The amplitude of this signal at a particular time is and from unwanted interference. regarded as a random variabld 5,6) (that is to say the amplitude may take any of a set of values and each value has associated with it a specific probability) and the whole signal is regarded as a stoihastic process. c5t6)More precisely: any dne measured signal is a member of an ensemble comprised of functions of time and it is the ensemble which is the stochastic In dealing with such processes it is of particular interest to know at which process. frequencies the power occurs and it is in this sense that the the term spectrum is usually defined, namely as a function whose domain is frequency and whose range is the expectation value of the power of a stochastic process at a given frequency. Judging by the literature relating to the modelling methods of Section 6 this seems to come naturally to those with an engineering background who regard the data as being of primary importance and the spectrum merely a useful way of parameterising these data. To those of us, however, brought up with the chemical applications

of NMR,

par-

ticularly those who remember CW instruments, the emphasis lies with the spectrum which The data here play a subordinate role, is the frequency distribution of phased intensity. and we assume that there is one spectrum which can be regarded as a concrete entity and that this spectrum sends its information to us in the form of an FID which is corrupted by The aim of the analysis is to obtain this noise having specific well-known characteristics. single spectrum, and the task is made difficult firstly because of the noise and secondly because not enough data may be available, i.e.

the FID may be too short.

The job before

us is to choose from a multitude of suitable spectra just one which fulfils our purpose. A bridgehead provides

by Robinson(‘)

which

valuable insights into many areas of physics and which is in a volume

of the

Proceedings.

between

of the IEEE

these two attitudes

may be the review

devoted solely to spectral estimation.

I must admit

that both

518

DAVID S. STEPHENSON

viewpoints are related by a transformation in consideration space: the first will certainly appeal to those well versed in classical signal processing. The second, however, avoids this necessity and is closely related to the field of image enhancement and picture processing(*).

e

To

emphasise

this dichotomy,

to demonstrate

that data may be presented

in several

spectral forms none of which has an absolute right to be called “correct”, and to provide a foretaste of what is to come I present the ‘H spectra of orthodichlorobenzene at 100 MHz in Fig.1. Figures la and Id illustrate conventional Fourier transformation and display noise, low resolution and truncation artefacts. The spectra produced by two “modern” methods (Figs lb, lc, le and If) do not show these defects. You may by now have the uncomfortable feeling that this is going to be terribly complicated

(you’re right), so let us return to something more familiar.

2.

NMR

RXF’ERIMENT

The immediate goal of an NMR experiment is the spectrum corresponding to the spin system being studied. Of course the general structure of the spectrum is dependent on the type of experiment performed, but to exemplify the arguments presented in this article it suffices to deal initially with the simplest of all pulse sequences. To understand the relationship between a spin system and its spectrum consider the The equilibrium magnetisation M, initially along one-pulse experiment illustrated in Fig.2. the z-axis

(Fig.Sa),

is brought

parallel to the y’ -axis of the rotating

framdg) by an r.f.

Linear prediction and maximum

entropy

519

pulse (Fig.Sb). The individual spin isochromadg) then start to fan out in this frame at speeds given by the differences between their characteristic Larmor frequencies and the frequency of the frame (Fig.Lc).

j_

&&

X

J-l&Y,

X’

X’

b

a FIG.2. Stages of evolution of the equilibrium and is perturbed by a 9O”(v

a short rim64interval.

C

netisation in a simple pulse experiment. In (a) the system is at pul se in (b), whilst (c) represents behaviour in the x ’ y ’ plane after

Without loss of generality the response of the spin system to the pulse can be described in terms of these J isochromats, each with its own particular amplitude Xj and offset frequency Uj Hz, and this is the spectral description in the frequency domain pictured in Fig.Sa where the ordinate represents amplitude and the abcissa frequency. The amplitudes of the respective isochromats, portrayed as vertical bars in Fig.Sa, may be brought together to form the spectral vector {Xj> j - 0, J - 1.

X

X

b

FIG.3.

(a) The spectrum and (b) the FID corresponding

to the expuiment

in Fig.2.

This representation is what we seek but not what we measure. To see this consider Fig.Sc again. In precessing about the z-axis the spin isochromats may give rise to voltages induced in coils whose axes are parallel to the x and y directions of the fixed laboratory frame. The amplitudes of these voltages oscillate at the Larmor frequencies mentioned above

i.e.

in the radiofrequency

range,

and are reduced

to audiofrequency

signals by

combining, in a Phase Sensitive Detector, with the frequency of rotation of the frame.(gglO) If one receiver coil is employed a single analogue signal may be obtained as a function of time, and this signal (or FID) contains information about the intensities and offset If two receiver coils orthogonal to each other were to frequencies of the spin isochromats. be used (this conceptually simple method is almost (il) always replaced by the constructionally simpler method of employing one receiver coil and phase sensitive detecting with two frequencies shifted 90’

with respect to each other) the two recordable analogue signals can

be regarded as the real and imaginary parts of a complex signal. detection

This quadrature

mode of

delivers also the sign of the offset frequency,cg) in addition to the intensity and

520

DAVID S. STEPHENSON

frequency information. (There are several different practical ways of realising quadrature detection,(12) the one chosen here is didactically the simplest). After detection this continuous time signal x(t) is uniformly sampled at a rate of l/At Hz to give rise to the discrete time series {xk} k - 0,K - 1 which is stored as a vector whose elements may be real or complex depending on the mode of detection. Just such a real vector is displayed in Fig.Sb where the ordinate again represents amplitude but the abcissa now stands for time. To see how the spectrum and FID are linked together let us consider how the induced voltage is produced. Each isochromat generates a sinusoidally varying signal given by where quadrature detection is explicitly assumed, and thus the yt’) - Xj exp(i2nvjt), continuous FID which is the sum over all isochromats is V(t) -

J-l 1

5

eXp(i27tLJf).

j-0

This signal is sampled at times tk -

kAt thus the elements of the measured FID vector

are Xk -

J-1 1

Xj exp(i2rru#&)

.

(1)

j-0

If,

furthermore,

the

frequencies

of the isochromats

are

given

by

Yj = jAv,

i.e.

the

spectrum represents a uniform sampling in frequency, eqn (1) is converted to J-1 X$ = 1

Xi exp(i2njkAtAv)

.

(2)

j-0

This last equation shows that if the J elements of the spectral vector cXj> are known, and if the sampling rates in time and frequency are also known, then the FID vector {xk) is My determined. To put it another way, if we know the spectrum we need no mental gymnastics to calculate the FID. Equation (2) acts a bit like a one-way street, and if our aim is to calculate the spectrum from the measured FID then several detours are necessary before we arrive at our destination. Our progress will be impeded by noise which falsifies the values xk, and may be made more difficult still if the number K of observed data is small. The standard approach to this spectral analysis problem is to use direct Fourier Transformation, a technique which was introduced into NMR in 1966,(“) &Id whose application is adequately described in references 3,$4 and 15. Because the equations of Fourier transformation are essential to an understanding of all spectral analysis techniques, the next section is devoted to their explanation. 3. 3.1.

FOURIER

TR&ISFORMATION

Fourier Series

It is well known that a complex periodic signal d(t), where I have explicitly used the notation “8 to emphasise that it is a measured signal including noise, may be represented as a series expansion using the set of orthogonal functions exp(i2rrjt/q, j I 0, 2 1, 22, . . . . . where T is the period in seconds.(16) Thus d(t) -

g j- --o

9

exp(i2njdlT)

(3)

Linear prediction and maximum

entropy

with complex coefficients Dj is the Fourier series representation Since the functions exp(i2njtIT) + Tl2 T-1

exp( - i2njtl’I)

I

are orthogonal,

exp(i2nMT)

521

of the signal d(t).

i.e.

dt = djk ,

(4)

.

(5)

- T/2 the coefficients are easily determined to be + T/2 Di

_

T-1

I - T/2

d(t) exp( - i2njtlT)cit

FIG.4.

Discrete spectrum of a periodic function.

These coefficients Dj embody the distribution of intensity of the original signal amongst the basic frequency l/T and its harmonics 2/T, 3/T, . . . . i.e. the set of complex numbers (Dj} constitutes a spectral representation of d(t). Plotting Dj against j (Fig.4) shows that a periodic function has a discrete spectrum. discrete function has a periodic spectrum. 3.2.

Equally well, and this will be used later,

a

The Fourier Transform

Whilst Fourier series are applicable only to the spectral analysis of periodic signals we are primally concerned with aperiodic signals. One way to deal with such functions is to consider them periodic with a periosl T which tends to infinity. Such a treatment leads naturally to the derivation of the Fourier transform.(*6917) Starting from eqn (5) and multiplying both sides by T we get + T/2 TDj 9

I

d(t) exp( - i2njtiT)

dt .

(6)

-T/2 Now the frequency

corresponding

to the basis function exp( - i2rtjt’T)

is j/T,

and as T

becomes arbitrarily large so the spacing between the frequencies j/T and (j + 1)/T becomes arbitrarily small and the frequency approaches a continuous variable Y. Similarly the fundamental frequency l/T becomes a differential frequency dv. Equation (6) thus becomes + T/2 lim T+=

TDj I

I

d(t) exp( - i2nut) dt ,

(7)

- T/2

and the right-hand

side of eqn (7) is defined as the Fourier

transform

of d(t), denoted by

D(u), i.e.

+-

qv) =

J -co

d(t) exp( - i2nvt) dt .

(8)

522

DAVID &STEPHENSON

Consider follows:

now the effect of letting T tend to infinity on eqn (3) which can be rewritten as +a

1

d(t)-

TDj exp(i2njt/l)(l/l)

.

(9)

j- -ca

As T tends to infinity the summation

in eqn (9) becomes an integral and TDj and If T may

be replaced by D(V) and dv respectively yielding the inverse transformation

J D(u)

+a d(t) =

exp(i2rrvt)

du

(10)

--oD

which allows the original signal d(t) to be recovered from its spectrum D(u). Since eqns (8) and (10) are rather unwieldy it is convenient to employ the concept of a transformation operator F where D(v) - F_[cf(t)] with the subscript indicating the minus-i transform(“) of eqn (8) and d(t) - F+[D(u)] with the subscript now indicating the positive sign of the exponential in eqn (10). In comparison

to the periodic

signal

the spectrum

D(v)

of an aperiodic

function

is

continuous. 3 .3 .

The Two Domain.@

It was realised quite early on that the spectrum obtained in continuous wave mode and the response to a pulse were related to one another by a Fourier transform(ig) eqn (8) where D and d are continuous functions of frequency and time respectively. D and d form what is known as a transform pair and although D and d may have different functional forms each has the same information content. There

are manifold

reasons

for going to the trouble

of applying

eqn (8) to a function

d(t). For instance the mathematical operations to which d(t) is to be subjected may be less formidable in the sister domain II, a notable example of this is convolution which reduces to mere multiplication in the transform domain. But the application with which we are concerned here is spectral analysis, the purpose being to extract from the signal d(t) information concerning the natural resonance frequencies of the spin system. In practice the integration in eqn (8) is not carried out in an analogue manner on a continuous signal d(t). Instead, as described in Section 2, the FID is sampled and stored in the form of a vector { dk), and this finite sequence of values is the starting point for the spectral analysis. The practical application of this method was demonstrated first by Ernst and Anderson,(“) but only after a fast method of carrying out the transformation of a discrete time sequence had become available.(20) 3.4.

Discrete Fourier Transformation

The discrete Fourier transform

(DFT)

(Dj} o f a sequence

(dk) of K terms is defined by

K-l

dk exp( - i2rrjWK)

q-t

.

(11)

k-0

Usually the sequence {dk) represents K consecutive samples d(kAt) of a continuous signal d(t), and the sequence { Dj} represents K consecutive samples NAY) in the frequency domain. Equation (11) can be seen as a discrete analogue of the Fourier its derivation is superbly described by Brigham.c2’)

integral (8), and

Linear prediction and maximum

entropy

523

An important attribute of the DFT depends on the cyclic properties of the basis functions exp( - i2nm/K).

Let j in eqn (11) lie outside the range 0 to K - 1, i.e.

j -

mK + n where

m # 0 and 0 4 n < K, then exp( - i2njMK)

-

exp( - i2nmKWK)

I

exp(-

I

exp( - iSnnk/K)

i2nmk) exp( - i2nnWK)

[ exp( - i2n)]*

exp( - iPnnk/K)

exp( - i2nnWK)

.

(12)

Substitutung eqn (12) into eqn (11) we see that Dj - On. Thus there are only K independent values in the sequence (D$. This means that the spectrum of a discrete time series is periodic. The consequences of this will be dealt with in the next section. Another property of the DFT depends on the orthogonality of the basis functions: K-l

K-l

1

exp( - i2njm/K)

exp(i2nkmlK)

= djk .

(13)

m-0 Equations (11) and (13) together lead to the definition of an inverse transform K-l

dk - K-’

1 j-0

Dj exp(iPnjMK)

.

(14)

If eqn (14) is compared with eqn (2) it can be seen that they have the same form, apart from a numerical factor it only needs that we set Av - l/KAt and they are identical. This demonstrates that the spectrum represented by the vector {Dj} may be obtained from the FID {dk) by discrete Fourier transformation, and when we do that the frequency sampling rate is the inverse of the product KAt. Another point to be noted here can be seen if we again apply the orthogonality relations eqn (12) but this time to eqn (14). Then it becomes apparent that only K independent data dk are defined by the transform, i.e. the data are periodic! Although the two transforms defined by eqns (8) and (11) appear similar there are some significant differences in interpretation. Of practical interest is the relationship between index j and frequency Y. Whereas Y is a continuous variable ranging from - 0~ to + 0, j is discrete and only K values of Dj are defined by eqn (11). Since { Dj> is thus periodic we are free to choose the range of the K values, and whilst the values 0 $ j c K are convenient the set W2 < j 4 W2 is equally valid and has the advantage that the index corresponds

directly to a frequency u between - vN and + vN, where uN is the Nyquist frequency. (‘) If the range 0 5 j < K is chosen it should be understood that the values of Dj for j > K/2 correspond to continuous transformation with u < 0, this is treated in greater detail by Brigham. Discrete Fourier transformation

is no longer carried out by blind application of eqn (11);

this would require far too many computational operations. Instead a legion of so-called “fast” transforms (FFTs) is available which mightily reduce the effort involved and which are all based on a very simple idea. Equation (11) can be expressed in vector notation as

D - Wd

(15)

where D - { Dj}, W is a KxK matrix having elements W ‘k = [exp( - i2nlK))ik and d - {dk}. The trick behind the fast transforms is to express \nf as a product of matrices, each having

a large number

of zero elements.

This drastically

reduces

the number

of

524

DAVID

S. STEPHENSON

non-trivial multiplications and may be done if K fulfds certain criteria, for instance K - 2” or K - 4”, where n is an integer, leads to a radix-2 or radix-4 Cooley-Tukey(20) algorithm. This interesting technical subject is fully explained by Brigham and by Elliot

a& Rae.(“) ATTRIBUTES

4.

OF FOURIER TRANSFORMATION

The foregoing treatment has described the DFT as applied classically to the conversion of an FID to a spectrum, and in the following I set out some of the consequences of the use of this method. These can be divided into advantageous and disadvantageous aspects. 4.1.

Properties of the DFT

Advantageous

One positive aspect of the DFT as implemented using a fast algorithm is its speed. A radii-2 computation of a K-point DFT requires 2KlogpK - 7K + 12 non-trivial real multiplications and SKlog2K - 3K + 4 real additions if complex multiplications are done with 4 real multiplications and 2 real additions. (‘*) A radix-4 computation is faster still, needing only (3K/2)log2K - 5K + 8 multiplications and (1 1K/4)logpK - (1%/S) + (8/S) additions. Direct computation

using eqn (11) needs 4(K - l)* non-trivial

real multiplications

and about the

same number of real additions. A further advantage of the DFT is its easy modification to allow the treatment of multidimensional data. The discussion so far has concerned one-dimensional (1D) data and therefore 1D spectra. By this is meant that the data vector {dk) and the spectral vector (Dj} are both functions of one index only. But for over a decade the technique of two-dimensional (2D) NMR has been known and its importance seems to grow from year to year.

Without going into the details of the methods we can define them as experiments

which acquire a signal as a function of two times tl and t2 and thus the data to be stored (d,) can be regarded as a function of two indices and the spectral presentation {D,} of the information

is also a function of two indices.

Just as the 1D NMR

spectrum is related to

the 1D FID by eqn (8), so is the 2D spectrum given by a similar expression(‘5)

D(vt,v2)

+=

+m

-0I

I --o

-

d(tl,t2) exp(-i2nvltl)

exp(-i2nv2Q

dt,dp

,

and this can be rewritten as

i-

+-

--oo

d(t,,Q -0D

D(q,vp) - J [ J so that the transformation then over ti.

exp(-iZnu2t2)

dt2] exp(-i2nvltl)

can be completed in two steps.

dt, ,

(16)

Firstly an integration over t2 and

The discrete analogue of eqn (16) is

D,.BLi[y 1-O

d, exp( - i2njklK)

] exp( - i2aiUL)

,

k-0

and a 2D DFT can be regarded as two sequential sets of 1D DFTs, the first set operates on the columns of {d,} to produce an intermediate matrix and the second set operates on the rows of this to yield the { Dij}.

This separation allows the implementation of fast algorithms 1D transforms .(*‘) Although other schemes claiming greater computational efficiency are possible, (**o*~)they have not yet found application in NMR. based on standard

Linear prediction and maximum 4.2.

entropy

525

Disadvantageous Properties of the DFT

One disadvantage, associated with using an FFT of any form is that K is restricted to a limited set of values. Thus the radix-2 algorithm can be implemented only when K is a power of 2, and in the radix-4 case K must be a Rower of 4. In general the fast algorithms rely on K being highly composite and the simplest implementations are found when K is the power of an integer. fz2) Although this restriction applies only if the DFT is determined using a fast algorithm,

the sole advantage

of the 1D DFT

is its speed when the full

factorizabiity of the transform matrix W is taken into account. Thus we may safely assume that this limitation is always with us, for to calculate the DFT using eqn (I 1) in a straightforward

manner

is not feasible.

Since in many cases the measured

sequence

{di)

may not contain such a special number of elements we are forced to do one of two things in order to apply the DFT: i.e. either we delete some of the elements thus rounding K down to a nice number, or we can add elements thus rounding K up. Both alternatives, forced upon us by the DFT, are bad. In the former case measured data are ignored, and in the latter unmeasured data are included. feature of the restraints on the length K of the sequence (dk) can be seen by

Another

considering eqns (11) and (12) which demonstrate that the spectral vector { Dj} contains only K independent elements. Furthermore, comparison of eqn (14) with eqn (2) yields the frequency sampling rate or spectral point resolution as Av -

(KAt)-

1 Hz.

If we want a

higher resolution in the frequency domain, which is equivalent to having more points, then we must interpolate. The easiest way to do this is to “zero-fill” in the time domain,(14p’5) i.e. extend the data record to a larger number than K samples by appending zeros before transformation. Whist this technique increases the point resolution in the spectrum, it fails to make the lines any sharper because the frequency resolution is inversely proportional to the time duration of the signal. It can, however, clear LIP uncertainty as to how many lines are present in the spectrum, as demonstrated in Fig.5.

1

L-

1lll

LJ-d

a

FIG.5.

C

b

Improvement + s

transformed using a lb-pomt .r +IU are,present. By zero-fhg lndaMllnacy clcarcd up.

trum appewanw

achieved

by zero-fGn6.

FT (the rql partsof the fm 8, spawal

to 32 pomta (b) or to 64 pomts (c)

In general zero-filling is applied to improve the appearance

of NMR

spectra at very little

extra computational cost. Although care must be taken when the FID has not decayed to zero before the end of the measurement period. This topic shall be considered next. A major handicap of the DFT is encountered in the treatment ofe short signals. Normally

in NMR

sequence

(dk} is acquired

the FID decays exponentially

signal will have vanished. be distorted. eqn (8).

with a time constant

T2

over sufficiently long time the information-bearing

so that if the part of the

Should this not be so then the spectrum obtained by DFT will

To see this more clearly we must go back to the continuous FT given by The integral in eqn (8) runs from - ODto + o, but our measurement of the signal

DAVID S. STEPHENSON

526

which starts at t - 0 must necessarily cease at a finite time t - to. So that the measured signal is the measurable signal multiplied by a rectangle function. Take as an example the complex exponential whose real part is illustrated in Fig.Ga. The corresponding transform using eqn (8) is the delta function of Fig.Gb. If the signal d(t) of Fig.Ga is multiplied by the rectangle function
F-MO401 - F-k491 QPF-M91 = D(v) @ R(v) 2 where the notation D(u) @ R(v) is defined by += I

D(v) @ R(v)

I --oD

D(f)

R(u -f)

df.

d(t)

D (~1 A h

b

r(t)

R (~1

A

0

h

tr

l- t

d

C

d’(t)

D’(v)

A

A

z- t e

I

A ”

u

f

FIG.6. Time domain truncation and its effects on the continuous Fourier transform. A time-unlimit+ complex exponential (a) yields a delta function b) on transformation. A rectang!e (c) transforms to a “sine 6. function (d). Multiplying (a) and (c) gives (e) w tch transforms to Q the convolutton of (b) and (d).

Linear prediction and maximum

527

entropy

ThUS

D’(u)

F_[d’(t)]

-

The transform

R(v)

- D(u) Q R(u) .

of the rectangle function, whose real part is shown in Fig.Gd, can be

seen to have significant sidelobes, and therefore its convolution with the delta function of Fig.Gb leads to a continuous spectrum D’(V) (Fig.Gf) which also displa+ sidelobes. Before returning to the DFT it is instructive to investigate this distortion a little more closely. Although the measured signal runs only from t - 0 to t - to, the integral in eqn (8) runs relentlessly from - 0~ to + OD,and the sidelobes in Fig.Gf can be seen as corrections to the original frequency of Fig.Gb necessary to accommodate the discontinuity at to. From this standpoint it can be easily understood why the normal NMR FID which decays to zero within the measurement time to and thus shows no such discontinuity in the time domain is not distorted on transformation.

w ’

,’

:

._

. .

I

A ,-\ 0

‘,



‘(

‘,

I-.

‘\.I

-. -0

\_’

I

, \

t

\

_

u

I._

a--



,

,

,’

\ 2’

I

. .

,--

*)

u

b

a

FIG.7. (a) Fourier transformation of a mcawred data vector containi an integer number of periods of a kequency component leads to exact rpccification of this component. (b) “fi S ould this not be the case, “leakage” results and the neighbouring components become non-zero.

The translation

of this effect to the DFT

is relatively simple.

Equation

(11) forms the

basis by which a discrete signal can be analysed into its Fourier components exp( - iZn/K) and harmonics. If the K measured values of the signal cover an integral number of periods then the harmonic analysis using eqn (11) leads to one Dj being non-zero and the rest, corresponding to the zero crossing points of the continuous Fourier transform, being zero (Fig.7a). However, if this is not the case there is no one Dj which describes the signal Zero-filling exactly and the intensity is shared or “leaked” into neighbouring Dj s (Fig.7b). won’t

help

here,

indeed

it makes

matters

worse

since

the

spectrum

approaches

the

continuous transform (Fig.69. The only remedy within the boundaries set by the DFT is to modify the original FID in such a way that it gradually goes to zero at t = to. This operation of apodisation(‘4*‘5) or windowin& *‘,*@ which is necessary to reduce the sidelobes always broadens the signals, sometimes significantly, and thus lowers the resolution of the spectrum. It was to get around this intrinsic problem of short signals that the “modern” methods of spectral estimation were introduced, and it is to be expected that they will have their maximum impact in this area. This is particularly true in 2D spectroscopy where the length of an experiment is directly linked to the number of points The spectral phase presents yet another problem. In the discrete transform of (dk} is the vector {Oj> whose elements are, The only exception is if the non-zero imaginary components. time data is Hermitian

acquired in tl. one-dimensional case the in general, complex with periodic extension of the

about t = 0 , i.e. unless dk - d f k or unless dk - c&l k where K This can easily be seen from eqn (14) by setting Dj

is the vector length used in the DFT. real.

Since this latter situation is hardly ever encountered

in practice the normal result of

applying the 1D DFT to an FID is to obtain a complex spectrum whose real and imaginary parts do not correspond to the absorption and dispersion components of the spin system.

528

DAVID S. STEPHENSON

This hurdle can be overcome in two ways. Firstly the real and imaginary parts may be combined to give a power spectrum or its square root, this is very easy but leads to a spectrum which is always positive and in which either the intensities are distorted (power spectrum) or the line shapes are distorted (magnitude spectrum). Secondly the real and imaginary components may be combined using two phase parameters cpo and Q i to produce a corrected spectrum whose real part is pure absorption and whose imaginary part pure dispersion. For 1D spectra this presents no unsurmountable

barrier,

either the two phase parameters

may be supplied by the spectroscopist using an interactive display, or an automatic

phasing

routine can be used. But for the 2D spectrum a novelty is introduced, namely it may be impossible to select phase parameters such that the lines are everywhere in pure absorption. This arises whenever the signal is phase modulated in t,@‘) and the result is the so-called “phase-twist” lineshapeQ@ caused by the superposition of double absorption and double dispersion lines. In coherence transfer experiments one way to provide quadrature detection in v1 is to add the results of two experiments which differ in the phases of some of their pulses,(2g) thus selecting either the p- or n-type peaks. PO) This addition certainly allows discrimination between positive and negative frequencies in vl, but brings about abovementioned phase-twist, which has been traditionally counteracted by displaying

the the

spectrum in magnitude mode (square root of the power spectrum). This has two disadvantages. Firstly the widespreading dispersion components reduce the spectral resolution, although clever application of weighting functions can ameliorate this.(*‘) Secondly the information inherent in antiphase multiplets is inexorably lost in the magnitude mode.(“) Indeed, if the point resolution is low even the correct multiplicity is lost. This can be overcome if the results of the two experiments are not added but kept separate(2g) and are subjected to a “hypercomplex” Fourier transformation,(‘5) which yields a set of four spectral components (real-real, real-imaginary, imaginary-real, imaginary-imaginary). The final phasing of the 2D spectrum requires the selection of the correct linear combination of these four components, which is no easy task@‘) and is usually achieved by phasing slices through the spectrum.(2g) Even when this is accomplished and the phase-twist is removed the lines are not all in the convenient double absorption mode. Specifically, in a COSY experiment for example, one may choose cross peaks to be in double absorption in which case the diagonal peaks are in double dispersion form, or one can make the opposite choice. The spectrum still has peaks with dispersion characteristics with all their disadvantages. At least one of the “modern” methods can be used to avoid the hypercomplex transform and produce pure absorption lineshapes everywhere with automatically correct phases, as we shall see in Section 7. A further prerequisite of using eqn (11) to determine the spectrum { Dj} is that all the elements of {dk} must have been measured and are available. The vector {dk} corresponds to a uniform sampling of the FID at time intervals At. Should any of the elements not be measured then eqn (11) cannot be applied, we cannot just set these elements to zero. This is usually no problem in 1D NMR, the FID is there anyway and we might just as well measure all the dk. But for 2D NMR the number of measurements in t1 should be kept as low as possible to reduce the length of the experiment. Application of the DFT means that the resolution in v1 is limited by the restriction time intervals.

that the samples be measured at uniform

If we remove this restriction then the resolution is limited by the maximum

value of tl used, and one could conceive an experiment

where the successive values of tl

follow another rule than mere contiguous integer multiples of At.(“)

Linear prediction and maximum Even in 1D NMR the spectral form. or acoustic

entropy

529

the inclusion of all the elements of an FID might be detrimental

to

For instance if the first few points are corrupted by pulse breakthrough

ringing then the spectrum

usually shows severe “baseline roll”.

Setting these

initial defective points to zero rarely alleviates the problem, and ignoring them by starting the measurement of the FID later introduces extreme phase problems. These difficulties can be .avoided by using better conceived spectral analysis methods. {dJ

Finally, the data are usually corrupted by noise. Thus the data which are measured consist of a part {x~} which comes from the spin system, the signal, plus a part {nk}

which comes from the electronic components, the noise. (This noise need not be additive, it could be multiplicative noise or phase noise, but this introduces unwqrrented complications). The distinction between signal and noise lies physically in the desirability of the signal and the undesirability of the noise, and mathematically in that their autocorrelation fimctions(35) are different. Because the Fourier transform is a linear operation, so that F[x+ n] - F[x] + F[n], the noise necessarily appears in the frequency domain after transformation, and any method designed to suppress the noise would distort the signal.

5. THE PROBLEM RESTATED We have seen that many obstacles are encountered when a spectrum { Dj} is obtained by discrete Fourier transformation of the data vector {dk}. These obstacles cannot be surmounted without sacrificing something valuable, be it resolution or sensitivity. One way out of this dilemma is to attempt to counteract the two main evils, namely insufficient data and noisy data, by accepting their existence and trying to determine the underlying structure of the data by a predictive technique whose main aim is to extend the length of the “data” before transformation, but which should also be able to emphasise the important spectral components and suppress the noise. This approach is explained in the next section and relates to the statistical methods mentioned in the introduction which have the logical direction “time-series -, spectrum”.

An alternative approach whose logical direction is exactly the opposite is made possible if we free ourselves of the notion of having to work deductively which is imposed on us by applying eqn (1 l), for this equation says nothing else than that the spectrum can be deduced from the data leaving no room for doubt about whether the measured dk are noisy or not, leaving no room for doubt about values of dk that might have been measured and leaving no room whatsoever to include valid information into the structure of DJ Equation (11) is too rigid: it makes no distinction between signal and noise and it treats the artificial boundaries imposed on the data by the limitations of time as an inherent component

of the

data themselves. What is needed is a certain sloppiness, a form of give-and-take between spectrum and data. This liberalization cannot go too far however, otherwise chaos would Thus a strong principle is needed to hold the be the sole outcome of any analysis. spectrum in check even when thz data constraints are relaxed. Several alternatives are conceivable and are discussed in Section 7.

6.

MODELLING

This section is based on a very unpretentious simple example.

A signal represented

idea which is perhaps best illustrated by a

by a sine wave x(i) - sin(wt) is measured

intervals of time to give the data vector {xk} k- O,....,K the data vector is not very long then {Xi},

- 1 (Fig.8).

at K

We have seen that if

the Fourier transform of {xk},

exhibits artefacts

530

DAVID S. STEPHENSON

(Fig.60 caused by the truncation of {xk}. These artefacts could be reduced if the data vector could be extended in a sensible way. A reasonable question is thus ‘is it possible to estimate or predict the value of xx given all the previous values of (xk}? ’ . The answer for a single sine wave is obviously ‘yes’ and a good “guesstimate” would be the point marked I’*” in Fig.% Obviously if this works for the Kth point we could extend the procedure to the (K + 1)th point and so on, until the new data vector has sufficient elements to allow Fourier transformation Fourier

to a spectrum

transformation

information

devoid of artefacts.

step is unnecessary

Indeed, if we use this method then the

since the extension of the data series has used

inherent in the original K elements and the extended

We can just as well use the information spectrum.

obtained

points, are redundant.

from the original elements to form a

X

FIG.8. A sine wave measured at the points marked “.“,

Is it possible to predict the point marked “e”?

The salient point to be made here is that we may have enough information,

even in a

very short time series, to allow good estimation of the underlying frequencies. The problem is to extract and use this information. This leads us to a discussion of how to represent a time series so that its information content may be parameterised. The concept behind this approach is that a particular time series may be considered as a representative of a class of functions, the model, and a set of parameters which have to be estimated. For instance the model in Fig.8 is obviously one sine wave and the information contained in the data vector (xk} may be concentrated into the single parameter or. If we are to treat a time series correctly the model must be chosen with care, for otherwise the number of parameters becomes large and some of the parameters may become spurious, particularly if we have to contend with noise. There seem to be two main paths we can follow: either we could treat each sequential datum

as being linked in some way to its neighbours

which leads to the concept

of

autoregressive modelling, or the underlying structure of the data could be introduced at the very start which leads to the less flexible but better tailored approach of functional modelling. 6.1.

Autoregressive Methods and Linear Prediction

Since the time series we are dealing with are deterministic

it seems reasonable to suppose

that any one measured point xk may be expressed as a function of the preceding points:

where the notation

xk is used to emphasis that I am considering,

for the moment,

time

Linear prediction and maximum

entropy

531

series with no additive observation noise. The simplest function which can be chosen is linear in the xk _m, thus we could predict xk by the relationship

xk

-

-i

%,xk_rn

+

(17)

ek )

m-l

where ek (sometimes

called the innovation)

is the kth element of a zero-mean

white noise

vector (ek} and the minus sign is chosen for later convenience.

This method of treating a

time series may be called Linear

reasons,

Prediction

(LP)

for obvious

and the idea of

representing a time series by eqn (17) goes back to Yule(36) who reasoned that the periodicities hidden in the time series are disturbed by noise and if their number M is known it suffices to determine the coefficients a, and the spectrum may be obtained as shown in the next section. The method to solve for the coefficients is regression analysis, and because the series is regressed onto its past the procedure is alternatively known as an Autoregressive (AR) model and forms one part of a more general treatment time series is given by

xk

-

amxk-m

-2

+

f

m-l

bnek-n

in which the

(18)

9

n-o

where the original AR model of eqn (17) has been extended to include a second summation over the noise vector. The second summation can be regarded as an averaging of the individual elements of {Q) and is thus called a Moving Average (MA) model. The whole model of eqn (18) is thus known by the acronym ARMA. I shall not, however, consider MA models or the more general ARMA treatment. Firstly they have not yet been used in NMR, secondly the MA approach is not too well suited to the determination of spectra consisting of sharp lines, (“) this shall become evident in the next section, and thirdly the algorithms for AR parameter or ARMA modeIs. Let

estimation are considerably more efficient than those for MA

us go back to the AR

model and consider

eqn (17)

more

closely.

The

system

described by this equation can be portrayed by the diagram in Fig.9, where the Jtth element of the series is given as the sum of the innovation ek and the preceding M elements of the series multiplied by the appropriate weights -a,. It system involving feedback, including feedback of the out by Yule@) and implicit in his derivation, this is consisting of a set of natural frequencies perturbed by

is obvious that we are dealing with a previous innovations, and, as pointed not the same as treating the series as additive white noise.

*(-a Ml FIG.9. The feedback scheme of an AR process. The kth element of a time series is made up of the sum of the innovation el, and the previous M elements of the auics xk _m multiplied by the appropriate cocffiients -a,.

This is an important

point to bear in mind,

for the innovation ek in the AR model is The aim, though, of spectral

part of the model and cannot be interpreted as additive noise. ‘JPIINR8 20/6-B

DAVID S. STEPHENSON

532

analysis is to extract the deterministic part of the signal and whilst such a signal may be well represented by a model having a relatively few parameters a, and a large random component (ek} an alternative with a greater number of parameters and a smaller noise component is obviously to be preferred. It is in this sense that we employ Linear Prediction: a good predictor being one which predicts xk as exactly as possible which means that ek should be a minimum. This is indeed the basis of several treatments (for instance Burg’s method),

but this again takes no account of the actual presence of observation noise.

The derivation

of the AR equations also rests heavily on this assumption

that no obser-

vation noise is present so that the presence of such noise will reduce the applicability of these equations. We shall see in Section 6.1.8 that the correct treatment of observation noise

leads to an ARMA

model

which,

as pointed

out before,

greatly

increases

the

computational effort. Be this as it may, there are two problems to be solved in using the AR method. On the one hand the coefficients a, have to be found, and then they must be used to calculate a spectrum. Since there are many distinct ways of determining the coefficients but only a few basic formulae for their conversion into a spectrum I shall treat the latter problem first. The Spectrum of a Predictive Time Series.

6.1.1.

The

discrete

measurement

of the

enforces a periodicity on the continuous spectrum X(u) elements of the time signal (xJ and allows a series expansion of the spectrum in orthogonal functions exp( - i2rrkvAt) similar to that of eqn (3). Thus +0 X(v)

=

xk exp( - i2nkvAt)

1

,

(19)

k= --OD where l/At is the frequency period in Hz. This relationship assumes that the measured values of the time series are available for the +oD, a condition which is obviously not fulfilled, but if the whole range --oD 6k4 elements of {xk}

are related by eqn (17)

and if the coefficients are known then it is a

simple matter to predict the unmeasured values xk and to use these instead. This method of data extension could certainly be used but it is equivalent@‘) to a more elegant method which allows one to analyse its drawbacks with greater rigour. It is expedient at this stage to introduce the compact z transform notation (see Appendix B), and by setting z - exp(i2rruAt) and substituting eqn (17) into eqn (19) we get

=-k

amXk-m k=-am-1

=

2

-

a,

z-“X(z)

+

z

ek

zek

k- --OD

+ E(z) ,

m-l and thus, by rearranging

and setting a,

-

1,

E(Z) x(4

5

a,

zmm



(20)

m-0 so that the spectrum

of the time series is the spectrum

of the noise E(z) divided by a

polynomial in z which must be evaluated on the unit circle since z - exp(i2nvAt). Now eqn (20) has two important features: firstly the noise appears exclusively in the numerator

Linear prediction and maximum so that at first glance it seems that we have accomplished

entropy

533

a perfect separation of signal and

noise. Indeed spectra estimated by such methods do show an almost noiseless baseline. This is because the noise spectrum in eqn (20) corresponds to the innovation {ek) which is assumed white, borne in mind this topic later eqn (20) is the

thus the spectrum E(z) may be replaced by a constant. This should if the intensities of different spectra are to be compared, we shall return when we consider the merits of these methods. The second feature polynomial in z which appears in the denominator. This polynomial has

be to of at

most M roots scattered across the complex z plane and these correspond to the singularities or “poles” of X(z) where this function becomes infinite. This aspect of the AR spectra leads to the alternative name of an “all-pole” model, and also accounts for the success of this approach

for the representation

of sharp-lined

spectra.

The poles are important

from

another point of view, namely, the spectrum X(z) is determined on the unit circle in the z plane so that for stability (see Appendix B) the poles must all lie within the unit circle. It can be shown that an MA process has a spectrum defined by an expression similar to eqn (20), but having the polynomial in the numerator, and thus the MA approach leads to an “all-zero” model. Such a model is, as pointed out previously, not well adapted to the requirements

of high-resolution

NMR where the spectra contain sharp peaks, although their

effect can be taken care of by drastically increasing the model order M. Further discussion of these points and the ARMA models in general can be found in the reviews by Cacl~ow,(~~) Kay and Marple,(ti) and Ulrych and Ooe(*‘) as well as the recent book by Marple.ts8) In the main body of the literature

relating to spectral analysis the emphasis is

always on the estimation of the power spectrum, which is easily obtained from eqn (20) as

so that the function of interest is lX(v)[*

02 P(v)

-

lJv~12

-

,

M

a, II

exp( - i2nmvAt)

(21)

2

m=O

where o* is the noise variance standard derivation of eqn (21)

or expectation

value of the power

of the noise.

can be found in any of the above

works which

The deal

exclusively with power spectra. The NMR spectroscopist, however, chooses the power spectrum only under protest: on the one hand important information is lost by ignoring the phase relationships of the various signals, whilst on the other hand the intensity distribution is distorted. It is thus of interest to see if a time series representable by eqn (17) may be converted to a phased spectrum. In general one could simply apply eqn (20) were it not for the fact that in its derivation the implicit assumption has been made that the time series extends from - 0 to + 00. We, however, are concerned solely with causal signals for which xk- 0, k
These problems are not so serious for the determination

of power spectra

since an equivalent representation of p(u) is the Fourier transform of the autocorrelation function of the data so that truncation of these data to negative time merely introduces end effects.

The

circumvention

of these difficulties for phased spectra has, however, been and further elaborated by Ni and Scheraga(“)

elegantly described by Tang and Norris,(*‘) whose treatments I shall now paraphrase.

If the time series is causal the spectrum is given by

(22) k-0

DAVID S.STEPHENSON

534

In order to allow for the separate treatment of the initial where again z - exp(i2rruAt). and subsequent elements of (xk) we can split the summation in eqn (22) into two parts: N-1

+oD

k-0

k-N

XkZ-k,

X(4 =

(23)

Thus the first part consists of a sum over elements which have been measured,

and the

second part consists of a sum over elements which may be related to previous elements using eqn (17).

This sets a restriction

on the value of N, and it is easy to see that

K L N L M. Tang and Norris set N - M, whereas N = K. Substituting eqn (17) into eqn (23) gives N-l

x(z)

=

+oo

M

1 xkz-k - 1

1

k-N

m-l

k-0

Ni and Scheraga

suggest

setting

+m

r

=-k +

amXk-m

ekzmk

k-N

N-l =

1

xkz-k

-

f

amzmm

g

m-l

k-0

[z

N-i

1

-

amzem

xkzsk - f m-1

k-0

where E’(z) is the phase-shifted rearranging then produces M x(z)

1

+

E’(z)

xkzek

-N$:kz-L]+E’(z)

spectrum

k-0

Setting a,

of the noise.

m-l

k-0 M

-

-

1 as before and

N-l-m

M

1 xkzmk+ 1 amzmm 1 xkzmk+

-

m-0

E’(z)

k-0

N-l-m

1 amzmm 1 xkzwk+ m-0

E’(z)

k-0

It is at this point that the two treatments N -

zwkim

k-0

N-l

a,=-”

xk_m

k-N

diverge.

If we follow Tang and Norris and set

M the order of summation in eqn (24) may be so changed to yield M-l

M

X(z)

1

amzVm

-

m-0

m

1

zmrn 1

a,x,

_n

+ E’(z)

,

n-0

m-0

i.e.

x(z) - [ C(z) +

E’(z)] /A(=)

2

where M-l

A(z) -

2

amzsm,

C(Z) -

1

cmzmm, and

m-0

m-0

cm -

2 n-0

anxm_n

,

with z - exp(i2nuAt). Comparing eqn (26) with eqn (20) we see that the causality has introduced the corrective term C(z) into the numerator and phase shifted the noise. Equation (26) is equivalent to the LPZ spectral function of Tang and Norris(‘*) except for have been the introduction of the noise E' (z). Once the prediction coefficients {a,} determined

eqn (26) may successfully be employed to derive a phased spectrum,

added advantage

that if the z transform

is a positive constant,

is performed with z = exp(kAt+ i2nvAt),

the spectral resolution is improved.

with the where k

Linear prediction and maximum

entropy

535

If N is chosen to be greater than M the summation order in eqn (25) employed and Ni and Scheraga show that eqn (24) may be converted to

cannot

be

N-l X(z)

a,zmm

f

-

t

xkzmk 5’

k-0

m-0

amzmm

+ E’(z)

where

M’ is the lesser of M and N - 1 - k, and normally measured data. Thus the spectrum is given by K-l

w-4 *

1

xkz -k

These

amzmm / A(z)

+

E’(z)

N -

K to employ

/ A(z) ,

all the

(27)

m-0

k-0

where M’

5

,

m-0

= min(M,K

- 1 -k).

two ways of approaching

the summation

in eqn (24)

are illustrated

in Fig.10.

Equation (24) represents summation over the rows in Fig.10, Tang and Norris’ method uses summation over the diagonals, and Ni and Scheraga suggest summing over the columns.

a,

z-”

a,

z-l

[ [

-2

x,z-O+ x,z-I+

x,z-O+ x,z-l+ x2z-2+

[ x,z-O+

a2 z

x,z++ . . . . . . . . . + XN_2z-(N-z$ xN_lz-(N-l) 1

x1

z-l +

x2

. . . . . . + XN_2z-(N-2)

z-*+ .

1

+ XN_3z-(N-3) 1

FIG.10. The terms which have the rows represents the inner summa eqn (251,Tan and Norris were able to summation o B eqn (27), was suggested by

Whilst eqn (26) is suitable for implementation using fast transforms if the spectrum is to be evaluated at a set of equidistant frequencies and if the effects of the noise E(z) are ignored, eqn (27) is computationally more expensive, even ignoring noise, and Ni and Scheraga(“) have also used the linear prediction eqn (17) to extrapolate the FID beyond the measured K elements before Fourier transformation. This is also a valid approach and probably that which is intuitively obvious. However, for noisy data the extra elements xk which are introduced for k L K are derived from the noisiest part of the FID and a spectrum

derived

from straightforward

extrapolation

shows a higher noise level than that

produced by the LPZ function. This would also be the case if eqn (27) were to be implemented. If the prediction given by eqn (I 7) is good and ek can be neglected these are valid results, but since the methods to be described only minimise ek and do not reduce it to zero the results of applying eqns (26) or (27) without the noise should be treated with caution: the intensities tend not to be reliable at low signal-to-noise ratios. Finally I wish to emphasise once again that the noise we have been discussing is not observation

noise but is prediction

noise related to the accuracy with which xk can be The interaction of observation noise with

calculated based on the previous M values of x.

DAVID S. STEPHENSGN

536

prediction noise is difficult to evaluate. This small but subtle point is the principal logical difference between AR and LP. In AR the existence of observation noise is neglected and the white noise sequence ek is treated as part of the model. This noise, being white, has a constant spectrum which should be included into the equations. In LP the vector {e,J has elements which represent prediction errors. These may not necessarily be members of a white noise sequence, but anyway the intention of LP is to apply methods to predict the time sequence (xk) as exactly as possible whilst averaging out the observation noise. This works well at high signal-to-noise levels, but the assumptions are no longer valid when significant noise is present.

PIG.ll. Demonstration of the effectiveness of the LPZ formulation of Tang and Norris to calculate a spe+um containii lines with arbitrary phases from linear predktion coeffiiients. (a) 1024 pint, PFT of a nokseless int FFT of the same signal but with added noise. (c) 10 4 pomt LPZ spectrum of I nthetic rignal. (b) 1024 x same noisy data as in Fp) obtained from 200 prediction coeiIkieots. Note the improved sensitivity and also the correct reproduction of the phases. (reproduced from Ref.42, with permission) Figure 11 shows that the LPZ treatment in a correctly phased spectrum.

of the prediction

coefficients

does actually result

6.1.2. Yule - Waker Equations. Let us start the discussion of the determination of (a,) by assuming once again that no observation noise is present. The most obvious way to calculate these coefficients, a method introduced by Walker,(45) leads to normal equations This approach is based on the autocorrelation known as the Yule-Walker equations. function of the data (see Appendix A), so we have to start by estimating the autocorrelation coefficients from these data {xk} . If we assume that the time series is stationary and zero-mean the autocorrelation coefficients are given by K-II-1

R,

-

R:,

-

x~+~x~* ll10,

A-'I: k-0

c-33)

Linear prediction and maximum where the factor A is chosen to be either K or K-n. biased estimate

since the number

entropy

537

The first alternative

of terms in the summation

is K-n,

is obviously a

but it is nearly

always used because the matrix of autocorrelation coefficients R is thus ensured to be positive semidefinite. This is a very important feature and a property of value in spectral estimation since the z transform A(z) of the prediction coeffkients {a,} then has all its zeros inside the unit circle .csg) Because A(z) appears in the denominator of eqns (20), (26) and (27), this stable system Appendix B). for short data

means that the poles of X(z) also lie within the unit circle and we have a where X(v) can be evaluated in the region of convergence of X(z) (see Using biased estimates, however, does tend to reduce the spectral resolution records.(3s~3g) As the length K of the data sequence becomes large, though,

the difference between both alternatives becomes small, so the choice is critical only for short sequences. In spite of this bias the coefficients obtained from eqn (28) are some sort of estimate of the expectation

values E[xkxk*_ ,J which can also be obtained

by multiplying

eqn (17) by

x~*_~, and taking the expectation

EP&- ,I -

EL-%

amxk- In-3 - n l

m-l

I

Since the data elements xk _ n are uncorrelated

+

Wp~*_,J .

with the innovation ek for n>O we get

M

1

R, = -

amRn_m,

(29)

m-l

and, for n -

0, M

R,-

-1

a,R_,+o*, m-l

where

a*

-

Combining

s

E[egk*]

= E[ekek*],

i.e.

a,

R,_,

-

{

of the white noise sequence

n - ’ n>O

go2

m-0

which are known as the Yule -Walker 6.1.3.

the variance

(ek).

eqns (29) and (30) and setting a0 = 1 allows us to write

The Levinson Recursion.

now have to solve eqns (31)

equations.

Given the estimates of the autocorrelation

for (a(z)}

denotes that the autocorrelation

(31)

matrix

coefficients we

and ot

where the M as subscript or superscript used is of order M. Equations (31) which can be

expressed in matrix form as 1 al”)

could be solved by standard

matrix

methods,

but a much more effkient technique taking

into account the Hermitian Toeplitz structure of the autocorrelation first by Levinson(*) and improved by Durbin.(“)

matrix R was described

538

DAVID

SSTEPHENSON

Detailsare given in various texts (e.g. references 26 and 38), but since the derivation is simple and the algorithm forms the basis of a good many spectral estimation *echniques I repeat it here. Let us apply the simple trick of taking the complex conjugate of eqn (31). Noting that aL is real and Ra*_,,

f m-0

ahi-,%I-m- {

-

R,

_n we obtain, after rearranging

the indices,

ui

which may also be expressed in matrix form as I-

d-lI*

a($* aw)* M-l

0 0

'2 Oh4

1

These

two sets of M + 1 equations

generality,

R, R,

1

(32)

and

(33)

(33)

may

be combined,

without

loss of

to produce one set of M + 2 equations

R_, R,

. . . R_, . . . R,_,

R,

R,_,

. . . R,

R M+l

R,

. . .

R_,_, R-,

&+I

kc?M+I

(34)

.

0

R-,

R,

1:: 1: 14 0

1

2

R,

OM

J

where 4M + 1 and Ar,,+ 1 are unknowns which have to be determined. Comparing

eqn (34) with eqn (32) expanded to M + 2 equations which have M + 1 new

coefficients a(:

‘)

-R, R,

R_, R,

... ...

R-, R_,_, R,_M R,_,

R,

Q-1

...

R,

_ R M+l

R,

. . .

we see immediately that 4M + I -

R,

0

ag + 1)

0

acM+ 1)

0

Mt

,l

ag++I1) and AM+I + eM + Iai

= af++i+,L)= - AM+Jo~

QM+l

R-,

1 a\” l 1)

.

OZi+l

= 0, so that

(35)

Obviously AM+ 1 is given by the summation AM+I

=

f”

a~)RM+,_,

(36)

>

m-0 where a0 = 1 and thus combining eqns (35) and (36) ag++++,‘)in terms of the old values, i.e. af++,‘)

I

a(,)RM+,_, II

-

m-0

2 OM.

we can express the new coeffkient

(37)

Linear prediction and maximum

539

entropy

The other coefficients of order M + 1 can be similarly obtained since

WY aW + I) = agn”’ + QM+A4+1-m In

WW



yielding aW+l) I

m

ag)+

a(,M++++,‘) a(MM!*,_m.

Pb)

Finally the new noise variance is given by

++I = OL(

1 -

lag++:) /*) .

These equations thus provide a method for recursively determining

the coefficients a,.

Starting with M = 0 eqn (31) provides the initialisation 0:

= R,,

thus ay)can order M. parameters

be found from eqn (37) and a: from eqn (39) and so on up to the desired It should be noted that this recurrence algorithm provides not only the AR at order M, but also those at all lower orders. This feature is valuable because

in general the correct order is not known beforehand and some criterion, usually based on the noise variance, must be employed in selecting M. Obviously an algorithm which can be used to generate successively higher order models is useful in this respect. The Levinson algorithm is anyway a powerful method of solving linear equations such as eqns (31) involving Hermitian Toeplitz matrices. 6.1.4. Burg’s Modification. The beautifully conceived recursion exemplified by eqns (37) to (39) has a serious defect. Namely it solves eqns (31) with estimates of the autocorrelation coefficients Ri. These of course are not known, they can only be estimated from the data for instance using eqn (28). Forgetting for the moment the problem of whether we choose A = K or A = K - n, we can see that if K is reasonably large eqn (28) may work quite well for small values of n where the summation has sufficient terms to average out observation noise. But the approximation gets progressively worse as R increases, and for small K the estimates will presumably be considerably in error. Using the data to produce poor estimates of the autocorrelation coefficients is a bad way of utilising the information at our disposal and Burg r18)has proposed an interesting method which obviates the need for prior estimates of the values of R and which operates solely on the data. To understand

this modification to the original algorithm one has to accept the key role This played by eM - ag) , often called the reflection coefficient, in the recursion. parameter is determined in eqn (37) from the old prediction coefficients, the autocorrelation coefficients and the noise variance, and is, indeed, the only parameter in the whole recursion which depends directly on the autocorrelation function, the other prediction coefficients being derived using eqn (38). An equally valid and perhaps better way which avoids the need for the autocorrelation

estimates would be to choose eM such that the noise

variance is minimised at each stage of the iteration. This seems a logical thing to do if our aim is to build a good predictor since we know that a measure of its quality is o$ and this would furthermore operate directly on the available data {xk} . Thus the aim of the Burg modification is to choose 4M such that

i302aq,

- 0 ,

and the next step suggested by Burg was a new way of determining

(40) a:.

Equation

(17)

540

DAVID S. STEPHENSON

allows us to write the prediction error as

amXk-m

1

m-0 with a0 = 1, so that an estimate of the prediction noise variance or error power is given by summing over the available 1ek12 to produce K-l 0:;~

a

(K-W-’

M

1

1 z

k-M

amxk-m

(*

(41)

*

m-0

Because the elements xk are forward predicted from the M previous elements (denoted by the “f’ subscript) the outer summation must be unsymmetrical with respect to the data sequence to avoid running the inner summation off the available data elements. In order to symmetrise this Burg suggested using a backward predictor on the same data to obtain another estimate of the error power a$ M, and to minimise the average of both with respect to b?r# To see what the backward predictor is substitute eqn (38) into eqn (17) include the ” f’ subscript and the prediction order superscript, thus

modified to

m-l

(42) where

ei:i=

xk-M

#)*xk _

+

M

+

m

.

(43)

m-l

In eqn (43) the error in element xk_M is related to the M following elements in the time series. Thus it can be regarded as the error associated with a backward predictor. The backward prediction coefficients are thus seen to be the complex conjugates of the forward coefficients. This is a consequence of the assumption of stationarity and no longer applies if the time series is non-stationary (see Appendix A). Equation (42) expresses the forward prediction error of order M in terms of the forward and backward predictors of order M - 1.

Equally well the backward prediction error of order M is described by (44)

Now the forward prediction using eqn (42) as

error power is given by eqn (41) which can be reformulated

K-l

and we can write an analogous equation for the backward prediction error power as K-l

Linear prediction and maximum

541

entropy

The total power is given as the average of forward and backward powers

(45) Applying eqn (40) to the definition of the prediction

error power given by eqn (45),

and

employing eqns (42) and (44) yields

k-M in which

the new reflection

coefficient

is given in terms

of the old errors.

This is a

completely different way of obtaining the reflection coefficient to that previously described in eqns (35) or (37), but the remaining coefficients at any order are still calculated using the Levinson recursion of eqn (58) and thus retain its advantage of computational efficiency. The reflection coefficient QM in eqn (46) is not dependent on the choice of K or K-M as denominator in eqn (45). Thus the effects of bias introduced into the estimates of the autocorrelation function eqn (28) to ensure stability and which lead to a degradation in resolution are not evident in the Burg procedure which generally yields a very high resolution. The stability of the Burg algorithm rests on the reflection coefficient’s being less than unity@‘) which causes the zeros of the z transform of the prediction coefficients {a,) to lie within the unit circle. 0’) As with straightforward solution of the Yule - Walker equations the recursive technique employed provides the prediction coefficients at all lower orders. The iteration is started for M - 0 by setting the error power K-l 0:

-

K-’

1

xkxk* ,

k-0

and thus eqn (46) may be used to determine p1 which can then be substituted into eqn (38) to give the prediction coefficients of order 1. These may then be used to find the prediction error of order 1 and eqn (46) is employed again to find p2 and so on. The whole iteration is rather neat and works exceedingly East. It can be seen that the method builds up its own implicit estimates of the autocorrelation function without assuming anything about the availability or non-availability of data outside that which has been measured,

the sole criterion used is that the prediction error at each stage should be

minimised. (A parenthetical note: Anderson c-1 derived a recursive relationship for the denominator of eqn (46) which has been quoted in reference 40. Faber@) has demonstrated that eqn (2.74) of reference 40 is incorrect and has given a derivation of Anderson’s In Marple’s book(‘s) the recursion (eqn (8.16) of reference 38) is given correctly). 6.1.5.

Least-Squares

improvement

Methods.

over straightforward

Although

Burg’s

procedure

recursion.

seems to be an enormous

solution of the Yule -Walker equations, several problems One difficulty is the appearance of two lines where have been noted by various authors. only one should be,c5’) another problem appears to be a frequency shifting depending on the phase of the signal.(52~5’) Both effects can be reduced by introducing more flexibility into the

DAVID S. STEPHENSON

542

algorithm designed to determine the coefficients a,. The frequency shifting has been ascribed to errors in the autocorrelations caused by using short data records, so that the situation could be improved by windowing to reduce end effects.c5’) Another alternative which avoids this data destructive procedure is to eliminate the Levinson recursion from the Burg algorithm and to solve for all the a(g) at each stage of the calculation. This concept was first introduced by Uh-ych and Clayton,(“) and fast algorithms to implement it have been published by Marple.(3**56) Let us start by considering eqn (17) again which can be rewritten in matrix form as e=Xa, where

(47)

e is an error

vector,

X

a rectangular

data

matrix

and a the vector

of sought

coefficients with a0 - 1. A measure of the error power is the sum of the squares of the elements of e, so that we are seeking a solution of eqn (47) which reduces ene, where en is the Hermitian x%3 where t@

transpose of the vector e, and this can be recast(ti) as the normal equations -

-

s,

(48)

[ok, 0, 0,. . ., OIT with OS being the minimum error power.

What still need to be defined are the dimensions of the vectors and matrices in eqns (47) and (48). If the full range of available data is used eqn (47) can be written

%

X0

. . .

0

ehi

Xhi

. . .

x0

-

(49) S-1

%-

1+M

S-1

. . .

0

. . .

%-l-M

S-1

and we see that the matrix X has zeros in the upper right and lower left corners. This means that the prediction is running off the available data, a situation which Burg tried to avoid. Furthermore, the product XHX is just the biased autocorrelation matrix multiplied by K, so this method (known as the autocorrelation method of linear prediction) leads simply to the Yule-Walker equations which can be solved by the Levinson recursion, but which yield a poorly resolved spectral estimate. If the dimension of X is reduced so that the prediction remains on the data, eqn (49) is transformed to

(50)

Linear prediction and maximum

entropy

and this least squares approach is known as the covariance this case the product employed.

XHX

is no longer Toeplitz

543

method of linear prediction.

and the Levinson

recursion

cannot

In be

The solution of this least-squares linear prediction problem could be found using

standard techniques@‘) which has been accomplished in NMR,@*) but a East algorithm similar to that of Levinson can be developed using the Toeplitz structure of the factor matrices.(3s~5g*w) In this case the assumption is no longer made that the time signal is stationary and the forward prediction coefficients {a,.,.,} are not related to the backward coefficients which are also obtained from this fast algorithm@s) by solving an equation similar

to eqn (50).

As with the Burg

modification

the recursive

program

for solving

eqn (50) may provide prediction coefficients for all lower orders. If the time series is non-stationary it is advantageous to keep forward and backward prediction separate to avoid distortions, but for autoregressive spectral analysis of stationary time series a better estimate can be obtained by combining these to produce one set of This is called the modified covarhce method which coefficients as in the Burg algorithm. again results in a set of normal equationstss)

in which the forward coefficients a, related to the backward coefficients by complex conjugation, are found by minimising the forward and backward prediction error jointly. Again the matrix Rc”) is not the autocorrelation matrix and does not have Toeplitz form (although

these properties

appear as K becomes

large) so the neat Levinson

recursion

is

once more not appropriate. However, as Marple has pointed out(“) and successfully employed, R has a special structure composed of Toeplitz matrices. In common with all the previously considered algorithms

the modified covariance

coefficients at all lower orders. The details of both covariance

and

modified

covariance

involved to relate here, but are fully explained in Marple’s have

an efficiency

close to that

frequency shifting to a great extent. and the deletion of the Levinson

of Burg

recursion provides prediction procedures

are

much

too

recent book.@@ The algorithms

and solve the problems

of line splitting and

These advantages must be bought at a price, however, recursion

which ensures stability

when the reflection

coefficient is less then unity means that some of the zeros of A(z) may occur outside the unit circle. This has been noted by Marple.(56) A further point concerns the efficiency of these recursive algorithms for they are really nothing more than solutions to the least-squares problem eqns (47) and (48). The special form of the Rc”) matrix means that a total of KM + 6M2 operations is needed to find the vector (a,}. The standard approaches not taking into account the structure of Rc”) require considerably more effort: of the order of MS. The line splitting phenomenon which is associated primarily with short data records@“) has been attributed by Fougere@‘) power at any one order.

to the non-complete

minimisation

of the prediction error

He devised a solution based on retaining the Levinson recursion

eqn (38) which links the prediction coefficients of any order together.

At the same time the

is sought by varying Q~, Q*, . . . ,Q~ independently.

The drawback to this minimum of 0; approach is that no recursive algorithm exists at present, so the implementation is Details of this method are given in the original article@‘) and also by Chen@*) expensive. who lists a computer program of the algorithm. 6.1.6.

Prediction

Order Selection.

One question which has not yet been posed is related The problem of order selection is mutual to

to the value of M the order of the prediction.

DAVID S. STEFWENSON

544 all the autoregressive

and linear prediction methods discussed here.

The expected informa-

tion content of the spectrum is obviously a function of M because eqn (20) has M roots so the spectrum will have M spectral features.

the polynomial

in

Obviously the right choice of M is crucial to the correct implementation of the AR or LP method. If M is too small then the resolution will be poor, and one might expect that by increasing M the resolution will improve up to some value of M and then remain constant. In general this is what one observes, however, as M gets larger and larger the algorithm will attempt to fit the observation

noise and spurious detail will become apparent

in the spectrum. This is a major problem which has not been satisfactorily solved, and although there are criteria for choosing an optimal M the AR/LP methods do suffer from the significant drawback that an incorrect choice may lead, via the inherent non-linearity, to noise in the data appearing as interpretable signals in the spectrum. Apart

from the obvious

way of varying

M until the spectrum

looks

“nice”

which

although not objective has certain merits, the usual criteria are based on the behaviour of the prediction error variance. To select M such that ai is a minimum will not work because all these methods have a prediction error variance which is a monotonically decreasing function of M. So some other function is required. Akaike(@) has introduced the final prediction error (FPE) criterion nation of the error variance

this latter quantity increases with increasing M whereas ai FPE(M)

= a

2 K+(M+l) M K-(M+l)

should be a minimum Bishop.@“)

based on a combi-

06 and the statistical deviation of the coefficients a(z). decreases, the FPE,

Since

defined as



at some value of M.

Further

details are given by Ulrych

Another criterion suggested by Akaike,@5) and called the Akaike is defined by

information

and

criterion

(AX),

AX(M)

= K log&&

+ 2M ,

where the term 2M can be regarded as a penalty for increasing the model order.

Ulrych

and 0oe(41) provide a justification for this measure which, as with the FPE, should be minimised to select M. Yet another criterion was introduced by Pame@) which is known as the criterion autoregressive transfer (CAT) function CAT(M)-

K-‘zg

M m-l

K-m

In

3, hi

and which again should be minimised to produce the correct order M. In general none of these criteria works well: all tend to underestimate the order, a feature which has also been noted in the analysis of NMR signals.@‘) This underestimation is particularly serious for low signal-to-noise ratios(68) and Ulrych and 0oe(41) suggest the empirical method of selecting M between K/S and K/2. Viti et al.@‘) show however that acceptable spectra may be obtained without choosing such extreme model orders. Using too high a value for M leads, as could be expected, to spurious peaks in the spectrum. This has been shown by Currie(70) to occur when M >W3 which seems to contradict the experience of Ulrych and Ooe. The conclusion must be that the choice of the correct order is very dependent on the type of data and the signal-to-noise ratio.

Linear prediction and maximum 6.1.7.

Eigemnaiysis

So far we have been concerned with signals containing no

Methods.

observation noise, our aim has been to build a good predictor. date

the noise by increasing

acknowledge increasing

that

predicting

the number

M of coefficients

the noise is like predicting

M we are introducing

for the failure of AR/LP

545

entropy

It is possible to accommoa,,,,

but one really should

the weather!

Furthermore,

by

spurious detail into the spectrum and this is one reason

methods at low signal-to-noise

ratios.

It should be apparent, though, that there is a significant difference between noise and signal. The noise contributes to the diagonal elements only of the autocorrelation matrix: this is the definition of white noise.

Thus it should be possible to separate signal from

noise on the basis of their differing statistics. eigenvalues of the autocorrelation

The

way to do this is to decide which

matrix R or equivalent data matrix products XHX belong

to the signal and which belong to the noise.

After we have made

the distinction

the

spectrum is constructed from the signal part of the matrix and the noise part is discarded. Let us go back to eqn (32). I said it could be solved by standard matrix methods and this is not only true but opens up completely new aspects. Although one could approach the problem by trying to remove the observation noise variance from the diagonal elements of the autocorrelation matrix (‘l) the obvious problem here is that we do not know exactly how much to remove. somehow? The autocorrelation

Is it possible to use the matrix

itself to provide the information

matrix of order M can be written as

R(M) _ o2I(M) + [R(M) _ &3(M)] -

a21(w

+ R’c”)

,

(51)

where PM) is the unit matrix of order M and o2 is the variance of the observation noise. If the eigenvalues of Rc”) are I,

L A, h . . . . I

A, ,

which are all greater than zero because Rc”) is positive definite, then the eigenvalues of R ’ CM)are Ao-a2

L Al-a2

L

.... 2

A,-u2,

and these must all be positive or zero for R’(M) to remain positive semidetinite.

So that o2

must lie in the range 0 $ u2 I

A,

-

minimum eigenvalue of Rc”).

What we could do is determine the eigenvalues of Rc”), subtract o2 to yield R’@o, and use this in the Yule-Walker equations to determine better AR parameters. As u2 is chosen closer and closer to the minimum

eigenvalue,

the spectrum gets sharper until, as o2 -

A,,

we have a line spectrum (R’c”) is singular and the poles of A(z) lie on the unit circle). This method is equivalent to a spectral estimation technique introduced by Pisarenko(“) for analysing signals consisting of sinusoids in white noise,(26*J8*“‘*41)(I once ordered this in an expensive restaurant and was violently sick afterwards). As noted previously better ARfLP parameters can be obtained by using the data directly instead of going through the autocorrelation estimates, so it is of interest to know if the same eigenanalysis technique can be applied to such methods.

DAVID S. STEPHENSON

546

The least squares linear prediction algorithms are all based on data matrix products of the form XHX which are similar to the autocorrelation matrix, so the intuitive way to proceed would be to choose M large enough for the prediction to work well, including the noise, and then to determine the eigenvalues of XHX and finally solve a new system of equations with a modified data matrix product where the minimum eigenvalue has been subtracted. This would be one way of approaching the problem, but it is not necessary to reevaluate a modified data matrix product since the determination of the eigenvalues and eigenvectors of XHX has already provided us with the solution. To see this return to eqn (17) yet again and assuming that the prediction order is chosen so large that even the noise is accounted for we can neglect ek and write this as

dk

-

ir

a,dk_,

-

>

(52)

m-l

where dk is used rather than xk to show that observation can be expressed in matrix form as

a-

noise is present.

Equation

-Da,

(52)

(53)

where

. . .

The formal solution to eqn (53) is a-

-D-‘d,

(54)

but since D is not a square matrix we have recourse to the pseudoinverse which needs a bit of explaining. The pseudoinverse

of a rectangular

matrix

is based on its decomposition

into singular

values(57~79)which can be regarded as an extension of the eigenvalue concept for square matrices. In general for an mxn complex matrix D of rank ti7*), where k is necessarily less than or equal to the smaller of m and n, there exist positive real numbers A, L A, h . . . I I,_ 1 > 0 (the singular values of D), an mxm unitary matrix U-

[u()JQ,...,u,_

i ] and an nxn unitary matrix V = [vo,vl,...,v,_

1] such that

k-l

D-

UAVH -

z

AiUiVin,

(55)

i-0

where the mxn matrix A has the diagonal elements Ai. Since Dn = VAHUH UHU = I and VHV = I,, where I, dimension mxm and nx’n respective?;, then D”D

-

V(AHA)VH

and

DDn

and I, are unit matrices of

= U(MH)UH

which demonstrates that the product matrices DnD and DDn have the same non-zero eigenvalues Ai2, but that the eigenvectors of DnD are vi and those of DD” are ui.

real

Linear prediction The pseudoinverse

and maximum

entropy

D# of D is defined using this decomposition

541

as

i-o where the nxm

matrix

Alv has diagonal

elements

li-

‘.

Using this definition

eqn (54) can

so WD

is of rank k.

be written properly as

aThe

-dd. nxn product

matrix

WD

has the non-zero

eigenvalues

Ai

Now if the product D”D can be compared with the autocorrelation matrix R of eqn (51) then the minimum eigenvalue which is related to the minimum singular value of D could be subtracted from the others and the calculation improved. But we can go a step further.

of the AR/LP

parameters

should

The order M has been chosen so that eqn (52) is nearly exact and presumably

be

a lot of

noise has been fitted. The singular values of D correspond to the lines in the spectrum so that for high order a lot of lines appear, many corresponding to the fitted noise. The decomposition

of eqn (55) allows the separation

principal

eigenvectors

subspace

spanned

values. I9

ui, vi corresponding

by the remaining

Based on this separation

into a signal subspace

to the k’

eigenvectors

we can construct

largest singular

corresponding

spanned by the k’ values,

and a noise

to the weakest

reduced rank approximations

singular to D and

as k'-1

D = 1

k'- 1

Aiuivy,

Diu = 1

i-0

This separation

Ai-‘viui

i-0

is clear if an AR process of order k’ without observation

noise is analysed

with an M order predictor where M > k’ , for then only k’ singular values are non-zero. As the noise level increases, however, the distinction becomes unclear and problems are encountered in selecting the correct signal order k’ . We shall come back to this theme in Section 6.2, where we consider functional modelling. I have introduced singular value decomposition (SVD) in this section because it is an ideal way to treat the pseudoinverse and to understand the concept of rank reduction, but the actual determination of the singular values as such is only one possible approach. The terminology of SVD has come into the NMR literature via the work of Barkhuijsen et al.(75) which shall be explained in Section 6.2. However, although these authors used the idea of SVD they approached the eigenanalysis problem by diagonalising the square data matrix product D”D and thus avoided determining the singular values directly as would have been possible efficiency

using

extant

subroutines. (73~76~77~78~7g) The

reason

for this choice

is the superior

to be expected(57) since the elements of the data matrix product may be calculated

very cheaply,(3*~80) and the subsequent straightforward SVD. Regardless of the practical

diagonalisation

implementation,

requires

however,

fewer operations

the basic

principles

than does remain

the

same: the rectangular data matrix or its square product is decomposed to yield the singular values or their squares, and a selection is made based on the relative sizes of these Which algorithm is chosen is merely a question of computational efficiency and numbers. stability. The one direct application of AR or rather LP eigenanalysis in NMR is the LPZ method of Tang and Norris(5s) who approach the least-squares problem eqn (53) by using a QR algorithm to reduce D to a product of an orthogonal matrix Q and an upper triangular

DAVID S. STEPHENSON

548 matrix

R followed by truncation

method

to reduce the rank in the calculation

of a.

is implemented using routines from the LINPACK(7g) library. ns/3 operations where m - max(M,KM) and n - min(M,K

This

QRD

QRD requires - M), and since

mn2 this is considerably larger than the KM + 6M2 needed for the covariance method of Section 6.1.5 a heavy computational penalty must be paid for such noise compensation methods,

although

admitted, problem.

though, that one of the real benefits of these methods is their tackling of the noise Within the limitations of the AR model this can be successfully accomplished It remains to be seen whether fast using matrix rank reduction schemes.

only by algorithms

the QRD

approach

is probably

more stable for large M.

It must be

can be devised for this purpose.

6.1.8. Summary of AR and LP Methods. The AR methods which I have considered may be concisely summarised starting from eqn (17) which represents an element of the time series {xk} as a linear function of M preceding elements plus an error. This can be called forward linear prediction. A similar equation could be set up for backward prediction in which an element is linearly related to the following elements of the series. The second-order statistics of a stationary time series are represented by a Hermitian autocorrelation matrix Q, furthermore the prediction coefftcients {a,) of eqn (17) and the matrix Q are related by the Yule - Walker equations (31) to the variance of the prediction error.

These equations

of computational

may be solved by the Levinson

efficiency

transform of (a,} all lie within the unit circle. Considerable improvement in spectral estimation modification

which

algorithm

which has the advantage

and which enforces stability by ensuring that the zeros of the z

retains

the Levinson

recursion

ficients at any one order and which determines

is obtained

linking

by

employing

the individual

the reflection coefficient

prediction

Burg’s coef-

pM at order M such

that the combined forward and backward prediction error is minimised. Burg’s algorithm operates directly on the data and makes no assumptions about elements of the time series windowing of the outside the range 0 to K - 1. This avoids the resolution-destructive autocorrelation description

matrix

and has

of this approach

given

rise

as a Maximum

to the alternative Entropy

(and

somewhat

method (see Section

7.1).

confusing) Stability

is

ensured because Burg’s method retains the Levinson recursion. One step further toward perfection is represented by the least-squares methods where the These normal equations (48) are solved directly to yield the prediction coefficients. algorithms utilise the special structure of the data matrix to achieve an efficiency close to that of Burg’s method, but do not ensure the stability associated with the Levinson recursion.

They do, however,

analysed by linear prediction: For all these algorithms

have the advantage an important

that a non-stationary

point in NMR

the most important

parameter

time series can be

(see Appendix A). is the prediction

order M whose

selection may be based on the several criteria presented in Section 6.1.6. A further point concerns the treatment of observation noise. Starting from noisy data

the spectrum

o(z)

eqn (20), as

m-0

is represented,

using the linearity

of the Fourier

transform

and applying

Linear prediction and maximum where N(z) is the spectrum

of the observation

entropy

noise and z -

549

exp(i2nuAt).

Equation

(56)

can equally well be written

D(Z)

-

[

E(Z) + N(z)

5

a,z-m]

/

f

amzSm

m-0

m-0

and because E(z) and N(z) represent the spectra of white noise processes this reduces to

D(z) -

$

bmz-m

a,z-

m

(57)

m-0

showing that the correct treatment of noisy data leads to a spectrum given by eqn (57) and characterised by both poles, corresponding to the roots of the denominator, and zeros, corresponding to the roots of the numerator. The number of poles equals the number of zeros. Thus {c&} is an ARMA process. This being so, it should be no surprise if the analysis of low signal-to-noise FIDs using the pure AR approach leads to imperfect results. In general one could summarise white noise is represented

by saying that the best method for damped sinusoids in

by the covariance

forward solution of the Yule-Walker

least-squares

equations

algorithm.

Both the straight-

and the Burg algorithm

suffer from the

assumption of stationarity. Whilst the Burg method yields, in general, a higher resolution than Yule-Walker, it succumbs to instabilities in the data, which lead to spurious line splitting and unreliable intensity ratios. Another way of treating the observation noise is based on the dissimilar statistics of white noise and signal proper, the aim being to remove the noise bias by isolating its contribution to the diagonal terms of the autocorrelation matrix(‘l) as shown in Section 6.1.7. This may be generalised to divide the autocorrelation matrix or data matrix into two vector subspaces, one being a signal subspace and the other a noise subspace which is discarded. Whilst the general ARMA treatment has not yet been applied to NMR, noise compensation which forms a mainstay of several methods to be considered in Section 6.2, has been employed in the LPZIQRD The

advantages

analysis, i.e.

of AR

method of Tang and Norris.c5*) methods

are common

to all the modern

the defects of the Fourier transform may be avoided.

methods

of spectral

This largely means that

very short data records may be treated to give well resolved spectra: an important advantage over FT. The main disadvantages seem to me to be on the one hand the introduction of the prediction order, a parameter whose correct choice is crucial to the whole operation but which seems to be hard to pin down, and secondly the non-linearity of the processes which, whilst leading to high resolution, also causes severe intensity distortions for spectra determined from low signal-to-noise data. This distortion which, as shown by Hoch,(*‘) is particularly severe for spectra containing signals of grossly different linewidths, is related to the statistical assumption of stationarity (see Appendix A) which is not applicable to damped sinusoids. (**) It is only in the Linear Prediction sense that this approach is a “natural”@g) one because the typical FID, represented as a sum of damped sinusoids, is a possible solution of the difference equation (17),(‘*) and it is then valid only if the forward and backward coefficients are not related by complex conjugation. A final, and not insignificant,

point is the observation

that, just as with the DFT,

the

assumption

behind all the AR methods is the existence of a data vector (c$) consisting of measurements taken at equal intervals of time At. If the time intervals are not rationally related then the DFT cannot be applied, neither can any of the methods considered in this section. If the intervals are equal but several measurements are missing then, as we have

DAVID S. STEPHENSON

550

seen for truncation, the DFT leads to gross distortion since we must assume that the missing elements of (dk) are zero. For AR methods using estimates of the autocorrelation matrix it should be possible to overcome

this difficulty, although the loss of several elements of (dk} will mean that the

estimates of R will be degraded leading to poor quality spectra. However, as we have seen, better spectra may be obtained by employing the data directly using Burg’s algorithm or a least-squares approach,

again it might be possible to extend such methods to disregard

missing data, but to the best of my knowledge this has not been attempted, although in the related technique of temporal Maximum Entropy (see Section 7.1) an iterative algorithm exists,@) regular

which, in principle, operates on non-uniformly measurement

two-dimensional

interval

experiments

is of little significance

sampled data.

The restriction to a

for one-dimensional

data,

but for

the ability to obtain well resolved spectra from non-regular

data would be advantageous. I

I

4

I[

-10

b

lb

-29

-39

26

ed human colon slP power spcstra obtain+ from a 2048 point FID. The sample consisted of re aE&inoma cells and the chemrcal shii scale is referred to 85% H PO (a) DFTpofP%$inal data. #bJ DFT of ori inal data after a nential multiplication, the noise is reducedsbut4&e lmes are broa ened. c) D after exponential multiplication. &d&Burg spectrum after applymg a ii point of first 51 z points of the & e) Burg spectrum aft- smoothing the twse. (upper traces m (d) and e&s”;” smoothing to the FID. . , vertical expansions, and &le reman numerals relate to species present in the sample) (reproduced from with permasion)

Viti et al.@7) have used both the straightforward solution of the Yule - Walker equations and the Burg modification in the analysis of the 31P NMR signals of human cells. The coefficients {a,} calculated using either algorithm are employed in eqn (21) to reproduce the power spectra. The main point of the work is the comparison of the two approaches with the DFT, together with a discussion of the selection of the correct prediction order M.

Linear prediction and maximum As would be expected

the resolution

of the DFT

points in a real FID with fairly good signal-to-noise is when a windowing

entropy

551

method is degraded

as the number

of

ratio is reduced from 2048 to 512, as it

function is applied to the original 2048

points to reduce the noise.

The Burg algorithm, however, applied to 2048 points reduces the noise considerably without at the same time causing a loss of resolution, and several humps in the Burg spectrum may be attributed to species known to be present in the sample, signals which are not obvious in the DFT spectrum. This is shown in Fig.12 which is rather difficult to interpret since some signals are folded, but which does illustrate the power of this method. A comparison of Burg spectra for different data lengths K ranging from 4096 to 512 shows that resolution is lost as K decreases, but in all cases the resolution is higher than for the DFT.

The straight solution of the Yule -Walker

equations is shown to lead to poorer

spectra both for this first sample and for a sample having a much lower signal-to-noise

ratio

(note that the captions of Figures 5 and 6 of reference 67 should be exchanged, furthermore their reference to Hacneau and Quikk should presumably read “E.J.Hannan and B.G. Quinn, J. R. Statist. Sot. B41, 190 (1979)” even though these authors do not quote a BIC criterion).

This conclusion is in agreement

with the rest of the literature dealing with

AR spectral estimation. Viti et a1.@‘) also discuss the choice of the prediction

order M. They compare several criteria (see Section 6.1.6) but conclude that the FPE is as good as any. In spite of this the spectral resolution may be improved by increasing the order beyond that calculated using FPE. The general conclusion is that Burg’s method may be applied to produce good

quality power spectra from low signal-to-noise ratio data provided the emphasis is on the determination of the existence of peaks and not so much on their shape or intensity. In a further paper @‘) these same authors again consider 31P NMR signals and use Burg’s method to calculate estimates of the power spectra. As in their first paper only real data are

treated. The main topic covered here is again the correct determination of the prediction order - the illustrations all show dramatically that better resolved spectra may be obtained by selecting a value of M greater than that suggested by the FPE, sometimes considerably greater. However, choosing an order close to that suggested by the FPE yields spectra having correct intensity ratios for high signal-to-noise

FIDs.

For signals containing

a large amount of noise the situation is not so simple and the FPE is a poor test of the order. To offset the poor performance of the FPE Viti et al. suggest filtering the FID before analysis to reduce the high frequency noise component, and this procedure seems to result in a better choice for M when applying the FPE. The conclusion of this paper seems to be that Burg’s method may be applied to both noisy signals and short data records to The crucial choice of the prediction produce spectra superior to that obtained by DFT. order should be made with the help of the FPE and this should be supported by filtering out the high frequency noise from the FID. These conclusions are illustrated in Fig.13. Ni and Scheraga c4’) also have used Burg’s method and demonstrate that well resolved phased spectra may be obtained free of truncation artefacts. The phased spectra were obtained by extrapolating the FID using forward linear prediction eqn (17) and employing the relationship (27). Burg’s method was modified to derive the reflection coefficient in eqn (46) using a weighted least-squares tit,@‘) proposed to counteract the defects of Burg’s approach and which has been shown@‘) to be almost as good as the covariance least-squares algorithm. The authors of this modification, Kaveh and Lippert,@4) attribute the line splittings and frequency shifting associated with Burg’s method to its inapplicability to non-stationary signals. Significantly, Ni and Scheraga(“) also claim that AR modelling is inadequate for exponentially decaying FIDs in practical NMR spectroscopy.

552

DAVID S. STEPHENSON

In a series of papers Tang and Norris have presented not only methods for determining the coefficients (a,) but also the means of their conversion into a phased spectrum with improved resolution. In their first paper-(5s) the suggestion is made that the coefkients be found using the eigenanalysis

methods of Section 6.1.7.

The prediction

order M is first

selected, then the (K-M) x M data matrix is decomposed using a QRD(57*7g) algorithm, finally the strongest eigenvalues are retained and used to determine the M coefficients (a,}. This method which Tang and Norris call LPZ is elaborated in a second paper and they go on to suggest improving the resolution by evaluating the spectrum on a circle inside the unit circle.(#) This, of course, results in sharper lines but does not increase the noise level or introduce truncation artefacts as would be the case if the original FID were to be multiplied by an increasing exponential. (W Tang and Norris conclude that the LPZ method simultaneously enhances sensitivity and resolution.

L

b)

P

i

0

Linear prediction Eigenvalue computer

methods

require

large

and maximum

amounts

entropy

of computer

553

time

and large

amounts

of

storage,

and Tang and Norris c4’) have also employed the straightforward solution of the Yule -Walker equations to determine the prediction coefficients before calculating the spectrum using eqn (26). The one significant defect of the method is the assumption of stationarity which leads to intensity applying the method iteratively. 0’) In a recent paper( s5) these same to allow for different summation The advantages

or disadvantages

of two polynomials

which

distortions. To overcome this Tang and Norris suggest Example spectra of this method are shown in Fig.11. authors extend the concept of the LPZ spectral function ranges in the numerator and denominator of eqn (26).

of this approximation

is known

as a Pade

of the spectrum

rational

X(z) as a quotient

fi_mction,(74) have

not yet been

demonstrated. 6.2.

Models

Functional

The autoregressive

method is a very general approach

to use in order to fit a time series

which we expect to consist of sinusoids. A better treatment would surely take this information into account and use it to good advantage. We are consequently brought to consider a set of models in which the expected

functional

form of the data is in.cluded into

the analysis and by choosing this approach we might expect to reduce associated in the AR method with selection of the prediction order. 6.2.1.

Damped

Exponential

sum of complex

widths of the individual would be given as

xk

2

=

Methods.

exponentials

because

spectral

lines.

the

The obvious model to choose for NMR this takes Thus

into

account

the kth element

both

problems

data is a

the frequencies

and

of an ideal noiseless

FID

b,zmk

(58)

m-l

where zm - exp((u, + i2rrv,)At), b, is a complex amplitude, u, is a damping factor, v, is a frequency and the sampling interval is At. The idea of modelling experimental data as a sum of exponent& goes back to Baron de Prony (see references 38,40) in 1795, although several changes have been made since then. The original idea was to fit K real exponentials to K data points. The modern version seeks a least ‘squares solution with M
whose roots must be found,

thirdly

these roots are used to determine

the

To begin with it can be shown that eqn (58) leads to a difference equation for the data similar to eqn (17). To see this we need to set up a polynomial in z having roots zm, and then multiply this out: Q(z) = fi

(z -

z,)

where a, - 1.

_

2

a,zM -n

Using eqn (58) we can write M

Xk-m

-

II-0

m-l

1

n-1

k-m

bnz,

(59)

554

DAVID S. STEPHENSON

and multiplying

this by a,

and summing over m gives

-2

a&k-m

a,

5

b,z,kmm

0

m-0 = mi

1 bnzl;

-M g

n-l

amznM-m

(60)

m-0

We can see that the last summation

in eqn (60) is the polynomial

eqn (59) evaluated at its

nth root and is thus zero, showing that

m-0

xk

a&k-m

=

m-l

The next step is to consider dk

-

xk

dk

+

the effect of noise, where the measured

data are given by

nk so that

amXk-m

-

+

“k

m-l

-

2

-

a,dk-,

+

m-l

Equation method.

(61)

2

amnk

m-0

represents

If we compare

an important

-

(61)

m

intermediate

stage

it with eqn (18) for a general ARMA

a special case where the orders of the AR and MA branches coefficients

of both branches

has found no application

are identical.

in NMR

As pointed

in the development

of the

process we see that it forms

are equal and furthermore

out earlier,

this general

the

treatment

yet, and I shall not consider it any more except to note

that one way to solve eqn (61) would be to vary the parameters a, iteratively(‘*) to minimise the noise variance YZlnk12, although other methods have been mooted.(*@ To make things easier we can simplify eqn (61) somewhat by collecting the last summation

ek

=

into just one noise value

f m-0

amnk - m

so that amdk-m

dk =

+

ek

(62)

m-l

This

is known as the extended Prony technique and the coefficients a, are chosen to minimise Xlek12. A comparison of eqn (62) with eqn (17) shows that this is mathematically equivalent to an AR estimation approach and the methods of Section 6.1 may be employed However, an important difference between eqns (62) and (17) should to determine the a,. be noted. The innovation ek of eqn (17) was taken to be a sample of white noise, and this feature with its consequences for the autocorrelation function of ek was taken into account

Linear prediction and maximum

555

entropy

On the other hand ek in eqn (62) is no longer a in deriving the Yule - Walker equations. sample of white noise and the previous derivation breaks down. Furthermore, eqn (62) is a forward

prediction

relationship,

so that methods

such as Burg’s

algorithm

which relates

forward and backward prediction coefficients by complex conjugation as a result of the stationarity assumption are not ideally suited to solve for (a,}: the covariance method is the better choice. Leaving

aside for the moment

the problem

of determining

(a,},

the next step is to

calculate the roots of the polynomial in eqn (59). This may be accomplished using standard techniques(87) and a program is listed in reference 38. This rooting procedure yields the frequencies v, and damping factors u,, but a further step is needed to obtain the complex amplitudes b,

which contain the intensity and phase of each line. and the second step To recapitulate: the first step determines the LP parameters (a,}, employs eqn (59) to find the roots (z,} of the polynomial 9. The third and final step which we come to now uses these {z,,,} ___ to evaluate (b,}___ of eqn (58). Since we are dealing with a least-squares technique eqn (58) is not exact and must be replaced by dk =

g

b,zmk

+

nk

m-l

expressable in matrix form as d-

(63)

where 1

1

. . .

1

I

“0 “1

,b =

-1

nK-l

A least-squares solution to eqn (63) which minimises the squared error nHn is given by b -

[ZHZ]-lZHd

(64)

which can be solved by a variety of techniques, for example Cholesky decomposition. Alternatively SVD could be used to solve eqn (63) as it was to solve eqn (53), and this approach also allows a noise reduction step @e) akin to that proposed in Section 6.1.7. 6.2.2. Kumaresan and Tufis’s Approach. One can see that the above method will do a good job of determining the spectrum if the FID is noiseless, although it will be beset by the usual AR problem of order determination. If M is too low the spectrum will be distorted. If noise is introduced the problems increase, and unfortunately no satisfactory treatment is available for this situation. If M exponentials are fitted using M order parameters in the presence of noise the estimates of the signal zeros will be biased, i.e. shifted from their true positions in the complex plane, the degree of bias being determined by the signal-to-noise then extra Kumaresan

ratio. To avoid bias we can increase the order of the prediction, but It was this problem which was tackled by noise zeros will be introduced. and Tufts using a combination of the modified Prony method and the

eigenanalysis methods of Section 6.1.7.

DAVID S. STEPHENSON

556

These authors have published several papers(*g~90~g*~gz~g3) dealing with the solution of the above problem.

They start by treating

the measured

signal, represented

by eqn (58),

as

being noiseless and then consider the addition of noise in an ad hoc fashion. In order to discriminate the exponentials from the noise, a large prediction order is chosen to allow the noise to be fitted, and a combination of f&ward and backward prediction is used to determine the roots (2,). Furthermore, SVD is employed as described in Section 6.1.7. As we have seen in Section 6.2.1

forward prediction is expressed by the relationship

af,mXk - m -0

f m=O

for which the characteristic

polynomial (65)

m-0

Since we are dealing with decaying exponentials the roots of @1(z) all lie within the

exists.

unit circle (see Appendix B).

A backward predictor for these data can also be set up

M

ab,mxk - M

m-0

-m

-0

and for this a polynomial M

Obb(Z) - 1

(ab,m)‘p -m

+‘. \“w_ .* c-:0

e-33)

P

can be formulated which for forward decaying exponentials has roots outside the unit circle.

b

d

h

Linear prediction and maximum If the prediction

order is chosen to be large,

entropy

557

then many of these roots correspond

to

noise and the spectrum will show spurious lines, but a selection can be made on the behaviour of the roots for Qt.(z) and @l_,(z). The noise roots of both polynomials will lie predominantly within the unit circle, since the statistics of a stationary random process do not change on time reversal, whereas the exponential signal roots will occur in reciprocal positions along a common radius for et(z) and @t,(z). Th is is shown in Fig.14 for the case of two damped exponentials, where the length of the data vector K has been chosen as 32. Figures 14a and 14b depict the position of the roots of @AZ) and @h(z) respectively for a noiseless signal analysed with order 2. When noise is introduced the estimates of the frequency (angular coordinate of the root) and damping factor (radial coordinate of the root) are disturbed, the degree of disturbance depending on the noise. To illustrate this Figs.14~ and 14d show the positions of the roots of et(z) and @h(z) for many different realisations of white noise, again analysing the signal with M - 2. We see that the estimates are not very reliable, there are both frequency shifts and poor estimates of the damping. The noise can be accommodated by increasing the prediction order in the case of Figs.14e and 14f to M = 8. There are many more extraneous zeros scattered within the unit circle (and some just outside it!) but the signal zeros are clustered where they belong. Figure 14f shows clearly that a good (but not perfect) discrimination is achieved. The backward prediction of the exponent& puts the signal roots outside the unit circle, and nearly always leaves the noise roots inside. To improve determination

the discrimination of the forward

even more the idea of SVD can be introduced

and backward

prediction

coefficients

into the

based on the eigen-

analysis methods of Section 6.1.7. Taking eqn (17) for the forward prediction and eqn (43) for the backward we can express these in matrix form as xf=

- xfaf

P-

-Xbab

+ ef

and + eb

where S-1

xf

. . .

=

XK-2

. . .

and

. . .

xb S-1

. . .

and the other vectors are defined by

DAVID S. STEPHENSON

558

The data matrices can be decomposed using eqn (55) as M-l

M-l

Xf = ~ Aic,‘(v,f>” ,

Xb - ~ i-0

i-0 the Is being the singular values.

If the signal consists of M ’

exponentials

in additive

noise, then the M ’ eigenvectors corresponding to the largest eigenvalues will provide a reduced rank approximation to X which corresponds to the exponentials alone, i.e. M’

M’

-1

X’f = I: Ai$f(vi’>” f X' i-0

b_1 i-0

-1

(67)

a~u~(v~)H ,

The least-squares solution for a will then be given by af =

-(X”)#xf,

&

I

_(Xlb)#,b

where M’ -1

M’

(X’f>#- 1 (a,?-’ l&up , (X’b)# i-0

z

-1

(A+1

vi”(lli”)” ,

(68)

i-0

Applying this technique to our previous example of two exponentials in noise analysed with a prediction order M = 8, but selecting only the two largest eigenvalues, i.e. M’ - 2, yields the results shown in Figs.14g and 14h for forward and backward prediction respectively.

Figure

14h

shows

dramatically

a clear

distinction

between

clustered close to the true positions outside the unit circle and extraneous

signal

zeros

noise zeros which

now all lie within the unit circle. A further improvement

advocated

by Kumaresan

and TI.&s(~~) is the correction

of the

M ’ singular values retained in eqns (67) and (68) by subtraction of the average of the noise singular values. This might be expected to reduce residual bias in the damping factors of the signal zeros which can be seen in Fig.l4h, or more clearly in Fig. 15a which again portrays the analysis of two exponentials in additive noise but this time with M = 24 and M’ - 2. The dots corresponding to the exponential with the larger damping tend to cluster on the inside of the true location, and by subtracting the average of the six noise singular values from the other two (note that the (K - M)xM data matrix has maximum rank K-M = 8, so that it has only eight singular values) Fig.lSb is obtained. The zeros now have a bias in the other direction, i.e. they now tend to lie on the outside of the true location. This overcorrection comes about because the signal power is proportional to the eigenvalues of squares of the singular values correction seen

the data matrix product XHX, as shown in Section 6.1.7, which are the singular values, so that subtracting the mean of the squares of the noise from the squares of the signal singular values(75gg* results in the improved in Fig. 15~.

Linear prediction and maximum

entropy

559

C FIG.15. Locations of the zeros of $(z) for a signal with a data length of 32 consisting of two exponentials in additive no+. . The true posit+ are mdicated by crqsscs and the dots shyw the results of anal sin the signal - 2. @) SVD witK d = 24 and $ 50 rcahaatlona of white noise. (a) +WI? method wltb M - 24 and M - 2 b;\ czdcctt,n of the two mat! sqular values by subtraction of the mean of .the other 6. (c) SVD with M - 2 but correctIon of the two mam srngular valuea by subtraction from their squares the mean of the squares of the other 6.

The distinction between signal roots and noise roots based on their relative positions with respect to the unit circle becomes difficult as the signal-to-noise ratio decreases. Even with backward prediction and the rank reduction represented by eqns (67) and (68) some signal roots may move inside the unit circle and some noise roots outside. Recently Delsuc et a1.(g5) have applied a method suggested by Porat and Friedlandedg6) to overcome this problem. This approach uses both forward and backward prediction to give two sets of roots.

Reflection

superimposable,

of the backward roots inside the unit circle should make the signal roots whereas the noise roots for backward prediction will now he predominantly

outside the unit circle. The distances between roots taken pairwise, one from each set, yields a criterion for distinguishing between signal and noise, signal roots being those whose separation is a minimum. This technique has the disadvantage present in the spectrum should be known or be estimated.

that the number

of lines

6.2.3. Avoiding Polynomial Rooting. The extended Prony method is an elegant way of analysing NMR data, particularly the treatment of noise using eigenvalue analysis which was introduced into magnetic resonance by Barkhuijsen et a1.(75) is worth noting. This approach which has been described in the preceding subsections suffers from one main computational problem: the central step, namely the polynomial rooting, is slow. Remember that Prony’s method consists of three main stages, firstly determination of the prediction

coefficients,

which may be accomplished

using an efficient LP algorithm

if no

eigenanalysis is desired, or by an SVD attack on the data matrix or diagonalisation of the data matrix product if an eigenanalysis is suggested. The resultant prediction coefficients are then used in eqns (65) or (66) to determine the roots {z,,.,) which may then be used in the third stage represented by eqn (64) to produce the complex amplitudes b. Recently Barkhuijsen et a1.tg7) have proposed an alternative which obviates the polynomial rooting step and which I shall now describe. We start

from the basic description

of the signal as a sum of complex

eqn (58) and note that this can be written in matrix form as x=Zb

exponent&

DAVID S. STEPHENSON

560

,1:

where

1

1

. , .

1

=1

z2

. . .

ZM

z-

X-

Zl

1

Note the similarity to eqn (63).

K-l

,b z2

K-l

. . .

ZMK.

Next we build a data matrix X, defined as

and see that 1

1

=1

zz

. . . . . .

x-

K-M-l

K-M-I

=1

. . .

T?

or X

-zB

(69)

where Z has the special form of a Vandermonde

matrix,

i.e.

the elements of Z are given

by z_

= z,“-l

and it is the base parameters {z,) in which we are interested. We are faced with two problems if we wish to find (z,}, namely the data matrix must be decomposed into a product such as eqn (69), and then {z,} must be extracted from the matrix Z. We have seen that SVD will decompose a matrix such as X into a product of three matrices eqn (55), i.e. x

= UAVH

and we can, without loss of generality, include the diagonal matrix A, into V to give x-w Comparing

IH

.

this with eqn (69) we see that the form is similar, but there is no justification

Linear prediction and maximum

entropy

561

for setting U - Z. Instead, eqn (69) must be generalised by including an arbitrary matrix Q with dimension MxM, to give X

-

square

ZQQ-‘B

so we can write U -

ZQ

A simple relationship allows us to extract (2,). First we define the matrix Z,, as that matrix with column dimension K-M - 1 formed from Z by removing the bottom iow, andI we define Z, as that matrix formed from Z by removing the top row, i.e.

K-M-2

=2 we

Zl

K-M-I

K-M-I Z2

.

.

.

K-M4

can see that Zh and Zt are related by Zt -

Zt$

where A’ is the diagonal matrix

z1

.

.

.

0

.

.

ZM

z2 pz-

. 0

.

1:

.

Then it is easy to see that uh

=

%Q

and U, - qQ

where Uh and U, are formed from U by removing the bottom and top rows respectively, that

vt = UbQ-

‘AzQ

(70)

and our task reduces to finding Q- ‘AZQ and then diagonalising it to determine solution to eqn (70) is the well known least squares solution@‘) given by Q-

’ A=Q

-

(UhHUh) - ’ ub”ut

P.

The

(71)

The matrices on the right hand side of eqn (71) are all known from the SVD treatment the data matrix X and it suffices to carry out the multiplications then diagonalise (z,}.

the resultant MxM

so

matrix

of

indicated in eqn (71) and

whose eigenvalues are the sought parameters

As Barkhuijsen

et ;11.(“) point out the inversion of the matrix product (UbHUb) in eqn (71) may be simplified by noting that U is unitary, thus (II%) - E the unit matrix, and since U,

is formed from U by removing

the bottom row which we can denote as the

562

DAVID S. STEPHENSON

row vector uhH then U$Jh

= ur% = E -

-

uhubn

uhuhn

so that using the augmented B(nxm),

C(mxm)

and D(mxn)

(A+BCD)-’

= A-’

-

E +

1

matrix

inversion

lemma@s~tO) which

for matrices

A(nxn),

states that + C-‘)-‘DA-’

A-‘B(DA-‘B

,

we get

(UbHUb)-l = so

UbUbH

-4%



that

UbUbHUbHUt

1 -ub%

The eigenvalues of this are the sought poles. Barkhuijsen et a1.tg7) who have named this method HSVD because the data matrix X has Hankel structure then go on to determine method represented by eqn (64).

the amplitudes b,

by the normal least-squares

6.2.4. Summary of Functional Mohlling. We started out by considering the FID as composed of a sum of complex exponentials eqn (58). The solution of this problem leads to an approach known as the extended Prony method which consists of three basic steps. Firstly the linear prediction parameters must be determined, secondly the roots of the polynomial (59) must be found and thirdly these roots are used in d final least-squares procedure to evaluate the complex amplitudes b, in eqn (58). The first step can be accomplished in a variety of ways. The most obvious is to use one of the methods described in Section 6.1 to solve eqn (62) for (a,). This has been applied by Barone et a/.@*) who used the Burg method to evaluate the prediction coefficients. The advantages determine

of this approach

are the availability

the optimal order M.

small (M
of well-defined criteria

These authors use the FPE

(Section

6.1.5)

to select a prediction

to

order

in relation to the data length and thus the noise should be eliminated in However, the assumption of stationarity inherent in the Burg method

severely curtails the efficiency of this approach

and a superior way might be to apply the

covariance least-squares algorithm of Section 6.1.5. The prediction coefficients may also be obtained by solving eqn (48) directly by Cholesky decomposition. This mode of attack has been chosen by Gesmar and Led,@O) who solve for the backward prediction coefficients which, as shown in Section 6.2.2, allow the discrimination of the signal and noise roots. In this case the order of the prediction M is chosen to be large (MI&/~) to accommodate the noise which is then eliminated at the end of the second stage. This again is not an optimum approach, for an expensive computational procedure has been employed. The eigenvalue methods also allow the selection of a very large prediction order (Ma13W4)

and noise elimination in both the first stage using rank reduction (Section 6.1.7)

and the second stage by employing

backward prediction.

This combination

is the method

Linear prediction and maximum propounded

by Kumaresan

formed the data matrix eigenvector

matrix.

and Tufts,

product XXH

entropy

563

and this was used by Barkhuijsen and diigonalised

et zJ.(‘~) who

it to yield the eigenvalues and an

This step was followed by selection of the most significant eigenvalues

(i.e. rank reduction of XXH) before completion of the calculation of (a,}. Whereas Barkhuijsen et al. (‘*) who named their method LPSVD employed tridiagonalisation of the product XXH followed by bisection(“) to perform the eigenanalysis, Tang et aI.(1oo) used a QR procedure, c5’) claiming it to be more efficient. direct

SVD

calculation

of the data

matrix

of the product

possible if M=3W4,

is employed.

As

2CXH followed by diagonalisation

particularly

This is the case only if

Barkhuijsen

as the elements of the XXH

et a1.(‘*) have

is the most matrix

shown,

efficient way

may be calculated

recursively.(s8~80~g8) The second stage of the analysis involves finding the roots of the polynomial (59).

This

is accomplished by Barkhuijsen et d.(75) and by Tang et aL(‘O”) using a routine by Steiglitz and Dickinson.(“‘) Gesmar and Led have used another from Svejgard(‘02) to solve for the roots, whilst Barone et al.(**) recast this as an eigenvalue problem. The first and second stages may be elegantly combined to avoid the time consuming rooting as shown by Barkhuijsen et al., cg7)which method is described in Section 6.2.3. The final stage of the extended Prony method these in a least-squares calculation eqn (64) to This is done by Barone et al.(**) using a SVD, the routine of Beppu and Ninomiya egg) for real

involves taking the roots {z,} and using determine the complex amplitudes (b,}. whereas Barkhuijsen

et al.

data or take a standard from the IBM Scientific Subroutine Package for complex data.

-2.

0.

5

2.

m

employ either routine (SIMQ)

5



FREQUENCY FIG. 16. 3tP spectra of a mouse.

oints to zero. (b) ,LPSVD e, ef.75, with permisnon)

Barkhuijsen

et al.

(a) real FFT of 768 point FID after o timal fdterin and setting initial three spectrum obtained from point 4 to 128 of the original%ID. (reproduced from

have applied their LPSVD

method both to synthetic

time data(75)

consisting of two exponentials in noise as well as to 31P NMR data.(75sg4) The synthetic data are used to show that for high signal-to-noise ratios two closely spaced Lorentzian lines with different linewidths and intensities can be resolved, the spectrum being presented in tabular form as a list of parameters {b,) and (2,). Noise rejection here is made by selecting only a very few significant singular values which in this case is easy because of the high For the 31P investigation Barkhuijsen et al. studied a mouse! signal-to-noise ratio. Because of experimental difficulties the total number of data points was limited and the first This sort of problem results in severe artefacts if the few points of the FID unreliable. DFT

method

is employed.

singular values (the number

Using

only K = 125

of non-zero

points and setting

singular values equals min(M,K

M i

88

- M))

the 37 could be

564

DAVID S. STEPHENSON

divided bite significant and non-significant by scrutiny yielding a reduced rank of M ’ = 9. The resulting. spectrum looks nicer than the DPT of the original data, in particular the rolling baseline has been removed (Fig.16). The ability of LPSVD to extract the correct frequencies, phases and damping factors, even when the initial part of the time signal is missing, is demonstrated by these authors for an electron spin echo signal. Furthermore, they

show that

by manipulating

the damping

factor

an

improved

resolution

may

be

obtained. One difficulty becomes apparent in this study, namely found by LPSVD

the question of whether a signal

is “really there” or whether it appears by some happy chance of noise.

A method for solving this problem is to take many realisations of the noise and determine estimates of the standard deviations of the signals found. The roots whose amplitudes lie below the standard deviation could then be rejected. This unrealistic approach to estimating the variance of a parameter can be avoided by using the Cramer - Rao inequality which sets a lower bound on the variance,@*Jos) and whose application is described both by Kumaresan and TuRS(~~) and by Barkhuijsen .et al. (‘*I These latter authors compare this statistical technique with the aforementioned usage of many noise realisations and conclude that the Cramer - Rao theory is justifmble and provides a means of determining whether peaks determined

by LPSVD are significant, those whose amplitudes calculated Cramer - Rao lower bound are rejected.

l

JL .

70.00

0o.w

w.00

40.00

PPM

w.w

PO.00

IO.00

are less than the

Linear prediction and maximum Tang et d.(‘oo) have used their LPQRD consisting of noisy damped sinusoids. teristics of Prony’s

entropy

565

method in the spectral analysis of synthetic data

They demonstrate

method for high signal-to-noise

the very good resolution charac-

ratios and particularly

may accurately be determined in the presence of large signals. Delsuc et aLcg5) have treated data with low signal-to-noise

that small signals

using

the

Porat

and

Friedlander modification described at the end of Section 6.2.2, and have compared these results with those obtained using DFT, Maximum Entropy (Section 7.1) and LPSVD. They show that all the modern methods are capable of producing better spectra than the DFT (Fig.l7), but that Maximum Entropy, whilst determining the line positions well produces many extraneous signals at low signal-to-noise ratios. In contrast, the LPSVD gives a clean spectrum but fails to find all the signals since the high noise level has moved some of the signal roots inside the unit circle.

As a compromise, the modified LPSVD method of Delsuc et al. produces lines near their correct positions at the cost of being told how many lines are present in the spectrum.

I

ppm

180

I

I 160

I

I UO

I

I 120

I

I 100

I

I 80

I

I 60

s

1 LO

1

(I

1

20

FIG.18. 13C spectra of porcine insulin. (a) FFf’ of a 8192 int FID. lb) +e same FID treated. by Prony:r method; 2400 backward prcdktion cocflkicnts were cakulat Jo of which 24 signal roots lay outs& the umt circle. Intensities smaller than three timea the corresponding &ndard deviation8 were ignored, M were signals whole phases arc inconsistent with the rest of the spcetrum. (c FFT of the residual between original FID and Prony reconstructed FID showing only noiae. Starred .nal! have intensity only alightly higher than the corresponding standard deviations. (reproduced from Ref.8? with permision)

C&mar

and Led@“) have applied Prony’s method to synthetic data consisting of damped

sinusoids in noise, to ‘H NMR data of a heptapeptide and to the % FiD of ‘insulin. The LP treatment of the high signal-to-noise ratio synthetic data yields a higher resolution and,

566

DAVID S. STEPHENSON

of course, a cleaner spectrum than the DFT. These authors show that the quality of the spectrum improves as the number of prediction coefficients is increased, although this only applies in high signal-to-noise situations. The LP calculation using 2400 prediction coefficients on 8192 data points of a high signal-to-noise ‘H FID of a heptapeptide gives a result almost identical to the DFT. More interesting is the application to a low signal-to-noise % FID of insulin. Again 2400 coefficients were used on 8192 data points and 124 signal roots were found to lie outside the unit circle.

To improve

the rejection

intensity greater than three times the corresponding

of extraneous

signals only those with

standard deviations were selected, and

signals having phases not compatible with neighbouring

signals were also eliminated.

The

results shown by Gesmar and Led (Fig.18) demonstrate the superiority of their LP approach over DFT. The LP spectrum has good resolution and particularly ,noticeable is the distinguishability of signals which in the DFT spectrum lie beneath the noise. The only drawback seems to be the enormous computing time needed. Barone et &.@s) have used the straightforward Prony method, with Burg’s procedure for the prediction coefficients, to analyse “P signals with good and moderately good signal-tonoise ratios.

For high signal-to-noise

For moderately good signal-to-noise

DFT whilst at the same time returning method was made by Barone

the agreement

between DFT and Prony is excellent.

the Prony method yields a better looking spectrum than

et al.

all the relevant features. on a single line spectrum

A final test of the Prony at various signal-to-noise

ratios. As this ratio decreases from 24:l down to 2:l the line position remains unaltered but the line intensity and particularly the linewidth are degraded. LPSVD, HSVD and their siblings have the advantage that the spectrum may be presented in tabular form. This enables the extraction of Lorentzian lines having vastly different linewidths and intensities, as shown by Barkhuijsen et al.,(“) and is ideally suited to allow mrther processing of the spectrum by a program designed to find correlations in lists of peak positions. However, this method does have the drawback that the assumption of Lorentzian lines together with a non-Lorentzian experimental lineshape leads to numerous components having differing phases and intensities for each spectral feature. The main thrust of Prony’s method is the determination firstly of the frequencies and damping factors followed by their usage to calculate the amplitudes {b,]. If ‘the frequencies and damping factors are known beforehand then the least-squares eqn (64) may be used to find b from the data d since the matrix Z is also known.

This has been put to use

by Noorbehesht et al. (lo*) who have analysed the 31P spectra of a rabbit. The frequencies and linewidths of lines corresponding to known species were available, and the least-squares fit allowed reliable estimates of the amplitudes at low signal-to-noise

ratios.

Of course, this

is a partial spectral analysis only, since significant assumptions must be made about the position and shape of the spectral lines. However, within these limitations, the method does seem to be of value in clinical applications. 6.3.

Extension

to Two Dimensions

Because the 2D Fourier transform is separable (Section 4.1) the spectral analysis may proceed in two distinct stages and there is no need to employ the same method in both. Specifically, since the tp dimension usually has sufficient points for good resolution a 1D DFT could be applied here and a high resolution modelling method used afterwards in the t I dimension. This leads to the concept of a “hybrid 2D spectral estimator”(‘s) for which the theory of the preceding sections is adequate. An alternative approach would be to employ a high resolution method initially in t1 and

Linear prediction and maximum

entropy

567

to use the results to extrapolate

the time data in tl to such an extent that normal 2D Fourier transformation could be applied. c’s) This extrapolation is most probably a better alternative to the usual zero-filling. Hoch(*‘) has applied the first method, i.e. 1D DFT in tp followed by high resolution He chose Burg’s algorithm yielding estimation in tl, to COSY data of tryptophan in D,O. a power spectrum to process 256 complex points in the tl domain obtained after Fourier transforming the t2 domain with a data length of 1024, the prediction order was set at 100. This hybrid spectrum was compared with 2D Fourier transformation using 256 Mor 512 points in the tl domain, shows superior

resolution,

and all spectra were symmetrised.

The hybrid spectrum

superior noise rejection and is devoid of the sidelobe artefacts

In particular, the broad tail which are present in the 256x1024 2D DFT spectrum. associated with the residual water signal is successfully suppressed by this method. Schussheim and Cowburn have employed LPSVD(75) as the high resolution estimator in the tl domain, having first used DFT in t2. This application of the extended Prony method yields lists of signal parameters

{b,)

and {z,}

for each value of f2, and these lists

can be scanned for correlations with the purpose of selecting only the “true” signals and avoiding artefacts. These authors demonstrate the improved resolution and elimination of truncation artefacts available with LPSVD. with tl phase modulation may be eliminated.

Furthermore,

the phase problems

associated

It would also be possible to employ a high resolution estimator in both dimensions, this has been done by Barkhuijsen et al. (“) for 2D electron spin echo data. However,

and this

still makes use of the separability of the 2D Fourier transform, and is a Ear cry from a full 2D LP analysis which has not yet been attempted in NMR. Details of 2D spectral estimation in general are given in the article by McClellan(106) and in the standard text of Dudgeon and Mersereau(“‘) as well as by Marple(‘s) who shows that 2D Yule-Walker equations may be set up and solved using a fast algorithm.(‘*) Furthermore, the covariance and modified covariance least-squares methods of Section 6.1.5 may also be extended to two dimensions.(io8) Functional modelling in the sense of Prony’s method has also been suggested (log) for 2D spectral estimation. Manassen

et a1.(“‘) have proposed a method similar to that of Noorbehesht

et d.(lo4) for

dealing with 2D spectra. The amplitudes b are to be found by solving the least-squares eqn (64) in the tl dimension after having determined the frequencies and damping factors in the t2 dimension by Fourier transformation and spectral fitting. Cross sections perpendicular to peaks of interest in the one-dimensional spectrum may then be calculated from a reduced data set to yield spectra similar in quality to those obtained by 2D FT.

7.

SPEGTRAL

AND TEMPORAL

DISCRIMINATING

FUNCTIONS

With the methods of the previous section we sought to parameterise

the time signal in

order to convert the large number of individual data into a handy amount which contains This compressed form could then be used to determine the all the relevant information. SpCCtNm. In this section I consider methods which use a basically different approach to seek a unique spectrum

{Xj> j - 0, J - 1 which is the frequency domain representation of The limdamental idea is to select that spectrum which best fits

the data (dk} k-O,K-1. the data. This apparently

simple idea is fraught with dangers

with the definition of “best”. spectrum

for our problem now lies

One difficulty is emphasised by my choice of “X”

which should be noiseless, and of “d”

for the

for the data which are noise corrupted.

568

DAVID

The ideal spectrum should resolution of the spectrum, should not be limited by the demonstrated in Section 4.2

SSTEPHENSON

be noiseless even if the data are noisy. Furthermore the represented ultimately by the number of spectral points J, length of the data K so that truncation artefacts of the type should not be evident. This means that J is not necessarily

equal to K. Let us start by considering the type of spectrum we want.

Normally the DFT produces

a complex spectrum whose real and imaginary parts are combined by phase correction to give a real spectrum. If we take the real spectrum as our goal then the imaginary part of
Xk -

1 j-0

Xj exp(i2njMJ)

,

(72)

and the index j is summed from 0 to J - 1 because there. are J spectral points, and where the factor J-l %nCe

(or K- 1 in eqn (14)) has been dropped for convenience. iS in general COmpkX, we have the relationship(‘**21)

{Xj} is real, whereas {xk)

xk -

b-,-k)*

(73)

7

pointed out in Section 4.2. Thus there are only J/2 + 1 independent values of xk in the time domain: the two real values x0 and xJ,r together with the J/2 - 1 complex values xk for k- 1, J/2 - 1. How do we compare the {xk) k - 0, J/2 with the measured data (dk) k - 0,K - 1 ? (i) The first case to consider is where the data are noiseless and where K I J/2 + 1. Obviously the problem is either overdetermined or exactly determined and only the values of dk from k - 0 to J/2 need be used and since these data are noiseless the best procedure is to set xk equal to dk for k- 0, J/2.

xk

-

dk

-

J-1 1

If this is done then we can see that

Xj exp(i2njWJ)

k-0,

Jf2 ,

j-0 and thus by using the orthogonality arrive at

Xj

I

J-’

( d, + (- l)‘dJ,2 +

relations eqn (13)

and the Hermiticity

2Re [ ‘g ’ dk exp( - i2njklJ) k-l

] }

j-0,

eqn (73)

J-l.

we

(74)

This little exercise in formula fiddling has shown that the real spectral elements Xi are fully determined by the data and, furthermore, that we can obtain J real spectral points from J/2 complex data points by zero filling to a data length of J and using a J-point DFT. This should be no surprise since it is the standard everyday tool of NMR spectroscopists; J real pieces of data in the time domain determine J real pieces of data in the frequency domain. (ii) Things get a bit more interesting if we consider a case where J/2 + 1 > K, for now the vector {xk} k- 0, J/2 is longer than the vector (dk) k- O,jC - 1. Certainly we can compare the’ two vectors up to the element 4 _ ,, but there are no measured elements, dk, for k > K - 1, so eqn (74) cannot be applied. The spectrum is under determined and an infinity of spectra will all equally well fit the data xk - dk for k = 0,K - 1. These differ only in their values of xk for k 2 K. Let us look at the situation where the spectrum

Linear prediction and maximum

entropy

569

contains one extra point i.e. J/2 - K. In a very general manner we can think of the compatible spectra as corresponding to the points on a line in “frequency space” shown in Fig.19a. Obviously we must select just question is - ‘how ? ’

one

of the spectra

as representative

and ,the

The solution is to apply a discriminating function p(Xi> and to select that spectrum which corresponds to a maximum or minimum of D along the line of compatible spectra. The function must be chosen with care, but many possibilities are available. This situation may be generalised remain valid.

to cases where J/2

-

K+ II, R > 0, and the above

considerations

(iii) Another case of importance is when J/2 + 1 - K but the data are no longer noiseless. If a spectrum is sought which fits the data exactly then the noise in the time domain will be carried over into the frequency domain. A better approach is to seek a spectrum which fits the data within the expectation value of the noise. This can be expressed easily in terms of a lit value K-l

(75) and C should be Ka* where a is the standard deviation of the noise. undetermined

This problem is also

and the range of compatible spectra is illustrated in Fig.lSb

as a closed curve

in frequency space.

b

a

c ..

The hatched area within the curve corresponds to spectra which fit the data better than would be expected taking the noise into account, i.e. the noise is, for these spectra, carried over from the time domain to the frequency domain, one of these spectra being an exact fit.

But, from those spectra which lie on the curve one must again be selected, and the

procedure

is the same: to choose

discriminating (iv) A

that spectrum with a maximum

or minimum

of some

function D(Xj) along the curve.

final case obtains

illustrated in Fig.lSc.

when

both J/2

4

1 > K and the data are noisy.

where the line AB has now expanded to a “sausage”.

This

is

The filling

570

DAVID S. STEPHENSON

corresponds to spectra having C < Ko* and the skin to the set of spectra from which we have to make a choice by application of the discriminating function. In the above three cases illustrating under determination of the spectrum the discriminating function was applied in the frequency domain. But the selection procedure can just as well be carried out in the time domain since {Xj} and (xk) are related by the Fourier transform.

Applying such a discriminating

function D(Q)

leads us to a situation analogous

to the above and the pictures in Fig.19 are equally valid. In general the method of approach is clear: we firstly introduce sloppiness, or to be more precise, we accept that our experiment is imperfect, and then to ameliorate the consequences of having too many solutions we choose just one by seeking the extremum of a discriminating function in the frequency or time domain. 7.1.

Maximum Entropy

I start the discussion of discriminating tinctions by considering that function which has been most widely published. In order to do justice to this concept of entropy it is necessary to apply some thought to the meanings of probability and uncertainty area where philosophy and mathematics merge. As recommend Papoulis’ monograph @) which contains an various works of Jaynes, particularly references 111, value. There are two forms of the Maximum Entropy one which I regard as “true” ME and which uses

-

a rather nebulous

a handy guide through the morass I excellent treatment on entropy. The 112 and 113, are also of inestimable (ME) method currently in use. The

the entropy of the spectrum as the discriminating function will be dealt with first. The second which uses the entropy rate of the time series leads to equations identical to those of the AR/LP method in 1D spectral analysis, and I include it merely to clear up the confusion on this point. 7.1 .l . Probability Density of One Random Variable. In this section and in Section 7.1.2 I shall deal with the development of an entropy discriminating function in the frequency domain. Let us consider a random variable@ x which can assume any one of a set of numbers as a result of the outcome of an experiment. The standard introductory example is the throwing of a six-faced die for which only six outcomes are possible and for each outcome we can assign to x the number of dots showing on the uppermost face. If we are gambling with such a die it is vital for us to estimate which face will come up next. We need to know the probabilities associated with each of the six possible outcomes of one throw, for we would then surely bet on the outcome with the highest probability. P

P

123456

P

123456

a

123456

b

C

the random variable x which can take on the values 1 - 6, the number of (a) The probabilitythat x - 5 is almost one, the entropy associatedwith lo arAms to base e). (b) The probability that x - 5 is almost zero, the

R11probabilities are equal, the entropy reaches its maximum

value

of

Linear prediction and maximum

entropy

571

The probabilities could be estimated by throwing the die N times and seeing how many times

Ni a particular

face i appeared.

The

probability

pi is then

approximated

by

pi - Ni lN. This is a rather laborious procedure, and is inappropriate in many cases. For instance, if the random variable is no longer the number of dots on a die but is the frequency vi in the experiment of Section 2 and if Xi is interpreted as the probability of occurence of this frequency, that is we regard Fig.Sa as a probability density, then the corresponding experiment is a discrete CW measurement of the individual X$ Let us, however, remain with our six-faced die for the moment. Our problem in determining the probabilities lies in the uncertainty about the occurrence of a future event. For instance the turning up of five dots on throwing the die once. If p5 - 0.999 and the other values of pi are almost zero ( Xpi = 1 ) then we are almost certain that the five will appear, the probability density for this situation is shown in Fig.BOa. If ps - 0.001 (Fig.SOb) we are almost certain that this face will not appear. If, however, p5 - l/6 then we don’t know either way and our uncertainty about the outcome of the experiment is large. Indeed if all the probabilities are equal to l/6 then our uncertainty

takes its greatest value.

It is possible, though, to quantify the uncertainty represented by a probability density for one random variable, and this leads to the concept of Entropy. Equally well a probability density could be derived from a knowledge of our uncertainty as to the outcome of an experiment. The classical way to obtain such a probability density is to employ the Principle of Insufficient Reason,(6~1’s) attributed to Bernoulli (1713), which says more or less that if we have no reason to prefer one outcome to another then the same probability must be assigned to each. When applied to our die experiment this means that p1 - p2 - .,. = ps = l/6, i.e. we assume that all outcomes are equally likely. The corresponding probability density is shown in Fig.SOc. This approach is intuitively appealing and is confirmed, at least for dice, by frequency analysis. This principle, however, does not aIlow us to determine a probability density whilst including relevant information measure of the uncertainty conditions(“6)

in the form of constraints.

To do this we need a functional

inherent in a probability density.

which any such measure

S should reasonably

Shannon(1”~115) provided three be expected

to fulfil, and he

showed that the only S which satisfied all three conditions was given by N

S-

-

F1 i-

Pi

l&Pi?

(76)

1

where F is an arbitrary factor. Shannon’s conditions are not unique, but any set of reasonable conditions(“7) will lead to a logarithmic form for S which had first been propounded by Hartley(“s) as a measure of information.

The relationship between uncertainty

and information

is superficially simple.

Before doing the experiment, for instance thowing a six-faced die, the uncertainty in the result is given by the logarithmic measure S. After the experiment is complete and the one outcome is obtained the gain in information is given by S. This is particularly easy to appreciate for the probability density of Fig.POa, here the result is almost certain so S is nearly zero, showing both a low uncertainty before the experiment and a low gain in This topic, the relation between uncertainty and inforinformation upon completion. mation, which has given rise to considerable discussion is treated in greater depth by Brillouin,(“g)

Jaynes(i”)

and also by Denbigh and Denbigh.(izo) Irrespective,

whether S is regarded as a measure of uncertainty

or a measure of information,

however,

of

it is beyond

doubt that any probability density has associated with it a statistic S given by eqn (76).

DAVID S. STEPHENSON

572

Shannon named this measure of uncertainty entropy, reputedly at the instigation of von Neumann,(‘20*121) and because of the similarity of eqn (76) to the classical entropy of statistical mechanics. Now that we have a measure of uncertainty S let us see how it can be exploited. The fundamentals were laid down by Jaynes(‘i6*‘**) in 1957 and have come to be known as Jaynes’ Principle. The idea inherent in this principle is relatively simple: namely that the least biased estimate of the probabilities pi is given by that probability density having the greatest S subject to any constraints on the pi which are appropriate. This leads to an estimate which is maximally

non-committal

the limction which is maximised Principle.

with regard to missing information,

is S it is also known as the Maximum

and since

Entropy

(ME)

If we apply this principle to the example of our six-faced die, including only the constraint that the sum of the probabilities is unity, the maximum entropy is attained when all the pi are equal to l/6 which is what we achieved by using the Principle of Insufficient In this case nothing has been gained, conceptually the Principle of Maximum Reason. Entropy is equivalent to the classical method, simplified since the constrained maximisation

but operationally the analysis is drastically of a well defined function lends itself to

elegant computer implementation. I shall delay until later a justification

of the use of ME and in the next section sketch an

easily appreciable rationalisation of the Maximum Entropy Principle which was used by Frieden to derive a method for improving the resolution of blurred optical images. 7.1.2. Frieden’s Entropy. In 1972, Frieden(12s) introduced the idea of restoring optical objects from blurred and noisy images by employing the spatial entropy of the object as the discriminating function. language of NMR.

In the following I shall reproduce his arguments

translated into the

The object, or in our case the spectrum {Xj}, is considered as an array of intensities with sum ~Xj j= 0, J - 1 associated with the frequencies vj=jAu where AU is the desired resolution in the final spectrum. This corresponds to the description of the NMR experiment given in Section 2. The next step taken by Frieden was to assume that the total intensity in the spectrum (object) could be divided into a very large number of intensity units AX such that each element Xj was represented as Xi - njAX where nj is an integer. Furthermore, the total intensity was assumed to be constant, i.e. tn. - N za 1. Frieden posed the question as J

to how the total intensity is distributed amongst the Xj and his answer was couched in the formalism of statistical mechanics. If we denote pj as the probability of any one intensity unit AX being located at frequency vi’ then for a very large number N we get pj = nj lN. Assuming no correlation between the elements Xi then the number of ways Wx in which a particular spectrum {Xj} can occur i&l*‘)

Wx(n041, . . . ,n,_ ,) = N!l(n,!,n,!,

...

,nJ_ ,!)

.

(77)

Taking the natural logarithm of this and employing Stirling’s approximation(74) leads to J-1 N- llogWX

=

-

1

(nj IN) 1% (nj IN)

j=O

J-1 -

-

I: j-0

Pj l”dPjl

(78)

Linear prediction and maximum entropy This is Shannon’s

form of the entropy of a probability

573

density introduced

in the last

section and we can see that the entropy is related to the degeneracy of a distribution. Thus distributions having a higher entropy have a higher multiplicity and can be real&d in more ways, and this is a justification

for expecting them to be more likely to be observed,(**‘) i.e.

more probable. Frieden@) converted eqn (78) to J-1

log W,

-

-u

~ j-0

Xj log Xj + B

(79)

where the constants u and /3 are given by u - l/AX and /3 - Nlog(NAX), and then he applied the Principle of Maximum Entropy by seeking that solution which maximised the summation in eqn (79). But he also went one step further and took into account the noise ek present in the measured data dk - xk + ek by applying the ME method to determine the most probable noise vector {ek}. This can be done using the foregoing treatment, but allowance must be made for the non-positivity of this noise since the logarithmic formulae used so far apply only to positive quantities.

By adding a bias B to aB noise values it can be ensured that all Ek -

ek + B

are greater than zero. If, furthermore, a noise unit AE is introduced then the noise Ek at each measured data point can be expressed as Ek - nkAE, and the number of ways of realising a particular noise distribution is wE

N!l(n,!,n,!,

-

where N = Enk.

...

,nK_l!)

This leads, in an analogous fashion to the above, to K-l

log w,

=

-a’

1

&log&

(u ’ ,/3 ’ constant)

+ P’

k-0

The number of ways of realising a particular spectrum which fits the data with a particular noise distribution is thus given by WxW, and thus the best solution to the problem is to maximise this subject to the constraints 4

-

xk +

k-O,K-

ek ,

1

and

~Xj

I

Y

(80)

The standard way to achieve this is to use the method of Lagrange multipliers(“) to a maximum

by setting

the sum K-l

J-1 -~XjlOgXj

K-l

-p~EklOgEk

j-0

-2

k-0

Ak(Xk+Ek-B-dk)

-@Xj-

r)

k-b

where Xk -

J-1 1

Xj exp(i2xjWJ)

(81)

j-0 and p, p and (4,)

represent the undetermined

multipliers.

The spectral solution to this problem is found by differentiation with respect to Xi, thus X. I

exp [ - 1 - p - K$

and the undetermined

1, exp(i2xjWJ)

]

(82)

k-0

multipliers

are obtained

by substituting

eqn (82)

into eqn (80).

574

DAVID SSTEPHENSON

Frieden(12s) solved the resulting non-linear equations using a Newton-Raphson method to determine the unknowns 1, and p and found convergence within 40 iterations.(*) But the problem becomes rapidly intractable as K increases. A final point to note is that this formulation

can

be applied

to one-,

two-,

or indeed

n-dimensional

data

without

any

restrictions. 7.1. ‘2.1. Gull and Daniell’s Implementation. with K + 1 Lagrange

multipliers can, however,

ek are zero-mean

random

The exact fitting of the data using eqn (82) be avoided if the assumption is made that

the

errors

ew

(80) may be replaced by the single constraint

variables

with variance

ok2, then the constraints

K-l

1 bk-dk12/ok2 -

K

k-0

and the problem reduces(125s’26) to finding the maximum J-1

K-l

j=O

k-0

of

(84) with only one Lagrange

multiplier A.

If the differential of Q(A) with respect to Xj is taken

we obtain aQ/aXj

= q

=

-1OgXj

1 -2~Re[~‘~

-

axk*/axj

I

(85)

k-0

which,

for a maximum,

must be zero.

iterative technique to solve this problem: - 0 simplifies to Fj

Xj - up

-1

Gull and Daniell(126) now applied a rather neat since xk is given by eqn (72) then eqn (85) with

K-1 (Xk- dk)

-2ARe[x

7

exp( - i2ttjk/J)

(86)

k-0

where the multiplier A must be positive. For any given dk, ok and a this equation determines the vector {Xj} which is found iteratively by starting with a uniform spectrum and determining the xk using eqn (81). These xk are then inserted into eqn (86) (the summation can be done using the FFT by setting (xk - dk) - 0 for k 2 K) to obtain a new vector . Unfortunately eqn (86) provides the solution as the exponent of the discrepancy between xk and dk and as such is numerically unstable. Solutions can be obtained only by averaging successive iterates, a Ezct noted not only by Gull and Daniell but also by Collins.(‘*‘) However, the self-consistent approach is not the only way to determine {xj}. Since we are seeking a solution in which Fj is zero, this is equivalent to finding a root of eqn (85) and we can apply Newton’s method (“1. Xj and by employing eqn (72) we get K-l

aFjiaxj- - llXj

-2A

1

k-0

ok-2

This involves taking the differential with respect to

Linear prediction and maximum

entropy

which shows that for a positive A eqn (85) will have one unique root.

575

If the differentials

8Fj l8Xi for i # j are neglected then at any particular value of A the iteration relates the new element Xj” to the old element Xj” by the equation Xj”

= xjO - Fi /(8Fi l8Xj)

Usually the differentials 85

laXi

. are not negligible and a suitable combination

Newton iterations and regula f&J”)

is needed.

Termination

of damped

may be based on the size of

the corrections Xjn - Xjo. The procedure is similar to the self-consistent method in that a solution is found initially for a small value of A and then 1 is increased until the criterion eqn (83) is satisfied. This method is stable and fairly fast, but stability and speed do not always go hand-in-hand. 7.1.2.2. Skilling and Bryan’s Implementation. An alternative to the foregoing would be to maximise Q by a direct search. This method has been applied to a similar problem by Wernecke and d’Addario(r2”) who used the the method of conjugate gradients.(12s) However, any search algorithm will be confronted by immense problems in dealing with NMR for we are seeking the maximum

data

of a function in a vector space of dimension J where J

may be several million. Even the conjugate gradient method cannot help us here for this requires the storage (not the inversion) of a JxJ Hessian matrix WQ. The solution is to restrict our search to a subspace, and to choose the subspace with care. This idea, introduced by Skilling and Bryan,02’) to counteract the problems associated with the self-consistent method is very successful in practice. Since the aim is to maximise the entropy and at the same time to bring the fit given by eqn (75) to a selected value these authors reasoned that the best search directions would be the gradients of these measures. Therefore in their general algorithm they chose a set of base vectors starting with OS and VC, the gradients of the entropy and the fit respectively, and expanded the set to include products of the form WS.VS,

WS.VC,

WC.VS

etc.

to introduce greater flexibility.

S and

C are defined by J-1

S =

-

z

Xj (log(Xj /A) - 1)

(87)

j-0

and

k-0

where the equation

for S has been modifiedu2’) to include the constant

A which can be

regarded as the default level for X. obtained by maximising S without the constraint operation. This can be seen by dl*Berentiating the modified entropy eqn (87) to give

asiaxj-

log(A)

-

log(Xj)

(88)

showing that Xj = A for a maximum of S. This approach allows a further improvement,

for the original idea had been to maximise

Q eqn (84) for a particular value of A and then change A and remaximise C(1)

could be brought into agreement

However,

C in

with the expectation

by separating eqn (84) into its components

Q.

Thus the lit

value of the noise in eqn (83).

it is possible to find a maximum

of S

where C,, is a target value of the fit at each iteration. But before I subject to C = C,, describe this approach let me go back to cover what Skilling and Bryan(12g) call ’ the single

DAVID S. STEPHENSON

576

most important key to the development

of a robust algorithm ’ , that is the idea of scaling

the spectral space. The two terms in eqn (84) tend to operate against one another, and particular difficulty is encountered with the fit C. The spectrum is necessarily positive because of the logarithmic form of S, whereas a good fit can always be found by setting elements of the spectrum to values less than zero. This can easily be seen from the form of the Fourier transform of the data, where each positive peak is flanked by sidelobes which are alternatively negative and positive. A change in the spectrum along the gradient of the fit PC thus normally tends to make small values of Xj negative and this must be prevented by setting such values to a very small positive value (essentially zero). But such action slows down the progress of the search tremendously

and a better approach

would be to replace

this ad hoc limitation by a method which allows large values of Xi to change more than small values. Just this effect may be achieved by locally scaling the space in such a way that the elements of the gradient VCj are changed to (VCi> ’ - Xj.VC .. This is equivalent to transforming to a new space with a metric~74~130) given by gij - d ij/X$ Once this has been done the new gradients (VS) ’ and (VC) ’ together with their transformed partners (WC)’ (VS) , (WC)’ .(VC)’ etc. and combinations thereof vectors ej of a subspace in which the maximum of Q is sought.

can be used as the basis

The problem is simplified if the new subspace is made Cartesian by orthonormalising basis vectors ej such that el.ej -

di.

the

In this subspace the functions S and C at a point q

may be expanded in Taylor series about the origin to give S(q) -

s(0)

C(q)

C(O) + Crq,

-

+ S$$

+

~S~vq~ql/ +

+ .....

(90)

+ .....

=@@l/

(91)

where the indices p,u run over the number of search directions (i.e. the subspace) and I have written them as subscripts contra-gredient

components.(74~‘30) S,,

and C,,

the dimensionality

in spite of the convention

are merely

the derivatives

of

for co- and of S and

C

respectively along elr, i.e. eP.VS and e,,.VC, and SP,, and CPv are the second derivatives with respect to the basis vectors, i.e. - eP.VVS.ev and CPv - e,,.WC.e,. But in % the scaled space the underlying metric gij - dGIXij is just minus the second derivative of the entropy - WS so that eqn (90) reduces to

(92) By diagonalising

C,,,,

C(q) = C(0) +

e q n ( 91 ) can also be simplified to give

cpqp + e,spsp

(99

where the y,, are the eigenvalues of CP,, and the rotation in the orthonormal subspace leaves eqn (92) unaltered. We are now set to look for that point q which has maximum entropy S and for which c - Ctargct the desired final value of the fit which would normally be chosen in accordance

with eqn (83) to equal ok2K. This point might, however, not be contained in the subspace and the best we can do is take a step towards the minimum of C in the subspace which is found using eqn (93) to have a fit value given by

CmiIl=

C(0) -

?4y,-‘c,c,

The strategy invoked by Skilling and Bryan is to aim for a value of the fit Cak,, between Gin

and Ctarget, a reasonable value being C,,

-

max(2C,&3

+ C(O)/S, C&g.&

What

Linear prediction and maximum is sought is the maximum

of S subject to C -

and using eqns (92) and (93) the maximisation

Ck.

entropy

577

By redefining Q as Q -

US -

C

of Q in the subspace results in finding that

point q given by qc( -

(oSc( -

C,Y(Y,

+ Q)

where u is chosen to fit C - Cti by a simple iterative procedure involving eqns (92) and (93) only. For further details of the control procedure of this elegant algorithm the original article(‘lg) should be consulted. The main feature of the algorithm between time and frequency.

is a continual forward and backward transformation

Any one iterative step starts with a real spectrum

{Xi}

for

which the entropy may be calculated using eqn (87) and the entropy gradient using eqn (89). This is then followed by transforming cXj> to the time domain using eqn (72). The complex vector (xk) is then employed in eqn (88) to determine the fit, and the gradient of the fit may be obtained by substituting eqn (72) into eqn (88) and differentiating with respect to Xj to give 8C18Xj

-

2Be[ K$

(xk - dk) exp( - i2nkjJ)

]

(94)

k-0

Both transforms (72) and (94) are J-point DFTs, but since in general J is much greater than K economies can be made by eliminating unnecessary calculations. Further transforms are necessary to determine the second derivative of the fit, and Skilling and Brya~#~~) cite six transforms per iteration for a program version using three search directions. The initial spectrum can be a uniform spectrum or a positivised FT of the data {dk), seems not to matter for the algorithm result after 10 to 20 iterations. The

termination

criterion

vectors are parallel and C -

takes all in its stride and produces

is based on the angle between C,,,,

VS and VC.

the problem has been solved.

works to bring C close to CtarBct and then iterates to maximise constant.

The degree of non-parallelism

TEST

-

it

a satisfactory When

these

Normally the algorithm the entropy

keeping C

can be measured by the quantity(12g)

1x pc2 2 IVSI - lvcl

which should be less than 0.1 for a good solution. 7.13. Probability Density and Entropy of Several Random Variables. In Sections 7.1 .l and 7.1.2 we were concerned with a single random variable x which could take on several values as a result of the outcome of one experiment. The probability with which these values appear,

or are to be expected,

has been given as the probability

density function

p(x). Let us now extend our view to include several random variables x1, x2, . . . and try to understand their collective behaviour. Our ultimate goal is the correct treatment of a time series which may be considered as just such a set of random variables. My purpose here and in Section 7.1.4 is to provide the link between the entropy rate of a time series and its power spectrum. In this section I shall derive an expression for the entropy of several random variables. The

path we take may become

difficult to follow, but we can ease our journey

making several reasonable assumptions important

which act as milestones along the way.

by

The most

of these is to suppose that each random variable has a normal distribution.

For

578

DAVID

S. STEPHENSON

one such real Gaussian random variable by(sJ’,‘s’) p(x) The defined

= (2x)-%

probability

0-l

exp{ -(x-~)~/2o~}

density

as the expectation

expectation

this implies that the probability

is a function value of x,

. -

value of the square of x- p, o2 -

If we extend this to two independent

(95)

of two parameters p

density@) is given

E[x], E[(x-

Gaussian

alone,

the mean

and the variance

p which is

defined

o2

as the

P)~]. random variables

x1 and x2, then their

joint probability density p(xl,xZ) is simply the product p(x1)p(x2). But if the two variables are not independent, that is if the outcome of the first experiment is correlated with the outcome of the second, then this correlation probability density is given by@~74J3’) (x1 - kQ2

P(x1,x2)

must be taken into account

_ 2r (x1 - P&2-

o12

where A -

The parameters

ol,

coefficient.

o2 and r which determine

be brought together in the form of a covariance

(96)

P2)*

02

0102

-r 2% ) ) -1 and r is the correlation

(2x0102(1

Pg) + (x2-

and their joint

the probability matrix

density may conveniently

o;.;r O:;: ]

c- [:;I :;;]=[

1

and eqn (96) rewritten in the form &x1,x2)

-

(2x)-l

(det C)-

H exp{ - $4(x-

r)TC-l(x-

cc))

where the boldface x and c denote the column vectors (x~,x~)~ and (~~,r~)~. With this matrix-vector notation it is simple to generalise variables which yields the joint probability density@~74~Ls‘) P(x)

-

-

P(xI,x2’“.,xm)

(2rr)-K’2 (det C,)-

to K real Gaussian

‘h exp{ - %(x- ~r)~C,-~(x-

random

P)}

(97)

In Section 7.1 .l we saw that every probability density has an entropy given by eqn (76) for the discrete case. Extending this to continuous variables we obtain@) s and

-

5

so we

probability s -

P(x) log(&))

obtain

the

dx

entropy

3

(98)

of one

Gaussian

random

variable

by

substituting

the

density eqn (95) into eqn (98) to give y2 log(2neo2)

.

To determine the entropy of the multivariate probability density of eqn (97) treatment is simplified by first transforming the general random variables x to zero-mean uncorrelated random variables y by means of the relation

where A is an orthogonal matrix. diagonal with elements A,, - 1,

the the

The covariance matrix Ax of the variables y which is is related to the covariance matrix of the variables x

Linear prediction and maximum

entropy

579

by(‘31)

n, - AC,AT , and the joint probability density becomes p(x)

-

(2n)-x’02 (det Cx)-

)c, exp( - MyTAx- ‘y)

I

(2~)~~~

H n

(det Cx)-

exp( - y$/2A,)

.

(99)

k Equation (99) can then be substituted into eqn (98) giving S

-

-F

j

n

exp(-yk2/2ak)

[ log(F)

-

k

where F = (2n)-x’2 s

=

-F

Y,*/~A,

] dYO..dY,

_,

k

(det Cx)-

log(F)

1

%.

n

1

Thus

exp( - yk2/2A,) dYk

k + F

1

[

k

-F

I

log(F)

-F

-

(l/2)

dY1 ]

l#k

n

A&2@

F 1

+

k =

J =F’f - Y:l2+)

exP( - yk2’2Ak) dYk n

1 (yk2/2Ak)

H

k

log(F) (2a)K/* (det C,)% log(det Cx)

+

(W2)

+

n

A,‘(2#’ I

F W2 (2a)W2 (det Cx)”

log(2xe)

(100)

The important thing in eqn (100) is the dependency of the entropy on Cx. In the present context the other term can be considered as merely the reference level for measuring

the entropy(3g) and may be disregarded.

This leads to an expression Gaussian probability density given by

entropy of a real multivariate

s =

‘/2 log(det C,)

.

for the

(101)

7.1.4. Autocorrelation and Entropy Rate of a Stochastic Process. An important application of the foregoing is in tbe treatment of time series which consist of a set of observations made sequentially in time. (3g) For instance the discrete time series txk)

-

&,,x1,x2,

*** &_I

1

may represent a set of observations made at equidistant time intervals 0, At, 2At, . . . KAt. Such a measured time series is always one particular realisation of a stochastic process@) (Appendix A) whose individual elements may be treated as random variables. In this section I shall show how the entropy of such a process is related to its power spectrum or spectral

density.

Several

assumptions

variables which comprise the series series, should they be real, have section, and the entropy of the Gaussian variables corresponds to real series only. The extension to

are necessary,

the first is to regard

the random

as being normally distributed. Thus the elements of the the multivariate Gaussian distribution of the preceding series is given by eqn (101). This treatment of real that proposed originally by Burg,(**“*) who dealt with complex series, which should interest us more, is treated

incompletely by Smylie et al. Us’) Only recently do other authors(38*‘34) seem to have realised that complex series are slightly different. JPllWRS 10/6-B

580

DAVID S. STEPHENSON In order to extend real series to complex ones and to employ the formalism introduced

by Burg further assumptions must be made. The aim is to obtain the entropy of a complex series in a form similar to eqn (101). The foundations have been laid by Goodman(135) who considered multivariate complex Gaussian distributions and are expounded both by Marple@ and by Kay. t”‘) The problem is to deal with the covariance matrix of a complex series which can be compared to that of a real series only under certain conditions. A complex Gaussian random variable x may in general be considered as the sum x = u + iv

,

where

both u and v are real Gaussian random variables. Thus K complex Gaussian random variables consist of 2K real Gaussian random variables, and the KxK complex covariance matrix Cc of x could be represented by the 2Kx2K real covariance matrix C’ of

u, v taken jointly.

That is the K-element

represented by the PK-element real vector iance matrix of x treated appropriately.

complex vector (uo,vO,ul, v,,

{x0,x1,x2,

...

...

,xx _ 1) could be

,% _ i,vK _ 1} and the covar-

Thus the complex element CJk would be given by

C. Jk

=

E[xjxk*]

-

i.e. a 2x2 block of C’. But the elements of Cc must behave as complex numbers under multiplication and complex numbers are isomorphic with 2x2 matrices only if these are antisymmetric.(‘35) Furthermore, the diagonal elements of Cc are real since Cc is Hermitian, so that the 2x2 blocks along the diagonal of C’ must themselves be diagonal. This theoretical discussion may be summarised by saying that for a K-variate complex Gaussian distribution it is necessary that the real and imaginary components of each complex random variable x are independent and have identical variances, (‘*) which seems not too unreasonable.

When this is so it can be shown(‘35) that det(C&)

-

Idet(Ck)12

,

and eqn (101) may be extended to yield S

-

log(det Ck)

(102)

for the entropy of a K-variate complex Gaussian distribution. This value is, of course, dependent on our choice of K, and it would be desirable to have a measure of the uncertainty of a process which is independent of the number of samples.

If we are dealing with processes

of infinite duration

then the value of S in

eqn (102) diverges as K tends to infinity, however, the quotient S/K tends to a limit as K+m and can be accepted as the average uncertainty per sample.@) This limit is called the entropy rate of the process S, thus

i The

-

next

zero-mean.

;.

&

S(X@Xl ,..., %_1)

step in the argument

.

is to assume

This implies that the covariance

that the process

matrix

we are dealing

with is

is identical with the autocorrelation

matrix (Appendix A), and the entropy rate may then be expressed as

s=

Kh_ t

log(det I$,)

.

(103)

Linear prediction and maximum

entropy

581

It is now necessary to make the further assumption that the time series is stationary. On the one hand this allows a simplification of Rx since the autocorrelation coefficients are no longer dependent

on the time origin and are functions of one index only.

Thus the Rx

matrix takes on the particularly simple Toeplitz form

Rx-[:_,

;;;

II(

i:‘]

where the elements along each diagonal are identical and Ri - R Ji. Secondly the power spectrum or power spectral density P(V) in which we are ultimately interested is then related to the autocorrelation by the discrete-time Wiener - Khinchin theorem Rk exp(-i2nvkllt) = At ‘f k- --OD

P(v)

(104)

with the inverse transform

J

UN

Rk

=

P(u) exp(i2nuMt)du

(105)

-UN

where uN is the Nyquist frequency uN = 1/2At. The

final step in this analysis

is to employ

an expression

for the determinant

Toeplitz matrix such as Rx so that eqn (103) may be developed. result shown by Szeg6(136) who demonstrated that

iii

log(det Rx)

K--

K

1 =2v,

J

uN -

of a

Luckily we can use a

log P(u) du

‘N

where P and R are related by eqn (105). Thus using eqn (103) we see that the entropy rate of a zero-mean complex stationary Gaussian process of infinite duration is given by UN

log P(u) du

J -

(106)

‘N

Having come so far it is worthwhile to ask why. In Sections 7.1.1 and 7.1.2 I showed that a valid approach to the determination of the spectrum of a time series was to maximise the configurational

entropy of the spectrum

was an example of a spectral discriminating rate of a time series is given by a Ilog

given by the familiar jp log(p) formula. function.

This

But eqn (106) shows that the entropy

formula for its power spectrum.

This has given

rise to considerable discussion and the layman may be forgiven for having received the impression that the general idea of maximum entropy is equivalent to maximum confusion. Equation (106) is not the entropy of the spectrum treated as a probability density. It is undeniably the entropy of a time series given the assumptions made, and as such is an example of a temporal discriminating function, for the best power spectrum is certainly that which leads to a maximum of s. We could, of course, use an iteration procedure to maximise s in the frequency domain, but this is unnecessary the solution exists in closed form for one-dimensional data.

since as Burg first showed@)

DAVID S. STEPHENSON

582

7.1.5. Burg’s Entropy. The problem Burg tried to solve was to estimate the power spectrum using only the limited number of autocorrelation coefficients available for a short time series. Equation (104) requires an infinite number of such coefficients. If only a few are known what do we do about the remaining coefficients which are needed to define the power spectrum?

His idea was to predict these unknown coefficients whilst making certain

that no extra information

was added.

This could be done by ensuring that the uncertainty

of the time series remained a maximum autocorrelation coefficients were fitted. Since the uncertainty

subject

to the constraint

that

the measured

in the original time series is given by eqn (106) and P(v) is given

by eqn (104) it suffices to substitute eqn (104) in eqn (106) and differentiate with respect to the unknown autocorrelations. &BR,

-

0,

Thus

Ikl >K

,

i.e.

J

UN

-

This

l/P(v).exp(

-i2xukAt)

du -

0,

Ikl >K

.

‘N

means

none

other

than

that

the discrete

Fourier

transform

of l/P(u)

can be

described with at the most 2K + 1 coefficients, i.e. l/P(u)

-

f

6k exp(i2xukAt)

(107)

k--K

with 8, = 8 _k* to guarantee reality of l/P(u). This eqn (107) can be inverted to yield 1

p(u> -

(108)

K

1 8, exp(i2nukh)

k=

-K

and using eqn (105) we obtain exp(i2numAt)

UN

J

RI?3 -

du (109)

- UN k fK6k

_

exp(i2ttvMt)

*

These last two equations form the basis of the temporal Maximum Entropy method. The spectrum is sought in the form of eqn (108) and the coefficients ek must be chosen so that eqn (109) is fulfilled. The problem is to find the vaiues of ok, and the solutions given in the iiterature(3g~1s3~137) use contour integration. I do not find this approach particularly illuminating and therefore provide here a slightly different derivation. Note that P(u) is a power spectrum and can thus be expressed as the product of a phased spectrum X(u) and its complex conjugate P(u)

= X(u)X’(u).

Using this relation in eqn (108) ailows us to write 1 P(v)

-

W)G*(v)

(110)

Linear prediction and maximum

583

entropy

where K

G(v) -

l/X(v)

1

-

gk exp( - i2nvkAt)

k-0 K

G*(v)=

l/X*(v)

-

1

gk*exp(i2nvkAt)

(111)

m10,

(112)

k-0

and K-m %n-

gk gk*+m

1

k-0

the Hermiticity

of (8,)

having been used to set gk -

0 for k< 0 in eqns (111).

The equations (111) and (112) alone do not completely define the coefficients gk but this is possible in terms of the phased spectrum X(v) if we expand X(v) as a Fourier series: X(v)

1

-

h, exp( - i2nvM)

1-O 0

X’(v) -

x

h,*exp(i2nvtit)

(113)

I-O

the reality of P(v) having again been used to constrain the index I non-negative. From eqns (111) and (113) we can see that h,g,

=

1

-

h;g;

.

(114)

Inserting eqn (110) into eqn (109) gives

"N

Rm

5

-

exp(i2nvmAt)

dv

GWG*(v)

‘N

and this can be used to form the sum K K

t

vN exp(i2nvmAt) gk Rm-k

*

k=O

kxo gkexp( - i2nvkAr) m

dv

I -

‘N

W)G*W

“N

-

exp(i2nvmAt)

5

G’W

“N

"N

I

-

dv

X*(v)

exp(i2nvmAt)

dv

.

(115)

‘N

Substituting eqn (113) into eqn (115) and employing the relation (114) produces

5 k-0

gk

Rm_k

m

(

ivdg:

m- ’ m>O.

(116)

584

DAVID S. STEPHENSON

These equations (116)

are the result of applying the Maximum

Entropy

principle plus the

aforementioned constraints to a time series {xk} . But we see that eqns (116) have exactly the same form as eqns (31), so that Burg’s ME method is none .other than an Autoregressive predictive technique in disguise. The distinction must, however, be made between the ME aspect of Burg’s method and the novel means he employed to calculate the AR coefficients which was presented in Section 6.1.4. This latter is a valid approach to determining these coefficients but we have seen, both in Section 6.1.4 and here, that one strong assumption made in both cases is that the time series be stationary. This restriction really makes the method inapplicable to NMR (see Appendix A) although a noiseless FID in which all the signals decay at the same rate could be made stationary by multiplication with a suitable increasing exponential.(‘3e) A further point to be considered when comparing

the two forms of ME is the treatment

of noise. Maximum Entropy applied in the frequency domain takes the presence of observation noise directly into account using the fit function eqn (75) and should thus give reasonable spectra even for noisy data. Maximum Entropy applied in the time domain fails to do this, although more sophisticated treatments(‘sg) may overcome the problem. As we have seen in Section 6.1.8 an ARMA Finally,

the correct treatment

of noise leads not to an AR model but to

model. however,

elegant explanation

the ME

approach

to Burg’s

for the line splitting problem.

method

provided

Jaynes(‘*‘)

with an

By pointing out that the Burg solution

eqn (108) is an approximation to the situation with an infinite time series but the exact solution to a case where the time series is circular he showed that two lines are be expected in the spectrum instead of one when the signal contains a sinusoidal component which does not go through an integer number of cycles during the measurement period. 7.1.6. Comments on Maximum Entropy. It is perhaps warrented to ask why the Maximum Entropy Principle is justified. There seem to be two ways to counter such a question. One is based on the derivation new aspects of consistency.

presented in Section 7.1.2,

the other introduces

To begin with let us consider the simple presentation of Section 7.1.2 more closely. We are seeking a spectrum which is consistent with a given FID. It makes sense in this context to choose that spectrum which is most probable. This raises the question of how to define the probability of a spectrum and forces us to consider some elementary probability theory. Let us start by denoting propositions by letters, A, B, C etc., then the notation P(A) represents the probability that A is true, the notation AB denotes that A and B are true and P(AB) signifies the probability that A and B are true. Furthermore, the notation P(AIB) represents the probability of A given B. Using this notation it is easy to write(ll’*l’s*‘ro) P(ABIC)

-

P(AIBC)

as well as the symmetrically P(ABIC) Combining

-

P(B(AC)

(117)

,

equivalent P(AIC)

.

(118)

eqn (117) and eqn (118) yields

P(AIBC)

= P(AIC)

This simple-looking probability

P(BIC)

P(AIBC)

VW) . WlC)

(119)

equation (119) is one form of Bayes’ theorem,@) it relates the posterior to the prior probability

P(AIC).

Thus the probability that A is true

Linear prediction and maximum given previous information that

A is true

leaming

C and new evidence B is related by eqn (119) to the probability

given only the previous

process

585

entropy

information.

in that the probability

It represents

that a hypothesis

an inductive(lW) or

A is true is reinforced

by

evidence B which fits the hypothesis, a measure of the fit being given by P(BIAC). Let us forget, for the moment, the influence of C which represents general background knowledge, and treat A as the proposition that a given spectrum is the unique spectrum we are seeking, and treat B as the measured FID. Then eqn (119) could be rephrased a#*‘) P(SpectrumlFID)

-

P(Spectrum)

P(FIDISpectrum)

.

(120)

The probability that the spectrum is “true” is thus the product of two probabilities, the prior probability of the spectrum and the probability that the FID agrees with the spectrum. This latter probability is fairly easy to determine if we assume that the measured data are corrupted solely by additive Gaussian noise, for in this case the variables xk - cfk are normally distributed and the measure of the fit is given by

(121) k

In eqn (121) ok2 1s * the variance of the noise and x2 has a &i-squared probability density.@) Thus the second probability in eqn (120) is proportional to exp( - x2/2). The main problem lies with the determination the solution is to set this proportional

of the prior probability P(Spectrum),

to the degeneracy

and

of the spectrum given by eqn (77).

Thus, from eqn (78), the prior probability is proportional to exp(NS) where N is the total number of intensity units AX in the spectrum. The posterior probability is therefore proportional

to exp(NS - x2/2),

This approach,

so that a solution is sought which maximises this quantity.

simple and appealing though it is, has been subject to critisism(“2~143)

which is based on the choice of the size of the intensity unit AX, or alternatively on the value of N. Gull and Daniell(‘26*141)varied N until the x2 of eqn (12 1) equalled the number of data points, which seems reasonable. But Livesey and Skiing(‘4S) complain that N is thus chosen a posteriori, which is not consistent with exp(NS) being a prior probability. The statistical arguments linking entropy with probability have also been called into question by Frieden, CL++) although his arguments are refuted by Musicus. The difficulties caused by equating maximum entropy with maximum degeneracy and maximum probability may be overcome by seeking a discriminating function based not on statistical arguments or on information theory but on the reasonable grounds of consistency. This approach was chosen by Shore and Johnson(‘%~*“) and by Tikochinsky et al.@‘@ The rather formal mathematical treatment of these authors has been made more digestible by Skilling,(lg) Livesey and Skilling,(l’s) Gull and Skilling(150) and by Laue et a1.(151) Shore function system

and Johnson(‘ti) should

fullil.

independence

introduce

These

are

four

axioms,

uniqueness,

and subset independence.

which

invariance That

any

consistent

to coordinate

discriminating transformation,

the function we choose should be

unique is clear, also that it should not depend on arbitrary transformations

is obvious.

The

other two conditions are sufficient to fur the form of S. Rather than go through the mathematics of Shore and Johnson I can do no better than to paraphrase the approach of Livesey and Skilling. U4’) These authors demonstrate that the requirement of subset independence is sufficient to impose the restriction (8/8pj)(8S/8pk)

-

0

where pj is the proportion

for all j # k , Xj /1X.

This comes about by considering

(122) the spectrum being

DAVID S. STEPHENSON

586

broken into disjoint subsets for each of which independent constraint information is available fixing the proportions in that subset. No information correlating the subsets is assumed available so that the function S must have a maximum over the proportions of one subset independent of the proportions in the others. This means that the gradient XYap, of S at pk must be independent of p;, thus leading to eqn (122). One consequence of -J eqn (122) is that S is separable, thus ’

-

1

"j@jl

V

j

and if no information

exists lavouring one element over another then the limctional forms SO that

sj@j) must all be identical

(123)

FIG.21. Schematic illustration of the two systems considered in the text. To clarify the concept of system independence we can regard the spectrum as consisting of two overlapping systems, 1 and 2, which Livesey and Skiing illustrate as the twodimensional matrix in Fig.21 (this in no way implies that the spectrum itself is twodimensional, it is merely a useful construct). The two systems in this illustration correspond to the rows and the columns.

The proportions in the first system are represented by uj, those of the second system by vj and those of the whole system as pi. If the systems

are independent in the sense that linear equality constraints (including the constraint on the total intensity ZJ - Xv - 1) of the form 1 i

uj aik

-

A,

and

1

vj bj,

-

B,

i

apply, the elementary proportions pi must be given by Pjj “j? * In the present context the A, and B, could represent measured data and the coefficients aik and bj, could be the terms exp( - i2aiklI) and exp( - i2rrjVJ) in the discrete Fourier transform. We have seen in Section 7.1.2 that the constrained maximum of S is found by using Lagrange

multipliers, and by applying this technique to both systems independently we seek

Linear prediction and maximum

entropy

587

the maxima of the sums

[Z ‘$ ajk -

91 -

1

s
1

'lk

9,

i

s(vj)

;

421 [i

*k ]

(124)

and -

+

i

1

vj bjl -

(124a)

Bl ]

j

These maxima are found by differentiating to give

S’(Uj)

1

+

(a& - Ak> -

ulk

0

(125)

k s ’ (vi>

1

+

u2,

(bjl -

B,)

-

0

.

(125a)

I The same method may be applied to the whole system for which the constraints 1 ij

uivj aik

=

Ak

and

x

ujvj bjl

-

BI

ij

are effective.

By forming a sum similar to eqn (124) and differentiating we obtain

“‘(uiv’)

+

1

&k a&

+

1

k

Combining

B2, bjl

-

0

*

(126)

1

eqns (125) and (126) yields

s ’ ( UjVj) -

S’(Uj)

-

S’(Vj)

-

1 k

(“lk-

Plk)

aik

+

1

(a*,-

021) bjl

1

differentiating this first with respect to vj and then with respect to uj gives s”’ (UjVj) ujvj

+

S”(Ujvj)

=

0

)

or in terms of the elementary proportions s”‘(p)

p

+

s”(p)

=

0

.

(127)

Equation (127) may be integrated to give the general solution s@)

=

*P log(p) + BP + C

where A, B and C are constants. s

-

A ~

Pi log@j)

(128)

Substituting eqn (128) into eqn (125) yields

+B + C

(129)

j since Ep - 1. The constants

in eqn (129)

axioms and simple arguments,

are irrelevant

so we see that,

Shore and Johnson

demonstrate

on the basis of reasonable that the only discriminating

function which leads to consistent results must have the form

s - -

z

Pj l"g(Pjl

where the minus sign is chosen to give S a maximum.

(130)

This has, furthermore, been done without invoking the concepts of maximum degeneracy or maximum probability. An amusing example of the consequences of not employing eqn (130) has been given by Gull and Skiiing.(L50)

588

DAVID S. STEPHENSON

The foregoing derivation has rested on the assumption that the sole information about the proportions came from the constraints. Should this not be the case and should prior knowledge in the form of a distribution mj be available then this can be included to give a revised form of the entropy

1

s - -

Pj

lOg(Pj

/mj>

(131)

.

j

This expression eqn (131) is a measure of the cross-entropy( ‘%I between two distributions pi and mj (actually the cross-entropy is the negative of S in eqn (131) and has to minimised) and its derivation is described by Shore and Johnson. If no constraints in operation maximising the S of eqn (131) will cause the distribution pj to equal distribution mj which can thus be regarded as a default level(150) and which is related to constant A of eqn (87). One of the reasons for the success of the spectral maximum

be are the the

entropy method is that the

logarithmic form enforces the populations Pj to be positive. This is of great value in image processing where the object is necessarily positive and where positive-constraining methods play a great role.@) But for NMR may be negative.

spectroscopy

it can be a nuisance because some peaks

This difficulty may be overcome by a method to be described in the next

Section, but Wright and Belton 05’) have argued that the fundamental problem lies with the correct definition of entropy for reconstruction of a phased spectrum. The underlying assumption

of the maximum

entropy method previously described is the

validity of the direct correspondence between the isochromats Xi and the spectral amplitudes Xj illustrated in Section 2, but this assumes that all the signals are in phase. Faced with this problem Wright and Belton(152) sought to derive an alternative form of the spectral entropy which better measured the information content of a complex signal obtained in coherent spectroscopy. They reasoned that the entropy should be invariant under certain transformations of the signal. Thus for instance neither the entropy of the signal nor that of the spectrum should change if either of these data sets is multiplied by an arbitrary complex constant. Principally though the entropy of the spectrum should equal that of the FID since both entropies measure the information available and mere Fourier transformation which is a reversible process neither creates nor destroys information. By arguing It@)>

that an analogy exists between quantum mechanics, where a state vector b e ex Pan ded as a series of complete orthonormal basis vectors

-Y

(132) with 1[,{t)>

being the jth basis vector and cj a complex coefficient,

and spectral analysis,

where the time signal may similarly be expanded as x(t)

-

1

Xj exp(i2njMJ)

,

(133)

Wright and Belton introduce the density matrix(15’) e whose elements eij are the ensemble averages3 and from which an ensemble entropy may be derived(15’) S-

- k

We 4deN

where k is Boltzmann’s

constant.

us to define an “intensity matrix”, s-

- Tr(P log(P))

The correspondence

between eqns (132) and (133) allows

with elements P+ = q,

which has the entropy

Linear prediction and maximum

entropy

589

Wright and Belton go on to show how, as a consequence of the Fourier transform relationship eqn (72), the matrix P in the frequency domain may be transformed to the matrix p in the time domain by p=WPW-’ where W is a unitary matrix with elements W. - exp(i2njWJ)/\rJ and where the elements Because W /rf unitary, transformation between frequency of p are given by pk! = xkx;. and time leaves the entropy unchanged S -

- Tr(P log(P))

-

- Tr(p log(p))

,

(134)

thus fulfilling the main condition of invariance. is to derive the elements of P or p from the spectrum or

The last stage in this argument

measured data. These elements are defined as ensemble averages over many elementary states and the degree of correlation between these is not known. Wright and Belton set this correlation

to zero by invoking the Principle of Maximum

the density

matrix

is diagonal

with elements

Entropy

given by Pii -

and thus conclude that

lXjl*.

Thus

the entropy

formula derived by this approach is J-1

S-

-

1

lxj12

l"glxj12

(135)

9

j-0

where the spectrum is scaled so that EIx~* - 1. The realization of this novel form of the entropy using the method of Skiing and Bryan (Section 7.1.2.2) would not be so easy because the total search space now includes the real and imaginary components of {Xj}. element are no longer independent

Since the real and imaginary parts of each individual one major feature of the algorithm, the diagonality of

WS, is absent. This might preclude a practicable implementation. However it should be possible to extend the method of Gull and Daniell (Section 7.1.2.1) Because of the relationship (134) the to include an entropy in the form of eqn (135). entropy can be minimised in either the time or the frequency domain, whichever is practically more efficient. However, eqn (135) relates to the power spectrum of the data so that if entropy can be regarded as a measure of smoothness then it is the power spectrum which will be smooth not necessarily the real and imaginary components. This is probably a serious defect since one aim of the method is to produce smooth phaseable spectra, could only be achieved by adding extra smoothing constraints (see Section 7.2).

this

7.1.7. Summary of Maximum Entropy Methods. I shall restrict this summary to the Two-dimensional extensions are considered treatment of one-dimensional spectral entropy. in Section 7.3 and one-dimensional temporal entropy has been shown to be equivalent to the AR/LP methods discussed in Section 6.1. The first application of Maximum Entropy (ME) to NMR is undoubtedly that of Sibisi.(*) The treatment is strongly based on that of Gull and Daniell,(‘*@ and although it deals with a two-dimensional representation the data are one-dimensional in the usual NMR sense of being measured as a function of one time variable only. As well as the normal frequency dimension in the spectrum, Sibisi introduced the linewidth (or equivalently the decay

rate) as the second dimension.

The

ability of the method

to produce

well-resolved one-dimensional

spectra devoid of truncation artefacts is demonstrated together with the possibility of counteracting the effects of zero- and first-order phase distortion. This latter is accomplished

by including both phase parameters

cpO and cpl into eqn (72)

590

DAVID S. STEPHENSON

relating the spectrum

to its transform

to give

J-1 Xk = exp(i2nq,)

1 j-0

Xi exp(i2a(k+qp,‘/J)

036)

where j’ +J/2 = (j+ J/2) modulo J (the use of the modulus is necessary if the data are complex and the index j runs from 0 to J - 1 because spectral elements Xi with j h J/2 correspond to negative frequencies, see Section 5.4 and Brigham(* The entropy is then maximised over the two phase parameters cp,, and Q, as well as X;.

160

140

bbpm)

130

120

I

FIG.22. “C spectra of 24nylpyridiie in CDClhD (a $ FT spectrum of a noisy FID (X indicates a spccuomctu artefact). (b) FT spectrum of the same p multi liition with a negative ex nential. (c) ME spectrum with automatic phasin . fl’. (d) F’T of a FID obtamcd wi& a larger pulse width and !&ring hig+r sig&to-noise. Because of the wetg tmg m qn (137) the ME method actively discriminates against e noise and 1s much more, auccc$ul at producing an mtuprctable spcztrum at low aigoal-to-noise ratios, corny:, 65: and (c). The low mtenstty for C2 is hardly lar cr than some of the noise 8 in (a) (marked with ! but it IS clearly distinguishabPe m the ME spectrum (c3 . (reproduced from Ref.1 !P1, with permission)

In 1985 Laue et a1.(151)applied the ME method to very noisy 13C data and moderately noisy ‘H data and showed that the resulting spectra were much freer of noise and had better resolution. The main theme is to show that ME can be used to improve resolution and sensitivity simultaneously, something which cannot be accomplished by windowing or filtering in the time domain. Entropy is introduced as a discriminating function in this paper and cogent reasons based on the consistency arguments of the preceding Section are given for its being chosen as the best. Furthermore, an outline is presented of the phase algorithm, that of Skilling and Bryant’*‘), which also finds the two principal parameters automatically and which converges within 100 Fourier transformations. Whereas Sibisi(‘) had introduced the decay rate into the algorithm in order to produce a

Linear prediction and maximum

entropy

591

two-dimensional reconstruction, Laue et d.(151,‘55)use the decay rate to discriminate the noise and to improve the resolution by modifying eqn (lS6) to give

against

J-1 xk

-

exp( - wk) exp(i2ncp, ) 1

Xj exp(i2rr(k + cp1)j ‘/J

(137)

j-0

Choosing w positive is equivalent to assuming the existence of Lorentzian transformed vector (xk) would have elements whose amplitudes decay

lines and the exponentially.

However, the noise in the data {dk} has a constant variance and will thus not be well fitted by a decaying signal. The result is to extract the “true” spectrum even in the presence of considerable noise. Comparison of the “C spectra obtained by F’T and by ME shows the improvement in clarity and suppression of artefacts with ME when dealing with noisecorrupted non-truncated data (Fig.22). A similar approach of using a non-zero w in eqn (137) has been employed by Ni et z&(156) who firstly multiply the FID by a decreasing exponential to reduce the noise and then apply ME with a positive w to regain the resolution. Ni et al. state that only in this way may one obtain noise suppression and high resolution simultaneously.

1

71

I

70

6’0

I

$7

66

I 66

1 65

1 64

b(ppml FIG.23. ‘% spectra of dioin be-e(d& obtained with an INEPT “f”“e. (a) FT spectrumof the FID. (b ME spatrum of the same. FID. The maerts show expansions of the ow-field multiplct. (reproduced from Re ) .157, with permission)

One of the defects of the spectral Entropy entropy

constrains

p to be positive.

Laue

procedure et

a1.(15’)

is that the jp log@) have

overcome

form of the

this difficulty

considering the sought spectrum to consist of the difference of two positive spectra. these, {Xj’

}, corresponds

by

One of

to the normal positive signals, the other, {.Kj- }, to the negative

DAVID S. STEPHENSON

592

signals. Since these two spectra are positive it is easy to apply the entropy formalism to both, yielding s-

J-1 z

-

xi+

log xj+

j-0

+

J-1 1

xj-

log xj-

j-0

and the real spectrum may be obtained as X. I J

xi+

-

xi-

(138)

whence Xj may be incorporated Laue et a1.(157) demonstrate

in eqn (137) to yield the time series {xk} . the use of this two-channel program on an INEPT(“)

spectrum with antiphase peaks and show the simultaneous improvement of resolution and suppression of noise obtainable with ME (Fig.23) to Martin(158) has used the method of Gull and Daniell, W6) described in Section 7.1.2.1, obtain ME reconstructions of magnitude spectra. He chose a somewhat different approach to that of Laue et al. by seeking to tit a “correlation” function derived from the experimental FID. The data (dk) were first transformed to the frequency domain using a normal DFT’ to give a normal spectrum { Dj> from which the magnitude spectrum { 1DA} was obtained and subsequently transformed back to the time domain to form the “correlation” function {dk’ } (this “correlation” function is not the same as the autocorrelation The self-consistent algorithm of Gull and Daniell(‘26) was then function of the data). employed to find a magnitude spectrum {X.’ } which fitted the (dk’ } within the expected variance of the noise. Martin showed that tt: e method did not yield a higher signal-to-noise ratio than did matched filtering. But it should be borne in mind that this result cannot be directly translated to the method of Laue er al. (tsl) because of the completely different treatment of the data and because Martin employed no weighting in eqn (137) The spectral ME method can also be applied to a truncated FID as shown by Laue et al.(34) who applied it to I70 data in which acoustic ringing and pulse breakthrough had Conventional FT of such data results corrupted the initial data points in the time domain. This may be eliminated either by increasing the delay baseline roll.

in considerable

between the end of the pulse and the start of the acquisition, a technique which introduces severe phase distortions, or by setting the initial data points to zero which introduces artefacts of the “sine” function type. An ME spectrum obtained from such truncated data is, however, devoid of such artefacts, (“) although to achieve a clean baseline Laue et al. found it necessary to weight the initial points in the FID rather than to employ a sharp cutoff and set them to zero (Fig.24). A

0

C

JLJ A

B v 4

b)

FIG.24. 170 spectraof aspyrone. The FID was corru ted by acoustic ringin trum of the same FID with the fvst 18 points set to zero. FID. (b) FT s with the first 1rc.pornts ignored. (reproduced from Ref.34, with permission)

4

Linear prediction and maximum

entropy

593

Laue et aI. go on to consider the quantative aspects of this method by applying it to a two-line spectrum and comparing the areas of the lines under various conditions. They show that the reliability of the areas decreased as the number of ignored points at the start of the FID increased. Furthermore, the reliability is also dependent on the decay rate w in eqn (137). If the spectrum is reconstructed with w set to zero the areas are least reliable, for maximum reliability the decay rate should be chosen close to the experimental T2*. The spectral ME method also allows a spectrum to be estimated even when points are missing in the middle of the FID. As I pointed out in Section 4.2 this benefit is of little value for 1D NMR. However, the demonstration of this technique by Barna et d.(3s) was made using 1D data so its discussion belongs here. The FID is normally acquired at regular time intervals 0, At, 2At, . . . etc. It is necessary to do this if the spectrum is to be obtained by DFT for using a non-regular time basis precludes the DFT and missing out some of the points leads to spectral distortion. But using ME there is no need for all the points to be present, although the data must still be taken at multiples of a basic interval At. The ME method will produce a spectrum best fitting the available data. The object of this approach is to acheive good resolution and high sensitivity with the minimum number of points. The obtainable spectral resolution is dependent on the length of time that the FID is sampled. For the FT method the resolution is simply the inverse of this time, whereas for the methods considered in this review higher resolution is obtainable both in principle and in practice. However, even here the information which leads to high resolution is contained in the data points at large values of kA t. Thus it makes sense to acquire such points. On the other hand the bandwidth

of the spectrum

for complex data is the inverse of the sampling interval At so

that if the spectrum is to contain no folded lines this interval is f=ed by the spectral width expected for the sample under study. A compromise must be reached in which some of the points are acquired close together with the sampling interval At to provide the correct spectral width, and some of the points must be acquired at high values of kAt for good resolution. Because the signal decays exponentially Bama et aI. suggested sampling the FID on an exponential basis with the sampling density being higher at the start of the FID, where the signal-to-noise

ratio is greatest,

in order to increase sensitivity.

the results obtained with this method is shown in Fig.25.

(a)

J,

An example of

594 7.2.

DAVID S. STEPHENSON

Maximum

Smoothness

One of the prominent aspects of ME spectra is the flatness of the baseline and apparent lack of noise. This can be rationalised in terms of the information concept of entropy. A spectrum in which numerous features such as wiggles and noise spikes occur is obviously one which has more information than a spectrum devoid of such artefacts. Because the maximum entropy method seeks that spectrum consistent with the data but having the least information these features will be absent in a ME spectrum. So, in general, maximum entropy spectra are smooth. If smoothness can be equated with lack of information then it makes sense to seek a spectrum in which smoothness itself is the discriminating function. This approach was first advocated

by Phillips(‘5g) and is described

by Friedenc8) and by Rosenfeld

and Kak.(160)

Belton and Wright (161) have applied the method to NMR based on a treatment by Hunt.(i6*) Let us consider first a spectrum (g(k)} obtained by Fourier transformation of a FID. This spectrum, which may have broad lines because of short T2 times and may be contaminated by noise, can be considered as the convolution of an ideal spectrum {Q)} with sharp lines and a lineshape or point spread Gmction { hQ}

plus additive noise {e(j)},

thus g(k) -

i j-0

h(k-j)f(j)

+ e(k)

k = O,a+ b-2

where the sequence {Q)] ha s a elements and (h(j)} also be written in matrix form as

w9 has b elements.

Equation

(139)

Hf + e

g-

can

(140)

where the elements of H are given by

h(k-J]

0 5 k-j2

b-l

otherwise for k -

and j -

O,a+b-2

O,a-1.

In the absence of noise the least-squares solution of eqn (140) would be given by f -

(HHH)-‘HHg

,

but noise leads to ill-conditioning(8) and can be taken into account vector f such that eHe -

(g-Hf)H(g-Hf)

= (a+ b-

where o2 is the variance smoothness of the spectrum

,

1)02

(141)

of the measured noise, subject to the constraint that the The smoothness is defined in terms of {f(j)} be a maximum.

the sum of squares of the second derivatives of the sought spectrum, smoothness

-

-

z

lfo’+l>-w+fo’-l~12

To make life easier we zero-till the arrays where P L a + 26 - 2 to give new arrays {g,},

$a =

I

fo o

OSjSa-1 a$jSP-1

by finding a solution

,

.

i.e. (142)

{g}, (f), {h} and {e} up to P elements, ($1, (hp] and (e,} such that

Linear prediction and maximum

entropy

595

and similarly for { gp) , { hp} and { ep} . Then eqn (140) can be expressed as gP

= Hpfp + ep

where the matrix HP has the form

hpCO>hp

hpP> HP

. . .

...

hpCl> hpC2> 1

. . .

hpco) :

J

*

hp(P - 1) hp(P - 2) and, furthermore, smoothness

eqn (142) becomes =

- (CpfJT(Cpfp>

(143)

where the matrix Cp is given by 1

.

-2 1 cP

-2

-2 1

1 1

=

1

. .

1 -2

1

.

1

*

-2

.

1

The purpose of choosing P L a + 2b- 2 is to introduce sufficient zeros into the vectors such that “wrap-around” problems are eliminated when the matrices HP and C, are written as above. The advantage of this formulation is that both matrices are circulant and the eigenvahres of such a matrix are obtained by the Fourier transform instance the eigenvalues A,(k) of HP are given by

of its first row.@s) For

P-l A,(k)

=

z

Hp(O& exp(i2nkjlP)

(144)

j-0 and the elements of the eigenvectors wk of both matrices are given by WkO) = exp(i2nkjlP) so that both matrices HP and C, can be expressed as HP

-

w/i,w-’

and

C

P

= W%W_’

(145)

where An and A, are diagonal matrices with elements

$#d4

- A,(k)

and

AJk,k) - A,(k)

(146)

and W is a matrix with elements

ww -

exp(i2njklP)

.

(147)

The solution to this problem is to find the minimum eqn (141) by means of Lagrange

y(CpfdT(CpfJ + kp - HpfJT
20/6-P

of eqn (143)

multipliers, i.e we must minimise

subject to the constraint

5%

DAVID S. STEPHENSON

Differentiating this with respect to fp and setting equal to zero yields

‘p -

(HpTHp + yCpTCJ

1 HpTgp .

(148)

Then substituting eqns (145) and (147) into eqn (148) produces

(14% Fourier (144)

transforms

and

are the Fourier

(146)

(Fp)

and

we see that

transforms

of the

(150)

Equation (150) then be varied and Wright(‘61) resolution and

can rapidly be solved at using Newton’s method have demonstrated that suppression of noise and

and to Gaussian resolution enhancement.

any particular value until the constraint this method allows is superior both to

of A, and this parameter can eqn (141) is satisfied. Belton simultaneous enhancement of matched exponential filtering

Its applicability to truncated FIDs has, however,

not been demonstrated. 7.3.

Extension

to Two Dimensions

The extension of the spectral ME method to two dimensions is simplicity itself.

The

expression for the entropy of a distribution eqn (76) is independent of the organisation of the probabilities pi whether one-, two- or n-dimensional. The only major algorithmic change needed is in the formulae for the transformation frequency.

Thus for two-dimensional

NMR eqn(137)

between the two domains, time and becomes

I-l J-l

xid - exp(- w(k+I))

exp(i2ncp,)

1 1 Xij exp(i2n(k+cp,)i’/I) i-Oj-0

exp(i2rr(l+cp,g’/J)

. (151)

The first explicit use of ME in two dimensions is that of Horels2) who used it to tackle two problems in 2D NMR. The first problem is the existence of “sine” wiggles caused by truncation of the FID. In general there is no great difficulty in acquiring sufficient points in the t2 dimension so that the FID has decayed to zero within the measurement period. The difficulty lies with tl, because the duration of the experiment is proportional to the number of points acquired in tl. Since spectrometer time is expensive this number must be kept as small as possible meaning that either resolution be sacrificed or the t1 FID be truncated, or both. The conventional approach is to be satisfied with a low resolution and to counteract the “sine” wiggles by weighting the FID with a Gaussian or sinebell function.(15*27s31) The second problem is the “phase-twist” lineshape in certain 2D experiments. This can be avoided by forming the magnitude spectrum but even here the dispersion components of the twisted lines interfere with the interpretation

of complex spectra and, furthermore,

the

positive - negative intensity information is lost. Hore(32) applied the method of Gull and Daniell(“6) to two-dimensional and demonstrated

its ability to produce pure absorption

spectra (Fig.26).

J-spectroscopy This apparently

mysterious process comes about because of the direction in which ME tackles the problem.

Linear prediction and maximum

a)

entropy

b)

-ii0

-lb

io

6

,200

20 Hz

4 (a) ___. 2D FT spectrum of a data array consisting of FIG.26. 2D ‘H J spectra of m-bromonitrobcnzcne. . . _ ^ . ,.\ 32(t )x64(r2) complex points which was zcro4lled to lZB(t,)XZ56(t,) complex Tmts betyrf tmnstormattlon. [b) 2D kfE spectru m with 128(4)x256(%) real points. (reproduced from Ref.32, wth pumlaslon)

We can consider a measured signal, phase modulated in tl, as consisting of contributions from components of the form [ exp( - tllT2)exp(i2mvltl)

] [ exp( - t2m2)exp(i2mv2+)

]

(152)

where for simplicity the line with frequency coordinates v 1 and v2 has equal linewidths l/nT2 in both frequency dimensions. Complex Fourier transformation of this signal yields

[

A1(ul) + iDAub ] [ A2P2) + iD2W

1

the real part of which (AlA + D,D2) is, the mixture of absorption and dispersion known as the “phase-twist” lineshape,(‘5*z7) and where A and D denote absorption and dispersion respectively. There is no way to obtain pure absdrption lines from expression (152) FT,

but the ME

method

starts by ‘assuming

absorption lines A1(u1)A2(u2) with respect to f2 to give AI(ul)

that the real spectrum

using a complex

consists of double

and sets the imaginary parts to zero before transforming

[ exp( - t2/T2)exWnu2f2)

first

]

Setting the imaginary parts of the and then with respect to fi to give expression (152). spectrum to zero causes the Fourier transform in the time domain to be Hermitian so that only the first quadrant measured

data:

the

of the back transformed

“phase-twist”

having

disappeared

FID into

may be compared the

other

with the

quadrants!

This

method can be applied to any 2D experiment where the signal is phase modulated in tl and will produce spectra having absorption lines only. By combining it with the two-channel method of Laue et af.(15*) poiitive and negative lines, such as foimd in COSY spectra, may

598

DAVID S. STEPHENSON

be taken care of. Thus phase sensitive COSY spectra may be produced from data obtained with echo or anti-echo selection(15) introduced to give quadrature detection in fi. Hore and Daniell@‘) have applied the method of Skiing and Bxya#s) to rotating-friame zeugmatography data. Rather than using FT to determine the spectrum directly from the data they start from an assumed distribution of absorptive intensity in the two-dimensional spectrum and use the Bloch equations to construct a theoretical time domain signal which could be compared the maximum

with the measured data. consistent

The spectrum was then chosen as that with with these data. Hore and Daniell show that the ME

off-resonance

effects which lead to frequency shifting and phase twists

entropy

method counteracts

and also eliminates “sine” wiggles due to severe truncation. Laue et a1.(16’) have also considered the application of ME to 2D NMR. But unlike Hore(s2) they took phase-sensitive data(*‘) as their starting point, i.e. they used the method of Marion and Wiithrich(lb5) to obtain the FID. In their first test they chose a doublequantum-filtered COSY spectrum where the signals are always in positive or negative double absorption mode. The ME method was shown to produce spectra which have a higher resolution and which are, to a large extent, devoid of truncation artebcts. The conclusion being that if the ME procedure is substituted for conventional FT then measuring time can be reduced by severely truncating the same time retaining good resolution.

the FID in the fi dimension,whilst

at

r

4

I

.;; :.. .:.

1.

:,. .::.

Absorption -W*-

:‘:. .:,

Dispersion -lu*

-

FIG.27. ‘H 2D COSY spectra of the AX spin system of 4-toluene sulphonyl chloride in die&y1 sulphoxide (d ). (a) Phase-sensitive FT spectrum with dispersive diagonal peaks and absorptive cross peaks. (b) Absorptive M% absorption subspectrum from the same data as in (a). (c) Absorptive ME dispersion subspectrum from the same data as in (a). (reproduced from Ref.164, with permission)

Laue et aL(‘64) also considered phase-sensitive 2D spectra containing double dispersion For instance a straightforward phase-sensitive COSY peaks as well as double absorption. spectrum can be adjusted such that the cross peaks which are of interest are in double absorption mode when the diagonal peaks are in double dispersion form. In the same way in which a 1D spectrum having positive and negative peaks can be considered as the difference of two positive spectra Xi’ and Xj- , eqn (138), so a 2D spectrum with positive and negative absorption and dispersion lines can be considered as the combination Xji, Xj-‘) whose joint entropy is given by positive spectra (Xi + , Xj-, J-1 s-

-

1

J-1

J-1 xj+

j==O

log xi+

+ 1 j-0

xj-

log xi-

+ z j=O

and which, either before or after back transformation

of four

J-1

x;

log x;

+ 1

xj-i

log xj-i

j-0

to the time domain, must be suitably

combined according to whether the mode of acquisition is that of Marion and Wiithrich(‘65) or that of States et a1.(16@

Linear prediction and m a x i m u m entropy

599

This approach allows the separation of the phase-sensitive spectrum into two subspectra: the absorption spectrum and .~(Xi - __Xi.-i) the dispersion spectrum which now has lines in pure absorption mode, the m t o r m a u o n about its dispersivity having been retained in the back transformation. T h e effects o f this are shown in Fig.27. The ability to separate the cross peaks from the diagonal peaks which are normally dispersive in phase-sensitive C O S Y and whose long tails interfere with the interpretation promises to be a useful adjunct to the less sensitive method of double-quantum filtration. Lane et al. <164) point out that this method will also be of use in cases where multiplets can have mixed lineshapes. Finally both Lane et al. (~64) and H o r e (s2) have noted that "t 1 noise ''(~5,~67) is suppressed b y the M E method of spectral analysis. Lane e t a / . attribute this effect to the greater intensity of absorption peaks in comparison to phase-twisted lines: the t 1 noise consequently appears smaller in relation to the peaks. Laue et al. suggest further that since the t 1 noise consists o f low frequency information in the t I dimension then it m a y be suppressed b y omitting the first few points in this dimension: a treatment which is disasterous if F T is employed, but which, as we have seen in Section 7.1.7, is possible if M E is used. An example of this technique is shown in Fig.28.

m

FIG.28. IH 2D J spectra of the AX spin system of 4-nitropbenol in dimcthyl sulphoxide (d6). (a) FT spectrum. (b) ME spectrum. (c) ME spectrum with resolution enhancement (positive w in eqn (151)). (d) ME spectrum with resolution enhancement and with the first 4 points in tI ignored. In (b) ME improves the signal to tI noise ratio by producing absorptive peaks. This ratio can be further increased by resolution enhancement (c). Finally, omitting the first few points in t t removes the low frequency components which contribute to t 1 noise (d). (reproduced from Ref. 164, with permission)

DAVID

S. STEPHENSON

b)

d)

:

FIG.29. transformed

tH 2D NOESY ttom

t2 to

(1

-w2-

s.&ctra of BPTI, only a small region b shown. For all four spectra the data were using‘~venti+ FT. (a) The t, to f tram&m wai perform4 with FT, the first

128 p+ntr in t, were mult$Aied by a shtf@ ainebcli function atuI ae+fW’to 1024 points before transformation. (b) The same lT4 pomts as in (a) were analysed with ME. (c) 128 p&n? exponentially sampled out of the 6rst 256 points anaIysed niitb Ml?,. (d) 512 pointa in tt, sinebell fdtercd ,and acre-fqed, analyscd with FT. Note that specttwn (b) ia i&ins&ally better than spectrum (a). ME produce_ far better resolution than does FT because the sinebell fittcriug is obviated. Spectrum(c)is bcttv yet: if oixly 128 points arc to analyscd it is wiser to samplethem exponet$ally tban regularly. Spectrum (c) should be comparedwith spectrum (a): in a practical situation the measuring times for both data sets arc equal, but sp&uum (c) is of far grcatcr interpretive value. Barna and Laue compare spectrum (c) to spectrum (d) which would require four times as much measuring time. (reproduced from Ref.168, with permission)

Barna and Laue(“j*) have extended their earlier work@‘) on exponential sampling to two dimensions. They start by showing that for noisy 1D data an optimum trade-off between sensitivity and resolution for a given number of measured data points is realised by sampling these points on an exponential basis and analysing them with ME. As a 2D example they employ the data from a NOESY(“) experiment on BPTI (bovine pancreatic trypsin inhibitor) which is a small protein. Bama and Laue were limited by computing facilities and were thus forced to use a hybrid algorithm consisting of,Fourier transformation in the f2 dimension followed by ME analysis in the tl dimension at selected values of $. The spectra, shown in Fig.29, indeed demonstrate that a superior resolution is obtained for a given number of points in tl using ME,

but these results for 2D are less convincing

than those for 1D. This lies partly with the difficulty of presenting 2D spectra as contour plots. Barna and Laue note that this exponential sampling cannot be carried too far and that exponential sampling of 128 points out of 512 produces no better results than sampling 128 points out of 256. This seems to be in agreement with their previous workts3) on 1D spectra.

Linear prediction and maximum

entropy

601

To round off this Section I include a short comment on the temporal Maximum Entropy We saw in Section 7.1.5 that in one dimension to 2D spectral estimation.

approach temporal

ME is equivalent to autoregressive modelling. This is not the case in two In the same way that the entropy rate of a one-dimensional process is given dimensions. by eqn (106) so the entropy rate of a 2D process can be defined as(s*~1’4) ‘N2

‘Nl

log

sI

-

I -

‘Nl

P(v1v2)

dv,dv*

‘N2

and it is possible to derive an expression for the power spectrum similar to eqn (108),

i.e

1 P(v1v2)

-

K

L

1

k-

1 -K

I-

exp(i2n(kvl+lv2)At)

8, -L

But whereas for 1D the polynomial in the denominator of eqn (108) can be factored as in eqn (1 lo), this is no longer generally the case for the two-dimensional polynomial of eqn (153). (3**134)Iterative algorithms designed to find the temporal ME solution have been propose&%‘~%‘7e) but as these methods have not been applied to NMR I shall forego the pleasure of discussing them.

8.

DEMONSTIUTIONS

In order to provide a comparison of the several methods introduced in Sections 6 and 7 I sought to concentrate on the two main problems which might be alleviated by these modern techniques, namely truncation of the FID and its possible noise corruption. To this end I conceived

for the one-dimensional

demonstrations

a synthetic

complex

time signal

whose spectrum consists of sixteen Lorentzian lines arranged as the eight doublets shown in Fig.SOa (real part only). The intensities of the doublets vary linearly with frequency as do also the splittings. The largest splitting on the left corresponds to a frequency separation of 0.016~~ and the smallest “splitting” on the right to O.O02v,, where vN is the Nyquist frequency, whilst the width at half-height of all the lines has been chosen as 0.003~~. This basic signal was corrupted by Gaussian noise and was also truncated to test the signal discrimination and resolution capabilities of the several methods. The two-dimensional demonstrations,

which deal only with the spectral maximum

entropy method,

are based on

the phase-sensitive COSY spectrum of an AMX spin system to be described later. All calculations were performed on a Control Data Corporation (CDC) Cyber 180-9953 computer operating systems, probably somewhat

with a main memory of 128 Megabyte working either under the NOSNE system with a word length of 64 bits which, as with all virtual environment offers an ease of programming at the expense of high overheads but which is representative of a class of work stations now being introduced into NMR, albeit faster, or under the NOS operating system with a 60-bit word and a limited

amount of addressable memory. optimised FORTRAN

Computing

times under each system using non-vector&d

code were similar, typically a 2048

to 2048

point complex

transform with bit inversion took 0.05 sec., a 4096 to 4096 transform took 0.12 sec. 8192 to 8192 transform took 0.25 set, using the code in Niederdrenk’s

book.(23)

Fourier and a

DAVID S. STEPHENSON

602 8.1.

One-Dimensional

Of the simple AR or LP techniques of Section 6.1, I have chosen to compare the Yule - Walker, Burg, Modified Covariance and Covariance methods. These are all easy to implement, indeed source code is given by MarpIe and needed only trivial modification before compilation. The prediction coefficients determined by these methods may be used in eqn (21) to calculate a power spectrum or alternatively could be employed in the LPZ formula of Tang

and Norris(42), eqn (26).

It is this latter approach

which I have taken

here, Tang and Norris have shown that a reliably phased spectrum may be constructed this is of more interest to the majority of practitioners SP~$lpJp

than is a power spectrum.

0.016

0.010

o.ooa

0.008

10

7

6

5

a

Intwlaltiea J

0.004

0.00s

and

Linear prediction and maximum

The methods of Section 6.2 are represented by the programs are excellently documented

603

entropy

packages available from the authors,

LPSVD

and HSVD,

both

and I have implemented

the complex versions of both programs. I was imprudent enough to write my own spectral maximum entropy program based on the prescription given by Skilling and Bryan. This program calculates the first derivatives of the entropy and the fit analytically, and the second derivative of the fit in the space detined by two search directions numerically. Six spectrum files are used and one FID file. Sii Fourier transforms per iteration are needed if the phase is not included in the iteration. I also programmed both the self-consistent version and the Newton iteration version of Gull and Daniell’s implementation, but these proved in practice to be neither so robust nor so fast as Skilling and Bryan’s algorithm. introduced

As I pointed out in Section 7.1.7 the phase may be into the problem via eqn (136). The entropy is not a function of the phase (in

the sense that neither X9&p0 nor &Slacp, are non-zero) but the fit G is, and has first and second derivatives with respect to Q o and a first derivative with respect to Q, which are easily determined. Inclusion of the phase into the problem most easily involves one extra Fourier transformation per iteration.

e

_-A_ C

Let us start with the almost noiseless truncated

FID whose FT spectrum

is shown in

Fig.SOb. If we compare the AR/LP methods then we see that Yule- Walker and the covariance method provide stable spectra without any computational tricks, examples being Increasing the filter length of the YW method improves the given in Figs.31 and 32. resolution (Figs.Sla to 31d) and the splittings can be emphasised by evaluating the transforms of the various arrays in the LPZ function on a circle within the unit circle by a as shown by Tang and suitable choice of k in the expression for z - exp(kAt+i2nuAt) Norris.(‘H) This resolution enhancement

has been applied in Figs.Sle

to 31h and the lines

604

DAVID S. STEPHENSON

are seen to be sharper although care must be taken to avoid negative-going sidelobes as in Fig.Slh where too much enhancement has been employed. The spectra in Fig.31 were obtained by using the YW procedure iteratively as suggested by Tang and Nor&**) to overcome the problem of non-stationarity, if this is not done the spectra do not look as good, and this seems to be a general necessity for the YW method in which the assumption of stationarity has to be counteracted. It is simple enough to implement: the filter coefficients of the first iteration being used to form a predicted time series which is subtracted from the original time series. The residuals are then subjected to a new analysis whose spectrum is added to that from the previous iteration and whose predicted time series is subtracted

from the old residuals to form new residuals which are

then treated in the same manner.

a

I\ b



I’\

h

x \

C

I FIG.32. Spectra produced by the covariatuzemethod from the almost noiseless tnmcated FID K - 64 FT spcctnun is abown in Pig.SOb. (a) Covariance mctbocl with a prediction falter lcngtb o$ M- 16rhose (b) Covariane method with M - 24. (c) Covariance method with M = 32. In Fig.32 the results of the covariance method for different values of the filter length M are presented. This method is obviously superior to YW. The spectra for Fig.32 were all evaluated on the ,unit circle using the LPZ formula and we see that, although in general the spectra are good, anomalies sometimes occur as in the second line from the right of Fig.32b All these “high-resolution” where one root obviously lies too close to the unit circle. spectral estimators suffer from such occasional problems. Burg’s method can sometimes produce weird spectra, examples are shown in Fig.33 for a filter length M - 32 where the FPE had a discontinuity. The value of M was not critical. Figure 33a is the power spectrum obtained using eqn (21) and it is clearly very distorted. Figure 33b shows the spectrum calculated from the LPZ formula. The spectrum obtained by extrapolating the data and then Fourier transforming was much the same. The lines in Fig.33b are all very sharp and indeed all the splittings are resolved (and some extra splittings as well!), the lines can be broadened by evaluating the spectrum outside the unit circle and this has been done in Fig.33c, the lines become broader and the intensities are more reliable in the phased spectrum, I shall call this method post-broadening because it is applied after the calculation of the filter coefficients. The defects of Burg’s method could be attributed to the non-stationarity of the FID, and this can be counteracted by multiplication with an increasing exponential, this can be called pre-narrowing since it is applied before

Linear prediction and maximum the calculation of the coefficients. calculated with the LPZ formula

605

entropy

This has been done in Fig33d and the spectrum outside the unit circle, i.e. with post-broadening.

Figure 33e shows the spectrum with a filter length of M - 40 employing

pre-narrowing

and

post-broadening. It seems that regardless of whether one applies pre-narrowing or not it is imperative that post-broadening is used to avoid the distorted spectra in Fig.SSb. Even then the spectra are not very impressive.

a

If the modified covariance method is applied directly to the FID it produces the nonsensical spectrum of Fig.34a when the LPZ is employed. By post-broadening the lines in the LPZ formula the spectrum is improved but spurious splittings remain (Fig.34b). Pre-narrowing alone does not help (Fig34c), but by pre-narrowing and then post-broadening

(Fig.Sld)

a perfect spectrum is obtained!

Both Burg’s and the modified covariance method treat the FID as if it were stationary, so it might be thought that the iterative application of these methods as for YW would improve the spectra. I have found, however, that this is not the case for truncated data. For all of these methods the calculation times for the prediction coefficients was about 0.005 sec., so tbat the main cost in using such a program lies still with the Fourier transformation, thus applying the Yule - Walker method iteratively takes about 0.2 sec. to produce Fig.Slh which should be compared with 0.05 sec. required to calculate Fig.SOb.

606

DAVID

S. STEPHENSON

modified covariance method from the almost noiseless FID (K - 64) whose (a) Modified covariance method with M- 32. @) Same as (a) but with was pre-narrowed before calculationof the fdtcr ceehiiients. (d) Pre-narrowing

The programs LPSVD and HSVD should produce identical results, but small differences presumably due to rounding error are noticeable. Figure 35 shows results for both programs for M - 44 which is roughly an optimum and the number of singular values retained in the decomposition was set at 16. Variation of M within the range 32
LPSVD

and HSVD

notice that the second line on the right should also be a

doublet and divide it into a correctly phased intense singlet and a weak component with incorrect phase and damping factor, this is visible in Fig.35a for the LPSVD program. Such artefacts could, of both programs can also the spectral linewidth. HSVD, additionally the

course, be suppressed before plotting the spectrum. The output of be used to produce enhanced resolution spectra by suitably altering Program times were about 0.3 sec. for LPSVD and 0.16 sec. for FID must be reconstructed and Fourier transformed.

a

b

,

FIG.35. Prony spectra of the almost noiseless truncated FID (K - 64 whose FT spectrum is shown in Fig.3qb. (b) HSVD spectrum wtth with fiter length M = 44 and 16 singular v A ues retakud. a LEWD s are noticeable between the two spectra, presumably M - 44 and 1T smgular values retained. Small dilkcnces due to rounding error.

Linear prediction and maximum

The ME method is the slowest of all for these truncated proportional

to the spectrum size J not the data length K.

607

entropy

data because the run time is The results for these non-noisy

data are also not too good: in no case does the algorithm of Skilling and Bryan, efficient -Ku2 given by eqn (75), where o2 is though it is, achieve the target fit value C_ estimated from the data, within a reasonable number of iterations. The program knows this and the value of TEST (Section 7.1.2) remains large. Futhermore, the resolution is Figure 36a shows the results of my not as high as that obtained with the other methods. one-channel

program

after 10 iterations.

clean, but the splittings are not evident.

The intensities are good,

the baseline is fairly

After 25 iterations (Fig.36b)

the first line on the

left has split, but progress is slow at about 0.16

sec.

per iteration for a spectral length of

2048. This value is smaller than would be expected for 7 Fourier transformations because on the one hand neither bit inversion nor normalisation is needed in the FFT and, furthermore, with only 64 data points a good many “butterflies”(21) can be eliminated. Noiseless data form a hard test for ME programs,

and better conceived

codes employing

more than just two search directions will certainly perform faster. Exponential multiplication of the type suggested by Laue et a./.(151~‘55)(see eqn (137)) may be used to emphasise the splittings, but this slows down the progress towards a solution considerably, and even then the spectra do not have the resolution obtainable with linear prediction. The target value of the fit may also be set very much higher than the estimated value whereupon the program

rapidly achieves

a maximum

entropy

solution

(small value of TEST)

but the

changes to the spectrum are minimal. Because the algorithm is completely stable it is only a matter of computer time before the correct termination criteria are fulfilled, thus after several hundred iterations the estimated target value of the fit has been obtained and TEST But these electronic exertions are not worth the effort, for the becomes less than 0.1. spectrum changes only unappreciably from that shown in Fig.36.

cctlum is shown in Fig.30b. FIG.36. ME spectra of the almost noiselesstruncated FID (a) ME rpectnm after 10 iterations. (b) ME spectrum at&r 2!v4) ltcrauons. whose FT sp

The value of the default intensity A in eqn (87) also had little effect on the spectrum, the smaller the better. But there was only a minimal difference between the spectrum for A = 1O-6 andthatwithA - 10 - 2 (both relative to a maximum spectral intensity of 1 .O). However, the approach to a solution is impaired if A is chosen to be too small. The reasons for the poor resolution are not hard to find and have been described by Bryan and Skilling(‘71). The ME program finds the spectrum with the highest entropy consistent with the fit given by eqn (75) having a certain value. This constrains the second moment of the residuals but not their first moment, and thus the smoothing effect of the entropy will tend to take intensity away from the peaks and use this to fill-in the valleys Bryan and Skilling’71) suggested employing a different constraint by

without increasing C. fitting the ordered

residuals to a Gaussian

distribution,

thus ensuring

that the residuals

608

DAVID S. STEPHENSON

remained white. They showed that indeed this “E statistic” produced a better resolution than the standard x2 statistic of eqn (75) for the deconvolution of blurred photographs, but my experience suggests that it is of little value for truncated NMR data. For ahnost noiseless truncated data the conclusions must by now be clear: only two algorithms are worth considering. These are the covariance method applied to the data without any tricks (Fig32c) post-broadening

and the modified

applied iteratively (Fig.34d).

covariance

method with pre-narrowing

Both approaches

good as that of Fig.SOa, and neither is considerably

and

yield spectra which are as

more expensive

than FT.

Next, the

silver medal winners so to speak, are the programs LPSVD and HSVD, and the bronze medal is taken by Yule-Walker. I do not think that the Burg spectra (Fig.33) are of great value in this context, neither does the ME method produce the required resolution.

Of course, although the problem of truncated data seems to covariance and modified covariance methods, the real tests start in Fig.SOc. When significant noise is present Burg’s method (Fig.37a), and here the iterative procedure of Tang and Norris(“)

have been solved by the with noisy data as shown improves its performance is advantageous, even so

the intensities leave much to be desired. The YW method generally has a lower resolution for a given filter length M (Fig37b) but a higher stability in that the filter length can be pushed higher without artefacts appearing. The modified covariance method is also similar (Fig.37c) as is the covariance method (Fig.37d). Whereas the Burg, YW and covariance spectrum were all obtained without post-broadening, the modified covariance method Pre-narrowing cannot be needed this treatment to produce a sensible-looking spectrum. applied

to these data, for to multiply the full FID by an increasing exponential would emphasise the noise and not convert the FID to a stationary time series. Neither is it required for the spectra in Fig.37 which all look better than the FT spectrum Fig.SOb

Linear prediction and maximum because the baseline noise has been removed.

609

entropy

The rightmost

splitting which is not visible

in the FT spectrum is resolved by Burg’s method (insert to Fig.37a)

but if one knows that

Burg’s method is subject to spurious splittings not much weight can be put on this information. The splitting does not show up with the other three methods even with resolution enhancement. The calculation time depends strongly on the !iher length M, for M = 192 the f&test method YW required 0.29 sec. to find the coefficients, whereas the modified covariance

was the slowest at 0.67

methods can, however, be used non-iteratively HSVD

also has no difficulty in analysing

sec.

The modified covariance

and covariance

so that they tend to be faster overall. such data although as may be expected the

computing times are enormous: 200 sec. for a filter length of 1536 yielding a XXH matrix of dimension 512x512, and 1600 sec. for a filter length of 1024 where the matrix to be In each case 32 singular values were retained and diagonalised has dimension 1024x 1024. in neither case was the rightmost splitting resolved.

The ME method is a surprise considering

its poor performance

on truncated

non-noisy

data. The program now has no difficulty in reaching the estimated fit value within 5 iterations (time about 3 sec. for transforming 2048 complex points to 4096 real points) and after 10 iterations the TEST value is well below 0.1. The spectra in Fig.38 are true ME spectra. In Fig.38a no exponential correction was employed and we see that the baseline noise is still present because the program is not actively distinguishing between signal and noise. As more and more exponential multiplication is used (Figs.38b to 38d) the baseline noise disappears,

the lines become sharper and the splitting on the right is resolved.

resolution in Fig.38d

is not as good as that of Fig.37a

The

for Burg’s method, but I would tend

to give it more credence. The problems increase for truncated noisy data. Let us consider YW first. Figure 39a shows a YW spectrum obtained with a filter length of M - 32 after S iterations. It can be

DAVID S. STEPHENSON

610

compared with Fig.Slc. Strangely, the resolution appears to be better for these noisy data: the first and second lines on the left have noticeable splittings as does the fourth line. But the third line is unsplit.

Post-narrowing

the spectrum

leads to Fig.39b,

the splittings

become more pronounced, but the third line still remains unsplit. This irregularity in the resolution is caused by the non-linear nature in which the algorithm tackles the noise, i.e. additive noise in the data appears no longer as additive noise in the spectrum as with FT, but leads to irregular intensities and splittings. The value of M was not very critical in this case, and pre-narrowing the data to enforce stationarity had also very little effect. The computation

times, of course, remain the same as for the almost noiseless truncated FID.

FIG.39. r

S a .produccd by the Yule- Walkmethod from the noia truncated FID K- 64) whose FT is S!%I m ,Fig.SOd. (a) yule - Walker y m with prcdicdon rdtcr length M - 31 after 3 iterations. ) same as (a) but wrth post-narrowmg m the LP formula.

The covariance

method which performed so well for ahnost noiseless truncated data gave

rise to the spectra in Fig.40 for various prediction filter lengths and with neither pre- nor post-treatment. If post-narrowing is applied the lines in Fig.40b become sharper, but the splitting of the second peak on the left is never resolved. Post-broadening can also be applied to improve Fig.40c and the spectrum then looks similar to that of Fig.40b. It would appear that the best the covariance method can do is shown in Fig.40b. computation times were the same as for the almost noiseless truncated FID.

Again the

FIG.40. S tra produced by the covariance method from the noisy truncated apcctrum is sgwn in Fig.SOd. (a) Covariancc SF with falter length M- 16. with M - 24. (c) Covariancc spectrum with M - 9 .

Burg’s method is true to its name of a “high-resolution” estimator. Even for these noisy data the splittings are well resolved. Although the minimum of the FPE lay at

truncated

about M - 28, the spectra are improved by choosing a longer filter. The spectra in Fig.41 were obtained with a filter length of M - 40, although this value was not critical. In Fig.4la the spectrum was calculated with post-broadening, that is the transforms were calculated outside the unit circle, this seems always to be necessary with Burg’s procedure to avoid instability. Figure 41b was calculated by initial pre-narrowing followed by final

Linear prediction and maximum should improve

the spectrum

entropy

611

because the FID then comes closer to

post-broadening,

this

being stationary,

but obviously the influence of noise plays a hidden role here.

t~sF_Fozir_$$

uccd by Burg’s method from the no’ truncated FID (K - 64) whose FT s ~G$.r lengd~ of 3 - 40 and post-broadening. (b) Post-=$

The modified covariance method gives similar noiseless truncated data, pre-narrowing is essential.

results and,

as in the case of almost

The programs based on Prony’s method, LPSVD

and HSVD,

do not perform as well as

might be expected and both are very dependent on the particular noise realisation if used to extract

the splittings.

Spectra from LPSVD

for a filter length of M - 44 and 16 retained

singular values are shown in Figs.42a to 42~ for three different realisations of the additive noise. Comparable spectra from HSVD are shown in Figs.42e to 42g. Obviously HSVD is more stable with respect to noise, and splittings can be resolved as in Fig.42g. If the number of singular values which are to be retained is reduced to 8 then the programs cannot resolve the splittings and the spectra of Fig.42d are now largely independent of the noise realisation.

and 42h are obtained,

these spectra

4

FIG.42. Prony spectra of the noisy truncated FID (K- 64) whore FT spectrum is shown in Fig.SOd. (a) $PSVD sp+rum ,wi+ filter length M - 44 and 16 ainguk valua retained. @) same conditions as (a) but a p$ere-et$~ l-s.. (c ) same as (a) and b but yet another noise rcalisatton. * (e) H&&r with fdtcr ley M = 8 k???s!!z~$ retained. (0 same as (e) -g?r ut avah?es* d&rent noise r tion. (g) same as e) and (t) but yet another noise rcalisation. (h) HSVD spectrum with M - 44 and 8 retained singular values.

612

DAVID S. STEPHENSON

_ua

PIG.43. ME s tra of the no’ truncated FID -64) whose FT spectrum is shown in Fig.SOd. (a) ME qxctnun after 1cr= mxations. (b) 3 E spectrum after P 5 iterations.

The results of the ME method applied to the noisy truncated FID are almost identical to those in Fig.36 for the almost noiseless truncated FID. The splittings are not well resolved (Fig.43) but the baseline is clean and the intensities are good. With considerable noise present in the data Shilling and Bryan’s

algorithm rapidly finds a solution at the estimated

fit value within 10 iterations. small splitting.

After 25 iterations (Fig.43b) the leftmost line has a very The spectra in this case are not strongly dependent on whether or not

exponential multiplication is employed.

FIG.44. Spectra of the full very noisy FID K- 2048) whose v spect~m is shown in FigJOe. (a) e spectrum after multiplication of the FID with a L exponenti (matched fdtc3J. Jl&~~~~~rw~$ a red&don t&r leng$ of M - 384. (c) Burg spaxnun with a falter length of M - 1 8 Fib and after 3 iterations. (d) Burg spectrum with M = 32, pre-broadcnmg, 3 iterations and post-narrowing. AU the AIULP methods are pretty poor when applied to a full very noisy FID. They should be compared with the effects of a matched filter (Fig.44a). None of these methods is capable of extracting the splittings with any greater reliability than FT after matched filtering and at the same time providing good estimates of the intensities. To pull out the splittings a large filter length is required and this can lead to gross intensity distortions and, furthermore, considerable noise remains in the baseline. This is shown in Fig.44b for

Burg’s method with a filter length of 384. The effects of noise can be reduced by multiplying the FID by a decreasing exponential, i.e. pre-broadening and then applying an AIULP method. In this case it is advisable to use a shorter filter, but the results are not

Linear prediction and maximum much better (see Fig.44c).

613

entropy

By reducing the filter length drastically the baseline noise and

the splittings can be made to disappear and the spectrum becomes cleaner (Fig.44d) information loss is great when compared with post-narrowing is able to restore this. If covariance methods are superior, but only in closer to that of Fig.44a. The intensities and dependent

both

on the particular

noise distribution

and

on the choice

functions and filter lengths, I think my only general recommendation The failure. of the in Section 6.1.8 the ARMA process and approaches, Prony’s

but the

matched filtering (Fig.44a) and no amount of anything the Yule- Walker method or the the rather negative sense that their results are splittings in all the methods are very strongly of broadening

can be: don’t bother.

AR/LP methods for very noisy data is to be expected, as I pointed out model is wrong. The correct treatment of additive noise leads to an neglect of the MA branch cannot be warrented. The other two method and spectral ME both take noise into account so one may

expect better performance. There is one great difference, however, between these two methods and that is computing time. Employing a filter length of M - 1536 leads to a XXH

matrix of dimension 512x512

of the order of 200 correspondingly

sec.

and the diagonalisation

on the Cyber.

A shorter

greater calculation times.

of this takes considerable time,

filter leads to a larger

Spectral ME requires about 5 sec.

matrix

results we have already seen suggest that the HSVD program is superior to LPSVD only the former. In

general

the

recommendation number

results

improve

of Kumaresan

for a larger

Glter and

and Tufts. (“1 Furthermore,

of singular values which are retained

this

and

Because the

is consistent

I tested with

the

the spectra look better if the

is set higher than the number

of signals

expected. This introduces extra noise in the form of signals with incorrect damping factors and phases, but these can be eliminated before plotting the spectrum.

C

1 ,

In Fig.45 with 32,

\

I present the results of the HSVD

mu,

1

I_______________

analysis for a filter length of M - 1556 and

64 and 128 singular values retained respectively.

In Fig.45a

I have left in the

spurious signals which can in general easily be identified, as shown by Gesmar and Led@‘), on the basis of their standard errors and also on the frequency dependency of their phase. A spectrum in which the weak signals are eliminated is shown in Fig.45b, here the error

614

DAVID S. STEPHENSON

analysis of the HSVD program has been used to identify those lines whose amplitudes are smaller than their corresponding estimated standard deviations. In Fig.45c this technique has been employed and also those signals whose phase lay more than 30° away from the expected phase at that frequency have been suppressed, this results in the elimination of all spurious signals. Another item characterises the three spectra in Fig.45. Figure 45a was obtained by retaining 32 singular values in the calculation, Fig.45b with 64 singular values and Fig.45c

with 128.

Apart from the better resolution as the number of retained singular

values increases it is a worrying

feature of all three spectra that lines are missing.

second line on the right is just not found by HSVD

The

if only 32 or 64 singular values are

retained, so it would appear that a good strategy is to retain a large number of such values. In Fig.45c which

it is there but two other lines are absent.

spurious

The reason for this lies in the way in

signals are suppressed.

The missing lines are shown in the insert to Fig.45c. It is visible that one is a close doublet, by investigating the tabular output of HSVD one can see that indeed both lines consist of two closely spaced signals with large

amplitudes

but

also

with

considerable

errors,

presumably

when

HSVD

cannot

quite

distinguish whether two signals are present or only one it assumes two but gives both of them large errors.

If an automatic

sorting procedure

signals then lines which should be there recommend

that the original

output

might

is then used to weed-out

well be suppressed.

be scrutinized

and such selection

the weak

I would strongly be made by the

spectroscopist. Figure 46 presents ME spectra of the very noisy full FID, all after 10 iterations, a calculation time of about 5 sec. In Fig.46a the spectrum was obtained with only minimal exponential multiplication (equivalent to a line narrowing of 0.001~~) and we see that the program leaves a lot of baseline noise in the spectrum. For a greater degree of exponential multiplication (0.002~~) the program starts discriminating against the noise (Fig.46b) and, furthermore, the splittings are intensified, the inserts show that the second line on the right is well split, and if one employs some imagination the first line on the right is also! Figure 46c shows the results after using exponential multiplication equivalent to the Lorentzian line-width

of the signals,

the spectrum

program had not reached a true maximum TEST

is not so impressive

as is Fig.46b

entropy solution after 10 iterations,

because

the

the value of

remaining at about 0.2.

a) ME A ent to

Linear prediction and maximum To

round

previously

off this section

I show the

to the very noisy truncated

FID.

entropy

615

results of applying the methods considered Figure 47a illustrates the YW method with

M - 32, 3 iterations and post-narrowing to improve the resolution. The spectrum is not too dependent on the filter length, but M should not be chosen too large. Pre-narrowing was of no value as might be. expected for a high noise level. In Fig.47b we see the results of the Burg method for M - 32 with post-broadening which is essential. Again M should not be chosen too large and it seems that iteration

helps here.

The modified covariance

method is illustrated in Fig.47c, as for Burg’s method post-broadening is necessary and iteration may be helpful. Pre-narrowing however did not help. In this case and for the covariance method (Fig.47d) a fairly short prediction filter improves the spectrum. Of these AR/LP methods only the Yule - Walker algorithm seems to give generally reliable results whose dependence on the parameters is low.

In comparison attempt is made introduced.

the HSVD spectrum of Fig.47e is considerably to resolve the splittings a considerable number

By using a large filter and by retaining

better, although if an of spurious signals is

a small number

of singular values

reliable estimates of the intensities are obtained. Maximum entropy (Fig.470 also does fairly well, although even with exponential multiplication the baseline noise is not removed. The intensities though are relatively good. are

Several things become clear after considering these results. Firstly, the AFULP methods The best method here, considering the cost of not reliable for full noisy FIDs.

computer time, is spectral Maximum Entropy illustrated in Fig.38 and 46. The ME method does provide considerable advantage over matched filtering as regards resolution for very noisy data. For truncated data the decision should be made on the basis of the signal-to-noise ratio. For severely truncated data with good signal-to-noise the covariance method, the modifted covariance method or HSVD can all be used successfully. Problems might be encountered with the modified covariance method if the line-widths of the individual signals differ As the significantly since this method requires pre-narrowing to ensure stationarity. signal-to-noise ratio decreases the decision should be made whether one wants high .JPlWES 20/6-C

616

DAVID S. STEPHENSON

signal-to-noise ratio decreases the decision should be made whether one wants high resolution or good intensities. A compromise solution is given by the Yule-Walker method. HSVD and ME give good intensities, whereas Burg’s method yields higher resolution. For very low signal-to-noise ratios we can forget the splittings and concentrate on the intensities and here again HSVD and, to a lesser extent, ME are successful. These conclusions are based solely on the foregoing results and can be taken only as very general recommendations. 8.2.

Two-Dimensional

To demonstrate the application of spectral Maximum Entropy to two-dimensional NMR I chose the synthetic FID of an AMX spin system measured in a COSY(“) experiment. Complex points were taken to provide quadrature detection in 5. To provide quadrature detection in t1 the results of two experiments differing in the phase of the second 90° pulse were combined to select the n peaks. Q5) This combination of two experiments introduces the phase-twist lineshape mentioned in Section 4.2. Furthermore, restricting the number of points in each dimension to 64 results in truncation arteEacts when the FID is zero-filled to 256x256

and Fourier transformed.

Both these effects can be seen in Figs.48a

. . .

l. .

.

l* . a. l.

c

*a

.

l.

.

.

l .

l

.

.

.

.

.r .e

l o .* . . . . .

and 49a.

l l

l .

.A?. l ‘W.*

1

4

*

,Z..

,,..

I

4

b)

softhe PIG.46. Two-dimensional COSY spectra of an WX spin system. The contour plots of the real Four positive and four negative contours arc plotted at heights of 0.5, rt.25, 0.125 complax I and 0.06 r+5 tunes highest peak, the negative contours are shown as dashed lines. (a) FT spwtrum of a - the shown. 64x64 complex PID zero-tlllcd to 256~256 points before transformation. (b) ME spectrum of the same FID. After 10 iterations of a two-channel ME program based on the algorithm of Skilling and Bryan and employing exponential multiplication to discriminate against the noise the spectra The phase-twist is eliminated, as are the shown in Figs.48b and 49b were obtained. truncation artefacts, the lines are sharp and in the contour plot it is easy to identify positive Since all the lines are now in pure absorption and negative signals in the cross-multiplets. the interference of dispersion mode tails from the diagonal peaks is no longer a problem, as in normal phase-sensitive

COSY,

obtainable with a double-quantum

and the sensitivity is of course much higher than that filter. The price to be paid for this achievement is a

Linear prediction

and maximum

entropy

617

calculation time of 9.2 sec. per iteration for this 256x256 point spectrum which should be compared with a time of 0.81 sec. for a straightforward 256x256 FFT using a complex version of the 2D FFT program in Niederdrenk’s book(2s) including bit-inversion and normalisation.

FIG.49. Two-dimensional COSY spectra of an AMX I in system. The real parts of the complex spectra are shown. (a) FT spectrum of a 64x64 complex FID zero- ii ed to 256~256 points before transformation. (b) ME spectrum of the same FID.

618

DAVID S. STEPHENSON

APPENDIX A: SOME STA-ITTITCALCONCEPTS

RELATING TO TIME

SERIES

Without going too much into the details of the statistics of stochastic processes, which are extensively treated by Papor&,@) Haykin, and by Gardner(‘72) as well as by MarpIe and by Kay(“‘) we can distinguish a few topics which are relevant to the proper treatement of time series. Firstly let us define a discrete time series as a set of observations

made at equidistant

time intervals At thus {xJ

-

{x@xJ,xz’

*** *xx_, 1

is a time series composed of measurements of the variable xf,t) made at times t - 0, At, 2At, . . . ,(K - 1)At. Each of these measurements is in some sense a number drawn from a stock of possible numbers and our chances of obtaining “x~” at time kAt is governed by the probability distribution of x~. This defines xk as a random variable, i.e. if we were to try to predict xk we should be hard put to say what value would be obtained, although if we were to measure one and the same “x~” many times the average of our measurements and their scatter would fit the probability distribution.

Since the time series consists of K such

measurements we have a K-fold indeterminacy, each element of {xk} having its own probability distribution. The ensemble of K probability distributions is called a stochastic

process and our time series is just one possible reahsation of this process. If the signal we are dealing with is completely known the time series is described as in the sense that all the elements are predictable. Normally, however, we

deterministic

have noise to contend with and the process is non-deterministic. In this case, though, certain averages may be defined which lead to a reduced set of parameters describing the process

-

its statistics.

Of these the most

important

are the first and second

statistics, i.e. the mean, the autocorrelation and the autocovariance. statistics can also be defined they are of no interest to us here.

order

Although higher order

The mean of the stochastic process which gives rise to a particular time series is defined as pk

-

E[xkl

where E is the expectation

Rkl -

E[xk

operator,

the autocorrelation

function of the process is

+Jd

where again E stands for the expectation operator,

and the autocovariance

function is

. Ckl = E[(xk+I-~k+3(X~-r3*l These latter two statistics are functions which link the individual elements of (xk}. Unfortunately the us: all we can do is estimate them by carrying out on the available elements of the one measured time A stochastic process all of whose statistics are

together the probability distributions of statistics themselves are not accessible to the appropriate mathematical operations series at our disposal. independent of time origin, or for a

discrete time process are independent of index shift, is called strictly stationary or strictsense stationary. Processes for which the mean and the autocorrelation and autocovariance function

only are independent

stationary of order two. p -

pk -

constant

of time shift are called

wide-sense

stationary

In these cases the mean becomes independent of index:

,

or

weakly

Linear prediction and maximum and the autocorrelation R,

-

RW

and autocovariance

, C,

An important

-

property

C,

619

entropy

become functions of one index only:

.

of stationary

processes is the Hermiticity

of the autocorrelation

function R,

-

R_;.

This relationship is put to use in the derivation of the Yule-Walker equations in Section 6.1.2, of the Burg algorithm in Section 6.1.4 where the forward and backward prediction coefficients are related by complex conjugation, and of the modified covariance method of That stationarity is not an inherent Section 6.1.5 where the same assumption is made. property

of the NMR

FID can be seen if we consider a time signal corresponding

to a

single damped sinusoid the elements of which are given by xk

= b exp[(u + i2nv)kAt]

The autocorrelation

function is then

E[xk+,x,*]

Rkl -

.

= Ib12exp[crAt(21+k)

+ i2nukAt]

= Ib12exp(clkAt+ i2nukAt) exp(2alAt) which shows a dependence on the index 1. The importance of this non-stationarity for linear prediction can be seen by considering again the case of a single damped sinusoid: xk

-

alfxk_l

xk

-

alb

exp[(u+i2nv)At]

=

xk+l

exp[ -(u

x~_~

+i2av)Ar]

,

x~+~

,

so that it is easy to see that alf # (alb)’ unless a - 0. Methods such as the abovementioned would thus be expected to lead to spectral distortion if applied to NMR because one of the assumptions is incorrect, and this is in i&t what one sees. Methods such

as the

functional

modelling

existence of damped sinusoids, the covariance

algorithms approach

which

specifically

of Section 6.1.5

assume

the

or the LPWQRD

technique of Tang and Norris should thus be preferable.

APPENDIX

B: THE

Z TRANSFORM

AND THE

COMPLEX

PLANE

Whilst the Fourier transform should be well-known to all NMR spectroscopists the z The standard text is the book by Jury(“‘) who takes a transform is probably a novelty. very mathematical approach. The essentials of the z transform are, however, simple and are given in many books on signal processing, for instance Cadzow(174) or Roberts and Mulli~(*~) who define it as x(z)

=

5

xk

.-k

(154)

DAVID S. STEPHENSGN

620

where {xk}, k - -=, . . . ,-, is a discrete time sequence whose elements may or may not be complex, z is a complex variable and the polynomial X(z) is clearly a complex function of z which is only defined for those values ,of z for which the power series eqn (154) converges. This idea of convergence is very important for it leads to the definition of the singularities (or poles) of X(z) and their position in the complex plane. If we restrict our attention to causal sequences ‘{Q}, i.e. those sequences having xk - 0 for k < 0, then it can be shown (174) that X(z) converges everywhere outside a circle of radius R in the complex plane where R depends on .the sequence (xk) . For 1.~1> R, i.e. in the region

of convergence,

the infinite

series eqn (154)

can

nearly

always

be ex-

pressed(174) as a simple ratio of polynomials in z. To clarify the situation let us take the specific example of an exponential sequence {xk} where xk -

exp(ikw) for k 2 0, thus OD

0 X(z)

-

exp(ikw)z -k

1

I

1

k-0

(Z-leia)k

(155)

k-0

The power series in eqn (155)

converges

only for IzI > [eiol and in this case we may

express X(z) as 1 X(Z)

-

l_z-leio

Z

-

z-e

io

which is ,a simple ratio of polynomials in 2. This example also nicely illustrates several other features of the z transform.

It can be

seen from eqn (156) ,that the expression for X(z) becomes zero at z - 0, and becomes infinite for z - exp(io). We say that X(z) has a zero at z - 0 and a pofe at z - exp(iw) even though eqn (156) is no longer a valid expression for the z transform within the region of divergence. This is shown in Fig.50.

Re z

FIG.50. A sectionof the complex plane showing the regions of convw ence and divergence separated by the boundary cimle of radius R centred on the origin. The simpli :cxponent i3 function of the text has a zero at the origin and a pole on the boundary. Since the z transform of a sequence (xk) can be expressed as the ratio of an mth order polynomial to a nth order polynomial b,z=‘+b#‘-‘+ x64

-

P+alza-l+

1..

...

+bm

+a,

.



Linear prediction and maximum

entropy

621

and because both polynomials may be written in Ezctored form b,(z - zl)(z - 22) . . . (z - z,,,) X(4

-

(Z-P1XZ-&)

...

(157)

(Z-P,)

we see that, in general, X(z) has m zeros and n poles. Furthermore these zeros and poles contain all the essential information about the function X(z) except for the multiplicative The positions of these poles and zeros are thus crucial in applications of the z constant b,. transform. One other very important point to be made here is that at least one of the poles is located

on the boundary

divergence

which can be seen for the simple example of Fig.50.

separating

the region

of convergence

from

the region

of

This pole will always be

that pole of X(z) which is located furthest from the origin, meaning that each pole lies either in the region of divergence or on the boundary. As may be expected no poles exist in the region of convergence. The usefulness of this property is that a knowledge of the positions of the poles tells us where the boundary is between convergence and divergence. Let us now see how the Fourier transform fits in here. The z transform may be viewed as a very general representation of a sequence over the whole complex z plane, but if we restrict 2 to the unit circle, i.e. 2 - exp(io) or jz( - 1, then the sole variable becomes cu and the z transform reduces to X(exp(io))

x(z) -

=

1 k-

xk exp( - icuk)

,

--oD

which is the continuous Fourier transform of the sequence { xk} . Im 2 unit

x _

circle

1

--.-___

w+

\

2 -----_____

.

Re z

FIG.51. The stability of a sequence is related to the positionof the poles of its z transformwith respectto the unit circle. Stable systems have their polls inside the unit circle.

It is interesting to consider the type of time series which may arise and to discuss their behaviour in terms of their z transforms evaluated on the unit circle. If the unit circle lies in the region of convergence

the time series has a bounded Fourier transform.

If, on the

other hand, the unit circle lies within the region of divergence the Fourier transform is unbounded which means that the elements of the time series do not converge with increasing

time.

This latter condition will occur if any of the poles of the z transform

lie

outside the unit circle, for we have seen that the furthermost pole from the origin lies on This is shown in Fig.5 1 for simple the boundary between convergence and divergence. exponential

time series.

If the z transform

of the series has a pole outside the unit circle

this is equivalent to a series with exponentially

increasing amplitude (case 1).

In terms of

622

DAVID S. STEPHENSON

response theory this corresponds to the transfer function of an unstable filter.

Case 2 where

the pole lies on the unit circle is also unstable, and although the amplitude of the series does not diverge neither does it converge. Case 3 is the system corresponding to the usual situation for NMR signals, the amplitude decays with time and the pole lies inside the unit circle. In some of the literature described

as minimum

a polynomial

in z having its roots inside the unit circle is

or if its roots lie outside the unit circle as minimum deb~y(‘~~). This nomenclature is confusing in the present context and I have avoided it. The important thing is that if the spectrum of a time sequence is to be represented by a

phase(“),

quotient of polynomials such as eqn (157) then the poles must lie within the unit circle if the spectrum is to make any sense. For remember that the methods which rely heavily on the z transform notation are those methods which deliberately extrapolate the time data beyond

the measurement

period:

a spectrum

with poles outside

the unit

circle would

correspond to a divergent extrapolation which is physically unrealistic. Another point is the relevance of the position of the poles with respect to the unit circle in the context of resolution. The closer these poles lie to the unit circle the sharper will be the spectral lines if the Fourier transform is employed, i.e. if the z transform is evaluated on the unit circle. Alternatively, if the spectrum is calculated using a z transform other than on the unit circle, we have the possibility of decreasing the spectral linewidth by choosing a circle of evaluation inside the unit circle. and Norris in their L&‘Z method(

Postscript

-

The search for the maximum

grai1(175), many seek and few find.

This fact has been put to use by Tang

of entropy is akin to the search for the holy

This, however, does not discourage the proposal of new

algorithms, such as the recently suggested approach of Zhuang et a1.(17@ These authors follow the path of Gull and Daniel1 and look for a maximum of Q(A) defined in Section 7.1.2.1. They show that the equations may be recast in the form of a Cauchy boundaryvalue problem(74) and that a solution may be obtained by defining an initial spectrum for A - 0 and then by following the solution curve V2Q(dX/dA) - VC using the notation of Section 7.1.2. This leads to an iterative procedure in which 1 is continually incremented and the spectrum can be updated from iteration to iteration (note that on the right-hand side of eqn (37) in reference (176) Fk should be replaced by fk). Provided that the A increments are small enough a maximum entropy solution should be achieved. Zhuang et al. show that this algorithm is as efficient as that of Skilling and Bryan. Gesmar and Led(177) have applied their functional modelling method to two dimensions and their one-dimensional technique has been used to evaluate the differences in the “C spectra of insulin from various sources.(‘78) Pisarenko’s method (Section 6.1.7) has been adapted by Shinnar and Eleff(17’) and applied both to synthetic data and to the 31P signals of a rat. Delsuc and Levy( 18’) have used spectral ME to recover J-coupling patterns in both oneand two-dimensional NMR, whilst Davies et d.( la1) have compared spectral ME with the CLEAN program of H6gbom(1*2~1*3) and conclude that the results of both approaches are similar but that CLEAN is faster. The fundamental idea behind the application of Maximum

Entropy to NMR

spectrum is related to a probability density whose entropy can be calculated. relationship is the one considered in Section 7.1.2

is that the

The simplest

where the spectrum of Fig.Sa is regarded

as the probability density of the amplitudes of the isochromats Zerothe spectrum be positive and all the signals in-phase.

of Fig.Bc. This requires that and first-order phase effects

Linear prediction and maximum

entropy

623

can be corrected for, and the procedure of Laue et al. 0”) allows the treatment of negative signals (Section 7.1.7). But these ad hoc modifications are not applicable to more general situations where the FID to be measured corresponds

to magnetisation

vectors brought to

different starting positions on the x’ y’ z sphere by a multipulse experiment. The associated spectra may have signals whose phases take on any value between 0 and 2n, and the artificial restriction to a small number of discrete phases is a severe constraint. To circumvent this problem Daniell and Hore (‘s4) have recently proposed maximising an entropy

based on a novel probability

treating all orientations

density.

In a classical sense this is equivalent

of the spins as equally likely, so that the unconstrained

to

maximisa-

tion of the entropy leads to an isotropic distribution of magnetisation.

Development of this idea produces an entropy which may be defined in terms of the modulus of a complex spectrum, thus overcoming the restriction to real positivity and the concomitant problem of signal phase. Daniell and Hore show that this classical approach has a quantum mechanical analogue in which the role of the probability density is taken by the diagonal elements of a density operator describing the system before excitation. For spins I = ?4 this produces an entropy which differs from the traditional form given by eqn (87) by becoming quadratic in lXjl for values of lXjl whl‘c h are small compared to the default A. This interesting new proposal is almost certain to rejuvenate with the application of Maximum

the controversy

associated

Entropy.

AchowJedgemenrs - It is a pleasure to acknowledge the support and advice of Prof.G.Bins+. The Bavarkn Academy of sciences is to be thanked for allowing me access to the machinery at the Lelbniz Computmg Centre, Munrch; arq gratefU1 to Dr..& .Led for sending me preprinta of references 80 177 and 178 and to Furthermore, I am much obliged to Dr.R.deBeer for making the Dr.J.Tang for a reprmt of reference tly indebted to Dr.G.J.Daniell for his valuable programs LPSV and HSVD available. Finally, I am comments, and to Dr.P.Hore for communicating reference 1!?a4.

REFERENCES 1. S.SIBISI, Nature 301, 134 (1983). Feynman!“, Unwin Paperbacks, Hemel Hempstead 2. R.P.FXYNMAN, ‘Surely you’re joking. Mr. (1986). 14, 27 (1980). 3. J.C.LINDON and A.G.FERRICE, Prog. Nud. Magn. Reson. SJxchmc. 4. HJ.GRAY and A.ISAACS, A New Dictionary OfPhysics, Longman, London (1975). 5. S.HAYKIN, Communicarion Systems, 2nd Bdn., Wiley, New York (1983). 6. A.PAPOULIS, Probability, Random Variables, and Smcbastic Pmcesses, 2nd Edn., McGraw-Hill, New York (1984). 7. E.A.ROBINSON, Proc. IEEE 70, 885 (1982). 8. B.R.FRIKDEN in Picture Processing and Digital FiJtering, T.S.HUANG (Ed.), Topics in Appfiai Physics, Vol.6, Springer, Berlin (1979). 9. E.FUKUSHIMA and S.B.W.ROEDER, Experimental Pulse NMR, Addison-Wesley, Reading, Mass. (1981). 10. D.I.HouL.T, Prog. Nud. Magn. Reson. Speczrosc. 12, 41 (1978). Magn. Reson. 54, 324 (1983). 11. C. -N.CHEN, D.I.HouLT and V.J.SANK,J. Magn. Reson. 66, 410 (1986). 12. C.J.TURNER and H.D.W.HILL,J. 37, 93 (1966). 13. R.R.BRNsT and W.A.ANDERSON, Rev. ScJ. Instrum. 14. D.SHAW, Fourier Transform NMR Spectroscopy, 2nd Edn., EIsevier, Amsterdam (1984). GBODENHAUSEN and A.WOKAUN, Principles of NucJear Magnetic Resonance in One 15. R.R.BRN~, and Two Dimensions, Oxford University Press, Oxford (1987). Orthogonal Transbms for Digital Signal P-sing, Springer, Berlin 16. N.ACHMED and K.R.RAo, (1975). 17. D.F.ELLIOTT and K.R.RAo, Fast Transforms, Academic Press, New York (1982). and its AppJkatbns, 2nd Edn., McGraw-Hi, New York 18. R.N.BRAcE~ELL., The Fourier Transkm and R.E.NORBERG, Phys. Rev. 107, 46 (1957). 19. I.J!zs 19, 297 (1965). 20. J.W.COOLEY and J.W.TUKEY, Math. Comput. PrenticcHall, Englewood Cliffs, NJ. (1974). 21. E.O.BRIGHAM, The Fasr Fourier Transform, (German Translation: FJV’, ScJmeJJe Fourier Transformation, Oldenbourg, Miinchen (1982)).

624

22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. ::: 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74.

DAVID

S. STEPHENSON

H J.NUSSBAUMER, Fast Fourier Transfbrm and Convolution Algorithms, 2nd Edn., Springer, Berlin (1982). K.NIEDERDRENK, Die endiiche Fourierund Walsh - Tmsformation mit einer EinUntng in die Bildverarbeitung, Viewcg, Braunsehweig (1982). GERIVARD, IEEE Trans. ASSP-25, 250 (1977). LRRABINER and B.COLD, Theory and Application of Digital Signal Processing, Prentice-Ha& Englewood Cliffs, NJ. (1975). RAROBERTS and C.T.M~LLIS, DighaZ Signal Processing, Addison-Wesley, Reading, Maar. (1987). A.BAx, Two-Dimensional Nuclear Magnetic Resonance in Liquids, Delft University Press, Delft (1982). G.BODENHAUSEN, RFREEMAN, R.NIEIXRMEYER and D.L.T~RNER, J. Magn. Reson. 26, 133 (1977). J.KI?.ELERand D.NEUHAUS, J. Magn. Reson. 63, 454 (1985). K.NAGAYAMA, A.KUMAR, K.WUTIUUCH and R.R.ERNsT,J. Magn. Reson. 40, 321 (1980). A.E.DERoME, Modern NMR Techniques for Chemistry Research, Pergamon Press, Oxford (1987). P.J.HoRE,J. Magn. Reson. 62, 561 (1985). J.C.J.BARNA, E.D.LAIJE, M.R.MAYGER, JSKILLING and S.J.P.WORRALL,J. Magn. Reson. 73, 69 (1987). E.D.LA~E, K.O.B.POLLARD, JSIULLING, JSTAUNTON and A.C.SUTKOW~KI, J. Magn. Reson. 72. 493 (1987). D.ZI&SOW; On-iine Rechner in der Chemie, de Gruyter, Berlin (1973). G.U.YULE, Phil. Trans. Roy. Sot. London A226. 267 (1927). \ l.A.C~~zow. Proc. IEEE 7d. 907 (1982). ~.L.MARPLE; Digital Spectral ‘AnaZy& P&ice-Hall, Englewood Cliffs, NJ. (1987). SHAYKIN and S.KESLER in Nonlinear Methods of Spectrai Analysis, S.HAYKIN (Ed.), Topics in Applied Physics, Vol. 34, 2nd Edn., Springer, Berlin (1983). S.M.KAY and S.L.MARPLE, Proc. IEEE 69, 1380 (1981). ’ ’ S.L.MARPLE, Proc. IEEE 70, 1238 (1982): T.J.ULRYCH and M.00~ in Nonlinear Methods of Spectral Analysis, S.HA~KIN (Ed.), Topics in Applied Physics, Vol. 34, 2nd Edn., Springer, Berlin (1989). J.TANG and J.R.NORRIS, Chem. Phys. Lett. 131, 252 (1986). F.NI and H.A.SCHERAGA,J. Magn. Reson. 70, 506 (1986). J.TANG and J.R.NoRRIs,J. 1Magn. Reson. 69, 180 (1986). G.WALKER, Proc. Roy. Sot. London AlSl, 518 (1931). N.LEVINSON, J. Math. and Phys. 25, 261 (1947). J.DURBIN Rev. Instrum. Znr. Stat. 28, 233 (1960). J.P.BuRG, Proceedings of the 37th Meeting of the Society of Exploration Geophysisists, Oklahoma City (1967). N.ANDERSON, Prop. IEEE 66, 1581 (1978). L.J.FABER, Proc. IEEE 74, 1046 (1986). P.F.FOUGERE. E.T.ZAWALICK and H.R.RADosKI. Phvs. Earth Planet. Interiors 12, 201 (1976). W.Y.CHEN &d G.R.STEGEN, J. Geophys. Res. 79, 3019 (1974). D.N.SWINGLER, IEEE Trans. m-28, 257 (1980). . D.N.SWINGLER. Prop. IEEE 67. 1368 (1979). . T.J.ULRYCH aAd R.W.CLAYT~N, Phys. E&h Pianet. Interiors 12, 188 (1976). L.MARPLE, IEEE Trans. ASSP- 28, 441 (1980). C.L.LAWSON and R.J.HANsoN, Solving Least Squares Problems, Prentice-Hall, Englewood Cliffs, NJ. (1974). J.TANG and J.R.NoRRIs, J. Chem. Phys. 84, 5210 (1986). M.MoRF, B.DICKINSON, T.KAILATW and A.VIEIRA, IEEE Trans. ASSP-25, 429 (1977). S.L.MARPLE, ZEEE Trans. ASSP-29, 62 (1981). P.F.FOUGERE, J. Geophys. Res. 82, 1051 (1977). C.H.CHEN, Nonlinear Maximum Entropy Spectral AnaZysis for Signai Recognition, Wiley, Chichester (1982). H.AKAIKE. Ann. Inst. Stat. Math. 21. 243 (1969). T.J.ULRY&H and T.N.BIsHoP, Rev. G&physi~~ & Space Physics 13, 183 (1975). H.AKAIKE, IEEE Trans. AC-19, 716 (1974). E.PARZEN, ZEJ%ETrans. AC - 19, 723 (1974j. V.Vrrr. P.BARONE. L.GUIDONI and E.MASSARO. I. Maen. Reson. 67. 91 (1986). ’ ’ ’ T.E.LA&DERs and ~.T.LAcoss, IEEE Trans. G<- 15, f6 (1977). V.VITI, E.MA%ARo, L.GUIDONI and P.BARONE, .I. Maan. Reson. 70, 379 (1986). . I R.G.C&RIE, GeophyS. J. R. Astr. Sot. 38, iv9 (1974). S.M.KAY, IEEE Trans. B-28, 292 (1980). V.F.PISARENKO, Ceophys. J. R. Astr. Sot. 33, 347 (1973). V.C.KLEMMA and AJLAuB, IEEE Trans. AC - 25, 164 (1980). G.A.KoRN and T.M.KoRN, Mathematical Handbook for scientists and Engineers, 2nd Edn., McGraw-Hill, New York (1968).

Linear prediction and maximum 75. 76. 77. 78.

79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92.

93. 94.

95. 96. 97.

98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130.

entropy

625

H.BARKHIJJJSEN,R.DEBEER, W.M.M.J.B~VEE and D.VANORMONDT, J. Magn. Reson. 61, 465 (1985). P.A.BUSINGER and G.H.GOLUB, C-. ACM 12, 564 (1969). G.H.GOLUB and C.RF.IN%Xi, Numer. Math. 14, 403 (1970). B.S.GARB~%V, J.M.BoYLE, J.J.DONGARRA and C.B.MOLER, Matrix EigensrJem Routines EISPACK Guide Extension, Lecture Notes in Computer Science, Vol. 51, Springer, Berlin (1977). J.J.DONGARRA, C.B.MOLER, J.R.BuNcH and G.W.STEWART, LZNPACK User’s Guide, SIAM, Pbiiadelbbii (1979). H.GEsI& and‘J.J.Lku, J. Mkgn. Reson. 76, 183 (1988). .I.C.HCCH,.I. Magn. Reson. 64, 436 (1985). J.C.Hocn,]. Ma&r. Reson. 66, 580 (1986). D.N.SWINGLER, J. Geophys. Res. 84, 679 (1979). J.S.Lu.i and N.A.MALIK, IEEE Tnu~. m-29, 1215 (1981). M.KA~EH arid G.A.LIPF%RT. IEEE TIWW. ASSP-31, 438 (1983). J.TANC and J.R.NoRRls, J. ‘Magn. Reson. 78, 23 (i988): ’ R.KUMARESAN, L.L.SCHARP and A.K.SHAW, ZEEE Trans. ABSP-$4, 637 (1986). M.A.JENKINS and J.F.TRAIJB, Commun. ACM 15, 97 (1972). P.BARONE, L.GUIDOM, E.MASSAROand V.VITI, J. Magn. Reson. 73, 23 (1987). D.W.Tuprs and R.KUMARESAN, Prop. IEEE 68, 419 (1980). R.KUMARESAN and D.W.TUPTS, Prue. IEEE 68, 1354 (1980). D.W.TvPrs and R.KUMARESAN, proc. IEEE 70, 975 (1982). D.W.ms and R.KUMARESAN, IEEE Trans. ASSP-30, 671 (1982). R.KUMARESANand D.W.Twrs; IEEE Trans. ASSP-Jo; 833 (1982). H.BARKH~JIJSEN,R.DEBEER and D.VAN~RMONDT, .I. Magn. Rcson. 67, 371 (1986). M.A.DEL&, F:NI and G.C.LE~Y, J. Magn. Resin. 73;548 (1987). B.PORAT and BFRIEDLANDER, IEEE Trans. ASSP-34, 1336 (1986). H.BARKHIJTJSEN,R.DEBEER and D.VANGRMONDT, J. Magn. Reson. 73, 553 (1987). H.BARK~SEN, R.DEBEER and D.VANGRMO~, J. Magn. Reson. 64, 343 (1985). Y.BEPPIJ and LNINOMIYA, Comp. Phys. Commun. 23, 123 (1981). J.TANG, C.P.LIN, M.K.BOWMAN and J.R.NoRRIs, J. Sign. Reson. 62, 167 (1985). K.~TEIGLITZ and B.DICKINSON, IEEE Trans. ASSP-30, 984 (1982). B.SV~JGARD, Bm 7, 240 (1967). A.VANDEBOS, in Handbook of Mcasurem em Scknce,P.H.SYDENHAM (Ed.), Wiley, London (1982). B.NOORBEHESHT,H.LEE end D.R.ENZMANN, J. Magn. Reson. 73, 423 (1987). A.E.SCHUSSHEIM and D.COWBURN, J. Magn. Reson. 71, 371 (1987). J.H.MCCLELLAN, Proc. IEEE 70, 1029 (1982). D.E.DUDGEON and R.M.M ERSEREAU, Multidimensional Digital Signal Processing, Prentice-Hail, Englewoed Ciiffs, NJ. (1984). C.L.NIKIAS end M.R.RAGHWEER, IEEE ASSP Spectrum Estimation Workshop ZI, 213 (1983). R.KUMARESAN and D.W.Tus’rs, Rot. IEEE 69, 1515 (1981). Y.MANASSEN. G.NAVON and C.T.W.MOONEN, .I. Magn. Reson. 72, 551 (1987). E.T.JAYNES in proceedings of the First Syr&osium~ on Engirxering Ap&&ons of Random Function Theory and Probability, J.L.BOCDANOW and F.KOZIN (Eds.), Wiley, New York (1963). ETJAYNES, IEEE Trans. E-4, 227 (1968). E.T.JAYNES in The Maximum Entropy Form&m, R.D.LEVINE and M.TRIBUS (Eda.), MIT Press, Cambridge, Mass. (1979). C.E.SHANNON, Beu System Tech. J. 27, 379 (1948). C.E.SHANNON and W.WEAVER, The Mathematical Theory of Communication, University of Iilinois Press, Urbana (1962). E.T.JAYNJ%, Phys. RCV. 106, 620 (1957). A.I.KIUNCHIN, MathematicalFoundationsof Information Theory, Dover, New York (1957). R.V.L.HARTLEY. BefJ System Tech. .I: 7. 535 (1928). L.BRILLOUIN, S&rice &d Infbrmatio~ Thkry, &I E&L, Academic Press, New York (1962). K.G.DENBIGH and JSDENBIGH, Entrupy in relation to incomfiete knowlodge, Cambridge University Press, C&nbridgc (1985). M.TRIBUS in The Maximum Entropy Formalism, R.D.LEVINE and M.TRIBUS (Eds.), MIT Press, Cambridge, Mass. (1979). E.T.JAYNEs, Phys. Rw. 108, 171 (1957). B.R.FRIEDEN, J. Opt. Sot. Am. 62, 511 (1972). E.T.JAYNEs, Pmt. IEEE 70, 939 (1982). S.J.WERNECKE and L.R.d’AuDAlUO, IEEE Trans. C-26, 351 (1977). S.F.G~LL and G.J.DANIELL, Nature 272, 686 (1978). D.M.COLLINS, Nature 298, 49 (1982). RFLETCHER and C.M.REEVES, Computer J. 7, 149 (1964). J.%ILLING and R.K.BRYAN, Mon. Not. R. Astr. Sot. 211, 111 (1984). M.LAGALJ_Y and WFRANZ, Vorksungen iiber Vektorrechnung, 6th Fdn., Akademisehe Verkgsgeseiisebaft, Leipzig (1959).

626

DAVID S. STEPHENSON

131. W.B.DAVENPORTand W.L.ROOT. An Introduction to the 271eor~ of Random Shafa and Noise. MeGraw - Hill, New York (1958). 132. J.P.BuRG, NATO Adv. Study Inst. on Signal Proc., Ensebede,NetberIsnds(1968). 133. D.E.SMYLIE, G.K.C.CLARKE and T..I.ULRYCH in Methods in Com~u,utationd Phvsics. B.ALDER. S&RNBA& and M.ROTENBERG@e&e Ede.), Vol. 13, Geoph&s, B.A.‘Bo~T, (Vol. Ed.); AcademicPress, New York (1973). 134. S.M.KAY, Modern SpectralEstimation, PrentiecHaB, EnglewoodCliffs (1988). 135. N.R.GOODMAN, Ann. Math. Statist. 34, 152 (1963). 136. G.SZE&, Math. 2. 6, 167 (1920). 137. J.A.EDWARD and M.M.PITELsoN, IEEE Trans. Inf Theory lT- 19, 232 (1973). 138. D.VAN~RMONDTand K.NEDERVEEN,Chem. Phys. Lm. 82, 443 (1981). and J.H.MCLzLLAN, IERR Trans. ABSP-32, 410 (1984). 139. J - P.Scxi~ Theory of Probability, 3rd Edn., Oxford UniversityPress, Oxford (1961). 140. H.JEFPREYS, 141. G.J.DANIELLand S.P.G~LL, IRR Proc. 127E, 170 (1980). 142. J.D.O’SLILLI~AN and M.M.KOME~AROFF in Indirect Imaging, J.A.ROBF.RTS(Ed.), Cambridge UniversityPress, Cambridge(1984). 143. A.K.LIVESEY and J.SKu.r.rNG,Acta. Opt. All, 113 (1985). 144. B.R.FRIEDEN,Proc. IEEE 73, 1764 (1985). 145. B.R.Mu%cus, Proc. IEEE 74, 762 (1986). 26 (1980). 146. J.E.SHOREand R.W.JOHNSON, IEEE Trans. lT-26, 942 (1983). 147. R.W.JOHNSONand J.E.SHORE, IEEE Trans. lT-29, 148. Y.TIKOCHIN~KY,N.Z.TISHBY and R.D.LE~IN& Phys. Rev. I_&. 52,1357 (1984). 149. JSIULLING, Nature 309, 748 (1984). 150. S.F.GLILLand J.%LLING in Indirect Imaging, J.A.ROBERTS (Ed.), Cambridge University Press, Cambridge(1984). 151. E.D.LAuE, JSKILLING,JSTAUNTON, S.&BISI and R.G.BRERETON,J. M&p. Reson. 62, 437 (1985). 152. K.M.WRIGHT and P.S.BELTON,Mol. Phys. 58, 485 (1986). 153. KBLUM, DensityMatrix Theoryand Apphcarions,Plenum Press, New York (1981). 154. R.C.TOLMAN, The Principles of St.&tkaI Mechanks, Oxford UniversityPress, Oxford (1954). 155. S.SIBISI,J.!%ILLING, R.G.BRERETON,E.D.LAuE and J.STAUNTON,Narore 311, 446 (1984). 156. F.NI, G.C.LE~~ and H.A.SCH~RAGA,J. Magn. Reson. 66, 385 (1986). 157. E.D.LAuE, J.SKILLINGand J.STAUNTON,J. Magn. Reson. 63, 418 (1985). 158. J.F.MARTIN, J. Magn. Reson. 65, 291 (1985). 159. D.L.PIULLIPS,J. Assoc. Comput. Mach. 9, 84 (1962). and A.C.KAK. Diairal PieroreProcessimr,2nd Edn., Vol. 1, AeademiePress. New 160. A.ROSENPELD York (1982). 161. P.S.BELTONand K.M.WRIGHT, J. Magn. Reson. 66, 564 (1986). 162. B.R.HuNT, IEEE Trans. AC- 17. 703 (19721. ‘Reson. 69, 386 (1986). 163. P.J.HoRE &d G.J.DANIELL,J. h&n. 164. E.D.L,+uz, M.R.MAYGER, J.SKILLINGand J.STAUNTON,J. Magn. Reson. 68, 14 (1986). 165. D.MARION and K.W~~THRICH,Biodhem. Biophys. Res. Commun. 113, 967 (1983). 166. D.J.STATES,R.A.HABERKORNand D.J.RuBEN,J. Magn. Reson. 46, 286 (1982). Reson. 58, 315 167. A.P.MEHLKOPP, D.KORBEE, T.A.RGGLEMAN and RFREEMAN, J. Magn. (19841. 168. J.C.J.B~NA and E.D.LAu~, J. Mbgn. Reson. 75, 384 (1987). 401 (1981). 169. J.S.LIM and N.A.MALIK, IEEE Trans. ASSP-29, 170. J.W.WOODS, IEEE Trans. IT-22, 552 (1976). 171. R.K.BRYAN and JSKILLING, Mon. Not. R. Astr. Sot. 191, 69 (1980). to Random Processes with Applications to Signals and Systems, 172. W.A.GARDNER, Introduction MxmiBen. New York (1986). 173. EIJuRY, Theory and A&i&on of the z Transbrm Method, Wiley, New York (1964). 174. J.A.CADZOW, Discrete - Time Systems, Prentice-HsU,EnglewoodCiiis, NJ. (1973). 175. T.MALORY, Le Morre &Arthur, PenguinBooks, Harmondsworth (1969). 176. X.ZHIJANG, E.~STEVOLDand R.M.HAR=IcK, IEEE Trans. ASSP-35, 208 (1987). 177. H.GESMARand J.J.LED, J. Magn. Reson. 76, 575 (1988). 178. J.J.Lzu, H.GESMAR, K.R.H~NRs and F.B.I-IAN~EN, J. Am. Chem. Sot. 110, 4165 (1988). 179. M.SI-IINNARand S.M.BLEFF,J. Magn. Reson. 76, 200 (1988). 180. M.A.DELSUC and G.C.LEVY, J. Magn. Reson. 76, 306 (1988). 181. SJ.DAVIES, C.BAUER, P.J.HoRE and RFREEMAN,J. Magn. Reson. 76, 476 (1988). 182. J.A.H&BoM, ASWOII. Astrophys. S~ppl. 15, 417 (1974). 183. A.J.SHAKA,J.KEELERand RFREEMAN,J. Magn. Reson. 56, 294 (1984). 184. G.J.DANIELLand P.J.HoRE, privatecommunication. 185. D.G.CHILDERS(Ed.), Modern Spectrum Analysis, IEEE Press, New York (1979), disrributedby Wiley, New York. 186. S.B.KESLER(Ed.), Modern SpecrromAnalysis, II, IEEE Press, New York (1986). That edited by CHiLDERSincludesreferences(48,59,64, The last two books are collections of reprints. 65,66,68,125,132,137 and 170). That from KESLERcontainsreferences(36,4O,56,72,83,92,108 and 169).