Progress in NMR Specnoscopy, Vol. 20, pp. 515-526, Printed in Great Britain. All rights reserved.
1988. Copyright
@.?794565/&3 So.00 + SO r&J 19&l Pergamon Press pk.
LINEAR PREDICTION AND MAXIMUM ENTROPY METHODS IN NMR SPECTROSCOPY DAVID S. STEPHENSON Institute of Organic Chemistry, University of Munich, Karlstrasse 23, 8000 Munich 2, Germany (Received 28 July 1988)
CONTENTS 1. Introduction 1.1. Overview 1.2. Basicconcepts 2. NMR Experiment 3. Fourier Transformation 3.1. Fourier Series 3.2. The Fourier Transform 3.3. The Two Domains 3.4. Discrete Fourier Transformation 4. Attributes of Fourier Transformation 4.1. Advantageous Properties of the DFT 4.2. Disadvantageous Properties of the DFI 5. The Problem Restated 6. Mode&g 6.1. Autoregressive Methods and Linear Prediction 6.1.1. The Spectrum of a Predictive Tie Series 6.1.2. Yule-Walker Equations 6.1.3. The Levinson Recursion 6.1.4. Burg’s Modification 6.1.5. Least-Squares Methods 6.1.6. Prediction Order Selection 6.1.7. Eigenanaiysis Methods 6.1.8. summary of AR and LP Methods 6.2. Functional Models 6.2.1. Damped Exponential Methods 6.2.2. K umaresan and Tufts’s Approach 6.2.3. Avoiding Polynomial Rooting 6.2.4. Summary of Functional Modelling 6.3. Extension to Two Dimensions 7. Spectral and Temporal Discriminating Functions 7.1. Maximum Entropy 7.1.1. Probability Density of One Random Variable 7.1.2. Frieden’s Entropy 7.1.2.1. Gull and Daniell’s Implementation 7.1.2.2. Skill@ and Bryan’s Implementation 7.1.3. Probability Density of Several Random Variables 7.1.4. Autocorrelation and Entropy Rate of a Stochastic Process 7.1.5. Burg’s Entropy 7.1.6. Comments on Maximum Entropy 7.1.7. summary of Maximum Entropy Methods 7.2. Maximum Smoothness 7.3. Extension to Two Dimensions 8. Demonstrations 8.1. One Dimensional 8.2. Two Dimensional Appendix A Some Statistical Concepts reiating to Tie Series Appendix B The z Transform and the Complex Plane Postscript Acknowledgements References 515 JPt4NR.s 20/6-A
516 516 517 518 520 520 521 522 522 524 524 525 529 529 530 532 536 537 539 541 543 545 548 553 553 555 559 562 566 567 570 570 572 574 575 577 579 582 584 589 594 5% 681 602 616 618 619 622 623 623
516
DAVID S. STEPHENSON
1. INTRODUCTION 1.1. Overview This
review
has
been
written
to provide
opportunity to understand several alternatives have recently found application in NMR.
the
general
NMR
to direct Fourier
spectroscopist
Transformation
with an
(FT)
which
These new methods all seem to do the impossible: they promise high resolution from very short records and excellent sensitivity from noisy signals. Indeed it would seem they fulfil a spectroscopist’s wildest dreams. One disadvantage
is, in some cases, a not insignificant
increase
in computation.
However,
if
previous advances in computer Rower can be extrapolated into the future this penalty should become negligible. A second drawback is a possible distrust by the analyst in the validity of their results. This is partly due to their novelty, but also because some methods which have been published have been shown to produce spurious spectral lines and intensity distributions which vary greatly from those expected. I hope to be able to clarify this situation. The review was planned as an excursion into Maximum Entropy (ME) which has interested me ever since I read Sibisi’s original paper.(‘) But we have come a long way since then and spectral estimation methods of a completely different kind also find their place here. This has two unfortunate consequences: firstly the review becomes longer and secondly, once the floodgates are opened, a profusion of techniques culled from other disciplines pours in. It is necessary to set a limit and I have chosen to treat only those methods which have appeared in the NMR literature, pointing out modifications which may be available. The methods seem to me to fall into two classes: those which work conceptually in the frequency domain and seek a spectrum with particular general characteristics (such as a maximum of the configurational entropy or a minimum of the overall second derivative), and those which work conceptually in the time domain and seek the spectrum of a time series which has the same statistical properties as the measured time series. The division is not clear, particularly as regards the major contender in the field at present, namely Maximum
Entropy,
which unfortunately
fact which shall be explained later. In many places I have sacrificed
is represented,
mathematical
by name at least, in both classes, a
rigour to achieve a coherent
story-line.
Indeed, I have gone even further. The bits I have found difficult to understand have been reinterpreted (particularly contour integration, surprisingly I’m not alone in not understanding contour integration@)). The bits I havn’t understood have been left out: I make no apology for this. I am not a mathematician and neither probably are you. But should you wish to fight your own way though the jungle of equations you will be admirably served by the reference list. I have chosen a tutorial approach for two reasons. In the first place the methods are completely novel in NMR and thus need a more comprehensive explanation than mere references to the original literature, although several excellent reviews do exist. Secondly the development of these methods has not yet reached such a stage that their limits can be recognised and in this respect I feel that a basic instruction in the fundamentals
provides a good foundation
considered as an up-to-date Throughout
for future developments.
Appendix to the excellent treatment
The whole paper can be
by Lindon and Ferrige.c3)
the article I have nearly always employed small italic letters for functions of
time x(t) and for sequences of values obtained from such functions by discretisation the The corresponding functions of frequency, derived by transnotation { xk. is used.
Linear prediction and maximum
entropy
517
formation, are denoted by capital letters X(Y), and the sequences formed by sampling at discrete frequencies as {Xi>. To be more specific: measured time data including noise are represented as { dk} which are composed of the data we are interested in {xk] plus the unavoidable noise Ink). The autoregressive methods of Section 6.1 use another type of “noise ” (ek) which can be regarded either as a white noise sequence which “drives” an autoregressive process or as a vector whose elements are a measure of the quality of a linear prediction (the prediction error). 1.2.
Basic Concepts
The principal purpose of doing an NMR experiment is to obtain information which will be of use in answering questions about the structure and dynamics of the substances under investigation. Because the primary data produced during a pulse experiment only rarely cast light on the relevant questions it is usual to transform these data into a spectral form since this is more easily assimilated. The concept of a spectrum as a plot of intensity
against wavelength, fiequency’or energf) is well known, but a more substantial definition is necessary right at the beginning since our chief concern is the second stage of the NMR experiment, namely the spectral analysis of the primary data after the pulsing is complete and before the interpretation is begun. We can immediately
distinguish between two complementary
ways of considering
the
problem in front of us. Although the dividing line is vague, this division is related to the above mentioned partition into two classes. The one takes as its outset the signal which comes from the NMR probe and treats this as a random signal, indeed it is such, being a single-valued mnction of time which conveys information and about which there is some degree of uncertainty before it occurs. (‘) The uncertainty arises frum noise in the system The amplitude of this signal at a particular time is and from unwanted interference. regarded as a random variabld 5,6) (that is to say the amplitude may take any of a set of values and each value has associated with it a specific probability) and the whole signal is regarded as a stoihastic process. c5t6)More precisely: any dne measured signal is a member of an ensemble comprised of functions of time and it is the ensemble which is the stochastic In dealing with such processes it is of particular interest to know at which process. frequencies the power occurs and it is in this sense that the the term spectrum is usually defined, namely as a function whose domain is frequency and whose range is the expectation value of the power of a stochastic process at a given frequency. Judging by the literature relating to the modelling methods of Section 6 this seems to come naturally to those with an engineering background who regard the data as being of primary importance and the spectrum merely a useful way of parameterising these data. To those of us, however, brought up with the chemical applications
of NMR,
par-
ticularly those who remember CW instruments, the emphasis lies with the spectrum which The data here play a subordinate role, is the frequency distribution of phased intensity. and we assume that there is one spectrum which can be regarded as a concrete entity and that this spectrum sends its information to us in the form of an FID which is corrupted by The aim of the analysis is to obtain this noise having specific well-known characteristics. single spectrum, and the task is made difficult firstly because of the noise and secondly because not enough data may be available, i.e.
the FID may be too short.
The job before
us is to choose from a multitude of suitable spectra just one which fulfils our purpose. A bridgehead provides
by Robinson(‘)
which
valuable insights into many areas of physics and which is in a volume
of the
Proceedings.
between
of the IEEE
these two attitudes
may be the review
devoted solely to spectral estimation.
I must admit
that both
518
DAVID S. STEPHENSON
viewpoints are related by a transformation in consideration space: the first will certainly appeal to those well versed in classical signal processing. The second, however, avoids this necessity and is closely related to the field of image enhancement and picture processing(*).
e
To
emphasise
this dichotomy,
to demonstrate
that data may be presented
in several
spectral forms none of which has an absolute right to be called “correct”, and to provide a foretaste of what is to come I present the ‘H spectra of orthodichlorobenzene at 100 MHz in Fig.1. Figures la and Id illustrate conventional Fourier transformation and display noise, low resolution and truncation artefacts. The spectra produced by two “modern” methods (Figs lb, lc, le and If) do not show these defects. You may by now have the uncomfortable feeling that this is going to be terribly complicated
(you’re right), so let us return to something more familiar.
2.
NMR
RXF’ERIMENT
The immediate goal of an NMR experiment is the spectrum corresponding to the spin system being studied. Of course the general structure of the spectrum is dependent on the type of experiment performed, but to exemplify the arguments presented in this article it suffices to deal initially with the simplest of all pulse sequences. To understand the relationship between a spin system and its spectrum consider the The equilibrium magnetisation M, initially along one-pulse experiment illustrated in Fig.2. the z-axis
(Fig.Sa),
is brought
parallel to the y’ -axis of the rotating
framdg) by an r.f.
Linear prediction and maximum
entropy
519
pulse (Fig.Sb). The individual spin isochromadg) then start to fan out in this frame at speeds given by the differences between their characteristic Larmor frequencies and the frequency of the frame (Fig.Lc).
j_
&&
X
J-l&Y,
X’
X’
b
a FIG.2. Stages of evolution of the equilibrium and is perturbed by a 9O”(v
a short rim64interval.
C
netisation in a simple pulse experiment. In (a) the system is at pul se in (b), whilst (c) represents behaviour in the x ’ y ’ plane after
Without loss of generality the response of the spin system to the pulse can be described in terms of these J isochromats, each with its own particular amplitude Xj and offset frequency Uj Hz, and this is the spectral description in the frequency domain pictured in Fig.Sa where the ordinate represents amplitude and the abcissa frequency. The amplitudes of the respective isochromats, portrayed as vertical bars in Fig.Sa, may be brought together to form the spectral vector {Xj> j - 0, J - 1.
X
X
b
FIG.3.
(a) The spectrum and (b) the FID corresponding
to the expuiment
in Fig.2.
This representation is what we seek but not what we measure. To see this consider Fig.Sc again. In precessing about the z-axis the spin isochromats may give rise to voltages induced in coils whose axes are parallel to the x and y directions of the fixed laboratory frame. The amplitudes of these voltages oscillate at the Larmor frequencies mentioned above
i.e.
in the radiofrequency
range,
and are reduced
to audiofrequency
signals by
combining, in a Phase Sensitive Detector, with the frequency of rotation of the frame.(gglO) If one receiver coil is employed a single analogue signal may be obtained as a function of time, and this signal (or FID) contains information about the intensities and offset If two receiver coils orthogonal to each other were to frequencies of the spin isochromats. be used (this conceptually simple method is almost (il) always replaced by the constructionally simpler method of employing one receiver coil and phase sensitive detecting with two frequencies shifted 90’
with respect to each other) the two recordable analogue signals can
be regarded as the real and imaginary parts of a complex signal. detection
This quadrature
mode of
delivers also the sign of the offset frequency,cg) in addition to the intensity and
520
DAVID S. STEPHENSON
frequency information. (There are several different practical ways of realising quadrature detection,(12) the one chosen here is didactically the simplest). After detection this continuous time signal x(t) is uniformly sampled at a rate of l/At Hz to give rise to the discrete time series {xk} k - 0,K - 1 which is stored as a vector whose elements may be real or complex depending on the mode of detection. Just such a real vector is displayed in Fig.Sb where the ordinate again represents amplitude but the abcissa now stands for time. To see how the spectrum and FID are linked together let us consider how the induced voltage is produced. Each isochromat generates a sinusoidally varying signal given by where quadrature detection is explicitly assumed, and thus the yt’) - Xj exp(i2nvjt), continuous FID which is the sum over all isochromats is V(t) -
J-l 1
5
eXp(i27tLJf).
j-0
This signal is sampled at times tk -
kAt thus the elements of the measured FID vector
are Xk -
J-1 1
Xj exp(i2rru#&)
.
(1)
j-0
If,
furthermore,
the
frequencies
of the isochromats
are
given
by
Yj = jAv,
i.e.
the
spectrum represents a uniform sampling in frequency, eqn (1) is converted to J-1 X$ = 1
Xi exp(i2njkAtAv)
.
(2)
j-0
This last equation shows that if the J elements of the spectral vector cXj> are known, and if the sampling rates in time and frequency are also known, then the FID vector {xk) is My determined. To put it another way, if we know the spectrum we need no mental gymnastics to calculate the FID. Equation (2) acts a bit like a one-way street, and if our aim is to calculate the spectrum from the measured FID then several detours are necessary before we arrive at our destination. Our progress will be impeded by noise which falsifies the values xk, and may be made more difficult still if the number K of observed data is small. The standard approach to this spectral analysis problem is to use direct Fourier Transformation, a technique which was introduced into NMR in 1966,(“) &Id whose application is adequately described in references 3,$4 and 15. Because the equations of Fourier transformation are essential to an understanding of all spectral analysis techniques, the next section is devoted to their explanation. 3. 3.1.
FOURIER
TR&ISFORMATION
Fourier Series
It is well known that a complex periodic signal d(t), where I have explicitly used the notation “8 to emphasise that it is a measured signal including noise, may be represented as a series expansion using the set of orthogonal functions exp(i2rrjt/q, j I 0, 2 1, 22, . . . . . where T is the period in seconds.(16) Thus d(t) -
g j- --o
9
exp(i2njdlT)
(3)
Linear prediction and maximum
entropy
with complex coefficients Dj is the Fourier series representation Since the functions exp(i2njtIT) + Tl2 T-1
exp( - i2njtl’I)
I
are orthogonal,
exp(i2nMT)
521
of the signal d(t).
i.e.
dt = djk ,
(4)
.
(5)
- T/2 the coefficients are easily determined to be + T/2 Di
_
T-1
I - T/2
d(t) exp( - i2njtlT)cit
FIG.4.
Discrete spectrum of a periodic function.
These coefficients Dj embody the distribution of intensity of the original signal amongst the basic frequency l/T and its harmonics 2/T, 3/T, . . . . i.e. the set of complex numbers (Dj} constitutes a spectral representation of d(t). Plotting Dj against j (Fig.4) shows that a periodic function has a discrete spectrum. discrete function has a periodic spectrum. 3.2.
Equally well, and this will be used later,
a
The Fourier Transform
Whilst Fourier series are applicable only to the spectral analysis of periodic signals we are primally concerned with aperiodic signals. One way to deal with such functions is to consider them periodic with a periosl T which tends to infinity. Such a treatment leads naturally to the derivation of the Fourier transform.(*6917) Starting from eqn (5) and multiplying both sides by T we get + T/2 TDj 9
I
d(t) exp( - i2njtiT)
dt .
(6)
-T/2 Now the frequency
corresponding
to the basis function exp( - i2rtjt’T)
is j/T,
and as T
becomes arbitrarily large so the spacing between the frequencies j/T and (j + 1)/T becomes arbitrarily small and the frequency approaches a continuous variable Y. Similarly the fundamental frequency l/T becomes a differential frequency dv. Equation (6) thus becomes + T/2 lim T+=
TDj I
I
d(t) exp( - i2nut) dt ,
(7)
- T/2
and the right-hand
side of eqn (7) is defined as the Fourier
transform
of d(t), denoted by
D(u), i.e.
+-
qv) =
J -co
d(t) exp( - i2nvt) dt .
(8)
522
DAVID &STEPHENSON
Consider follows:
now the effect of letting T tend to infinity on eqn (3) which can be rewritten as +a
1
d(t)-
TDj exp(i2njt/l)(l/l)
.
(9)
j- -ca
As T tends to infinity the summation
in eqn (9) becomes an integral and TDj and If T may
be replaced by D(V) and dv respectively yielding the inverse transformation
J D(u)
+a d(t) =
exp(i2rrvt)
du
(10)
--oD
which allows the original signal d(t) to be recovered from its spectrum D(u). Since eqns (8) and (10) are rather unwieldy it is convenient to employ the concept of a transformation operator F where D(v) - F_[cf(t)] with the subscript indicating the minus-i transform(“) of eqn (8) and d(t) - F+[D(u)] with the subscript now indicating the positive sign of the exponential in eqn (10). In comparison
to the periodic
signal
the spectrum
D(v)
of an aperiodic
function
is
continuous. 3 .3 .
The Two Domain.@
It was realised quite early on that the spectrum obtained in continuous wave mode and the response to a pulse were related to one another by a Fourier transform(ig) eqn (8) where D and d are continuous functions of frequency and time respectively. D and d form what is known as a transform pair and although D and d may have different functional forms each has the same information content. There
are manifold
reasons
for going to the trouble
of applying
eqn (8) to a function
d(t). For instance the mathematical operations to which d(t) is to be subjected may be less formidable in the sister domain II, a notable example of this is convolution which reduces to mere multiplication in the transform domain. But the application with which we are concerned here is spectral analysis, the purpose being to extract from the signal d(t) information concerning the natural resonance frequencies of the spin system. In practice the integration in eqn (8) is not carried out in an analogue manner on a continuous signal d(t). Instead, as described in Section 2, the FID is sampled and stored in the form of a vector { dk), and this finite sequence of values is the starting point for the spectral analysis. The practical application of this method was demonstrated first by Ernst and Anderson,(“) but only after a fast method of carrying out the transformation of a discrete time sequence had become available.(20) 3.4.
Discrete Fourier Transformation
The discrete Fourier transform
(DFT)
(Dj} o f a sequence
(dk) of K terms is defined by
K-l
dk exp( - i2rrjWK)
q-t
.
(11)
k-0
Usually the sequence {dk) represents K consecutive samples d(kAt) of a continuous signal d(t), and the sequence { Dj} represents K consecutive samples NAY) in the frequency domain. Equation (11) can be seen as a discrete analogue of the Fourier its derivation is superbly described by Brigham.c2’)
integral (8), and
Linear prediction and maximum
entropy
523
An important attribute of the DFT depends on the cyclic properties of the basis functions exp( - i2nm/K).
Let j in eqn (11) lie outside the range 0 to K - 1, i.e.
j -
mK + n where
m # 0 and 0 4 n < K, then exp( - i2njMK)
-
exp( - i2nmKWK)
I
exp(-
I
exp( - iSnnk/K)
i2nmk) exp( - i2nnWK)
[ exp( - i2n)]*
exp( - iPnnk/K)
exp( - i2nnWK)
.
(12)
Substitutung eqn (12) into eqn (11) we see that Dj - On. Thus there are only K independent values in the sequence (D$. This means that the spectrum of a discrete time series is periodic. The consequences of this will be dealt with in the next section. Another property of the DFT depends on the orthogonality of the basis functions: K-l
K-l
1
exp( - i2njm/K)
exp(i2nkmlK)
= djk .
(13)
m-0 Equations (11) and (13) together lead to the definition of an inverse transform K-l
dk - K-’
1 j-0
Dj exp(iPnjMK)
.
(14)
If eqn (14) is compared with eqn (2) it can be seen that they have the same form, apart from a numerical factor it only needs that we set Av - l/KAt and they are identical. This demonstrates that the spectrum represented by the vector {Dj} may be obtained from the FID {dk) by discrete Fourier transformation, and when we do that the frequency sampling rate is the inverse of the product KAt. Another point to be noted here can be seen if we again apply the orthogonality relations eqn (12) but this time to eqn (14). Then it becomes apparent that only K independent data dk are defined by the transform, i.e. the data are periodic! Although the two transforms defined by eqns (8) and (11) appear similar there are some significant differences in interpretation. Of practical interest is the relationship between index j and frequency Y. Whereas Y is a continuous variable ranging from - 0~ to + 0, j is discrete and only K values of Dj are defined by eqn (11). Since { Dj> is thus periodic we are free to choose the range of the K values, and whilst the values 0 $ j c K are convenient the set W2 < j 4 W2 is equally valid and has the advantage that the index corresponds
directly to a frequency u between - vN and + vN, where uN is the Nyquist frequency. (‘) If the range 0 5 j < K is chosen it should be understood that the values of Dj for j > K/2 correspond to continuous transformation with u < 0, this is treated in greater detail by Brigham. Discrete Fourier transformation
is no longer carried out by blind application of eqn (11);
this would require far too many computational operations. Instead a legion of so-called “fast” transforms (FFTs) is available which mightily reduce the effort involved and which are all based on a very simple idea. Equation (11) can be expressed in vector notation as
D - Wd
(15)
where D - { Dj}, W is a KxK matrix having elements W ‘k = [exp( - i2nlK))ik and d - {dk}. The trick behind the fast transforms is to express \nf as a product of matrices, each having
a large number
of zero elements.
This drastically
reduces
the number
of
524
DAVID
S. STEPHENSON
non-trivial multiplications and may be done if K fulfds certain criteria, for instance K - 2” or K - 4”, where n is an integer, leads to a radix-2 or radix-4 Cooley-Tukey(20) algorithm. This interesting technical subject is fully explained by Brigham and by Elliot
a& Rae.(“) ATTRIBUTES
4.
OF FOURIER TRANSFORMATION
The foregoing treatment has described the DFT as applied classically to the conversion of an FID to a spectrum, and in the following I set out some of the consequences of the use of this method. These can be divided into advantageous and disadvantageous aspects. 4.1.
Properties of the DFT
Advantageous
One positive aspect of the DFT as implemented using a fast algorithm is its speed. A radii-2 computation of a K-point DFT requires 2KlogpK - 7K + 12 non-trivial real multiplications and SKlog2K - 3K + 4 real additions if complex multiplications are done with 4 real multiplications and 2 real additions. (‘*) A radix-4 computation is faster still, needing only (3K/2)log2K - 5K + 8 multiplications and (1 1K/4)logpK - (1%/S) + (8/S) additions. Direct computation
using eqn (11) needs 4(K - l)* non-trivial
real multiplications
and about the
same number of real additions. A further advantage of the DFT is its easy modification to allow the treatment of multidimensional data. The discussion so far has concerned one-dimensional (1D) data and therefore 1D spectra. By this is meant that the data vector {dk) and the spectral vector (Dj} are both functions of one index only. But for over a decade the technique of two-dimensional (2D) NMR has been known and its importance seems to grow from year to year.
Without going into the details of the methods we can define them as experiments
which acquire a signal as a function of two times tl and t2 and thus the data to be stored (d,) can be regarded as a function of two indices and the spectral presentation {D,} of the information
is also a function of two indices.
Just as the 1D NMR
spectrum is related to
the 1D FID by eqn (8), so is the 2D spectrum given by a similar expression(‘5)
D(vt,v2)
+=
+m
-0I
I --o
-
d(tl,t2) exp(-i2nvltl)
exp(-i2nv2Q
dt,dp
,
and this can be rewritten as
i-
+-
--oo
d(t,,Q -0D
D(q,vp) - J [ J so that the transformation then over ti.
exp(-iZnu2t2)
dt2] exp(-i2nvltl)
can be completed in two steps.
dt, ,
(16)
Firstly an integration over t2 and
The discrete analogue of eqn (16) is
D,.BLi[y 1-O
d, exp( - i2njklK)
] exp( - i2aiUL)
,
k-0
and a 2D DFT can be regarded as two sequential sets of 1D DFTs, the first set operates on the columns of {d,} to produce an intermediate matrix and the second set operates on the rows of this to yield the { Dij}.
This separation allows the implementation of fast algorithms 1D transforms .(*‘) Although other schemes claiming greater computational efficiency are possible, (**o*~)they have not yet found application in NMR. based on standard
Linear prediction and maximum 4.2.
entropy
525
Disadvantageous Properties of the DFT
One disadvantage, associated with using an FFT of any form is that K is restricted to a limited set of values. Thus the radix-2 algorithm can be implemented only when K is a power of 2, and in the radix-4 case K must be a Rower of 4. In general the fast algorithms rely on K being highly composite and the simplest implementations are found when K is the power of an integer. fz2) Although this restriction applies only if the DFT is determined using a fast algorithm,
the sole advantage
of the 1D DFT
is its speed when the full
factorizabiity of the transform matrix W is taken into account. Thus we may safely assume that this limitation is always with us, for to calculate the DFT using eqn (I 1) in a straightforward
manner
is not feasible.
Since in many cases the measured
sequence
{di)
may not contain such a special number of elements we are forced to do one of two things in order to apply the DFT: i.e. either we delete some of the elements thus rounding K down to a nice number, or we can add elements thus rounding K up. Both alternatives, forced upon us by the DFT, are bad. In the former case measured data are ignored, and in the latter unmeasured data are included. feature of the restraints on the length K of the sequence (dk) can be seen by
Another
considering eqns (11) and (12) which demonstrate that the spectral vector { Dj} contains only K independent elements. Furthermore, comparison of eqn (14) with eqn (2) yields the frequency sampling rate or spectral point resolution as Av -
(KAt)-
1 Hz.
If we want a
higher resolution in the frequency domain, which is equivalent to having more points, then we must interpolate. The easiest way to do this is to “zero-fill” in the time domain,(14p’5) i.e. extend the data record to a larger number than K samples by appending zeros before transformation. Whist this technique increases the point resolution in the spectrum, it fails to make the lines any sharper because the frequency resolution is inversely proportional to the time duration of the signal. It can, however, clear LIP uncertainty as to how many lines are present in the spectrum, as demonstrated in Fig.5.
1
L-
1lll
LJ-d
a
FIG.5.
C
b
Improvement + s
transformed using a lb-pomt .r +IU are,present. By zero-fhg lndaMllnacy clcarcd up.
trum appewanw
achieved
by zero-fGn6.
FT (the rql partsof the fm 8, spawal
to 32 pomta (b) or to 64 pomts (c)
In general zero-filling is applied to improve the appearance
of NMR
spectra at very little
extra computational cost. Although care must be taken when the FID has not decayed to zero before the end of the measurement period. This topic shall be considered next. A major handicap of the DFT is encountered in the treatment ofe short signals. Normally
in NMR
sequence
(dk} is acquired
the FID decays exponentially
signal will have vanished. be distorted. eqn (8).
with a time constant
T2
over sufficiently long time the information-bearing
so that if the part of the
Should this not be so then the spectrum obtained by DFT will
To see this more clearly we must go back to the continuous FT given by The integral in eqn (8) runs from - ODto + o, but our measurement of the signal
DAVID S. STEPHENSON
526
which starts at t - 0 must necessarily cease at a finite time t - to. So that the measured signal is the measurable signal multiplied by a rectangle function. Take as an example the complex exponential whose real part is illustrated in Fig.Ga. The corresponding transform using eqn (8) is the delta function of Fig.Gb. If the signal d(t) of Fig.Ga is multiplied by the rectangle function
F-MO401 - F-k491 QPF-M91 = D(v) @ R(v) 2 where the notation D(u) @ R(v) is defined by += I
D(v) @ R(v)
I --oD
D(f)
R(u -f)
df.
d(t)
D (~1 A h
b
r(t)
R (~1
A
0
h
tr
l- t
d
C
d’(t)
D’(v)
A
A
z- t e
I
A ”
u
f
FIG.6. Time domain truncation and its effects on the continuous Fourier transform. A time-unlimit+ complex exponential (a) yields a delta function b) on transformation. A rectang!e (c) transforms to a “sine 6. function (d). Multiplying (a) and (c) gives (e) w tch transforms to Q the convolutton of (b) and (d).
Linear prediction and maximum
527
entropy
ThUS
D’(u)
F_[d’(t)]
-
The transform
R(v)
- D(u) Q R(u) .
of the rectangle function, whose real part is shown in Fig.Gd, can be
seen to have significant sidelobes, and therefore its convolution with the delta function of Fig.Gb leads to a continuous spectrum D’(V) (Fig.Gf) which also displa+ sidelobes. Before returning to the DFT it is instructive to investigate this distortion a little more closely. Although the measured signal runs only from t - 0 to t - to, the integral in eqn (8) runs relentlessly from - 0~ to + OD,and the sidelobes in Fig.Gf can be seen as corrections to the original frequency of Fig.Gb necessary to accommodate the discontinuity at to. From this standpoint it can be easily understood why the normal NMR FID which decays to zero within the measurement time to and thus shows no such discontinuity in the time domain is not distorted on transformation.
w ’
,’
:
._
. .
I
A ,-\ 0
‘,
’
‘(
‘,
I-.
‘\.I
-. -0
\_’
I
, \
t
\
_
u
I._
a--
’
,
,
,’
\ 2’
I
. .
,--
*)
u
b
a
FIG.7. (a) Fourier transformation of a mcawred data vector containi an integer number of periods of a kequency component leads to exact rpccification of this component. (b) “fi S ould this not be the case, “leakage” results and the neighbouring components become non-zero.
The translation
of this effect to the DFT
is relatively simple.
Equation
(11) forms the
basis by which a discrete signal can be analysed into its Fourier components exp( - iZn/K) and harmonics. If the K measured values of the signal cover an integral number of periods then the harmonic analysis using eqn (11) leads to one Dj being non-zero and the rest, corresponding to the zero crossing points of the continuous Fourier transform, being zero (Fig.7a). However, if this is not the case there is no one Dj which describes the signal Zero-filling exactly and the intensity is shared or “leaked” into neighbouring Dj s (Fig.7b). won’t
help
here,
indeed
it makes
matters
worse
since
the
spectrum
approaches
the
continuous transform (Fig.69. The only remedy within the boundaries set by the DFT is to modify the original FID in such a way that it gradually goes to zero at t = to. This operation of apodisation(‘4*‘5) or windowin& *‘,*@ which is necessary to reduce the sidelobes always broadens the signals, sometimes significantly, and thus lowers the resolution of the spectrum. It was to get around this intrinsic problem of short signals that the “modern” methods of spectral estimation were introduced, and it is to be expected that they will have their maximum impact in this area. This is particularly true in 2D spectroscopy where the length of an experiment is directly linked to the number of points The spectral phase presents yet another problem. In the discrete transform of (dk} is the vector {Oj> whose elements are, The only exception is if the non-zero imaginary components. time data is Hermitian
acquired in tl. one-dimensional case the in general, complex with periodic extension of the
about t = 0 , i.e. unless dk - d f k or unless dk - c&l k where K This can easily be seen from eqn (14) by setting Dj
is the vector length used in the DFT. real.
Since this latter situation is hardly ever encountered
in practice the normal result of
applying the 1D DFT to an FID is to obtain a complex spectrum whose real and imaginary parts do not correspond to the absorption and dispersion components of the spin system.
528
DAVID S. STEPHENSON
This hurdle can be overcome in two ways. Firstly the real and imaginary parts may be combined to give a power spectrum or its square root, this is very easy but leads to a spectrum which is always positive and in which either the intensities are distorted (power spectrum) or the line shapes are distorted (magnitude spectrum). Secondly the real and imaginary components may be combined using two phase parameters cpo and Q i to produce a corrected spectrum whose real part is pure absorption and whose imaginary part pure dispersion. For 1D spectra this presents no unsurmountable
barrier,
either the two phase parameters
may be supplied by the spectroscopist using an interactive display, or an automatic
phasing
routine can be used. But for the 2D spectrum a novelty is introduced, namely it may be impossible to select phase parameters such that the lines are everywhere in pure absorption. This arises whenever the signal is phase modulated in t,@‘) and the result is the so-called “phase-twist” lineshapeQ@ caused by the superposition of double absorption and double dispersion lines. In coherence transfer experiments one way to provide quadrature detection in v1 is to add the results of two experiments which differ in the phases of some of their pulses,(2g) thus selecting either the p- or n-type peaks. PO) This addition certainly allows discrimination between positive and negative frequencies in vl, but brings about abovementioned phase-twist, which has been traditionally counteracted by displaying
the the
spectrum in magnitude mode (square root of the power spectrum). This has two disadvantages. Firstly the widespreading dispersion components reduce the spectral resolution, although clever application of weighting functions can ameliorate this.(*‘) Secondly the information inherent in antiphase multiplets is inexorably lost in the magnitude mode.(“) Indeed, if the point resolution is low even the correct multiplicity is lost. This can be overcome if the results of the two experiments are not added but kept separate(2g) and are subjected to a “hypercomplex” Fourier transformation,(‘5) which yields a set of four spectral components (real-real, real-imaginary, imaginary-real, imaginary-imaginary). The final phasing of the 2D spectrum requires the selection of the correct linear combination of these four components, which is no easy task@‘) and is usually achieved by phasing slices through the spectrum.(2g) Even when this is accomplished and the phase-twist is removed the lines are not all in the convenient double absorption mode. Specifically, in a COSY experiment for example, one may choose cross peaks to be in double absorption in which case the diagonal peaks are in double dispersion form, or one can make the opposite choice. The spectrum still has peaks with dispersion characteristics with all their disadvantages. At least one of the “modern” methods can be used to avoid the hypercomplex transform and produce pure absorption lineshapes everywhere with automatically correct phases, as we shall see in Section 7. A further prerequisite of using eqn (11) to determine the spectrum { Dj} is that all the elements of {dk} must have been measured and are available. The vector {dk} corresponds to a uniform sampling of the FID at time intervals At. Should any of the elements not be measured then eqn (11) cannot be applied, we cannot just set these elements to zero. This is usually no problem in 1D NMR, the FID is there anyway and we might just as well measure all the dk. But for 2D NMR the number of measurements in t1 should be kept as low as possible to reduce the length of the experiment. Application of the DFT means that the resolution in v1 is limited by the restriction time intervals.
that the samples be measured at uniform
If we remove this restriction then the resolution is limited by the maximum
value of tl used, and one could conceive an experiment
where the successive values of tl
follow another rule than mere contiguous integer multiples of At.(“)
Linear prediction and maximum Even in 1D NMR the spectral form. or acoustic
entropy
529
the inclusion of all the elements of an FID might be detrimental
to
For instance if the first few points are corrupted by pulse breakthrough
ringing then the spectrum
usually shows severe “baseline roll”.
Setting these
initial defective points to zero rarely alleviates the problem, and ignoring them by starting the measurement of the FID later introduces extreme phase problems. These difficulties can be .avoided by using better conceived spectral analysis methods. {dJ
Finally, the data are usually corrupted by noise. Thus the data which are measured consist of a part {x~} which comes from the spin system, the signal, plus a part {nk}
which comes from the electronic components, the noise. (This noise need not be additive, it could be multiplicative noise or phase noise, but this introduces unwqrrented complications). The distinction between signal and noise lies physically in the desirability of the signal and the undesirability of the noise, and mathematically in that their autocorrelation fimctions(35) are different. Because the Fourier transform is a linear operation, so that F[x+ n] - F[x] + F[n], the noise necessarily appears in the frequency domain after transformation, and any method designed to suppress the noise would distort the signal.
5. THE PROBLEM RESTATED We have seen that many obstacles are encountered when a spectrum { Dj} is obtained by discrete Fourier transformation of the data vector {dk}. These obstacles cannot be surmounted without sacrificing something valuable, be it resolution or sensitivity. One way out of this dilemma is to attempt to counteract the two main evils, namely insufficient data and noisy data, by accepting their existence and trying to determine the underlying structure of the data by a predictive technique whose main aim is to extend the length of the “data” before transformation, but which should also be able to emphasise the important spectral components and suppress the noise. This approach is explained in the next section and relates to the statistical methods mentioned in the introduction which have the logical direction “time-series -, spectrum”.
An alternative approach whose logical direction is exactly the opposite is made possible if we free ourselves of the notion of having to work deductively which is imposed on us by applying eqn (1 l), for this equation says nothing else than that the spectrum can be deduced from the data leaving no room for doubt about whether the measured dk are noisy or not, leaving no room for doubt about values of dk that might have been measured and leaving no room whatsoever to include valid information into the structure of DJ Equation (11) is too rigid: it makes no distinction between signal and noise and it treats the artificial boundaries imposed on the data by the limitations of time as an inherent component
of the
data themselves. What is needed is a certain sloppiness, a form of give-and-take between spectrum and data. This liberalization cannot go too far however, otherwise chaos would Thus a strong principle is needed to hold the be the sole outcome of any analysis. spectrum in check even when thz data constraints are relaxed. Several alternatives are conceivable and are discussed in Section 7.
6.
MODELLING
This section is based on a very unpretentious simple example.
A signal represented
idea which is perhaps best illustrated by a
by a sine wave x(i) - sin(wt) is measured
intervals of time to give the data vector {xk} k- O,....,K the data vector is not very long then {Xi},
- 1 (Fig.8).
at K
We have seen that if
the Fourier transform of {xk},
exhibits artefacts
530
DAVID S. STEPHENSON
(Fig.60 caused by the truncation of {xk}. These artefacts could be reduced if the data vector could be extended in a sensible way. A reasonable question is thus ‘is it possible to estimate or predict the value of xx given all the previous values of (xk}? ’ . The answer for a single sine wave is obviously ‘yes’ and a good “guesstimate” would be the point marked I’*” in Fig.% Obviously if this works for the Kth point we could extend the procedure to the (K + 1)th point and so on, until the new data vector has sufficient elements to allow Fourier transformation Fourier
to a spectrum
transformation
information
devoid of artefacts.
step is unnecessary
Indeed, if we use this method then the
since the extension of the data series has used
inherent in the original K elements and the extended
We can just as well use the information spectrum.
obtained
points, are redundant.
from the original elements to form a
X
FIG.8. A sine wave measured at the points marked “.“,
Is it possible to predict the point marked “e”?
The salient point to be made here is that we may have enough information,
even in a
very short time series, to allow good estimation of the underlying frequencies. The problem is to extract and use this information. This leads us to a discussion of how to represent a time series so that its information content may be parameterised. The concept behind this approach is that a particular time series may be considered as a representative of a class of functions, the model, and a set of parameters which have to be estimated. For instance the model in Fig.8 is obviously one sine wave and the information contained in the data vector (xk} may be concentrated into the single parameter or. If we are to treat a time series correctly the model must be chosen with care, for otherwise the number of parameters becomes large and some of the parameters may become spurious, particularly if we have to contend with noise. There seem to be two main paths we can follow: either we could treat each sequential datum
as being linked in some way to its neighbours
which leads to the concept
of
autoregressive modelling, or the underlying structure of the data could be introduced at the very start which leads to the less flexible but better tailored approach of functional modelling. 6.1.
Autoregressive Methods and Linear Prediction
Since the time series we are dealing with are deterministic
it seems reasonable to suppose
that any one measured point xk may be expressed as a function of the preceding points:
where the notation
xk is used to emphasis that I am considering,
for the moment,
time
Linear prediction and maximum
entropy
531
series with no additive observation noise. The simplest function which can be chosen is linear in the xk _m, thus we could predict xk by the relationship
xk
-
-i
%,xk_rn
+
(17)
ek )
m-l
where ek (sometimes
called the innovation)
is the kth element of a zero-mean
white noise
vector (ek} and the minus sign is chosen for later convenience.
This method of treating a
time series may be called Linear
reasons,
Prediction
(LP)
for obvious
and the idea of
representing a time series by eqn (17) goes back to Yule(36) who reasoned that the periodicities hidden in the time series are disturbed by noise and if their number M is known it suffices to determine the coefficients a, and the spectrum may be obtained as shown in the next section. The method to solve for the coefficients is regression analysis, and because the series is regressed onto its past the procedure is alternatively known as an Autoregressive (AR) model and forms one part of a more general treatment time series is given by
xk
-
amxk-m
-2
+
f
m-l
bnek-n
in which the
(18)
9
n-o
where the original AR model of eqn (17) has been extended to include a second summation over the noise vector. The second summation can be regarded as an averaging of the individual elements of {Q) and is thus called a Moving Average (MA) model. The whole model of eqn (18) is thus known by the acronym ARMA. I shall not, however, consider MA models or the more general ARMA treatment. Firstly they have not yet been used in NMR, secondly the MA approach is not too well suited to the determination of spectra consisting of sharp lines, (“) this shall become evident in the next section, and thirdly the algorithms for AR parameter or ARMA modeIs. Let
estimation are considerably more efficient than those for MA
us go back to the AR
model and consider
eqn (17)
more
closely.
The
system
described by this equation can be portrayed by the diagram in Fig.9, where the Jtth element of the series is given as the sum of the innovation ek and the preceding M elements of the series multiplied by the appropriate weights -a,. It system involving feedback, including feedback of the out by Yule@) and implicit in his derivation, this is consisting of a set of natural frequencies perturbed by
is obvious that we are dealing with a previous innovations, and, as pointed not the same as treating the series as additive white noise.
*(-a Ml FIG.9. The feedback scheme of an AR process. The kth element of a time series is made up of the sum of the innovation el, and the previous M elements of the auics xk _m multiplied by the appropriate cocffiients -a,.
This is an important
point to bear in mind,
for the innovation ek in the AR model is The aim, though, of spectral
part of the model and cannot be interpreted as additive noise. ‘JPIINR8 20/6-B
DAVID S. STEPHENSON
532
analysis is to extract the deterministic part of the signal and whilst such a signal may be well represented by a model having a relatively few parameters a, and a large random component (ek} an alternative with a greater number of parameters and a smaller noise component is obviously to be preferred. It is in this sense that we employ Linear Prediction: a good predictor being one which predicts xk as exactly as possible which means that ek should be a minimum. This is indeed the basis of several treatments (for instance Burg’s method),
but this again takes no account of the actual presence of observation noise.
The derivation
of the AR equations also rests heavily on this assumption
that no obser-
vation noise is present so that the presence of such noise will reduce the applicability of these equations. We shall see in Section 6.1.8 that the correct treatment of observation noise
leads to an ARMA
model
which,
as pointed
out before,
greatly
increases
the
computational effort. Be this as it may, there are two problems to be solved in using the AR method. On the one hand the coefficients a, have to be found, and then they must be used to calculate a spectrum. Since there are many distinct ways of determining the coefficients but only a few basic formulae for their conversion into a spectrum I shall treat the latter problem first. The Spectrum of a Predictive Time Series.
6.1.1.
The
discrete
measurement
of the
enforces a periodicity on the continuous spectrum X(u) elements of the time signal (xJ and allows a series expansion of the spectrum in orthogonal functions exp( - i2rrkvAt) similar to that of eqn (3). Thus +0 X(v)
=
xk exp( - i2nkvAt)
1
,
(19)
k= --OD where l/At is the frequency period in Hz. This relationship assumes that the measured values of the time series are available for the +oD, a condition which is obviously not fulfilled, but if the whole range --oD 6k4 elements of {xk}
are related by eqn (17)
and if the coefficients are known then it is a
simple matter to predict the unmeasured values xk and to use these instead. This method of data extension could certainly be used but it is equivalent@‘) to a more elegant method which allows one to analyse its drawbacks with greater rigour. It is expedient at this stage to introduce the compact z transform notation (see Appendix B), and by setting z - exp(i2rruAt) and substituting eqn (17) into eqn (19) we get
=-k
amXk-m k=-am-1
=
2
-
a,
z-“X(z)
+
z
ek
zek
k- --OD
+ E(z) ,
m-l and thus, by rearranging
and setting a,
-
1,
E(Z) x(4
5
a,
zmm
’
(20)
m-0 so that the spectrum
of the time series is the spectrum
of the noise E(z) divided by a
polynomial in z which must be evaluated on the unit circle since z - exp(i2nvAt). Now eqn (20) has two important features: firstly the noise appears exclusively in the numerator
Linear prediction and maximum so that at first glance it seems that we have accomplished
entropy
533
a perfect separation of signal and
noise. Indeed spectra estimated by such methods do show an almost noiseless baseline. This is because the noise spectrum in eqn (20) corresponds to the innovation {ek) which is assumed white, borne in mind this topic later eqn (20) is the
thus the spectrum E(z) may be replaced by a constant. This should if the intensities of different spectra are to be compared, we shall return when we consider the merits of these methods. The second feature polynomial in z which appears in the denominator. This polynomial has
be to of at
most M roots scattered across the complex z plane and these correspond to the singularities or “poles” of X(z) where this function becomes infinite. This aspect of the AR spectra leads to the alternative name of an “all-pole” model, and also accounts for the success of this approach
for the representation
of sharp-lined
spectra.
The poles are important
from
another point of view, namely, the spectrum X(z) is determined on the unit circle in the z plane so that for stability (see Appendix B) the poles must all lie within the unit circle. It can be shown that an MA process has a spectrum defined by an expression similar to eqn (20), but having the polynomial in the numerator, and thus the MA approach leads to an “all-zero” model. Such a model is, as pointed out previously, not well adapted to the requirements
of high-resolution
NMR where the spectra contain sharp peaks, although their
effect can be taken care of by drastically increasing the model order M. Further discussion of these points and the ARMA models in general can be found in the reviews by Cacl~ow,(~~) Kay and Marple,(ti) and Ulrych and Ooe(*‘) as well as the recent book by Marple.ts8) In the main body of the literature
relating to spectral analysis the emphasis is
always on the estimation of the power spectrum, which is easily obtained from eqn (20) as
so that the function of interest is lX(v)[*
02 P(v)
-
lJv~12
-
,
M
a, II
exp( - i2nmvAt)
(21)
2
m=O
where o* is the noise variance standard derivation of eqn (21)
or expectation
value of the power
of the noise.
can be found in any of the above
works which
The deal
exclusively with power spectra. The NMR spectroscopist, however, chooses the power spectrum only under protest: on the one hand important information is lost by ignoring the phase relationships of the various signals, whilst on the other hand the intensity distribution is distorted. It is thus of interest to see if a time series representable by eqn (17) may be converted to a phased spectrum. In general one could simply apply eqn (20) were it not for the fact that in its derivation the implicit assumption has been made that the time series extends from - 0 to + 00. We, however, are concerned solely with causal signals for which xk- 0, k
These problems are not so serious for the determination
of power spectra
since an equivalent representation of p(u) is the Fourier transform of the autocorrelation function of the data so that truncation of these data to negative time merely introduces end effects.
The
circumvention
of these difficulties for phased spectra has, however, been and further elaborated by Ni and Scheraga(“)
elegantly described by Tang and Norris,(*‘) whose treatments I shall now paraphrase.
If the time series is causal the spectrum is given by
(22) k-0
DAVID S.STEPHENSON
534
In order to allow for the separate treatment of the initial where again z - exp(i2rruAt). and subsequent elements of (xk) we can split the summation in eqn (22) into two parts: N-1
+oD
k-0
k-N
XkZ-k,
X(4 =
(23)
Thus the first part consists of a sum over elements which have been measured,
and the
second part consists of a sum over elements which may be related to previous elements using eqn (17).
This sets a restriction
on the value of N, and it is easy to see that
K L N L M. Tang and Norris set N - M, whereas N = K. Substituting eqn (17) into eqn (23) gives N-l
x(z)
=
+oo
M
1 xkz-k - 1
1
k-N
m-l
k-0
Ni and Scheraga
suggest
setting
+m
r
=-k +
amXk-m
ekzmk
k-N
N-l =
1
xkz-k
-
f
amzmm
g
m-l
k-0
[z
N-i
1
-
amzem
xkzsk - f m-1
k-0
where E’(z) is the phase-shifted rearranging then produces M x(z)
1
+
E’(z)
xkzek
-N$:kz-L]+E’(z)
spectrum
k-0
Setting a,
of the noise.
m-l
k-0 M
-
-
1 as before and
N-l-m
M
1 xkzmk+ 1 amzmm 1 xkzmk+
-
m-0
E’(z)
k-0
N-l-m
1 amzmm 1 xkzwk+ m-0
E’(z)
k-0
It is at this point that the two treatments N -
zwkim
k-0
N-l
a,=-”
xk_m
k-N
diverge.
If we follow Tang and Norris and set
M the order of summation in eqn (24) may be so changed to yield M-l
M
X(z)
1
amzVm
-
m-0
m
1
zmrn 1
a,x,
_n
+ E’(z)
,
n-0
m-0
i.e.
x(z) - [ C(z) +
E’(z)] /A(=)
2
where M-l
A(z) -
2
amzsm,
C(Z) -
1
cmzmm, and
m-0
m-0
cm -
2 n-0
anxm_n
,
with z - exp(i2nuAt). Comparing eqn (26) with eqn (20) we see that the causality has introduced the corrective term C(z) into the numerator and phase shifted the noise. Equation (26) is equivalent to the LPZ spectral function of Tang and Norris(‘*) except for have been the introduction of the noise E' (z). Once the prediction coefficients {a,} determined
eqn (26) may successfully be employed to derive a phased spectrum,
added advantage
that if the z transform
is a positive constant,
is performed with z = exp(kAt+ i2nvAt),
the spectral resolution is improved.
with the where k
Linear prediction and maximum
entropy
535
If N is chosen to be greater than M the summation order in eqn (25) employed and Ni and Scheraga show that eqn (24) may be converted to
cannot
be
N-l X(z)
a,zmm
f
-
t
xkzmk 5’
k-0
m-0
amzmm
+ E’(z)
where
M’ is the lesser of M and N - 1 - k, and normally measured data. Thus the spectrum is given by K-l
w-4 *
1
xkz -k
These
amzmm / A(z)
+
E’(z)
N -
K to employ
/ A(z) ,
all the
(27)
m-0
k-0
where M’
5
,
m-0
= min(M,K
- 1 -k).
two ways of approaching
the summation
in eqn (24)
are illustrated
in Fig.10.
Equation (24) represents summation over the rows in Fig.10, Tang and Norris’ method uses summation over the diagonals, and Ni and Scheraga suggest summing over the columns.
a,
z-”
a,
z-l
[ [
-2
x,z-O+ x,z-I+
x,z-O+ x,z-l+ x2z-2+
[ x,z-O+
a2 z
x,z++ . . . . . . . . . + XN_2z-(N-z$ xN_lz-(N-l) 1
x1
z-l +
x2
. . . . . . + XN_2z-(N-2)
z-*+ .
1
+ XN_3z-(N-3) 1
FIG.10. The terms which have the rows represents the inner summa eqn (251,Tan and Norris were able to summation o B eqn (27), was suggested by
Whilst eqn (26) is suitable for implementation using fast transforms if the spectrum is to be evaluated at a set of equidistant frequencies and if the effects of the noise E(z) are ignored, eqn (27) is computationally more expensive, even ignoring noise, and Ni and Scheraga(“) have also used the linear prediction eqn (17) to extrapolate the FID beyond the measured K elements before Fourier transformation. This is also a valid approach and probably that which is intuitively obvious. However, for noisy data the extra elements xk which are introduced for k L K are derived from the noisiest part of the FID and a spectrum
derived
from straightforward
extrapolation
shows a higher noise level than that
produced by the LPZ function. This would also be the case if eqn (27) were to be implemented. If the prediction given by eqn (I 7) is good and ek can be neglected these are valid results, but since the methods to be described only minimise ek and do not reduce it to zero the results of applying eqns (26) or (27) without the noise should be treated with caution: the intensities tend not to be reliable at low signal-to-noise ratios. Finally I wish to emphasise once again that the noise we have been discussing is not observation
noise but is prediction
noise related to the accuracy with which xk can be The interaction of observation noise with
calculated based on the previous M values of x.
DAVID S. STEPHENSGN
536
prediction noise is difficult to evaluate. This small but subtle point is the principal logical difference between AR and LP. In AR the existence of observation noise is neglected and the white noise sequence ek is treated as part of the model. This noise, being white, has a constant spectrum which should be included into the equations. In LP the vector {e,J has elements which represent prediction errors. These may not necessarily be members of a white noise sequence, but anyway the intention of LP is to apply methods to predict the time sequence (xk) as exactly as possible whilst averaging out the observation noise. This works well at high signal-to-noise levels, but the assumptions are no longer valid when significant noise is present.
PIG.ll. Demonstration of the effectiveness of the LPZ formulation of Tang and Norris to calculate a spe+um containii lines with arbitrary phases from linear predktion coeffiiients. (a) 1024 pint, PFT of a nokseless int FFT of the same signal but with added noise. (c) 10 4 pomt LPZ spectrum of I nthetic rignal. (b) 1024 x same noisy data as in Fp) obtained from 200 prediction coeiIkieots. Note the improved sensitivity and also the correct reproduction of the phases. (reproduced from Ref.42, with permission) Figure 11 shows that the LPZ treatment in a correctly phased spectrum.
of the prediction
coefficients
does actually result
6.1.2. Yule - Waker Equations. Let us start the discussion of the determination of (a,) by assuming once again that no observation noise is present. The most obvious way to calculate these coefficients, a method introduced by Walker,(45) leads to normal equations This approach is based on the autocorrelation known as the Yule-Walker equations. function of the data (see Appendix A), so we have to start by estimating the autocorrelation coefficients from these data {xk} . If we assume that the time series is stationary and zero-mean the autocorrelation coefficients are given by K-II-1
R,
-
R:,
-
x~+~x~* ll10,
A-'I: k-0
c-33)
Linear prediction and maximum where the factor A is chosen to be either K or K-n. biased estimate
since the number
entropy
537
The first alternative
of terms in the summation
is K-n,
is obviously a
but it is nearly
always used because the matrix of autocorrelation coefficients R is thus ensured to be positive semidefinite. This is a very important feature and a property of value in spectral estimation since the z transform A(z) of the prediction coeffkients {a,} then has all its zeros inside the unit circle .csg) Because A(z) appears in the denominator of eqns (20), (26) and (27), this stable system Appendix B). for short data
means that the poles of X(z) also lie within the unit circle and we have a where X(v) can be evaluated in the region of convergence of X(z) (see Using biased estimates, however, does tend to reduce the spectral resolution records.(3s~3g) As the length K of the data sequence becomes large, though,
the difference between both alternatives becomes small, so the choice is critical only for short sequences. In spite of this bias the coefficients obtained from eqn (28) are some sort of estimate of the expectation
values E[xkxk*_ ,J which can also be obtained
by multiplying
eqn (17) by
x~*_~, and taking the expectation
EP&- ,I -
EL-%
amxk- In-3 - n l
m-l
I
Since the data elements xk _ n are uncorrelated
+
Wp~*_,J .
with the innovation ek for n>O we get
M
1
R, = -
amRn_m,
(29)
m-l
and, for n -
0, M
R,-
-1
a,R_,+o*, m-l
where
a*
-
Combining
s
E[egk*]
= E[ekek*],
i.e.
a,
R,_,
-
{
of the white noise sequence
n - ’ n>O
go2
m-0
which are known as the Yule -Walker 6.1.3.
the variance
(ek).
eqns (29) and (30) and setting a0 = 1 allows us to write
The Levinson Recursion.
now have to solve eqns (31)
equations.
Given the estimates of the autocorrelation
for (a(z)}
denotes that the autocorrelation
(31)
matrix
coefficients we
and ot
where the M as subscript or superscript used is of order M. Equations (31) which can be
expressed in matrix form as 1 al”)
could be solved by standard
matrix
methods,
but a much more effkient technique taking
into account the Hermitian Toeplitz structure of the autocorrelation first by Levinson(*) and improved by Durbin.(“)
matrix R was described
538
DAVID
SSTEPHENSON
Detailsare given in various texts (e.g. references 26 and 38), but since the derivation is simple and the algorithm forms the basis of a good many spectral estimation *echniques I repeat it here. Let us apply the simple trick of taking the complex conjugate of eqn (31). Noting that aL is real and Ra*_,,
f m-0
ahi-,%I-m- {
-
R,
_n we obtain, after rearranging
the indices,
ui
which may also be expressed in matrix form as I-
d-lI*
a($* aw)* M-l
0 0
'2 Oh4
1
These
two sets of M + 1 equations
generality,
R, R,
1
(32)
and
(33)
(33)
may
be combined,
without
loss of
to produce one set of M + 2 equations
R_, R,
. . . R_, . . . R,_,
R,
R,_,
. . . R,
R M+l
R,
. . .
R_,_, R-,
&+I
kc?M+I
(34)
.
0
R-,
R,
1:: 1: 14 0
1
2
R,
OM
J
where 4M + 1 and Ar,,+ 1 are unknowns which have to be determined. Comparing
eqn (34) with eqn (32) expanded to M + 2 equations which have M + 1 new
coefficients a(:
‘)
-R, R,
R_, R,
... ...
R-, R_,_, R,_M R,_,
R,
Q-1
...
R,
_ R M+l
R,
. . .
we see immediately that 4M + I -
R,
0
ag + 1)
0
acM+ 1)
0
Mt
,l
ag++I1) and AM+I + eM + Iai
= af++i+,L)= - AM+Jo~
QM+l
R-,
1 a\” l 1)
.
OZi+l
= 0, so that
(35)
Obviously AM+ 1 is given by the summation AM+I
=
f”
a~)RM+,_,
(36)
>
m-0 where a0 = 1 and thus combining eqns (35) and (36) ag++++,‘)in terms of the old values, i.e. af++,‘)
I
a(,)RM+,_, II
-
m-0
2 OM.
we can express the new coeffkient
(37)
Linear prediction and maximum
539
entropy
The other coefficients of order M + 1 can be similarly obtained since
WY aW + I) = agn”’ + QM+A4+1-m In
WW
’
yielding aW+l) I
m
ag)+
a(,M++++,‘) a(MM!*,_m.
Pb)
Finally the new noise variance is given by
++I = OL(
1 -
lag++:) /*) .
These equations thus provide a method for recursively determining
the coefficients a,.
Starting with M = 0 eqn (31) provides the initialisation 0:
= R,,
thus ay)can order M. parameters
be found from eqn (37) and a: from eqn (39) and so on up to the desired It should be noted that this recurrence algorithm provides not only the AR at order M, but also those at all lower orders. This feature is valuable because
in general the correct order is not known beforehand and some criterion, usually based on the noise variance, must be employed in selecting M. Obviously an algorithm which can be used to generate successively higher order models is useful in this respect. The Levinson algorithm is anyway a powerful method of solving linear equations such as eqns (31) involving Hermitian Toeplitz matrices. 6.1.4. Burg’s Modification. The beautifully conceived recursion exemplified by eqns (37) to (39) has a serious defect. Namely it solves eqns (31) with estimates of the autocorrelation coefficients Ri. These of course are not known, they can only be estimated from the data for instance using eqn (28). Forgetting for the moment the problem of whether we choose A = K or A = K - n, we can see that if K is reasonably large eqn (28) may work quite well for small values of n where the summation has sufficient terms to average out observation noise. But the approximation gets progressively worse as R increases, and for small K the estimates will presumably be considerably in error. Using the data to produce poor estimates of the autocorrelation coefficients is a bad way of utilising the information at our disposal and Burg r18)has proposed an interesting method which obviates the need for prior estimates of the values of R and which operates solely on the data. To understand
this modification to the original algorithm one has to accept the key role This played by eM - ag) , often called the reflection coefficient, in the recursion. parameter is determined in eqn (37) from the old prediction coefficients, the autocorrelation coefficients and the noise variance, and is, indeed, the only parameter in the whole recursion which depends directly on the autocorrelation function, the other prediction coefficients being derived using eqn (38). An equally valid and perhaps better way which avoids the need for the autocorrelation
estimates would be to choose eM such that the noise
variance is minimised at each stage of the iteration. This seems a logical thing to do if our aim is to build a good predictor since we know that a measure of its quality is o$ and this would furthermore operate directly on the available data {xk} . Thus the aim of the Burg modification is to choose 4M such that
i302aq,
- 0 ,
and the next step suggested by Burg was a new way of determining
(40) a:.
Equation
(17)
540
DAVID S. STEPHENSON
allows us to write the prediction error as
amXk-m
1
m-0 with a0 = 1, so that an estimate of the prediction noise variance or error power is given by summing over the available 1ek12 to produce K-l 0:;~
a
(K-W-’
M
1
1 z
k-M
amxk-m
(*
(41)
*
m-0
Because the elements xk are forward predicted from the M previous elements (denoted by the “f’ subscript) the outer summation must be unsymmetrical with respect to the data sequence to avoid running the inner summation off the available data elements. In order to symmetrise this Burg suggested using a backward predictor on the same data to obtain another estimate of the error power a$ M, and to minimise the average of both with respect to b?r# To see what the backward predictor is substitute eqn (38) into eqn (17) include the ” f’ subscript and the prediction order superscript, thus
modified to
m-l
(42) where
ei:i=
xk-M
#)*xk _
+
M
+
m
.
(43)
m-l
In eqn (43) the error in element xk_M is related to the M following elements in the time series. Thus it can be regarded as the error associated with a backward predictor. The backward prediction coefficients are thus seen to be the complex conjugates of the forward coefficients. This is a consequence of the assumption of stationarity and no longer applies if the time series is non-stationary (see Appendix A). Equation (42) expresses the forward prediction error of order M in terms of the forward and backward predictors of order M - 1.
Equally well the backward prediction error of order M is described by (44)
Now the forward prediction using eqn (42) as
error power is given by eqn (41) which can be reformulated
K-l
and we can write an analogous equation for the backward prediction error power as K-l
Linear prediction and maximum
541
entropy
The total power is given as the average of forward and backward powers
(45) Applying eqn (40) to the definition of the prediction
error power given by eqn (45),
and
employing eqns (42) and (44) yields
k-M in which
the new reflection
coefficient
is given in terms
of the old errors.
This is a
completely different way of obtaining the reflection coefficient to that previously described in eqns (35) or (37), but the remaining coefficients at any order are still calculated using the Levinson recursion of eqn (58) and thus retain its advantage of computational efficiency. The reflection coefficient QM in eqn (46) is not dependent on the choice of K or K-M as denominator in eqn (45). Thus the effects of bias introduced into the estimates of the autocorrelation function eqn (28) to ensure stability and which lead to a degradation in resolution are not evident in the Burg procedure which generally yields a very high resolution. The stability of the Burg algorithm rests on the reflection coefficient’s being less than unity@‘) which causes the zeros of the z transform of the prediction coefficients {a,) to lie within the unit circle. 0’) As with straightforward solution of the Yule - Walker equations the recursive technique employed provides the prediction coefficients at all lower orders. The iteration is started for M - 0 by setting the error power K-l 0:
-
K-’
1
xkxk* ,
k-0
and thus eqn (46) may be used to determine p1 which can then be substituted into eqn (38) to give the prediction coefficients of order 1. These may then be used to find the prediction error of order 1 and eqn (46) is employed again to find p2 and so on. The whole iteration is rather neat and works exceedingly East. It can be seen that the method builds up its own implicit estimates of the autocorrelation function without assuming anything about the availability or non-availability of data outside that which has been measured,
the sole criterion used is that the prediction error at each stage should be
minimised. (A parenthetical note: Anderson c-1 derived a recursive relationship for the denominator of eqn (46) which has been quoted in reference 40. Faber@) has demonstrated that eqn (2.74) of reference 40 is incorrect and has given a derivation of Anderson’s In Marple’s book(‘s) the recursion (eqn (8.16) of reference 38) is given correctly). 6.1.5.
Least-Squares
improvement
Methods.
over straightforward
Although
Burg’s
procedure
recursion.
seems to be an enormous
solution of the Yule -Walker equations, several problems One difficulty is the appearance of two lines where have been noted by various authors. only one should be,c5’) another problem appears to be a frequency shifting depending on the phase of the signal.(52~5’) Both effects can be reduced by introducing more flexibility into the
DAVID S. STEPHENSON
542
algorithm designed to determine the coefficients a,. The frequency shifting has been ascribed to errors in the autocorrelations caused by using short data records, so that the situation could be improved by windowing to reduce end effects.c5’) Another alternative which avoids this data destructive procedure is to eliminate the Levinson recursion from the Burg algorithm and to solve for all the a(g) at each stage of the calculation. This concept was first introduced by Uh-ych and Clayton,(“) and fast algorithms to implement it have been published by Marple.(3**56) Let us start by considering eqn (17) again which can be rewritten in matrix form as e=Xa, where
(47)
e is an error
vector,
X
a rectangular
data
matrix
and a the vector
of sought
coefficients with a0 - 1. A measure of the error power is the sum of the squares of the elements of e, so that we are seeking a solution of eqn (47) which reduces ene, where en is the Hermitian x%3 where t@
transpose of the vector e, and this can be recast(ti) as the normal equations -
-
s,
(48)
[ok, 0, 0,. . ., OIT with OS being the minimum error power.
What still need to be defined are the dimensions of the vectors and matrices in eqns (47) and (48). If the full range of available data is used eqn (47) can be written
%
X0
. . .
0
ehi
Xhi
. . .
x0
-
(49) S-1
%-
1+M
S-1
. . .
0
. . .
%-l-M
S-1
and we see that the matrix X has zeros in the upper right and lower left corners. This means that the prediction is running off the available data, a situation which Burg tried to avoid. Furthermore, the product XHX is just the biased autocorrelation matrix multiplied by K, so this method (known as the autocorrelation method of linear prediction) leads simply to the Yule-Walker equations which can be solved by the Levinson recursion, but which yield a poorly resolved spectral estimate. If the dimension of X is reduced so that the prediction remains on the data, eqn (49) is transformed to
(50)
Linear prediction and maximum
entropy
and this least squares approach is known as the covariance this case the product employed.
XHX
is no longer Toeplitz
543
method of linear prediction.
and the Levinson
recursion
cannot
In be
The solution of this least-squares linear prediction problem could be found using
standard techniques@‘) which has been accomplished in NMR,@*) but a East algorithm similar to that of Levinson can be developed using the Toeplitz structure of the factor matrices.(3s~5g*w) In this case the assumption is no longer made that the time signal is stationary and the forward prediction coefficients {a,.,.,} are not related to the backward coefficients which are also obtained from this fast algorithm@s) by solving an equation similar
to eqn (50).
As with the Burg
modification
the recursive
program
for solving
eqn (50) may provide prediction coefficients for all lower orders. If the time series is non-stationary it is advantageous to keep forward and backward prediction separate to avoid distortions, but for autoregressive spectral analysis of stationary time series a better estimate can be obtained by combining these to produce one set of This is called the modified covarhce method which coefficients as in the Burg algorithm. again results in a set of normal equationstss)
in which the forward coefficients a, related to the backward coefficients by complex conjugation, are found by minimising the forward and backward prediction error jointly. Again the matrix Rc”) is not the autocorrelation matrix and does not have Toeplitz form (although
these properties
appear as K becomes
large) so the neat Levinson
recursion
is
once more not appropriate. However, as Marple has pointed out(“) and successfully employed, R has a special structure composed of Toeplitz matrices. In common with all the previously considered algorithms
the modified covariance
coefficients at all lower orders. The details of both covariance
and
modified
covariance
involved to relate here, but are fully explained in Marple’s have
an efficiency
close to that
frequency shifting to a great extent. and the deletion of the Levinson
of Burg
recursion provides prediction procedures
are
much
too
recent book.@@ The algorithms
and solve the problems
of line splitting and
These advantages must be bought at a price, however, recursion
which ensures stability
when the reflection
coefficient is less then unity means that some of the zeros of A(z) may occur outside the unit circle. This has been noted by Marple.(56) A further point concerns the efficiency of these recursive algorithms for they are really nothing more than solutions to the least-squares problem eqns (47) and (48). The special form of the Rc”) matrix means that a total of KM + 6M2 operations is needed to find the vector (a,}. The standard approaches not taking into account the structure of Rc”) require considerably more effort: of the order of MS. The line splitting phenomenon which is associated primarily with short data records@“) has been attributed by Fougere@‘) power at any one order.
to the non-complete
minimisation
of the prediction error
He devised a solution based on retaining the Levinson recursion
eqn (38) which links the prediction coefficients of any order together.
At the same time the
is sought by varying Q~, Q*, . . . ,Q~ independently.
The drawback to this minimum of 0; approach is that no recursive algorithm exists at present, so the implementation is Details of this method are given in the original article@‘) and also by Chen@*) expensive. who lists a computer program of the algorithm. 6.1.6.
Prediction
Order Selection.
One question which has not yet been posed is related The problem of order selection is mutual to
to the value of M the order of the prediction.
DAVID S. STEFWENSON
544 all the autoregressive
and linear prediction methods discussed here.
The expected informa-
tion content of the spectrum is obviously a function of M because eqn (20) has M roots so the spectrum will have M spectral features.
the polynomial
in
Obviously the right choice of M is crucial to the correct implementation of the AR or LP method. If M is too small then the resolution will be poor, and one might expect that by increasing M the resolution will improve up to some value of M and then remain constant. In general this is what one observes, however, as M gets larger and larger the algorithm will attempt to fit the observation
noise and spurious detail will become apparent
in the spectrum. This is a major problem which has not been satisfactorily solved, and although there are criteria for choosing an optimal M the AR/LP methods do suffer from the significant drawback that an incorrect choice may lead, via the inherent non-linearity, to noise in the data appearing as interpretable signals in the spectrum. Apart
from the obvious
way of varying
M until the spectrum
looks
“nice”
which
although not objective has certain merits, the usual criteria are based on the behaviour of the prediction error variance. To select M such that ai is a minimum will not work because all these methods have a prediction error variance which is a monotonically decreasing function of M. So some other function is required. Akaike(@) has introduced the final prediction error (FPE) criterion nation of the error variance
this latter quantity increases with increasing M whereas ai FPE(M)
= a
2 K+(M+l) M K-(M+l)
should be a minimum Bishop.@“)
based on a combi-
06 and the statistical deviation of the coefficients a(z). decreases, the FPE,
Since
defined as
’
at some value of M.
Further
details are given by Ulrych
Another criterion suggested by Akaike,@5) and called the Akaike is defined by
information
and
criterion
(AX),
AX(M)
= K log&&
+ 2M ,
where the term 2M can be regarded as a penalty for increasing the model order.
Ulrych
and 0oe(41) provide a justification for this measure which, as with the FPE, should be minimised to select M. Yet another criterion was introduced by Pame@) which is known as the criterion autoregressive transfer (CAT) function CAT(M)-
K-‘zg
M m-l
K-m
In
3, hi
and which again should be minimised to produce the correct order M. In general none of these criteria works well: all tend to underestimate the order, a feature which has also been noted in the analysis of NMR signals.@‘) This underestimation is particularly serious for low signal-to-noise ratios(68) and Ulrych and 0oe(41) suggest the empirical method of selecting M between K/S and K/2. Viti et al.@‘) show however that acceptable spectra may be obtained without choosing such extreme model orders. Using too high a value for M leads, as could be expected, to spurious peaks in the spectrum. This has been shown by Currie(70) to occur when M >W3 which seems to contradict the experience of Ulrych and Ooe. The conclusion must be that the choice of the correct order is very dependent on the type of data and the signal-to-noise ratio.
Linear prediction and maximum 6.1.7.
Eigemnaiysis
So far we have been concerned with signals containing no
Methods.
observation noise, our aim has been to build a good predictor. date
the noise by increasing
acknowledge increasing
that
predicting
the number
M of coefficients
the noise is like predicting
M we are introducing
for the failure of AR/LP
545
entropy
It is possible to accommoa,,,,
but one really should
the weather!
Furthermore,
by
spurious detail into the spectrum and this is one reason
methods at low signal-to-noise
ratios.
It should be apparent, though, that there is a significant difference between noise and signal. The noise contributes to the diagonal elements only of the autocorrelation matrix: this is the definition of white noise.
Thus it should be possible to separate signal from
noise on the basis of their differing statistics. eigenvalues of the autocorrelation
The
way to do this is to decide which
matrix R or equivalent data matrix products XHX belong
to the signal and which belong to the noise.
After we have made
the distinction
the
spectrum is constructed from the signal part of the matrix and the noise part is discarded. Let us go back to eqn (32). I said it could be solved by standard matrix methods and this is not only true but opens up completely new aspects. Although one could approach the problem by trying to remove the observation noise variance from the diagonal elements of the autocorrelation matrix (‘l) the obvious problem here is that we do not know exactly how much to remove. somehow? The autocorrelation
Is it possible to use the matrix
itself to provide the information
matrix of order M can be written as
R(M) _ o2I(M) + [R(M) _ &3(M)] -
a21(w
+ R’c”)
,
(51)
where PM) is the unit matrix of order M and o2 is the variance of the observation noise. If the eigenvalues of Rc”) are I,
L A, h . . . . I
A, ,
which are all greater than zero because Rc”) is positive definite, then the eigenvalues of R ’ CM)are Ao-a2
L Al-a2
L
.... 2
A,-u2,
and these must all be positive or zero for R’(M) to remain positive semidetinite.
So that o2
must lie in the range 0 $ u2 I
A,
-
minimum eigenvalue of Rc”).
What we could do is determine the eigenvalues of Rc”), subtract o2 to yield R’@o, and use this in the Yule-Walker equations to determine better AR parameters. As u2 is chosen closer and closer to the minimum
eigenvalue,
the spectrum gets sharper until, as o2 -
A,,
we have a line spectrum (R’c”) is singular and the poles of A(z) lie on the unit circle). This method is equivalent to a spectral estimation technique introduced by Pisarenko(“) for analysing signals consisting of sinusoids in white noise,(26*J8*“‘*41)(I once ordered this in an expensive restaurant and was violently sick afterwards). As noted previously better ARfLP parameters can be obtained by using the data directly instead of going through the autocorrelation estimates, so it is of interest to know if the same eigenanalysis technique can be applied to such methods.
DAVID S. STEPHENSON
546
The least squares linear prediction algorithms are all based on data matrix products of the form XHX which are similar to the autocorrelation matrix, so the intuitive way to proceed would be to choose M large enough for the prediction to work well, including the noise, and then to determine the eigenvalues of XHX and finally solve a new system of equations with a modified data matrix product where the minimum eigenvalue has been subtracted. This would be one way of approaching the problem, but it is not necessary to reevaluate a modified data matrix product since the determination of the eigenvalues and eigenvectors of XHX has already provided us with the solution. To see this return to eqn (17) yet again and assuming that the prediction order is chosen so large that even the noise is accounted for we can neglect ek and write this as
dk
-
ir
a,dk_,
-
>
(52)
m-l
where dk is used rather than xk to show that observation can be expressed in matrix form as
a-
noise is present.
Equation
-Da,
(52)
(53)
where
. . .
The formal solution to eqn (53) is a-
-D-‘d,
(54)
but since D is not a square matrix we have recourse to the pseudoinverse which needs a bit of explaining. The pseudoinverse
of a rectangular
matrix
is based on its decomposition
into singular
values(57~79)which can be regarded as an extension of the eigenvalue concept for square matrices. In general for an mxn complex matrix D of rank ti7*), where k is necessarily less than or equal to the smaller of m and n, there exist positive real numbers A, L A, h . . . I I,_ 1 > 0 (the singular values of D), an mxm unitary matrix U-
[u()JQ,...,u,_
i ] and an nxn unitary matrix V = [vo,vl,...,v,_
1] such that
k-l
D-
UAVH -
z
AiUiVin,
(55)
i-0
where the mxn matrix A has the diagonal elements Ai. Since Dn = VAHUH UHU = I and VHV = I,, where I, dimension mxm and nx’n respective?;, then D”D
-
V(AHA)VH
and
DDn
and I, are unit matrices of
= U(MH)UH
which demonstrates that the product matrices DnD and DDn have the same non-zero eigenvalues Ai2, but that the eigenvectors of DnD are vi and those of DD” are ui.
real
Linear prediction The pseudoinverse
and maximum
entropy
D# of D is defined using this decomposition
541
as
i-o where the nxm
matrix
Alv has diagonal
elements
li-
‘.
Using this definition
eqn (54) can
so WD
is of rank k.
be written properly as
aThe
-dd. nxn product
matrix
WD
has the non-zero
eigenvalues
Ai
Now if the product D”D can be compared with the autocorrelation matrix R of eqn (51) then the minimum eigenvalue which is related to the minimum singular value of D could be subtracted from the others and the calculation improved. But we can go a step further.
of the AR/LP
parameters
should
The order M has been chosen so that eqn (52) is nearly exact and presumably
be
a lot of
noise has been fitted. The singular values of D correspond to the lines in the spectrum so that for high order a lot of lines appear, many corresponding to the fitted noise. The decomposition
of eqn (55) allows the separation
principal
eigenvectors
subspace
spanned
values. I9
ui, vi corresponding
by the remaining
Based on this separation
into a signal subspace
to the k’
eigenvectors
we can construct
largest singular
corresponding
spanned by the k’ values,
and a noise
to the weakest
reduced rank approximations
singular to D and
as k'-1
D = 1
k'- 1
Aiuivy,
Diu = 1
i-0
This separation
Ai-‘viui
i-0
is clear if an AR process of order k’ without observation
noise is analysed
with an M order predictor where M > k’ , for then only k’ singular values are non-zero. As the noise level increases, however, the distinction becomes unclear and problems are encountered in selecting the correct signal order k’ . We shall come back to this theme in Section 6.2, where we consider functional modelling. I have introduced singular value decomposition (SVD) in this section because it is an ideal way to treat the pseudoinverse and to understand the concept of rank reduction, but the actual determination of the singular values as such is only one possible approach. The terminology of SVD has come into the NMR literature via the work of Barkhuijsen et al.(75) which shall be explained in Section 6.2. However, although these authors used the idea of SVD they approached the eigenanalysis problem by diagonalising the square data matrix product D”D and thus avoided determining the singular values directly as would have been possible efficiency
using
extant
subroutines. (73~76~77~78~7g) The
reason
for this choice
is the superior
to be expected(57) since the elements of the data matrix product may be calculated
very cheaply,(3*~80) and the subsequent straightforward SVD. Regardless of the practical
diagonalisation
implementation,
requires
however,
fewer operations
the basic
principles
than does remain
the
same: the rectangular data matrix or its square product is decomposed to yield the singular values or their squares, and a selection is made based on the relative sizes of these Which algorithm is chosen is merely a question of computational efficiency and numbers. stability. The one direct application of AR or rather LP eigenanalysis in NMR is the LPZ method of Tang and Norris(5s) who approach the least-squares problem eqn (53) by using a QR algorithm to reduce D to a product of an orthogonal matrix Q and an upper triangular
DAVID S. STEPHENSON
548 matrix
R followed by truncation
method
to reduce the rank in the calculation
of a.
is implemented using routines from the LINPACK(7g) library. ns/3 operations where m - max(M,KM) and n - min(M,K
This
QRD
QRD requires - M), and since
mn2 this is considerably larger than the KM + 6M2 needed for the covariance method of Section 6.1.5 a heavy computational penalty must be paid for such noise compensation methods,
although
admitted, problem.
though, that one of the real benefits of these methods is their tackling of the noise Within the limitations of the AR model this can be successfully accomplished It remains to be seen whether fast using matrix rank reduction schemes.
only by algorithms
the QRD
approach
is probably
more stable for large M.
It must be
can be devised for this purpose.
6.1.8. Summary of AR and LP Methods. The AR methods which I have considered may be concisely summarised starting from eqn (17) which represents an element of the time series {xk} as a linear function of M preceding elements plus an error. This can be called forward linear prediction. A similar equation could be set up for backward prediction in which an element is linearly related to the following elements of the series. The second-order statistics of a stationary time series are represented by a Hermitian autocorrelation matrix Q, furthermore the prediction coefftcients {a,) of eqn (17) and the matrix Q are related by the Yule - Walker equations (31) to the variance of the prediction error.
These equations
of computational
may be solved by the Levinson
efficiency
transform of (a,} all lie within the unit circle. Considerable improvement in spectral estimation modification
which
algorithm
which has the advantage
and which enforces stability by ensuring that the zeros of the z
retains
the Levinson
recursion
ficients at any one order and which determines
is obtained
linking
by
employing
the individual
the reflection coefficient
prediction
Burg’s coef-
pM at order M such
that the combined forward and backward prediction error is minimised. Burg’s algorithm operates directly on the data and makes no assumptions about elements of the time series windowing of the outside the range 0 to K - 1. This avoids the resolution-destructive autocorrelation description
matrix
and has
of this approach
given
rise
as a Maximum
to the alternative Entropy
(and
somewhat
method (see Section
7.1).
confusing) Stability
is
ensured because Burg’s method retains the Levinson recursion. One step further toward perfection is represented by the least-squares methods where the These normal equations (48) are solved directly to yield the prediction coefficients. algorithms utilise the special structure of the data matrix to achieve an efficiency close to that of Burg’s method, but do not ensure the stability associated with the Levinson recursion.
They do, however,
analysed by linear prediction: For all these algorithms
have the advantage an important
that a non-stationary
point in NMR
the most important
parameter
time series can be
(see Appendix A). is the prediction
order M whose
selection may be based on the several criteria presented in Section 6.1.6. A further point concerns the treatment of observation noise. Starting from noisy data
the spectrum
o(z)
eqn (20), as
m-0
is represented,
using the linearity
of the Fourier
transform
and applying
Linear prediction and maximum where N(z) is the spectrum
of the observation
entropy
noise and z -
549
exp(i2nuAt).
Equation
(56)
can equally well be written
D(Z)
-
[
E(Z) + N(z)
5
a,z-m]
/
f
amzSm
m-0
m-0
and because E(z) and N(z) represent the spectra of white noise processes this reduces to
D(z) -
$
bmz-m
a,z-
m
(57)
m-0
showing that the correct treatment of noisy data leads to a spectrum given by eqn (57) and characterised by both poles, corresponding to the roots of the denominator, and zeros, corresponding to the roots of the numerator. The number of poles equals the number of zeros. Thus {c&} is an ARMA process. This being so, it should be no surprise if the analysis of low signal-to-noise FIDs using the pure AR approach leads to imperfect results. In general one could summarise white noise is represented
by saying that the best method for damped sinusoids in
by the covariance
forward solution of the Yule-Walker
least-squares
equations
algorithm.
Both the straight-
and the Burg algorithm
suffer from the
assumption of stationarity. Whilst the Burg method yields, in general, a higher resolution than Yule-Walker, it succumbs to instabilities in the data, which lead to spurious line splitting and unreliable intensity ratios. Another way of treating the observation noise is based on the dissimilar statistics of white noise and signal proper, the aim being to remove the noise bias by isolating its contribution to the diagonal terms of the autocorrelation matrix(‘l) as shown in Section 6.1.7. This may be generalised to divide the autocorrelation matrix or data matrix into two vector subspaces, one being a signal subspace and the other a noise subspace which is discarded. Whilst the general ARMA treatment has not yet been applied to NMR, noise compensation which forms a mainstay of several methods to be considered in Section 6.2, has been employed in the LPZIQRD The
advantages
analysis, i.e.
of AR
method of Tang and Norris.c5*) methods
are common
to all the modern
the defects of the Fourier transform may be avoided.
methods
of spectral
This largely means that
very short data records may be treated to give well resolved spectra: an important advantage over FT. The main disadvantages seem to me to be on the one hand the introduction of the prediction order, a parameter whose correct choice is crucial to the whole operation but which seems to be hard to pin down, and secondly the non-linearity of the processes which, whilst leading to high resolution, also causes severe intensity distortions for spectra determined from low signal-to-noise data. This distortion which, as shown by Hoch,(*‘) is particularly severe for spectra containing signals of grossly different linewidths, is related to the statistical assumption of stationarity (see Appendix A) which is not applicable to damped sinusoids. (**) It is only in the Linear Prediction sense that this approach is a “natural”@g) one because the typical FID, represented as a sum of damped sinusoids, is a possible solution of the difference equation (17),(‘*) and it is then valid only if the forward and backward coefficients are not related by complex conjugation. A final, and not insignificant,
point is the observation
that, just as with the DFT,
the
assumption
behind all the AR methods is the existence of a data vector (c$) consisting of measurements taken at equal intervals of time At. If the time intervals are not rationally related then the DFT cannot be applied, neither can any of the methods considered in this section. If the intervals are equal but several measurements are missing then, as we have
DAVID S. STEPHENSON
550
seen for truncation, the DFT leads to gross distortion since we must assume that the missing elements of (dk) are zero. For AR methods using estimates of the autocorrelation matrix it should be possible to overcome
this difficulty, although the loss of several elements of (dk} will mean that the
estimates of R will be degraded leading to poor quality spectra. However, as we have seen, better spectra may be obtained by employing the data directly using Burg’s algorithm or a least-squares approach,
again it might be possible to extend such methods to disregard
missing data, but to the best of my knowledge this has not been attempted, although in the related technique of temporal Maximum Entropy (see Section 7.1) an iterative algorithm exists,@) regular
which, in principle, operates on non-uniformly measurement
two-dimensional
interval
experiments
is of little significance
sampled data.
The restriction to a
for one-dimensional
data,
but for
the ability to obtain well resolved spectra from non-regular
data would be advantageous. I
I
4
I[
-10
b
lb
-29
-39
26
ed human colon slP power spcstra obtain+ from a 2048 point FID. The sample consisted of re aE&inoma cells and the chemrcal shii scale is referred to 85% H PO (a) DFTpofP%$inal data. #bJ DFT of ori inal data after a nential multiplication, the noise is reducedsbut4&e lmes are broa ened. c) D after exponential multiplication. &d&Burg spectrum after applymg a ii point of first 51 z points of the & e) Burg spectrum aft- smoothing the twse. (upper traces m (d) and e&s”;” smoothing to the FID. . , vertical expansions, and &le reman numerals relate to species present in the sample) (reproduced from with permasion)
Viti et al.@7) have used both the straightforward solution of the Yule - Walker equations and the Burg modification in the analysis of the 31P NMR signals of human cells. The coefficients {a,} calculated using either algorithm are employed in eqn (21) to reproduce the power spectra. The main point of the work is the comparison of the two approaches with the DFT, together with a discussion of the selection of the correct prediction order M.
Linear prediction and maximum As would be expected
the resolution
of the DFT
points in a real FID with fairly good signal-to-noise is when a windowing
entropy
551
method is degraded
as the number
of
ratio is reduced from 2048 to 512, as it
function is applied to the original 2048
points to reduce the noise.
The Burg algorithm, however, applied to 2048 points reduces the noise considerably without at the same time causing a loss of resolution, and several humps in the Burg spectrum may be attributed to species known to be present in the sample, signals which are not obvious in the DFT spectrum. This is shown in Fig.12 which is rather difficult to interpret since some signals are folded, but which does illustrate the power of this method. A comparison of Burg spectra for different data lengths K ranging from 4096 to 512 shows that resolution is lost as K decreases, but in all cases the resolution is higher than for the DFT.
The straight solution of the Yule -Walker
equations is shown to lead to poorer
spectra both for this first sample and for a sample having a much lower signal-to-noise
ratio
(note that the captions of Figures 5 and 6 of reference 67 should be exchanged, furthermore their reference to Hacneau and Quikk should presumably read “E.J.Hannan and B.G. Quinn, J. R. Statist. Sot. B41, 190 (1979)” even though these authors do not quote a BIC criterion).
This conclusion is in agreement
with the rest of the literature dealing with
AR spectral estimation. Viti et a1.@‘) also discuss the choice of the prediction
order M. They compare several criteria (see Section 6.1.6) but conclude that the FPE is as good as any. In spite of this the spectral resolution may be improved by increasing the order beyond that calculated using FPE. The general conclusion is that Burg’s method may be applied to produce good
quality power spectra from low signal-to-noise ratio data provided the emphasis is on the determination of the existence of peaks and not so much on their shape or intensity. In a further paper @‘) these same authors again consider 31P NMR signals and use Burg’s method to calculate estimates of the power spectra. As in their first paper only real data are
treated. The main topic covered here is again the correct determination of the prediction order - the illustrations all show dramatically that better resolved spectra may be obtained by selecting a value of M greater than that suggested by the FPE, sometimes considerably greater. However, choosing an order close to that suggested by the FPE yields spectra having correct intensity ratios for high signal-to-noise
FIDs.
For signals containing
a large amount of noise the situation is not so simple and the FPE is a poor test of the order. To offset the poor performance of the FPE Viti et al. suggest filtering the FID before analysis to reduce the high frequency noise component, and this procedure seems to result in a better choice for M when applying the FPE. The conclusion of this paper seems to be that Burg’s method may be applied to both noisy signals and short data records to The crucial choice of the prediction produce spectra superior to that obtained by DFT. order should be made with the help of the FPE and this should be supported by filtering out the high frequency noise from the FID. These conclusions are illustrated in Fig.13. Ni and Scheraga c4’) also have used Burg’s method and demonstrate that well resolved phased spectra may be obtained free of truncation artefacts. The phased spectra were obtained by extrapolating the FID using forward linear prediction eqn (17) and employing the relationship (27). Burg’s method was modified to derive the reflection coefficient in eqn (46) using a weighted least-squares tit,@‘) proposed to counteract the defects of Burg’s approach and which has been shown@‘) to be almost as good as the covariance least-squares algorithm. The authors of this modification, Kaveh and Lippert,@4) attribute the line splittings and frequency shifting associated with Burg’s method to its inapplicability to non-stationary signals. Significantly, Ni and Scheraga(“) also claim that AR modelling is inadequate for exponentially decaying FIDs in practical NMR spectroscopy.
552
DAVID S. STEPHENSON
In a series of papers Tang and Norris have presented not only methods for determining the coefficients (a,) but also the means of their conversion into a phased spectrum with improved resolution. In their first paper-(5s) the suggestion is made that the coefkients be found using the eigenanalysis
methods of Section 6.1.7.
The prediction
order M is first
selected, then the (K-M) x M data matrix is decomposed using a QRD(57*7g) algorithm, finally the strongest eigenvalues are retained and used to determine the M coefficients (a,}. This method which Tang and Norris call LPZ is elaborated in a second paper and they go on to suggest improving the resolution by evaluating the spectrum on a circle inside the unit circle.(#) This, of course, results in sharper lines but does not increase the noise level or introduce truncation artefacts as would be the case if the original FID were to be multiplied by an increasing exponential. (W Tang and Norris conclude that the LPZ method simultaneously enhances sensitivity and resolution.
L
b)
P
i
0
Linear prediction Eigenvalue computer
methods
require
large
and maximum
amounts
entropy
of computer
553
time
and large
amounts
of
storage,
and Tang and Norris c4’) have also employed the straightforward solution of the Yule -Walker equations to determine the prediction coefficients before calculating the spectrum using eqn (26). The one significant defect of the method is the assumption of stationarity which leads to intensity applying the method iteratively. 0’) In a recent paper( s5) these same to allow for different summation The advantages
or disadvantages
of two polynomials
which
distortions. To overcome this Tang and Norris suggest Example spectra of this method are shown in Fig.11. authors extend the concept of the LPZ spectral function ranges in the numerator and denominator of eqn (26).
of this approximation
is known
as a Pade
of the spectrum
rational
X(z) as a quotient
fi_mction,(74) have
not yet been
demonstrated. 6.2.
Models
Functional
The autoregressive
method is a very general approach
to use in order to fit a time series
which we expect to consist of sinusoids. A better treatment would surely take this information into account and use it to good advantage. We are consequently brought to consider a set of models in which the expected
functional
form of the data is in.cluded into
the analysis and by choosing this approach we might expect to reduce associated in the AR method with selection of the prediction order. 6.2.1.
Damped
Exponential
sum of complex
widths of the individual would be given as
xk
2
=
Methods.
exponentials
because
spectral
lines.
the
The obvious model to choose for NMR this takes Thus
into
account
the kth element
both
problems
data is a
the frequencies
and
of an ideal noiseless
FID
b,zmk
(58)
m-l
where zm - exp((u, + i2rrv,)At), b, is a complex amplitude, u, is a damping factor, v, is a frequency and the sampling interval is At. The idea of modelling experimental data as a sum of exponent& goes back to Baron de Prony (see references 38,40) in 1795, although several changes have been made since then. The original idea was to fit K real exponentials to K data points. The modern version seeks a least ‘squares solution with M
whose roots must be found,
thirdly
these roots are used to determine
the
To begin with it can be shown that eqn (58) leads to a difference equation for the data similar to eqn (17). To see this we need to set up a polynomial in z having roots zm, and then multiply this out: Q(z) = fi
(z -
z,)
where a, - 1.
_
2
a,zM -n
Using eqn (58) we can write M
Xk-m
-
II-0
m-l
1
n-1
k-m
bnz,
(59)
554
DAVID S. STEPHENSON
and multiplying
this by a,
and summing over m gives
-2
a&k-m
a,
5
b,z,kmm
0
m-0 = mi
1 bnzl;
-M g
n-l
amznM-m
(60)
m-0
We can see that the last summation
in eqn (60) is the polynomial
eqn (59) evaluated at its
nth root and is thus zero, showing that
m-0
xk
a&k-m
=
m-l
The next step is to consider dk
-
xk
dk
+
the effect of noise, where the measured
data are given by
nk so that
amXk-m
-
+
“k
m-l
-
2
-
a,dk-,
+
m-l
Equation method.
(61)
2
amnk
m-0
represents
If we compare
an important
-
(61)
m
intermediate
stage
it with eqn (18) for a general ARMA
a special case where the orders of the AR and MA branches coefficients
of both branches
has found no application
are identical.
in NMR
As pointed
in the development
of the
process we see that it forms
are equal and furthermore
out earlier,
this general
the
treatment
yet, and I shall not consider it any more except to note
that one way to solve eqn (61) would be to vary the parameters a, iteratively(‘*) to minimise the noise variance YZlnk12, although other methods have been mooted.(*@ To make things easier we can simplify eqn (61) somewhat by collecting the last summation
ek
=
into just one noise value
f m-0
amnk - m
so that amdk-m
dk =
+
ek
(62)
m-l
This
is known as the extended Prony technique and the coefficients a, are chosen to minimise Xlek12. A comparison of eqn (62) with eqn (17) shows that this is mathematically equivalent to an AR estimation approach and the methods of Section 6.1 may be employed However, an important difference between eqns (62) and (17) should to determine the a,. be noted. The innovation ek of eqn (17) was taken to be a sample of white noise, and this feature with its consequences for the autocorrelation function of ek was taken into account
Linear prediction and maximum
555
entropy
On the other hand ek in eqn (62) is no longer a in deriving the Yule - Walker equations. sample of white noise and the previous derivation breaks down. Furthermore, eqn (62) is a forward
prediction
relationship,
so that methods
such as Burg’s
algorithm
which relates
forward and backward prediction coefficients by complex conjugation as a result of the stationarity assumption are not ideally suited to solve for (a,}: the covariance method is the better choice. Leaving
aside for the moment
the problem
of determining
(a,},
the next step is to
calculate the roots of the polynomial in eqn (59). This may be accomplished using standard techniques(87) and a program is listed in reference 38. This rooting procedure yields the frequencies v, and damping factors u,, but a further step is needed to obtain the complex amplitudes b,
which contain the intensity and phase of each line. and the second step To recapitulate: the first step determines the LP parameters (a,}, employs eqn (59) to find the roots (z,} of the polynomial 9. The third and final step which we come to now uses these {z,,,} ___ to evaluate (b,}___ of eqn (58). Since we are dealing with a least-squares technique eqn (58) is not exact and must be replaced by dk =
g
b,zmk
+
nk
m-l
expressable in matrix form as d-
(63)
where 1
1
. . .
1
I
“0 “1
,b =
-1
nK-l
A least-squares solution to eqn (63) which minimises the squared error nHn is given by b -
[ZHZ]-lZHd
(64)
which can be solved by a variety of techniques, for example Cholesky decomposition. Alternatively SVD could be used to solve eqn (63) as it was to solve eqn (53), and this approach also allows a noise reduction step @e) akin to that proposed in Section 6.1.7. 6.2.2. Kumaresan and Tufis’s Approach. One can see that the above method will do a good job of determining the spectrum if the FID is noiseless, although it will be beset by the usual AR problem of order determination. If M is too low the spectrum will be distorted. If noise is introduced the problems increase, and unfortunately no satisfactory treatment is available for this situation. If M exponentials are fitted using M order parameters in the presence of noise the estimates of the signal zeros will be biased, i.e. shifted from their true positions in the complex plane, the degree of bias being determined by the signal-to-noise then extra Kumaresan
ratio. To avoid bias we can increase the order of the prediction, but It was this problem which was tackled by noise zeros will be introduced. and Tufts using a combination of the modified Prony method and the
eigenanalysis methods of Section 6.1.7.
DAVID S. STEPHENSON
556
These authors have published several papers(*g~90~g*~gz~g3) dealing with the solution of the above problem.
They start by treating
the measured
signal, represented
by eqn (58),
as
being noiseless and then consider the addition of noise in an ad hoc fashion. In order to discriminate the exponentials from the noise, a large prediction order is chosen to allow the noise to be fitted, and a combination of f&ward and backward prediction is used to determine the roots (2,). Furthermore, SVD is employed as described in Section 6.1.7. As we have seen in Section 6.2.1
forward prediction is expressed by the relationship
af,mXk - m -0
f m=O
for which the characteristic
polynomial (65)
m-0
Since we are dealing with decaying exponentials the roots of @1(z) all lie within the
exists.
unit circle (see Appendix B).
A backward predictor for these data can also be set up
M
ab,mxk - M
m-0
-m
-0
and for this a polynomial M
Obb(Z) - 1
(ab,m)‘p -m
+‘. \“w_ .* c-:0
e-33)
P
can be formulated which for forward decaying exponentials has roots outside the unit circle.
b
d
h
Linear prediction and maximum If the prediction
order is chosen to be large,
entropy
557
then many of these roots correspond
to
noise and the spectrum will show spurious lines, but a selection can be made on the behaviour of the roots for Qt.(z) and @l_,(z). The noise roots of both polynomials will lie predominantly within the unit circle, since the statistics of a stationary random process do not change on time reversal, whereas the exponential signal roots will occur in reciprocal positions along a common radius for et(z) and @t,(z). Th is is shown in Fig.14 for the case of two damped exponentials, where the length of the data vector K has been chosen as 32. Figures 14a and 14b depict the position of the roots of @AZ) and @h(z) respectively for a noiseless signal analysed with order 2. When noise is introduced the estimates of the frequency (angular coordinate of the root) and damping factor (radial coordinate of the root) are disturbed, the degree of disturbance depending on the noise. To illustrate this Figs.14~ and 14d show the positions of the roots of et(z) and @h(z) for many different realisations of white noise, again analysing the signal with M - 2. We see that the estimates are not very reliable, there are both frequency shifts and poor estimates of the damping. The noise can be accommodated by increasing the prediction order in the case of Figs.14e and 14f to M = 8. There are many more extraneous zeros scattered within the unit circle (and some just outside it!) but the signal zeros are clustered where they belong. Figure 14f shows clearly that a good (but not perfect) discrimination is achieved. The backward prediction of the exponent& puts the signal roots outside the unit circle, and nearly always leaves the noise roots inside. To improve determination
the discrimination of the forward
even more the idea of SVD can be introduced
and backward
prediction
coefficients
into the
based on the eigen-
analysis methods of Section 6.1.7. Taking eqn (17) for the forward prediction and eqn (43) for the backward we can express these in matrix form as xf=
- xfaf
P-
-Xbab
+ ef
and + eb
where S-1
xf
. . .
=
XK-2
. . .
and
. . .
xb S-1
. . .
and the other vectors are defined by
DAVID S. STEPHENSON
558
The data matrices can be decomposed using eqn (55) as M-l
M-l
Xf = ~ Aic,‘(v,f>” ,
Xb - ~ i-0
i-0 the Is being the singular values.
If the signal consists of M ’
exponentials
in additive
noise, then the M ’ eigenvectors corresponding to the largest eigenvalues will provide a reduced rank approximation to X which corresponds to the exponentials alone, i.e. M’
M’
-1
X’f = I: Ai$f(vi’>” f X' i-0
b_1 i-0
-1
(67)
a~u~(v~)H ,
The least-squares solution for a will then be given by af =
-(X”)#xf,
&
I
_(Xlb)#,b
where M’ -1
M’
(X’f>#- 1 (a,?-’ l&up , (X’b)# i-0
z
-1
(A+1
vi”(lli”)” ,
(68)
i-0
Applying this technique to our previous example of two exponentials in noise analysed with a prediction order M = 8, but selecting only the two largest eigenvalues, i.e. M’ - 2, yields the results shown in Figs.14g and 14h for forward and backward prediction respectively.
Figure
14h
shows
dramatically
a clear
distinction
between
clustered close to the true positions outside the unit circle and extraneous
signal
zeros
noise zeros which
now all lie within the unit circle. A further improvement
advocated
by Kumaresan
and TI.&s(~~) is the correction
of the
M ’ singular values retained in eqns (67) and (68) by subtraction of the average of the noise singular values. This might be expected to reduce residual bias in the damping factors of the signal zeros which can be seen in Fig.l4h, or more clearly in Fig. 15a which again portrays the analysis of two exponentials in additive noise but this time with M = 24 and M’ - 2. The dots corresponding to the exponential with the larger damping tend to cluster on the inside of the true location, and by subtracting the average of the six noise singular values from the other two (note that the (K - M)xM data matrix has maximum rank K-M = 8, so that it has only eight singular values) Fig.lSb is obtained. The zeros now have a bias in the other direction, i.e. they now tend to lie on the outside of the true location. This overcorrection comes about because the signal power is proportional to the eigenvalues of squares of the singular values correction seen
the data matrix product XHX, as shown in Section 6.1.7, which are the singular values, so that subtracting the mean of the squares of the noise from the squares of the signal singular values(75gg* results in the improved in Fig. 15~.
Linear prediction and maximum
entropy
559
C FIG.15. Locations of the zeros of $(z) for a signal with a data length of 32 consisting of two exponentials in additive no+. . The true posit+ are mdicated by crqsscs and the dots shyw the results of anal sin the signal - 2. @) SVD witK d = 24 and $ 50 rcahaatlona of white noise. (a) +WI? method wltb M - 24 and M - 2 b;\ czdcctt,n of the two mat! sqular values by subtraction of the mean of .the other 6. (c) SVD with M - 2 but correctIon of the two mam srngular valuea by subtraction from their squares the mean of the squares of the other 6.
The distinction between signal roots and noise roots based on their relative positions with respect to the unit circle becomes difficult as the signal-to-noise ratio decreases. Even with backward prediction and the rank reduction represented by eqns (67) and (68) some signal roots may move inside the unit circle and some noise roots outside. Recently Delsuc et a1.(g5) have applied a method suggested by Porat and Friedlandedg6) to overcome this problem. This approach uses both forward and backward prediction to give two sets of roots.
Reflection
superimposable,
of the backward roots inside the unit circle should make the signal roots whereas the noise roots for backward prediction will now he predominantly
outside the unit circle. The distances between roots taken pairwise, one from each set, yields a criterion for distinguishing between signal and noise, signal roots being those whose separation is a minimum. This technique has the disadvantage present in the spectrum should be known or be estimated.
that the number
of lines
6.2.3. Avoiding Polynomial Rooting. The extended Prony method is an elegant way of analysing NMR data, particularly the treatment of noise using eigenvalue analysis which was introduced into magnetic resonance by Barkhuijsen et a1.(75) is worth noting. This approach which has been described in the preceding subsections suffers from one main computational problem: the central step, namely the polynomial rooting, is slow. Remember that Prony’s method consists of three main stages, firstly determination of the prediction
coefficients,
which may be accomplished
using an efficient LP algorithm
if no
eigenanalysis is desired, or by an SVD attack on the data matrix or diagonalisation of the data matrix product if an eigenanalysis is suggested. The resultant prediction coefficients are then used in eqns (65) or (66) to determine the roots {z,,.,) which may then be used in the third stage represented by eqn (64) to produce the complex amplitudes b. Recently Barkhuijsen et a1.tg7) have proposed an alternative which obviates the polynomial rooting step and which I shall now describe. We start
from the basic description
of the signal as a sum of complex
eqn (58) and note that this can be written in matrix form as x=Zb
exponent&
DAVID S. STEPHENSON
560
,1:
where
1
1
. , .
1
=1
z2
. . .
ZM
z-
X-
Zl
1
Note the similarity to eqn (63).
K-l
,b z2
K-l
. . .
ZMK.
Next we build a data matrix X, defined as
and see that 1
1
=1
zz
. . . . . .
x-
K-M-l
K-M-I
=1
. . .
T?
or X
-zB
(69)
where Z has the special form of a Vandermonde
matrix,
i.e.
the elements of Z are given
by z_
= z,“-l
and it is the base parameters {z,) in which we are interested. We are faced with two problems if we wish to find (z,}, namely the data matrix must be decomposed into a product such as eqn (69), and then {z,} must be extracted from the matrix Z. We have seen that SVD will decompose a matrix such as X into a product of three matrices eqn (55), i.e. x
= UAVH
and we can, without loss of generality, include the diagonal matrix A, into V to give x-w Comparing
IH
.
this with eqn (69) we see that the form is similar, but there is no justification
Linear prediction and maximum
entropy
561
for setting U - Z. Instead, eqn (69) must be generalised by including an arbitrary matrix Q with dimension MxM, to give X
-
square
ZQQ-‘B
so we can write U -
ZQ
A simple relationship allows us to extract (2,). First we define the matrix Z,, as that matrix with column dimension K-M - 1 formed from Z by removing the bottom iow, andI we define Z, as that matrix formed from Z by removing the top row, i.e.
K-M-2
=2 we
Zl
K-M-I
K-M-I Z2
.
.
.
K-M4
can see that Zh and Zt are related by Zt -
Zt$
where A’ is the diagonal matrix
z1
.
.
.
0
.
.
ZM
z2 pz-
. 0
.
1:
.
Then it is easy to see that uh
=
%Q
and U, - qQ
where Uh and U, are formed from U by removing the bottom and top rows respectively, that
vt = UbQ-
‘AzQ
(70)
and our task reduces to finding Q- ‘AZQ and then diagonalising it to determine solution to eqn (70) is the well known least squares solution@‘) given by Q-
’ A=Q
-
(UhHUh) - ’ ub”ut
P.
The
(71)
The matrices on the right hand side of eqn (71) are all known from the SVD treatment the data matrix X and it suffices to carry out the multiplications then diagonalise (z,}.
the resultant MxM
so
matrix
of
indicated in eqn (71) and
whose eigenvalues are the sought parameters
As Barkhuijsen
et ;11.(“) point out the inversion of the matrix product (UbHUb) in eqn (71) may be simplified by noting that U is unitary, thus (II%) - E the unit matrix, and since U,
is formed from U by removing
the bottom row which we can denote as the
562
DAVID S. STEPHENSON
row vector uhH then U$Jh
= ur% = E -
-
uhubn
uhuhn
so that using the augmented B(nxm),
C(mxm)
and D(mxn)
(A+BCD)-’
= A-’
-
E +
1
matrix
inversion
lemma@s~tO) which
for matrices
A(nxn),
states that + C-‘)-‘DA-’
A-‘B(DA-‘B
,
we get
(UbHUb)-l = so
UbUbH
-4%
’
that
UbUbHUbHUt
1 -ub%
The eigenvalues of this are the sought poles. Barkhuijsen et a1.tg7) who have named this method HSVD because the data matrix X has Hankel structure then go on to determine method represented by eqn (64).
the amplitudes b,
by the normal least-squares
6.2.4. Summary of Functional Mohlling. We started out by considering the FID as composed of a sum of complex exponentials eqn (58). The solution of this problem leads to an approach known as the extended Prony method which consists of three basic steps. Firstly the linear prediction parameters must be determined, secondly the roots of the polynomial (59) must be found and thirdly these roots are used in d final least-squares procedure to evaluate the complex amplitudes b, in eqn (58). The first step can be accomplished in a variety of ways. The most obvious is to use one of the methods described in Section 6.1 to solve eqn (62) for (a,). This has been applied by Barone et a/.@*) who used the Burg method to evaluate the prediction coefficients. The advantages determine
of this approach
are the availability
the optimal order M.
small (M
of well-defined criteria
These authors use the FPE
(Section
6.1.5)
to select a prediction
to
order
in relation to the data length and thus the noise should be eliminated in However, the assumption of stationarity inherent in the Burg method
severely curtails the efficiency of this approach
and a superior way might be to apply the
covariance least-squares algorithm of Section 6.1.5. The prediction coefficients may also be obtained by solving eqn (48) directly by Cholesky decomposition. This mode of attack has been chosen by Gesmar and Led,@O) who solve for the backward prediction coefficients which, as shown in Section 6.2.2, allow the discrimination of the signal and noise roots. In this case the order of the prediction M is chosen to be large (MI&/~) to accommodate the noise which is then eliminated at the end of the second stage. This again is not an optimum approach, for an expensive computational procedure has been employed. The eigenvalue methods also allow the selection of a very large prediction order (Ma13W4)
and noise elimination in both the first stage using rank reduction (Section 6.1.7)
and the second stage by employing
backward prediction.
This combination
is the method
Linear prediction and maximum propounded
by Kumaresan
formed the data matrix eigenvector
matrix.
and Tufts,
product XXH
entropy
563
and this was used by Barkhuijsen and diigonalised
et zJ.(‘~) who
it to yield the eigenvalues and an
This step was followed by selection of the most significant eigenvalues
(i.e. rank reduction of XXH) before completion of the calculation of (a,}. Whereas Barkhuijsen et al. (‘*) who named their method LPSVD employed tridiagonalisation of the product XXH followed by bisection(“) to perform the eigenanalysis, Tang et aI.(1oo) used a QR procedure, c5’) claiming it to be more efficient. direct
SVD
calculation
of the data
matrix
of the product
possible if M=3W4,
is employed.
As
2CXH followed by diagonalisation
particularly
This is the case only if
Barkhuijsen
as the elements of the XXH
et a1.(‘*) have
is the most matrix
shown,
efficient way
may be calculated
recursively.(s8~80~g8) The second stage of the analysis involves finding the roots of the polynomial (59).
This
is accomplished by Barkhuijsen et d.(75) and by Tang et aL(‘O”) using a routine by Steiglitz and Dickinson.(“‘) Gesmar and Led have used another from Svejgard(‘02) to solve for the roots, whilst Barone et al.(**) recast this as an eigenvalue problem. The first and second stages may be elegantly combined to avoid the time consuming rooting as shown by Barkhuijsen et al., cg7)which method is described in Section 6.2.3. The final stage of the extended Prony method these in a least-squares calculation eqn (64) to This is done by Barone et al.(**) using a SVD, the routine of Beppu and Ninomiya egg) for real
involves taking the roots {z,} and using determine the complex amplitudes (b,}. whereas Barkhuijsen
et al.
data or take a standard from the IBM Scientific Subroutine Package for complex data.
-2.
0.
5
2.
m
employ either routine (SIMQ)
5
FREQUENCY FIG. 16. 3tP spectra of a mouse.
oints to zero. (b) ,LPSVD e, ef.75, with permisnon)
Barkhuijsen
et al.
(a) real FFT of 768 point FID after o timal fdterin and setting initial three spectrum obtained from point 4 to 128 of the original%ID. (reproduced from
have applied their LPSVD
method both to synthetic
time data(75)
consisting of two exponentials in noise as well as to 31P NMR data.(75sg4) The synthetic data are used to show that for high signal-to-noise ratios two closely spaced Lorentzian lines with different linewidths and intensities can be resolved, the spectrum being presented in tabular form as a list of parameters {b,) and (2,). Noise rejection here is made by selecting only a very few significant singular values which in this case is easy because of the high For the 31P investigation Barkhuijsen et al. studied a mouse! signal-to-noise ratio. Because of experimental difficulties the total number of data points was limited and the first This sort of problem results in severe artefacts if the few points of the FID unreliable. DFT
method
is employed.
singular values (the number
Using
only K = 125
of non-zero
points and setting
singular values equals min(M,K
M i
88
- M))
the 37 could be
564
DAVID S. STEPHENSON
divided bite significant and non-significant by scrutiny yielding a reduced rank of M ’ = 9. The resulting. spectrum looks nicer than the DPT of the original data, in particular the rolling baseline has been removed (Fig.16). The ability of LPSVD to extract the correct frequencies, phases and damping factors, even when the initial part of the time signal is missing, is demonstrated by these authors for an electron spin echo signal. Furthermore, they
show that
by manipulating
the damping
factor
an
improved
resolution
may
be
obtained. One difficulty becomes apparent in this study, namely found by LPSVD
the question of whether a signal
is “really there” or whether it appears by some happy chance of noise.
A method for solving this problem is to take many realisations of the noise and determine estimates of the standard deviations of the signals found. The roots whose amplitudes lie below the standard deviation could then be rejected. This unrealistic approach to estimating the variance of a parameter can be avoided by using the Cramer - Rao inequality which sets a lower bound on the variance,@*Jos) and whose application is described both by Kumaresan and TuRS(~~) and by Barkhuijsen .et al. (‘*I These latter authors compare this statistical technique with the aforementioned usage of many noise realisations and conclude that the Cramer - Rao theory is justifmble and provides a means of determining whether peaks determined
by LPSVD are significant, those whose amplitudes calculated Cramer - Rao lower bound are rejected.
l
JL .
70.00
0o.w
w.00
40.00
PPM
w.w
PO.00
IO.00
are less than the
Linear prediction and maximum Tang et d.(‘oo) have used their LPQRD consisting of noisy damped sinusoids. teristics of Prony’s
entropy
565
method in the spectral analysis of synthetic data
They demonstrate
method for high signal-to-noise
the very good resolution charac-
ratios and particularly
may accurately be determined in the presence of large signals. Delsuc et aLcg5) have treated data with low signal-to-noise
that small signals
using
the
Porat
and
Friedlander modification described at the end of Section 6.2.2, and have compared these results with those obtained using DFT, Maximum Entropy (Section 7.1) and LPSVD. They show that all the modern methods are capable of producing better spectra than the DFT (Fig.l7), but that Maximum Entropy, whilst determining the line positions well produces many extraneous signals at low signal-to-noise ratios. In contrast, the LPSVD gives a clean spectrum but fails to find all the signals since the high noise level has moved some of the signal roots inside the unit circle.
As a compromise, the modified LPSVD method of Delsuc et al. produces lines near their correct positions at the cost of being told how many lines are present in the spectrum.
I
ppm
180
I
I 160
I
I UO
I
I 120
I
I 100
I
I 80
I
I 60
s
1 LO
1
(I
1
20
FIG.18. 13C spectra of porcine insulin. (a) FFf’ of a 8192 int FID. lb) +e same FID treated. by Prony:r method; 2400 backward prcdktion cocflkicnts were cakulat Jo of which 24 signal roots lay outs& the umt circle. Intensities smaller than three timea the corresponding &ndard deviation8 were ignored, M were signals whole phases arc inconsistent with the rest of the spcetrum. (c FFT of the residual between original FID and Prony reconstructed FID showing only noiae. Starred .nal! have intensity only alightly higher than the corresponding standard deviations. (reproduced from Ref.8? with permision)
C&mar
and Led@“) have applied Prony’s method to synthetic data consisting of damped
sinusoids in noise, to ‘H NMR data of a heptapeptide and to the % FiD of ‘insulin. The LP treatment of the high signal-to-noise ratio synthetic data yields a higher resolution and,
566
DAVID S. STEPHENSON
of course, a cleaner spectrum than the DFT. These authors show that the quality of the spectrum improves as the number of prediction coefficients is increased, although this only applies in high signal-to-noise situations. The LP calculation using 2400 prediction coefficients on 8192 data points of a high signal-to-noise ‘H FID of a heptapeptide gives a result almost identical to the DFT. More interesting is the application to a low signal-to-noise % FID of insulin. Again 2400 coefficients were used on 8192 data points and 124 signal roots were found to lie outside the unit circle.
To improve
the rejection
intensity greater than three times the corresponding
of extraneous
signals only those with
standard deviations were selected, and
signals having phases not compatible with neighbouring
signals were also eliminated.
The
results shown by Gesmar and Led (Fig.18) demonstrate the superiority of their LP approach over DFT. The LP spectrum has good resolution and particularly ,noticeable is the distinguishability of signals which in the DFT spectrum lie beneath the noise. The only drawback seems to be the enormous computing time needed. Barone et &.@s) have used the straightforward Prony method, with Burg’s procedure for the prediction coefficients, to analyse “P signals with good and moderately good signal-tonoise ratios.
For high signal-to-noise
For moderately good signal-to-noise
DFT whilst at the same time returning method was made by Barone
the agreement
between DFT and Prony is excellent.
the Prony method yields a better looking spectrum than
et al.
all the relevant features. on a single line spectrum
A final test of the Prony at various signal-to-noise
ratios. As this ratio decreases from 24:l down to 2:l the line position remains unaltered but the line intensity and particularly the linewidth are degraded. LPSVD, HSVD and their siblings have the advantage that the spectrum may be presented in tabular form. This enables the extraction of Lorentzian lines having vastly different linewidths and intensities, as shown by Barkhuijsen et al.,(“) and is ideally suited to allow mrther processing of the spectrum by a program designed to find correlations in lists of peak positions. However, this method does have the drawback that the assumption of Lorentzian lines together with a non-Lorentzian experimental lineshape leads to numerous components having differing phases and intensities for each spectral feature. The main thrust of Prony’s method is the determination firstly of the frequencies and damping factors followed by their usage to calculate the amplitudes {b,]. If ‘the frequencies and damping factors are known beforehand then the least-squares eqn (64) may be used to find b from the data d since the matrix Z is also known.
This has been put to use
by Noorbehesht et al. (lo*) who have analysed the 31P spectra of a rabbit. The frequencies and linewidths of lines corresponding to known species were available, and the least-squares fit allowed reliable estimates of the amplitudes at low signal-to-noise
ratios.
Of course, this
is a partial spectral analysis only, since significant assumptions must be made about the position and shape of the spectral lines. However, within these limitations, the method does seem to be of value in clinical applications. 6.3.
Extension
to Two Dimensions
Because the 2D Fourier transform is separable (Section 4.1) the spectral analysis may proceed in two distinct stages and there is no need to employ the same method in both. Specifically, since the tp dimension usually has sufficient points for good resolution a 1D DFT could be applied here and a high resolution modelling method used afterwards in the t I dimension. This leads to the concept of a “hybrid 2D spectral estimator”(‘s) for which the theory of the preceding sections is adequate. An alternative approach would be to employ a high resolution method initially in t1 and
Linear prediction and maximum
entropy
567
to use the results to extrapolate
the time data in tl to such an extent that normal 2D Fourier transformation could be applied. c’s) This extrapolation is most probably a better alternative to the usual zero-filling. Hoch(*‘) has applied the first method, i.e. 1D DFT in tp followed by high resolution He chose Burg’s algorithm yielding estimation in tl, to COSY data of tryptophan in D,O. a power spectrum to process 256 complex points in the tl domain obtained after Fourier transforming the t2 domain with a data length of 1024, the prediction order was set at 100. This hybrid spectrum was compared with 2D Fourier transformation using 256 Mor 512 points in the tl domain, shows superior
resolution,
and all spectra were symmetrised.
The hybrid spectrum
superior noise rejection and is devoid of the sidelobe artefacts
In particular, the broad tail which are present in the 256x1024 2D DFT spectrum. associated with the residual water signal is successfully suppressed by this method. Schussheim and Cowburn have employed LPSVD(75) as the high resolution estimator in the tl domain, having first used DFT in t2. This application of the extended Prony method yields lists of signal parameters
{b,)
and {z,}
for each value of f2, and these lists
can be scanned for correlations with the purpose of selecting only the “true” signals and avoiding artefacts. These authors demonstrate the improved resolution and elimination of truncation artefacts available with LPSVD. with tl phase modulation may be eliminated.
Furthermore,
the phase problems
associated
It would also be possible to employ a high resolution estimator in both dimensions, this has been done by Barkhuijsen et al. (“) for 2D electron spin echo data. However,
and this
still makes use of the separability of the 2D Fourier transform, and is a Ear cry from a full 2D LP analysis which has not yet been attempted in NMR. Details of 2D spectral estimation in general are given in the article by McClellan(106) and in the standard text of Dudgeon and Mersereau(“‘) as well as by Marple(‘s) who shows that 2D Yule-Walker equations may be set up and solved using a fast algorithm.(‘*) Furthermore, the covariance and modified covariance least-squares methods of Section 6.1.5 may also be extended to two dimensions.(io8) Functional modelling in the sense of Prony’s method has also been suggested (log) for 2D spectral estimation. Manassen
et a1.(“‘) have proposed a method similar to that of Noorbehesht
et d.(lo4) for
dealing with 2D spectra. The amplitudes b are to be found by solving the least-squares eqn (64) in the tl dimension after having determined the frequencies and damping factors in the t2 dimension by Fourier transformation and spectral fitting. Cross sections perpendicular to peaks of interest in the one-dimensional spectrum may then be calculated from a reduced data set to yield spectra similar in quality to those obtained by 2D FT.
7.
SPEGTRAL
AND TEMPORAL
DISCRIMINATING
FUNCTIONS
With the methods of the previous section we sought to parameterise
the time signal in
order to convert the large number of individual data into a handy amount which contains This compressed form could then be used to determine the all the relevant information. SpCCtNm. In this section I consider methods which use a basically different approach to seek a unique spectrum
{Xj> j - 0, J - 1 which is the frequency domain representation of The limdamental idea is to select that spectrum which best fits
the data (dk} k-O,K-1. the data. This apparently
simple idea is fraught with dangers
with the definition of “best”. spectrum
for our problem now lies
One difficulty is emphasised by my choice of “X”
which should be noiseless, and of “d”
for the
for the data which are noise corrupted.
568
DAVID
The ideal spectrum should resolution of the spectrum, should not be limited by the demonstrated in Section 4.2
SSTEPHENSON
be noiseless even if the data are noisy. Furthermore the represented ultimately by the number of spectral points J, length of the data K so that truncation artefacts of the type should not be evident. This means that J is not necessarily
equal to K. Let us start by considering the type of spectrum we want.
Normally the DFT produces
a complex spectrum whose real and imaginary parts are combined by phase correction to give a real spectrum. If we take the real spectrum as our goal then the imaginary part of
Xk -
1 j-0
Xj exp(i2njMJ)
,
(72)
and the index j is summed from 0 to J - 1 because there. are J spectral points, and where the factor J-l %nCe
(or K- 1 in eqn (14)) has been dropped for convenience. iS in general COmpkX, we have the relationship(‘**21)
{Xj} is real, whereas {xk)
xk -
b-,-k)*
(73)
7
pointed out in Section 4.2. Thus there are only J/2 + 1 independent values of xk in the time domain: the two real values x0 and xJ,r together with the J/2 - 1 complex values xk for k- 1, J/2 - 1. How do we compare the {xk) k - 0, J/2 with the measured data (dk) k - 0,K - 1 ? (i) The first case to consider is where the data are noiseless and where K I J/2 + 1. Obviously the problem is either overdetermined or exactly determined and only the values of dk from k - 0 to J/2 need be used and since these data are noiseless the best procedure is to set xk equal to dk for k- 0, J/2.
xk
-
dk
-
J-1 1
If this is done then we can see that
Xj exp(i2njWJ)
k-0,
Jf2 ,
j-0 and thus by using the orthogonality arrive at
Xj
I
J-’
( d, + (- l)‘dJ,2 +
relations eqn (13)
and the Hermiticity
2Re [ ‘g ’ dk exp( - i2njklJ) k-l
] }
j-0,
eqn (73)
J-l.
we
(74)
This little exercise in formula fiddling has shown that the real spectral elements Xi are fully determined by the data and, furthermore, that we can obtain J real spectral points from J/2 complex data points by zero filling to a data length of J and using a J-point DFT. This should be no surprise since it is the standard everyday tool of NMR spectroscopists; J real pieces of data in the time domain determine J real pieces of data in the frequency domain. (ii) Things get a bit more interesting if we consider a case where J/2 + 1 > K, for now the vector {xk} k- 0, J/2 is longer than the vector (dk) k- O,jC - 1. Certainly we can compare the’ two vectors up to the element 4 _ ,, but there are no measured elements, dk, for k > K - 1, so eqn (74) cannot be applied. The spectrum is under determined and an infinity of spectra will all equally well fit the data xk - dk for k = 0,K - 1. These differ only in their values of xk for k 2 K. Let us look at the situation where the spectrum
Linear prediction and maximum
entropy
569
contains one extra point i.e. J/2 - K. In a very general manner we can think of the compatible spectra as corresponding to the points on a line in “frequency space” shown in Fig.19a. Obviously we must select just question is - ‘how ? ’
one
of the spectra
as representative
and ,the
The solution is to apply a discriminating function p(Xi> and to select that spectrum which corresponds to a maximum or minimum of D along the line of compatible spectra. The function must be chosen with care, but many possibilities are available. This situation may be generalised remain valid.
to cases where J/2
-
K+ II, R > 0, and the above
considerations
(iii) Another case of importance is when J/2 + 1 - K but the data are no longer noiseless. If a spectrum is sought which fits the data exactly then the noise in the time domain will be carried over into the frequency domain. A better approach is to seek a spectrum which fits the data within the expectation value of the noise. This can be expressed easily in terms of a lit value K-l
(75) and C should be Ka* where a is the standard deviation of the noise. undetermined
This problem is also
and the range of compatible spectra is illustrated in Fig.lSb
as a closed curve
in frequency space.
b
a
c ..
The hatched area within the curve corresponds to spectra which fit the data better than would be expected taking the noise into account, i.e. the noise is, for these spectra, carried over from the time domain to the frequency domain, one of these spectra being an exact fit.
But, from those spectra which lie on the curve one must again be selected, and the
procedure
is the same: to choose
discriminating (iv) A
that spectrum with a maximum
or minimum
of some
function D(Xj) along the curve.
final case obtains
illustrated in Fig.lSc.
when
both J/2
4
1 > K and the data are noisy.
where the line AB has now expanded to a “sausage”.
This
is
The filling
570
DAVID S. STEPHENSON
corresponds to spectra having C < Ko* and the skin to the set of spectra from which we have to make a choice by application of the discriminating function. In the above three cases illustrating under determination of the spectrum the discriminating function was applied in the frequency domain. But the selection procedure can just as well be carried out in the time domain since {Xj} and (xk) are related by the Fourier transform.
Applying such a discriminating
function D(Q)
leads us to a situation analogous
to the above and the pictures in Fig.19 are equally valid. In general the method of approach is clear: we firstly introduce sloppiness, or to be more precise, we accept that our experiment is imperfect, and then to ameliorate the consequences of having too many solutions we choose just one by seeking the extremum of a discriminating function in the frequency or time domain. 7.1.
Maximum Entropy
I start the discussion of discriminating tinctions by considering that function which has been most widely published. In order to do justice to this concept of entropy it is necessary to apply some thought to the meanings of probability and uncertainty area where philosophy and mathematics merge. As recommend Papoulis’ monograph @) which contains an various works of Jaynes, particularly references 111, value. There are two forms of the Maximum Entropy one which I regard as “true” ME and which uses
-
a rather nebulous
a handy guide through the morass I excellent treatment on entropy. The 112 and 113, are also of inestimable (ME) method currently in use. The
the entropy of the spectrum as the discriminating function will be dealt with first. The second which uses the entropy rate of the time series leads to equations identical to those of the AR/LP method in 1D spectral analysis, and I include it merely to clear up the confusion on this point. 7.1 .l . Probability Density of One Random Variable. In this section and in Section 7.1.2 I shall deal with the development of an entropy discriminating function in the frequency domain. Let us consider a random variable@ x which can assume any one of a set of numbers as a result of the outcome of an experiment. The standard introductory example is the throwing of a six-faced die for which only six outcomes are possible and for each outcome we can assign to x the number of dots showing on the uppermost face. If we are gambling with such a die it is vital for us to estimate which face will come up next. We need to know the probabilities associated with each of the six possible outcomes of one throw, for we would then surely bet on the outcome with the highest probability. P
P
123456
P
123456
a
123456
b
C
the random variable x which can take on the values 1 - 6, the number of (a) The probabilitythat x - 5 is almost one, the entropy associatedwith lo arAms to base e). (b) The probability that x - 5 is almost zero, the
R11probabilities are equal, the entropy reaches its maximum
value
of
Linear prediction and maximum
entropy
571
The probabilities could be estimated by throwing the die N times and seeing how many times
Ni a particular
face i appeared.
The
probability
pi is then
approximated
by
pi - Ni lN. This is a rather laborious procedure, and is inappropriate in many cases. For instance, if the random variable is no longer the number of dots on a die but is the frequency vi in the experiment of Section 2 and if Xi is interpreted as the probability of occurence of this frequency, that is we regard Fig.Sa as a probability density, then the corresponding experiment is a discrete CW measurement of the individual X$ Let us, however, remain with our six-faced die for the moment. Our problem in determining the probabilities lies in the uncertainty about the occurrence of a future event. For instance the turning up of five dots on throwing the die once. If p5 - 0.999 and the other values of pi are almost zero ( Xpi = 1 ) then we are almost certain that the five will appear, the probability density for this situation is shown in Fig.BOa. If ps - 0.001 (Fig.SOb) we are almost certain that this face will not appear. If, however, p5 - l/6 then we don’t know either way and our uncertainty about the outcome of the experiment is large. Indeed if all the probabilities are equal to l/6 then our uncertainty
takes its greatest value.
It is possible, though, to quantify the uncertainty represented by a probability density for one random variable, and this leads to the concept of Entropy. Equally well a probability density could be derived from a knowledge of our uncertainty as to the outcome of an experiment. The classical way to obtain such a probability density is to employ the Principle of Insufficient Reason,(6~1’s) attributed to Bernoulli (1713), which says more or less that if we have no reason to prefer one outcome to another then the same probability must be assigned to each. When applied to our die experiment this means that p1 - p2 - .,. = ps = l/6, i.e. we assume that all outcomes are equally likely. The corresponding probability density is shown in Fig.SOc. This approach is intuitively appealing and is confirmed, at least for dice, by frequency analysis. This principle, however, does not aIlow us to determine a probability density whilst including relevant information measure of the uncertainty conditions(“6)
in the form of constraints.
To do this we need a functional
inherent in a probability density.
which any such measure
S should reasonably
Shannon(1”~115) provided three be expected
to fulfil, and he
showed that the only S which satisfied all three conditions was given by N
S-
-
F1 i-
Pi
l&Pi?
(76)
1
where F is an arbitrary factor. Shannon’s conditions are not unique, but any set of reasonable conditions(“7) will lead to a logarithmic form for S which had first been propounded by Hartley(“s) as a measure of information.
The relationship between uncertainty
and information
is superficially simple.
Before doing the experiment, for instance thowing a six-faced die, the uncertainty in the result is given by the logarithmic measure S. After the experiment is complete and the one outcome is obtained the gain in information is given by S. This is particularly easy to appreciate for the probability density of Fig.POa, here the result is almost certain so S is nearly zero, showing both a low uncertainty before the experiment and a low gain in This topic, the relation between uncertainty and inforinformation upon completion. mation, which has given rise to considerable discussion is treated in greater depth by Brillouin,(“g)
Jaynes(i”)
and also by Denbigh and Denbigh.(izo) Irrespective,
whether S is regarded as a measure of uncertainty
or a measure of information,
however,
of
it is beyond
doubt that any probability density has associated with it a statistic S given by eqn (76).
DAVID S. STEPHENSON
572
Shannon named this measure of uncertainty entropy, reputedly at the instigation of von Neumann,(‘20*121) and because of the similarity of eqn (76) to the classical entropy of statistical mechanics. Now that we have a measure of uncertainty S let us see how it can be exploited. The fundamentals were laid down by Jaynes(‘i6*‘**) in 1957 and have come to be known as Jaynes’ Principle. The idea inherent in this principle is relatively simple: namely that the least biased estimate of the probabilities pi is given by that probability density having the greatest S subject to any constraints on the pi which are appropriate. This leads to an estimate which is maximally
non-committal
the limction which is maximised Principle.
with regard to missing information,
is S it is also known as the Maximum
and since
Entropy
(ME)
If we apply this principle to the example of our six-faced die, including only the constraint that the sum of the probabilities is unity, the maximum entropy is attained when all the pi are equal to l/6 which is what we achieved by using the Principle of Insufficient In this case nothing has been gained, conceptually the Principle of Maximum Reason. Entropy is equivalent to the classical method, simplified since the constrained maximisation
but operationally the analysis is drastically of a well defined function lends itself to
elegant computer implementation. I shall delay until later a justification
of the use of ME and in the next section sketch an
easily appreciable rationalisation of the Maximum Entropy Principle which was used by Frieden to derive a method for improving the resolution of blurred optical images. 7.1.2. Frieden’s Entropy. In 1972, Frieden(12s) introduced the idea of restoring optical objects from blurred and noisy images by employing the spatial entropy of the object as the discriminating function. language of NMR.
In the following I shall reproduce his arguments
translated into the
The object, or in our case the spectrum {Xj}, is considered as an array of intensities with sum ~Xj j= 0, J - 1 associated with the frequencies vj=jAu where AU is the desired resolution in the final spectrum. This corresponds to the description of the NMR experiment given in Section 2. The next step taken by Frieden was to assume that the total intensity in the spectrum (object) could be divided into a very large number of intensity units AX such that each element Xj was represented as Xi - njAX where nj is an integer. Furthermore, the total intensity was assumed to be constant, i.e. tn. - N za 1. Frieden posed the question as J
to how the total intensity is distributed amongst the Xj and his answer was couched in the formalism of statistical mechanics. If we denote pj as the probability of any one intensity unit AX being located at frequency vi’ then for a very large number N we get pj = nj lN. Assuming no correlation between the elements Xi then the number of ways Wx in which a particular spectrum {Xj} can occur i&l*‘)
Wx(n041, . . . ,n,_ ,) = N!l(n,!,n,!,
...
,nJ_ ,!)
.
(77)
Taking the natural logarithm of this and employing Stirling’s approximation(74) leads to J-1 N- llogWX
=
-
1
(nj IN) 1% (nj IN)
j=O
J-1 -
-
I: j-0
Pj l”dPjl
(78)
Linear prediction and maximum entropy This is Shannon’s
form of the entropy of a probability
573
density introduced
in the last
section and we can see that the entropy is related to the degeneracy of a distribution. Thus distributions having a higher entropy have a higher multiplicity and can be real&d in more ways, and this is a justification
for expecting them to be more likely to be observed,(**‘) i.e.
more probable. Frieden@) converted eqn (78) to J-1
log W,
-
-u
~ j-0
Xj log Xj + B
(79)
where the constants u and /3 are given by u - l/AX and /3 - Nlog(NAX), and then he applied the Principle of Maximum Entropy by seeking that solution which maximised the summation in eqn (79). But he also went one step further and took into account the noise ek present in the measured data dk - xk + ek by applying the ME method to determine the most probable noise vector {ek}. This can be done using the foregoing treatment, but allowance must be made for the non-positivity of this noise since the logarithmic formulae used so far apply only to positive quantities.
By adding a bias B to aB noise values it can be ensured that all Ek -
ek + B
are greater than zero. If, furthermore, a noise unit AE is introduced then the noise Ek at each measured data point can be expressed as Ek - nkAE, and the number of ways of realising a particular noise distribution is wE
N!l(n,!,n,!,
-
where N = Enk.
...
,nK_l!)
This leads, in an analogous fashion to the above, to K-l
log w,
=
-a’
1
&log&
(u ’ ,/3 ’ constant)
+ P’
k-0
The number of ways of realising a particular spectrum which fits the data with a particular noise distribution is thus given by WxW, and thus the best solution to the problem is to maximise this subject to the constraints 4
-
xk +
k-O,K-
ek ,
1
and
~Xj
I
Y
(80)
The standard way to achieve this is to use the method of Lagrange multipliers(“) to a maximum
by setting
the sum K-l
J-1 -~XjlOgXj
K-l
-p~EklOgEk
j-0
-2
k-0
Ak(Xk+Ek-B-dk)
-@Xj-
r)
k-b
where Xk -
J-1 1
Xj exp(i2xjWJ)
(81)
j-0 and p, p and (4,)
represent the undetermined
multipliers.
The spectral solution to this problem is found by differentiation with respect to Xi, thus X. I
exp [ - 1 - p - K$
and the undetermined
1, exp(i2xjWJ)
]
(82)
k-0
multipliers
are obtained
by substituting
eqn (82)
into eqn (80).
574
DAVID SSTEPHENSON
Frieden(12s) solved the resulting non-linear equations using a Newton-Raphson method to determine the unknowns 1, and p and found convergence within 40 iterations.(*) But the problem becomes rapidly intractable as K increases. A final point to note is that this formulation
can
be applied
to one-,
two-,
or indeed
n-dimensional
data
without
any
restrictions. 7.1. ‘2.1. Gull and Daniell’s Implementation. with K + 1 Lagrange
multipliers can, however,
ek are zero-mean
random
The exact fitting of the data using eqn (82) be avoided if the assumption is made that
the
errors
ew
(80) may be replaced by the single constraint
variables
with variance
ok2, then the constraints
K-l
1 bk-dk12/ok2 -
K
k-0
and the problem reduces(125s’26) to finding the maximum J-1
K-l
j=O
k-0
of
(84) with only one Lagrange
multiplier A.
If the differential of Q(A) with respect to Xj is taken
we obtain aQ/aXj
= q
=
-1OgXj
1 -2~Re[~‘~
-
axk*/axj
I
(85)
k-0
which,
for a maximum,
must be zero.
iterative technique to solve this problem: - 0 simplifies to Fj
Xj - up
-1
Gull and Daniell(126) now applied a rather neat since xk is given by eqn (72) then eqn (85) with
K-1 (Xk- dk)
-2ARe[x
7
exp( - i2ttjk/J)
(86)
k-0
where the multiplier A must be positive. For any given dk, ok and a this equation determines the vector {Xj} which is found iteratively by starting with a uniform spectrum and determining the xk using eqn (81). These xk are then inserted into eqn (86) (the summation can be done using the FFT by setting (xk - dk) - 0 for k 2 K) to obtain a new vector . Unfortunately eqn (86) provides the solution as the exponent of the discrepancy between xk and dk and as such is numerically unstable. Solutions can be obtained only by averaging successive iterates, a Ezct noted not only by Gull and Daniell but also by Collins.(‘*‘) However, the self-consistent approach is not the only way to determine {xj}. Since we are seeking a solution in which Fj is zero, this is equivalent to finding a root of eqn (85) and we can apply Newton’s method (“1. Xj and by employing eqn (72) we get K-l
aFjiaxj- - llXj
-2A
1
k-0
ok-2
This involves taking the differential with respect to
Linear prediction and maximum
entropy
which shows that for a positive A eqn (85) will have one unique root.
575
If the differentials
8Fj l8Xi for i # j are neglected then at any particular value of A the iteration relates the new element Xj” to the old element Xj” by the equation Xj”
= xjO - Fi /(8Fi l8Xj)
Usually the differentials 85
laXi
. are not negligible and a suitable combination
Newton iterations and regula f&J”)
is needed.
Termination
of damped
may be based on the size of
the corrections Xjn - Xjo. The procedure is similar to the self-consistent method in that a solution is found initially for a small value of A and then 1 is increased until the criterion eqn (83) is satisfied. This method is stable and fairly fast, but stability and speed do not always go hand-in-hand. 7.1.2.2. Skilling and Bryan’s Implementation. An alternative to the foregoing would be to maximise Q by a direct search. This method has been applied to a similar problem by Wernecke and d’Addario(r2”) who used the the method of conjugate gradients.(12s) However, any search algorithm will be confronted by immense problems in dealing with NMR for we are seeking the maximum
data
of a function in a vector space of dimension J where J
may be several million. Even the conjugate gradient method cannot help us here for this requires the storage (not the inversion) of a JxJ Hessian matrix WQ. The solution is to restrict our search to a subspace, and to choose the subspace with care. This idea, introduced by Skilling and Bryan,02’) to counteract the problems associated with the self-consistent method is very successful in practice. Since the aim is to maximise the entropy and at the same time to bring the fit given by eqn (75) to a selected value these authors reasoned that the best search directions would be the gradients of these measures. Therefore in their general algorithm they chose a set of base vectors starting with OS and VC, the gradients of the entropy and the fit respectively, and expanded the set to include products of the form WS.VS,
WS.VC,
WC.VS
etc.
to introduce greater flexibility.
S and
C are defined by J-1
S =
-
z
Xj (log(Xj /A) - 1)
(87)
j-0
and
k-0
where the equation
for S has been modifiedu2’) to include the constant
A which can be
regarded as the default level for X. obtained by maximising S without the constraint operation. This can be seen by dl*Berentiating the modified entropy eqn (87) to give
asiaxj-
log(A)
-
log(Xj)
(88)
showing that Xj = A for a maximum of S. This approach allows a further improvement,
for the original idea had been to maximise
Q eqn (84) for a particular value of A and then change A and remaximise C(1)
could be brought into agreement
However,
C in
with the expectation
by separating eqn (84) into its components
Q.
Thus the lit
value of the noise in eqn (83).
it is possible to find a maximum
of S
where C,, is a target value of the fit at each iteration. But before I subject to C = C,, describe this approach let me go back to cover what Skilling and Bryan(12g) call ’ the single
DAVID S. STEPHENSON
576
most important key to the development
of a robust algorithm ’ , that is the idea of scaling
the spectral space. The two terms in eqn (84) tend to operate against one another, and particular difficulty is encountered with the fit C. The spectrum is necessarily positive because of the logarithmic form of S, whereas a good fit can always be found by setting elements of the spectrum to values less than zero. This can easily be seen from the form of the Fourier transform of the data, where each positive peak is flanked by sidelobes which are alternatively negative and positive. A change in the spectrum along the gradient of the fit PC thus normally tends to make small values of Xj negative and this must be prevented by setting such values to a very small positive value (essentially zero). But such action slows down the progress of the search tremendously
and a better approach
would be to replace
this ad hoc limitation by a method which allows large values of Xi to change more than small values. Just this effect may be achieved by locally scaling the space in such a way that the elements of the gradient VCj are changed to (VCi> ’ - Xj.VC .. This is equivalent to transforming to a new space with a metric~74~130) given by gij - d ij/X$ Once this has been done the new gradients (VS) ’ and (VC) ’ together with their transformed partners (WC)’ (VS) , (WC)’ .(VC)’ etc. and combinations thereof vectors ej of a subspace in which the maximum of Q is sought.
can be used as the basis
The problem is simplified if the new subspace is made Cartesian by orthonormalising basis vectors ej such that el.ej -
di.
the
In this subspace the functions S and C at a point q
may be expanded in Taylor series about the origin to give S(q) -
s(0)
C(q)
C(O) + Crq,
-
+ S$$
+
~S~vq~ql/ +
+ .....
(90)
+ .....
=@@l/
(91)
where the indices p,u run over the number of search directions (i.e. the subspace) and I have written them as subscripts contra-gredient
components.(74~‘30) S,,
and C,,
the dimensionality
in spite of the convention
are merely
the derivatives
of
for co- and of S and
C
respectively along elr, i.e. eP.VS and e,,.VC, and SP,, and CPv are the second derivatives with respect to the basis vectors, i.e. - eP.VVS.ev and CPv - e,,.WC.e,. But in % the scaled space the underlying metric gij - dGIXij is just minus the second derivative of the entropy - WS so that eqn (90) reduces to
(92) By diagonalising
C,,,,
C(q) = C(0) +
e q n ( 91 ) can also be simplified to give
cpqp + e,spsp
(99
where the y,, are the eigenvalues of CP,, and the rotation in the orthonormal subspace leaves eqn (92) unaltered. We are now set to look for that point q which has maximum entropy S and for which c - Ctargct the desired final value of the fit which would normally be chosen in accordance
with eqn (83) to equal ok2K. This point might, however, not be contained in the subspace and the best we can do is take a step towards the minimum of C in the subspace which is found using eqn (93) to have a fit value given by
CmiIl=
C(0) -
?4y,-‘c,c,
The strategy invoked by Skilling and Bryan is to aim for a value of the fit Cak,, between Gin
and Ctarget, a reasonable value being C,,
-
max(2C,&3
+ C(O)/S, C&g.&
What
Linear prediction and maximum is sought is the maximum
of S subject to C -
and using eqns (92) and (93) the maximisation
Ck.
entropy
577
By redefining Q as Q -
US -
C
of Q in the subspace results in finding that
point q given by qc( -
(oSc( -
C,Y(Y,
+ Q)
where u is chosen to fit C - Cti by a simple iterative procedure involving eqns (92) and (93) only. For further details of the control procedure of this elegant algorithm the original article(‘lg) should be consulted. The main feature of the algorithm between time and frequency.
is a continual forward and backward transformation
Any one iterative step starts with a real spectrum
{Xi}
for
which the entropy may be calculated using eqn (87) and the entropy gradient using eqn (89). This is then followed by transforming cXj> to the time domain using eqn (72). The complex vector (xk) is then employed in eqn (88) to determine the fit, and the gradient of the fit may be obtained by substituting eqn (72) into eqn (88) and differentiating with respect to Xj to give 8C18Xj
-
2Be[ K$
(xk - dk) exp( - i2nkjJ)
]
(94)
k-0
Both transforms (72) and (94) are J-point DFTs, but since in general J is much greater than K economies can be made by eliminating unnecessary calculations. Further transforms are necessary to determine the second derivative of the fit, and Skilling and Brya~#~~) cite six transforms per iteration for a program version using three search directions. The initial spectrum can be a uniform spectrum or a positivised FT of the data {dk), seems not to matter for the algorithm result after 10 to 20 iterations. The
termination
criterion
vectors are parallel and C -
takes all in its stride and produces
is based on the angle between C,,,,
VS and VC.
the problem has been solved.
works to bring C close to CtarBct and then iterates to maximise constant.
The degree of non-parallelism
TEST
-
it
a satisfactory When
these
Normally the algorithm the entropy
keeping C
can be measured by the quantity(12g)
1x pc2 2 IVSI - lvcl
which should be less than 0.1 for a good solution. 7.13. Probability Density and Entropy of Several Random Variables. In Sections 7.1 .l and 7.1.2 we were concerned with a single random variable x which could take on several values as a result of the outcome of one experiment. The probability with which these values appear,
or are to be expected,
has been given as the probability
density function
p(x). Let us now extend our view to include several random variables x1, x2, . . . and try to understand their collective behaviour. Our ultimate goal is the correct treatment of a time series which may be considered as just such a set of random variables. My purpose here and in Section 7.1.4 is to provide the link between the entropy rate of a time series and its power spectrum. In this section I shall derive an expression for the entropy of several random variables. The
path we take may become
difficult to follow, but we can ease our journey
making several reasonable assumptions important
which act as milestones along the way.
by
The most
of these is to suppose that each random variable has a normal distribution.
For
578
DAVID
S. STEPHENSON
one such real Gaussian random variable by(sJ’,‘s’) p(x) The defined
= (2x)-%
probability
0-l
exp{ -(x-~)~/2o~}
density
as the expectation
expectation
this implies that the probability
is a function value of x,
. -
value of the square of x- p, o2 -
If we extend this to two independent
(95)
of two parameters p
density@) is given
E[x], E[(x-
Gaussian
alone,
the mean
and the variance
p which is
defined
o2
as the
P)~]. random variables
x1 and x2, then their
joint probability density p(xl,xZ) is simply the product p(x1)p(x2). But if the two variables are not independent, that is if the outcome of the first experiment is correlated with the outcome of the second, then this correlation probability density is given by@~74J3’) (x1 - kQ2
P(x1,x2)
must be taken into account
_ 2r (x1 - P&2-
o12
where A -
The parameters
ol,
coefficient.
o2 and r which determine
be brought together in the form of a covariance
(96)
P2)*
02
0102
-r 2% ) ) -1 and r is the correlation
(2x0102(1
Pg) + (x2-
and their joint
the probability matrix
density may conveniently
o;.;r O:;: ]
c- [:;I :;;]=[
1
and eqn (96) rewritten in the form &x1,x2)
-
(2x)-l
(det C)-
H exp{ - $4(x-
r)TC-l(x-
cc))
where the boldface x and c denote the column vectors (x~,x~)~ and (~~,r~)~. With this matrix-vector notation it is simple to generalise variables which yields the joint probability density@~74~Ls‘) P(x)
-
-
P(xI,x2’“.,xm)
(2rr)-K’2 (det C,)-
to K real Gaussian
‘h exp{ - %(x- ~r)~C,-~(x-
random
P)}
(97)
In Section 7.1 .l we saw that every probability density has an entropy given by eqn (76) for the discrete case. Extending this to continuous variables we obtain@) s and
-
5
so we
probability s -
P(x) log(&))
obtain
the
dx
entropy
3
(98)
of one
Gaussian
random
variable
by
substituting
the
density eqn (95) into eqn (98) to give y2 log(2neo2)
.
To determine the entropy of the multivariate probability density of eqn (97) treatment is simplified by first transforming the general random variables x to zero-mean uncorrelated random variables y by means of the relation
where A is an orthogonal matrix. diagonal with elements A,, - 1,
the the
The covariance matrix Ax of the variables y which is is related to the covariance matrix of the variables x
Linear prediction and maximum
entropy
579
by(‘31)
n, - AC,AT , and the joint probability density becomes p(x)
-
(2n)-x’02 (det Cx)-
)c, exp( - MyTAx- ‘y)
I
(2~)~~~
H n
(det Cx)-
exp( - y$/2A,)
.
(99)
k Equation (99) can then be substituted into eqn (98) giving S
-
-F
j
n
exp(-yk2/2ak)
[ log(F)
-
k
where F = (2n)-x’2 s
=
-F
Y,*/~A,
] dYO..dY,
_,
k
(det Cx)-
log(F)
1
%.
n
1
Thus
exp( - yk2/2A,) dYk
k + F
1
[
k
-F
I
log(F)
-F
-
(l/2)
dY1 ]
l#k
n
A&2@
F 1
+
k =
J =F’f - Y:l2+)
exP( - yk2’2Ak) dYk n
1 (yk2/2Ak)
H
k
log(F) (2a)K/* (det C,)% log(det Cx)
+
(W2)
+
n
A,‘(2#’ I
F W2 (2a)W2 (det Cx)”
log(2xe)
(100)
The important thing in eqn (100) is the dependency of the entropy on Cx. In the present context the other term can be considered as merely the reference level for measuring
the entropy(3g) and may be disregarded.
This leads to an expression Gaussian probability density given by
entropy of a real multivariate
s =
‘/2 log(det C,)
.
for the
(101)
7.1.4. Autocorrelation and Entropy Rate of a Stochastic Process. An important application of the foregoing is in tbe treatment of time series which consist of a set of observations made sequentially in time. (3g) For instance the discrete time series txk)
-
&,,x1,x2,
*** &_I
1
may represent a set of observations made at equidistant time intervals 0, At, 2At, . . . KAt. Such a measured time series is always one particular realisation of a stochastic process@) (Appendix A) whose individual elements may be treated as random variables. In this section I shall show how the entropy of such a process is related to its power spectrum or spectral
density.
Several
assumptions
variables which comprise the series series, should they be real, have section, and the entropy of the Gaussian variables corresponds to real series only. The extension to
are necessary,
the first is to regard
the random
as being normally distributed. Thus the elements of the the multivariate Gaussian distribution of the preceding series is given by eqn (101). This treatment of real that proposed originally by Burg,(**“*) who dealt with complex series, which should interest us more, is treated
incompletely by Smylie et al. Us’) Only recently do other authors(38*‘34) seem to have realised that complex series are slightly different. JPllWRS 10/6-B
580
DAVID S. STEPHENSON In order to extend real series to complex ones and to employ the formalism introduced
by Burg further assumptions must be made. The aim is to obtain the entropy of a complex series in a form similar to eqn (101). The foundations have been laid by Goodman(135) who considered multivariate complex Gaussian distributions and are expounded both by Marple@ and by Kay. t”‘) The problem is to deal with the covariance matrix of a complex series which can be compared to that of a real series only under certain conditions. A complex Gaussian random variable x may in general be considered as the sum x = u + iv
,
where
both u and v are real Gaussian random variables. Thus K complex Gaussian random variables consist of 2K real Gaussian random variables, and the KxK complex covariance matrix Cc of x could be represented by the 2Kx2K real covariance matrix C’ of
u, v taken jointly.
That is the K-element
represented by the PK-element real vector iance matrix of x treated appropriately.
complex vector (uo,vO,ul, v,,
{x0,x1,x2,
...
...
,xx _ 1) could be
,% _ i,vK _ 1} and the covar-
Thus the complex element CJk would be given by
C. Jk
=
E[xjxk*]
-
i.e. a 2x2 block of C’. But the elements of Cc must behave as complex numbers under multiplication and complex numbers are isomorphic with 2x2 matrices only if these are antisymmetric.(‘35) Furthermore, the diagonal elements of Cc are real since Cc is Hermitian, so that the 2x2 blocks along the diagonal of C’ must themselves be diagonal. This theoretical discussion may be summarised by saying that for a K-variate complex Gaussian distribution it is necessary that the real and imaginary components of each complex random variable x are independent and have identical variances, (‘*) which seems not too unreasonable.
When this is so it can be shown(‘35) that det(C&)
-
Idet(Ck)12
,
and eqn (101) may be extended to yield S
-
log(det Ck)
(102)
for the entropy of a K-variate complex Gaussian distribution. This value is, of course, dependent on our choice of K, and it would be desirable to have a measure of the uncertainty of a process which is independent of the number of samples.
If we are dealing with processes
of infinite duration
then the value of S in
eqn (102) diverges as K tends to infinity, however, the quotient S/K tends to a limit as K+m and can be accepted as the average uncertainty per sample.@) This limit is called the entropy rate of the process S, thus
i The
-
next
zero-mean.
;.
&
S(X@Xl ,..., %_1)
step in the argument
.
is to assume
This implies that the covariance
that the process
matrix
we are dealing
with is
is identical with the autocorrelation
matrix (Appendix A), and the entropy rate may then be expressed as
s=
Kh_ t
log(det I$,)
.
(103)
Linear prediction and maximum
entropy
581
It is now necessary to make the further assumption that the time series is stationary. On the one hand this allows a simplification of Rx since the autocorrelation coefficients are no longer dependent
on the time origin and are functions of one index only.
Thus the Rx
matrix takes on the particularly simple Toeplitz form
Rx-[:_,
;;;
II(
i:‘]
where the elements along each diagonal are identical and Ri - R Ji. Secondly the power spectrum or power spectral density P(V) in which we are ultimately interested is then related to the autocorrelation by the discrete-time Wiener - Khinchin theorem Rk exp(-i2nvkllt) = At ‘f k- --OD
P(v)
(104)
with the inverse transform
J
UN
Rk
=
P(u) exp(i2nuMt)du
(105)
-UN
where uN is the Nyquist frequency uN = 1/2At. The
final step in this analysis
is to employ
an expression
for the determinant
Toeplitz matrix such as Rx so that eqn (103) may be developed. result shown by Szeg6(136) who demonstrated that
iii
log(det Rx)
K--
K
1 =2v,
J
uN -
of a
Luckily we can use a
log P(u) du
‘N
where P and R are related by eqn (105). Thus using eqn (103) we see that the entropy rate of a zero-mean complex stationary Gaussian process of infinite duration is given by UN
log P(u) du
J -
(106)
‘N
Having come so far it is worthwhile to ask why. In Sections 7.1.1 and 7.1.2 I showed that a valid approach to the determination of the spectrum of a time series was to maximise the configurational
entropy of the spectrum
was an example of a spectral discriminating rate of a time series is given by a Ilog
given by the familiar jp log(p) formula. function.
This
But eqn (106) shows that the entropy
formula for its power spectrum.
This has given
rise to considerable discussion and the layman may be forgiven for having received the impression that the general idea of maximum entropy is equivalent to maximum confusion. Equation (106) is not the entropy of the spectrum treated as a probability density. It is undeniably the entropy of a time series given the assumptions made, and as such is an example of a temporal discriminating function, for the best power spectrum is certainly that which leads to a maximum of s. We could, of course, use an iteration procedure to maximise s in the frequency domain, but this is unnecessary the solution exists in closed form for one-dimensional data.
since as Burg first showed@)
DAVID S. STEPHENSON
582
7.1.5. Burg’s Entropy. The problem Burg tried to solve was to estimate the power spectrum using only the limited number of autocorrelation coefficients available for a short time series. Equation (104) requires an infinite number of such coefficients. If only a few are known what do we do about the remaining coefficients which are needed to define the power spectrum?
His idea was to predict these unknown coefficients whilst making certain
that no extra information
was added.
This could be done by ensuring that the uncertainty
of the time series remained a maximum autocorrelation coefficients were fitted. Since the uncertainty
subject
to the constraint
that
the measured
in the original time series is given by eqn (106) and P(v) is given
by eqn (104) it suffices to substitute eqn (104) in eqn (106) and differentiate with respect to the unknown autocorrelations. &BR,
-
0,
Thus
Ikl >K
,
i.e.
J
UN
-
This
l/P(v).exp(
-i2xukAt)
du -
0,
Ikl >K
.
‘N
means
none
other
than
that
the discrete
Fourier
transform
of l/P(u)
can be
described with at the most 2K + 1 coefficients, i.e. l/P(u)
-
f
6k exp(i2xukAt)
(107)
k--K
with 8, = 8 _k* to guarantee reality of l/P(u). This eqn (107) can be inverted to yield 1
p(u> -
(108)
K
1 8, exp(i2nukh)
k=
-K
and using eqn (105) we obtain exp(i2numAt)
UN
J
RI?3 -
du (109)
- UN k fK6k
_
exp(i2ttvMt)
*
These last two equations form the basis of the temporal Maximum Entropy method. The spectrum is sought in the form of eqn (108) and the coefficients ek must be chosen so that eqn (109) is fulfilled. The problem is to find the vaiues of ok, and the solutions given in the iiterature(3g~1s3~137) use contour integration. I do not find this approach particularly illuminating and therefore provide here a slightly different derivation. Note that P(u) is a power spectrum and can thus be expressed as the product of a phased spectrum X(u) and its complex conjugate P(u)
= X(u)X’(u).
Using this relation in eqn (108) ailows us to write 1 P(v)
-
W)G*(v)
(110)
Linear prediction and maximum
583
entropy
where K
G(v) -
l/X(v)
1
-
gk exp( - i2nvkAt)
k-0 K
G*(v)=
l/X*(v)
-
1
gk*exp(i2nvkAt)
(111)
m10,
(112)
k-0
and K-m %n-
gk gk*+m
1
k-0
the Hermiticity
of (8,)
having been used to set gk -
0 for k< 0 in eqns (111).
The equations (111) and (112) alone do not completely define the coefficients gk but this is possible in terms of the phased spectrum X(v) if we expand X(v) as a Fourier series: X(v)
1
-
h, exp( - i2nvM)
1-O 0
X’(v) -
x
h,*exp(i2nvtit)
(113)
I-O
the reality of P(v) having again been used to constrain the index I non-negative. From eqns (111) and (113) we can see that h,g,
=
1
-
h;g;
.
(114)
Inserting eqn (110) into eqn (109) gives
"N
Rm
5
-
exp(i2nvmAt)
dv
GWG*(v)
‘N
and this can be used to form the sum K K
t
vN exp(i2nvmAt) gk Rm-k
*
k=O
kxo gkexp( - i2nvkAr) m
dv
I -
‘N
W)G*W
“N
-
exp(i2nvmAt)
5
G’W
“N
"N
I
-
dv
X*(v)
exp(i2nvmAt)
dv
.
(115)
‘N
Substituting eqn (113) into eqn (115) and employing the relation (114) produces
5 k-0
gk
Rm_k
m
(
ivdg:
m- ’ m>O.
(116)
584
DAVID S. STEPHENSON
These equations (116)
are the result of applying the Maximum
Entropy
principle plus the
aforementioned constraints to a time series {xk} . But we see that eqns (116) have exactly the same form as eqns (31), so that Burg’s ME method is none .other than an Autoregressive predictive technique in disguise. The distinction must, however, be made between the ME aspect of Burg’s method and the novel means he employed to calculate the AR coefficients which was presented in Section 6.1.4. This latter is a valid approach to determining these coefficients but we have seen, both in Section 6.1.4 and here, that one strong assumption made in both cases is that the time series be stationary. This restriction really makes the method inapplicable to NMR (see Appendix A) although a noiseless FID in which all the signals decay at the same rate could be made stationary by multiplication with a suitable increasing exponential.(‘3e) A further point to be considered when comparing
the two forms of ME is the treatment
of noise. Maximum Entropy applied in the frequency domain takes the presence of observation noise directly into account using the fit function eqn (75) and should thus give reasonable spectra even for noisy data. Maximum Entropy applied in the time domain fails to do this, although more sophisticated treatments(‘sg) may overcome the problem. As we have seen in Section 6.1.8 an ARMA Finally,
the correct treatment
of noise leads not to an AR model but to
model. however,
elegant explanation
the ME
approach
to Burg’s
for the line splitting problem.
method
provided
Jaynes(‘*‘)
with an
By pointing out that the Burg solution
eqn (108) is an approximation to the situation with an infinite time series but the exact solution to a case where the time series is circular he showed that two lines are be expected in the spectrum instead of one when the signal contains a sinusoidal component which does not go through an integer number of cycles during the measurement period. 7.1.6. Comments on Maximum Entropy. It is perhaps warrented to ask why the Maximum Entropy Principle is justified. There seem to be two ways to counter such a question. One is based on the derivation new aspects of consistency.
presented in Section 7.1.2,
the other introduces
To begin with let us consider the simple presentation of Section 7.1.2 more closely. We are seeking a spectrum which is consistent with a given FID. It makes sense in this context to choose that spectrum which is most probable. This raises the question of how to define the probability of a spectrum and forces us to consider some elementary probability theory. Let us start by denoting propositions by letters, A, B, C etc., then the notation P(A) represents the probability that A is true, the notation AB denotes that A and B are true and P(AB) signifies the probability that A and B are true. Furthermore, the notation P(AIB) represents the probability of A given B. Using this notation it is easy to write(ll’*l’s*‘ro) P(ABIC)
-
P(AIBC)
as well as the symmetrically P(ABIC) Combining
-
P(B(AC)
(117)
,
equivalent P(AIC)
.
(118)
eqn (117) and eqn (118) yields
P(AIBC)
= P(AIC)
This simple-looking probability
P(BIC)
P(AIBC)
VW) . WlC)
(119)
equation (119) is one form of Bayes’ theorem,@) it relates the posterior to the prior probability
P(AIC).
Thus the probability that A is true
Linear prediction and maximum given previous information that
A is true
leaming
C and new evidence B is related by eqn (119) to the probability
given only the previous
process
585
entropy
information.
in that the probability
It represents
that a hypothesis
an inductive(lW) or
A is true is reinforced
by
evidence B which fits the hypothesis, a measure of the fit being given by P(BIAC). Let us forget, for the moment, the influence of C which represents general background knowledge, and treat A as the proposition that a given spectrum is the unique spectrum we are seeking, and treat B as the measured FID. Then eqn (119) could be rephrased a#*‘) P(SpectrumlFID)
-
P(Spectrum)
P(FIDISpectrum)
.
(120)
The probability that the spectrum is “true” is thus the product of two probabilities, the prior probability of the spectrum and the probability that the FID agrees with the spectrum. This latter probability is fairly easy to determine if we assume that the measured data are corrupted solely by additive Gaussian noise, for in this case the variables xk - cfk are normally distributed and the measure of the fit is given by
(121) k
In eqn (121) ok2 1s * the variance of the noise and x2 has a &i-squared probability density.@) Thus the second probability in eqn (120) is proportional to exp( - x2/2). The main problem lies with the determination the solution is to set this proportional
of the prior probability P(Spectrum),
to the degeneracy
and
of the spectrum given by eqn (77).
Thus, from eqn (78), the prior probability is proportional to exp(NS) where N is the total number of intensity units AX in the spectrum. The posterior probability is therefore proportional
to exp(NS - x2/2),
This approach,
so that a solution is sought which maximises this quantity.
simple and appealing though it is, has been subject to critisism(“2~143)
which is based on the choice of the size of the intensity unit AX, or alternatively on the value of N. Gull and Daniell(‘26*141)varied N until the x2 of eqn (12 1) equalled the number of data points, which seems reasonable. But Livesey and Skiing(‘4S) complain that N is thus chosen a posteriori, which is not consistent with exp(NS) being a prior probability. The statistical arguments linking entropy with probability have also been called into question by Frieden, CL++) although his arguments are refuted by Musicus. The difficulties caused by equating maximum entropy with maximum degeneracy and maximum probability may be overcome by seeking a discriminating function based not on statistical arguments or on information theory but on the reasonable grounds of consistency. This approach was chosen by Shore and Johnson(‘%~*“) and by Tikochinsky et al.@‘@ The rather formal mathematical treatment of these authors has been made more digestible by Skilling,(lg) Livesey and Skilling,(l’s) Gull and Skilling(150) and by Laue et a1.(151) Shore function system
and Johnson(‘ti) should
fullil.
independence
introduce
These
are
four
axioms,
uniqueness,
and subset independence.
which
invariance That
any
consistent
to coordinate
discriminating transformation,
the function we choose should be
unique is clear, also that it should not depend on arbitrary transformations
is obvious.
The
other two conditions are sufficient to fur the form of S. Rather than go through the mathematics of Shore and Johnson I can do no better than to paraphrase the approach of Livesey and Skilling. U4’) These authors demonstrate that the requirement of subset independence is sufficient to impose the restriction (8/8pj)(8S/8pk)
-
0
where pj is the proportion
for all j # k , Xj /1X.
This comes about by considering
(122) the spectrum being
DAVID S. STEPHENSON
586
broken into disjoint subsets for each of which independent constraint information is available fixing the proportions in that subset. No information correlating the subsets is assumed available so that the function S must have a maximum over the proportions of one subset independent of the proportions in the others. This means that the gradient XYap, of S at pk must be independent of p;, thus leading to eqn (122). One consequence of -J eqn (122) is that S is separable, thus ’
-
1
"j@jl
V
j
and if no information
exists lavouring one element over another then the limctional forms SO that
sj@j) must all be identical
(123)
FIG.21. Schematic illustration of the two systems considered in the text. To clarify the concept of system independence we can regard the spectrum as consisting of two overlapping systems, 1 and 2, which Livesey and Skiing illustrate as the twodimensional matrix in Fig.21 (this in no way implies that the spectrum itself is twodimensional, it is merely a useful construct). The two systems in this illustration correspond to the rows and the columns.
The proportions in the first system are represented by uj, those of the second system by vj and those of the whole system as pi. If the systems
are independent in the sense that linear equality constraints (including the constraint on the total intensity ZJ - Xv - 1) of the form 1 i
uj aik
-
A,
and
1
vj bj,
-
B,
i
apply, the elementary proportions pi must be given by Pjj “j? * In the present context the A, and B, could represent measured data and the coefficients aik and bj, could be the terms exp( - i2aiklI) and exp( - i2rrjVJ) in the discrete Fourier transform. We have seen in Section 7.1.2 that the constrained maximum of S is found by using Lagrange
multipliers, and by applying this technique to both systems independently we seek
Linear prediction and maximum
entropy
587
the maxima of the sums
[Z ‘$ ajk -
91 -
1
s
1
'lk
9,
i
s(vj)
;
421 [i
*k ]
(124)
and -
+
i
1
vj bjl -
(124a)
Bl ]
j
These maxima are found by differentiating to give
S’(Uj)
1
+
(a& - Ak> -
ulk
0
(125)
k s ’ (vi>
1
+
u2,
(bjl -
B,)
-
0
.
(125a)
I The same method may be applied to the whole system for which the constraints 1 ij
uivj aik
=
Ak
and
x
ujvj bjl
-
BI
ij
are effective.
By forming a sum similar to eqn (124) and differentiating we obtain
“‘(uiv’)
+
1
&k a&
+
1
k
Combining
B2, bjl
-
0
*
(126)
1
eqns (125) and (126) yields
s ’ ( UjVj) -
S’(Uj)
-
S’(Vj)
-
1 k
(“lk-
Plk)
aik
+
1
(a*,-
021) bjl
1
differentiating this first with respect to vj and then with respect to uj gives s”’ (UjVj) ujvj
+
S”(Ujvj)
=
0
)
or in terms of the elementary proportions s”‘(p)
p
+
s”(p)
=
0
.
(127)
Equation (127) may be integrated to give the general solution s@)
=
*P log(p) + BP + C
where A, B and C are constants. s
-
A ~
Pi log@j)
(128)
Substituting eqn (128) into eqn (125) yields
+B + C
(129)
j since Ep - 1. The constants
in eqn (129)
axioms and simple arguments,
are irrelevant
so we see that,
Shore and Johnson
demonstrate
on the basis of reasonable that the only discriminating
function which leads to consistent results must have the form
s - -
z
Pj l"g(Pjl
where the minus sign is chosen to give S a maximum.
(130)
This has, furthermore, been done without invoking the concepts of maximum degeneracy or maximum probability. An amusing example of the consequences of not employing eqn (130) has been given by Gull and Skiiing.(L50)
588
DAVID S. STEPHENSON
The foregoing derivation has rested on the assumption that the sole information about the proportions came from the constraints. Should this not be the case and should prior knowledge in the form of a distribution mj be available then this can be included to give a revised form of the entropy
1
s - -
Pj
lOg(Pj
/mj>
(131)
.
j
This expression eqn (131) is a measure of the cross-entropy( ‘%I between two distributions pi and mj (actually the cross-entropy is the negative of S in eqn (131) and has to minimised) and its derivation is described by Shore and Johnson. If no constraints in operation maximising the S of eqn (131) will cause the distribution pj to equal distribution mj which can thus be regarded as a default level(150) and which is related to constant A of eqn (87). One of the reasons for the success of the spectral maximum
be are the the
entropy method is that the
logarithmic form enforces the populations Pj to be positive. This is of great value in image processing where the object is necessarily positive and where positive-constraining methods play a great role.@) But for NMR may be negative.
spectroscopy
it can be a nuisance because some peaks
This difficulty may be overcome by a method to be described in the next
Section, but Wright and Belton 05’) have argued that the fundamental problem lies with the correct definition of entropy for reconstruction of a phased spectrum. The underlying assumption
of the maximum
entropy method previously described is the
validity of the direct correspondence between the isochromats Xi and the spectral amplitudes Xj illustrated in Section 2, but this assumes that all the signals are in phase. Faced with this problem Wright and Belton(152) sought to derive an alternative form of the spectral entropy which better measured the information content of a complex signal obtained in coherent spectroscopy. They reasoned that the entropy should be invariant under certain transformations of the signal. Thus for instance neither the entropy of the signal nor that of the spectrum should change if either of these data sets is multiplied by an arbitrary complex constant. Principally though the entropy of the spectrum should equal that of the FID since both entropies measure the information available and mere Fourier transformation which is a reversible process neither creates nor destroys information. By arguing It@)>
that an analogy exists between quantum mechanics, where a state vector b e ex Pan ded as a series of complete orthonormal basis vectors
-Y
(132) with 1[,{t)>
being the jth basis vector and cj a complex coefficient,
and spectral analysis,
where the time signal may similarly be expanded as x(t)
-
1
Xj exp(i2njMJ)
,
(133)
Wright and Belton introduce the density matrix(15’) e whose elements eij are the ensemble averages3 and from which an ensemble entropy may be derived(15’) S-
- k
We 4deN
where k is Boltzmann’s
constant.
us to define an “intensity matrix”, s-
- Tr(P log(P))
The correspondence
between eqns (132) and (133) allows
with elements P+ = q,
which has the entropy
Linear prediction and maximum
entropy
589
Wright and Belton go on to show how, as a consequence of the Fourier transform relationship eqn (72), the matrix P in the frequency domain may be transformed to the matrix p in the time domain by p=WPW-’ where W is a unitary matrix with elements W. - exp(i2njWJ)/\rJ and where the elements Because W /rf unitary, transformation between frequency of p are given by pk! = xkx;. and time leaves the entropy unchanged S -
- Tr(P log(P))
-
- Tr(p log(p))
,
(134)
thus fulfilling the main condition of invariance. is to derive the elements of P or p from the spectrum or
The last stage in this argument
measured data. These elements are defined as ensemble averages over many elementary states and the degree of correlation between these is not known. Wright and Belton set this correlation
to zero by invoking the Principle of Maximum
the density
matrix
is diagonal
with elements
Entropy
given by Pii -
and thus conclude that
lXjl*.
Thus
the entropy
formula derived by this approach is J-1
S-
-
1
lxj12
l"glxj12
(135)
9
j-0
where the spectrum is scaled so that EIx~* - 1. The realization of this novel form of the entropy using the method of Skiing and Bryan (Section 7.1.2.2) would not be so easy because the total search space now includes the real and imaginary components of {Xj}. element are no longer independent
Since the real and imaginary parts of each individual one major feature of the algorithm, the diagonality of
WS, is absent. This might preclude a practicable implementation. However it should be possible to extend the method of Gull and Daniell (Section 7.1.2.1) Because of the relationship (134) the to include an entropy in the form of eqn (135). entropy can be minimised in either the time or the frequency domain, whichever is practically more efficient. However, eqn (135) relates to the power spectrum of the data so that if entropy can be regarded as a measure of smoothness then it is the power spectrum which will be smooth not necessarily the real and imaginary components. This is probably a serious defect since one aim of the method is to produce smooth phaseable spectra, could only be achieved by adding extra smoothing constraints (see Section 7.2).
this
7.1.7. Summary of Maximum Entropy Methods. I shall restrict this summary to the Two-dimensional extensions are considered treatment of one-dimensional spectral entropy. in Section 7.3 and one-dimensional temporal entropy has been shown to be equivalent to the AR/LP methods discussed in Section 6.1. The first application of Maximum Entropy (ME) to NMR is undoubtedly that of Sibisi.(*) The treatment is strongly based on that of Gull and Daniell,(‘*@ and although it deals with a two-dimensional representation the data are one-dimensional in the usual NMR sense of being measured as a function of one time variable only. As well as the normal frequency dimension in the spectrum, Sibisi introduced the linewidth (or equivalently the decay
rate) as the second dimension.
The
ability of the method
to produce
well-resolved one-dimensional
spectra devoid of truncation artefacts is demonstrated together with the possibility of counteracting the effects of zero- and first-order phase distortion. This latter is accomplished
by including both phase parameters
cpO and cpl into eqn (72)
590
DAVID S. STEPHENSON
relating the spectrum
to its transform
to give
J-1 Xk = exp(i2nq,)
1 j-0
Xi exp(i2a(k+qp,‘/J)
036)
where j’ +J/2 = (j+ J/2) modulo J (the use of the modulus is necessary if the data are complex and the index j runs from 0 to J - 1 because spectral elements Xi with j h J/2 correspond to negative frequencies, see Section 5.4 and Brigham(* The entropy is then maximised over the two phase parameters cp,, and Q, as well as X;.
160
140
bbpm)
130
120
I
FIG.22. “C spectra of 24nylpyridiie in CDClhD (a $ FT spectrum of a noisy FID (X indicates a spccuomctu artefact). (b) FT spectrum of the same p multi liition with a negative ex nential. (c) ME spectrum with automatic phasin . fl’. (d) F’T of a FID obtamcd wi& a larger pulse width and !&ring hig+r sig&to-noise. Because of the wetg tmg m qn (137) the ME method actively discriminates against e noise and 1s much more, auccc$ul at producing an mtuprctable spcztrum at low aigoal-to-noise ratios, corny:, 65: and (c). The low mtenstty for C2 is hardly lar cr than some of the noise 8 in (a) (marked with ! but it IS clearly distinguishabPe m the ME spectrum (c3 . (reproduced from Ref.1 !P1, with permission)
In 1985 Laue et a1.(151)applied the ME method to very noisy 13C data and moderately noisy ‘H data and showed that the resulting spectra were much freer of noise and had better resolution. The main theme is to show that ME can be used to improve resolution and sensitivity simultaneously, something which cannot be accomplished by windowing or filtering in the time domain. Entropy is introduced as a discriminating function in this paper and cogent reasons based on the consistency arguments of the preceding Section are given for its being chosen as the best. Furthermore, an outline is presented of the phase algorithm, that of Skilling and Bryant’*‘), which also finds the two principal parameters automatically and which converges within 100 Fourier transformations. Whereas Sibisi(‘) had introduced the decay rate into the algorithm in order to produce a
Linear prediction and maximum
entropy
591
two-dimensional reconstruction, Laue et d.(151,‘55)use the decay rate to discriminate the noise and to improve the resolution by modifying eqn (lS6) to give
against
J-1 xk
-
exp( - wk) exp(i2ncp, ) 1
Xj exp(i2rr(k + cp1)j ‘/J
(137)
j-0
Choosing w positive is equivalent to assuming the existence of Lorentzian transformed vector (xk) would have elements whose amplitudes decay
lines and the exponentially.
However, the noise in the data {dk} has a constant variance and will thus not be well fitted by a decaying signal. The result is to extract the “true” spectrum even in the presence of considerable noise. Comparison of the “C spectra obtained by F’T and by ME shows the improvement in clarity and suppression of artefacts with ME when dealing with noisecorrupted non-truncated data (Fig.22). A similar approach of using a non-zero w in eqn (137) has been employed by Ni et z&(156) who firstly multiply the FID by a decreasing exponential to reduce the noise and then apply ME with a positive w to regain the resolution. Ni et al. state that only in this way may one obtain noise suppression and high resolution simultaneously.
1
71
I
70
6’0
I
$7
66
I 66
1 65
1 64
b(ppml FIG.23. ‘% spectra of dioin be-e(d& obtained with an INEPT “f”“e. (a) FT spectrumof the FID. (b ME spatrum of the same. FID. The maerts show expansions of the ow-field multiplct. (reproduced from Re ) .157, with permission)
One of the defects of the spectral Entropy entropy
constrains
p to be positive.
Laue
procedure et
a1.(15’)
is that the jp log@) have
overcome
form of the
this difficulty
considering the sought spectrum to consist of the difference of two positive spectra. these, {Xj’
}, corresponds
by
One of
to the normal positive signals, the other, {.Kj- }, to the negative
DAVID S. STEPHENSON
592
signals. Since these two spectra are positive it is easy to apply the entropy formalism to both, yielding s-
J-1 z
-
xi+
log xj+
j-0
+
J-1 1
xj-
log xj-
j-0
and the real spectrum may be obtained as X. I J
xi+
-
xi-
(138)
whence Xj may be incorporated Laue et a1.(157) demonstrate
in eqn (137) to yield the time series {xk} . the use of this two-channel program on an INEPT(“)
spectrum with antiphase peaks and show the simultaneous improvement of resolution and suppression of noise obtainable with ME (Fig.23) to Martin(158) has used the method of Gull and Daniell, W6) described in Section 7.1.2.1, obtain ME reconstructions of magnitude spectra. He chose a somewhat different approach to that of Laue et al. by seeking to tit a “correlation” function derived from the experimental FID. The data (dk) were first transformed to the frequency domain using a normal DFT’ to give a normal spectrum { Dj> from which the magnitude spectrum { 1DA} was obtained and subsequently transformed back to the time domain to form the “correlation” function {dk’ } (this “correlation” function is not the same as the autocorrelation The self-consistent algorithm of Gull and Daniell(‘26) was then function of the data). employed to find a magnitude spectrum {X.’ } which fitted the (dk’ } within the expected variance of the noise. Martin showed that tt: e method did not yield a higher signal-to-noise ratio than did matched filtering. But it should be borne in mind that this result cannot be directly translated to the method of Laue er al. (tsl) because of the completely different treatment of the data and because Martin employed no weighting in eqn (137) The spectral ME method can also be applied to a truncated FID as shown by Laue et al.(34) who applied it to I70 data in which acoustic ringing and pulse breakthrough had Conventional FT of such data results corrupted the initial data points in the time domain. This may be eliminated either by increasing the delay baseline roll.
in considerable
between the end of the pulse and the start of the acquisition, a technique which introduces severe phase distortions, or by setting the initial data points to zero which introduces artefacts of the “sine” function type. An ME spectrum obtained from such truncated data is, however, devoid of such artefacts, (“) although to achieve a clean baseline Laue et al. found it necessary to weight the initial points in the FID rather than to employ a sharp cutoff and set them to zero (Fig.24). A
0
C
JLJ A
B v 4
b)
FIG.24. 170 spectraof aspyrone. The FID was corru ted by acoustic ringin trum of the same FID with the fvst 18 points set to zero. FID. (b) FT s with the first 1rc.pornts ignored. (reproduced from Ref.34, with permission)
4
Linear prediction and maximum
entropy
593
Laue et aI. go on to consider the quantative aspects of this method by applying it to a two-line spectrum and comparing the areas of the lines under various conditions. They show that the reliability of the areas decreased as the number of ignored points at the start of the FID increased. Furthermore, the reliability is also dependent on the decay rate w in eqn (137). If the spectrum is reconstructed with w set to zero the areas are least reliable, for maximum reliability the decay rate should be chosen close to the experimental T2*. The spectral ME method also allows a spectrum to be estimated even when points are missing in the middle of the FID. As I pointed out in Section 4.2 this benefit is of little value for 1D NMR. However, the demonstration of this technique by Barna et d.(3s) was made using 1D data so its discussion belongs here. The FID is normally acquired at regular time intervals 0, At, 2At, . . . etc. It is necessary to do this if the spectrum is to be obtained by DFT for using a non-regular time basis precludes the DFT and missing out some of the points leads to spectral distortion. But using ME there is no need for all the points to be present, although the data must still be taken at multiples of a basic interval At. The ME method will produce a spectrum best fitting the available data. The object of this approach is to acheive good resolution and high sensitivity with the minimum number of points. The obtainable spectral resolution is dependent on the length of time that the FID is sampled. For the FT method the resolution is simply the inverse of this time, whereas for the methods considered in this review higher resolution is obtainable both in principle and in practice. However, even here the information which leads to high resolution is contained in the data points at large values of kA t. Thus it makes sense to acquire such points. On the other hand the bandwidth
of the spectrum
for complex data is the inverse of the sampling interval At so
that if the spectrum is to contain no folded lines this interval is f=ed by the spectral width expected for the sample under study. A compromise must be reached in which some of the points are acquired close together with the sampling interval At to provide the correct spectral width, and some of the points must be acquired at high values of kAt for good resolution. Because the signal decays exponentially Bama et aI. suggested sampling the FID on an exponential basis with the sampling density being higher at the start of the FID, where the signal-to-noise
ratio is greatest,
in order to increase sensitivity.
the results obtained with this method is shown in Fig.25.
(a)
J,
An example of
594 7.2.
DAVID S. STEPHENSON
Maximum
Smoothness
One of the prominent aspects of ME spectra is the flatness of the baseline and apparent lack of noise. This can be rationalised in terms of the information concept of entropy. A spectrum in which numerous features such as wiggles and noise spikes occur is obviously one which has more information than a spectrum devoid of such artefacts. Because the maximum entropy method seeks that spectrum consistent with the data but having the least information these features will be absent in a ME spectrum. So, in general, maximum entropy spectra are smooth. If smoothness can be equated with lack of information then it makes sense to seek a spectrum in which smoothness itself is the discriminating function. This approach was first advocated
by Phillips(‘5g) and is described
by Friedenc8) and by Rosenfeld
and Kak.(160)
Belton and Wright (161) have applied the method to NMR based on a treatment by Hunt.(i6*) Let us consider first a spectrum (g(k)} obtained by Fourier transformation of a FID. This spectrum, which may have broad lines because of short T2 times and may be contaminated by noise, can be considered as the convolution of an ideal spectrum {Q)} with sharp lines and a lineshape or point spread Gmction { hQ}
plus additive noise {e(j)},
thus g(k) -
i j-0
h(k-j)f(j)
+ e(k)
k = O,a+ b-2
where the sequence {Q)] ha s a elements and (h(j)} also be written in matrix form as
w9 has b elements.
Equation
(139)
Hf + e
g-
can
(140)
where the elements of H are given by
h(k-J]
0 5 k-j2
b-l
otherwise for k -
and j -
O,a+b-2
O,a-1.
In the absence of noise the least-squares solution of eqn (140) would be given by f -
(HHH)-‘HHg
,
but noise leads to ill-conditioning(8) and can be taken into account vector f such that eHe -
(g-Hf)H(g-Hf)
= (a+ b-
where o2 is the variance smoothness of the spectrum
,
1)02
(141)
of the measured noise, subject to the constraint that the The smoothness is defined in terms of {f(j)} be a maximum.
the sum of squares of the second derivatives of the sought spectrum, smoothness
-
-
z
lfo’+l>-w+fo’-l~12
To make life easier we zero-till the arrays where P L a + 26 - 2 to give new arrays {g,},
$a =
I
fo o
OSjSa-1 a$jSP-1
by finding a solution
,
.
i.e. (142)
{g}, (f), {h} and {e} up to P elements, ($1, (hp] and (e,} such that
Linear prediction and maximum
entropy
595
and similarly for { gp) , { hp} and { ep} . Then eqn (140) can be expressed as gP
= Hpfp + ep
where the matrix HP has the form
hpCO>hp hpP> HP
. . .
...
hpCl> hpC2> 1
. . .
hpco) :
J
*
hp(P - 1) hp(P - 2) and, furthermore, smoothness
eqn (142) becomes =
- (CpfJT(Cpfp>
(143)
where the matrix Cp is given by 1
.
-2 1 cP
-2
-2 1
1 1
=
1
. .
1 -2
1
.
1
*
-2
.
1
The purpose of choosing P L a + 2b- 2 is to introduce sufficient zeros into the vectors such that “wrap-around” problems are eliminated when the matrices HP and C, are written as above. The advantage of this formulation is that both matrices are circulant and the eigenvahres of such a matrix are obtained by the Fourier transform instance the eigenvalues A,(k) of HP are given by
of its first row.@s) For
P-l A,(k)
=
z
Hp(O& exp(i2nkjlP)
(144)
j-0 and the elements of the eigenvectors wk of both matrices are given by WkO) = exp(i2nkjlP) so that both matrices HP and C, can be expressed as HP
-
w/i,w-’
and
C
P
= W%W_’
(145)
where An and A, are diagonal matrices with elements
$#d4
- A,(k)
and
AJk,k) - A,(k)
(146)
and W is a matrix with elements
ww -
exp(i2njklP)
.
(147)
The solution to this problem is to find the minimum eqn (141) by means of Lagrange
y(CpfdT(CpfJ + kp - HpfJT
20/6-P
of eqn (143)
multipliers, i.e we must minimise
subject to the constraint
5%
DAVID S. STEPHENSON
Differentiating this with respect to fp and setting equal to zero yields
‘p -
(HpTHp + yCpTCJ
1 HpTgp .
(148)
Then substituting eqns (145) and (147) into eqn (148) produces
(14% Fourier (144)
transforms
and
are the Fourier
(146)
(Fp)
and
we see that
transforms
of the
(150)
Equation (150) then be varied and Wright(‘61) resolution and
can rapidly be solved at using Newton’s method have demonstrated that suppression of noise and
and to Gaussian resolution enhancement.
any particular value until the constraint this method allows is superior both to
of A, and this parameter can eqn (141) is satisfied. Belton simultaneous enhancement of matched exponential filtering
Its applicability to truncated FIDs has, however,
not been demonstrated. 7.3.
Extension
to Two Dimensions
The extension of the spectral ME method to two dimensions is simplicity itself.
The
expression for the entropy of a distribution eqn (76) is independent of the organisation of the probabilities pi whether one-, two- or n-dimensional. The only major algorithmic change needed is in the formulae for the transformation frequency.
Thus for two-dimensional
NMR eqn(137)
between the two domains, time and becomes
I-l J-l
xid - exp(- w(k+I))
exp(i2ncp,)
1 1 Xij exp(i2n(k+cp,)i’/I) i-Oj-0
exp(i2rr(l+cp,g’/J)
. (151)
The first explicit use of ME in two dimensions is that of Horels2) who used it to tackle two problems in 2D NMR. The first problem is the existence of “sine” wiggles caused by truncation of the FID. In general there is no great difficulty in acquiring sufficient points in the t2 dimension so that the FID has decayed to zero within the measurement period. The difficulty lies with tl, because the duration of the experiment is proportional to the number of points acquired in tl. Since spectrometer time is expensive this number must be kept as small as possible meaning that either resolution be sacrificed or the t1 FID be truncated, or both. The conventional approach is to be satisfied with a low resolution and to counteract the “sine” wiggles by weighting the FID with a Gaussian or sinebell function.(15*27s31) The second problem is the “phase-twist” lineshape in certain 2D experiments. This can be avoided by forming the magnitude spectrum but even here the dispersion components of the twisted lines interfere with the interpretation
of complex spectra and, furthermore,
the
positive - negative intensity information is lost. Hore(32) applied the method of Gull and Daniell(“6) to two-dimensional and demonstrated
its ability to produce pure absorption
spectra (Fig.26).
J-spectroscopy This apparently
mysterious process comes about because of the direction in which ME tackles the problem.
Linear prediction and maximum
a)
entropy
b)
-ii0
-lb
io
6
,200
20 Hz
4 (a) ___. 2D FT spectrum of a data array consisting of FIG.26. 2D ‘H J spectra of m-bromonitrobcnzcne. . . _ ^ . ,.\ 32(t )x64(r2) complex points which was zcro4lled to lZB(t,)XZ56(t,) complex Tmts betyrf tmnstormattlon. [b) 2D kfE spectru m with 128(4)x256(%) real points. (reproduced from Ref.32, wth pumlaslon)
We can consider a measured signal, phase modulated in tl, as consisting of contributions from components of the form [ exp( - tllT2)exp(i2mvltl)
] [ exp( - t2m2)exp(i2mv2+)
]
(152)
where for simplicity the line with frequency coordinates v 1 and v2 has equal linewidths l/nT2 in both frequency dimensions. Complex Fourier transformation of this signal yields
[
A1(ul) + iDAub ] [ A2P2) + iD2W
1
the real part of which (AlA + D,D2) is, the mixture of absorption and dispersion known as the “phase-twist” lineshape,(‘5*z7) and where A and D denote absorption and dispersion respectively. There is no way to obtain pure absdrption lines from expression (152) FT,
but the ME
method
starts by ‘assuming
absorption lines A1(u1)A2(u2) with respect to f2 to give AI(ul)
that the real spectrum
using a complex
consists of double
and sets the imaginary parts to zero before transforming
[ exp( - t2/T2)exWnu2f2)
first
]
Setting the imaginary parts of the and then with respect to fi to give expression (152). spectrum to zero causes the Fourier transform in the time domain to be Hermitian so that only the first quadrant measured
data:
the
of the back transformed
“phase-twist”
having
disappeared
FID into
may be compared the
other
with the
quadrants!
This
method can be applied to any 2D experiment where the signal is phase modulated in tl and will produce spectra having absorption lines only. By combining it with the two-channel method of Laue et af.(15*) poiitive and negative lines, such as foimd in COSY spectra, may
598
DAVID S. STEPHENSON
be taken care of. Thus phase sensitive COSY spectra may be produced from data obtained with echo or anti-echo selection(15) introduced to give quadrature detection in fi. Hore and Daniell@‘) have applied the method of Skiing and Bxya#s) to rotating-friame zeugmatography data. Rather than using FT to determine the spectrum directly from the data they start from an assumed distribution of absorptive intensity in the two-dimensional spectrum and use the Bloch equations to construct a theoretical time domain signal which could be compared the maximum
with the measured data. consistent
The spectrum was then chosen as that with with these data. Hore and Daniell show that the ME
off-resonance
effects which lead to frequency shifting and phase twists
entropy
method counteracts
and also eliminates “sine” wiggles due to severe truncation. Laue et a1.(16’) have also considered the application of ME to 2D NMR. But unlike Hore(s2) they took phase-sensitive data(*‘) as their starting point, i.e. they used the method of Marion and Wiithrich(lb5) to obtain the FID. In their first test they chose a doublequantum-filtered COSY spectrum where the signals are always in positive or negative double absorption mode. The ME method was shown to produce spectra which have a higher resolution and which are, to a large extent, devoid of truncation artebcts. The conclusion being that if the ME procedure is substituted for conventional FT then measuring time can be reduced by severely truncating the same time retaining good resolution.
the FID in the fi dimension,whilst
at
r
4
I
.;; :.. .:.
1.
:,. .::.
Absorption -W*-
:‘:. .:,
Dispersion -lu*
-
FIG.27. ‘H 2D COSY spectra of the AX spin system of 4-toluene sulphonyl chloride in die&y1 sulphoxide (d ). (a) Phase-sensitive FT spectrum with dispersive diagonal peaks and absorptive cross peaks. (b) Absorptive M% absorption subspectrum from the same data as in (a). (c) Absorptive ME dispersion subspectrum from the same data as in (a). (reproduced from Ref.164, with permission)
Laue et aL(‘64) also considered phase-sensitive 2D spectra containing double dispersion For instance a straightforward phase-sensitive COSY peaks as well as double absorption. spectrum can be adjusted such that the cross peaks which are of interest are in double absorption mode when the diagonal peaks are in double dispersion form. In the same way in which a 1D spectrum having positive and negative peaks can be considered as the difference of two positive spectra Xi’ and Xj- , eqn (138), so a 2D spectrum with positive and negative absorption and dispersion lines can be considered as the combination Xji, Xj-‘) whose joint entropy is given by positive spectra (Xi + , Xj-, J-1 s-
-
1
J-1
J-1 xj+
j==O
log xi+
+ 1 j-0
xj-
log xi-
+ z j=O
and which, either before or after back transformation
of four
J-1
x;
log x;
+ 1
xj-i
log xj-i
j-0
to the time domain, must be suitably
combined according to whether the mode of acquisition is that of Marion and Wiithrich(‘65) or that of States et a1.(16@
Linear prediction and m a x i m u m entropy
599
This approach allows the separation of the phase-sensitive spectrum into two subspectra: the absorption spectrum and .~(Xi - __Xi.-i) the dispersion spectrum which now has lines in pure absorption mode, the m t o r m a u o n about its dispersivity having been retained in the back transformation. T h e effects o f this are shown in Fig.27. The ability to separate the cross peaks from the diagonal peaks which are normally dispersive in phase-sensitive C O S Y and whose long tails interfere with the interpretation promises to be a useful adjunct to the less sensitive method of double-quantum filtration. Lane et al. <164) point out that this method will also be of use in cases where multiplets can have mixed lineshapes. Finally both Lane et al. (~64) and H o r e (s2) have noted that "t 1 noise ''(~5,~67) is suppressed b y the M E method of spectral analysis. Lane e t a / . attribute this effect to the greater intensity of absorption peaks in comparison to phase-twisted lines: the t 1 noise consequently appears smaller in relation to the peaks. Laue et al. suggest further that since the t 1 noise consists o f low frequency information in the t I dimension then it m a y be suppressed b y omitting the first few points in this dimension: a treatment which is disasterous if F T is employed, but which, as we have seen in Section 7.1.7, is possible if M E is used. An example of this technique is shown in Fig.28.
m
FIG.28. IH 2D J spectra of the AX spin system of 4-nitropbenol in dimcthyl sulphoxide (d6). (a) FT spectrum. (b) ME spectrum. (c) ME spectrum with resolution enhancement (positive w in eqn (151)). (d) ME spectrum with resolution enhancement and with the first 4 points in tI ignored. In (b) ME improves the signal to tI noise ratio by producing absorptive peaks. This ratio can be further increased by resolution enhancement (c). Finally, omitting the first few points in t t removes the low frequency components which contribute to t 1 noise (d). (reproduced from Ref. 164, with permission)
DAVID
S. STEPHENSON
b)
d)
:
FIG.29. transformed
tH 2D NOESY ttom
t2 to
(1
-w2-
s.&ctra of BPTI, only a small region b shown. For all four spectra the data were using‘~venti+ FT. (a) The t, to f tram&m wai perform4 with FT, the first
128 p+ntr in t, were mult$Aied by a shtf@ ainebcli function atuI ae+fW’to 1024 points before transformation. (b) The same lT4 pomts as in (a) were analysed with ME. (c) 128 p&n? exponentially sampled out of the 6rst 256 points anaIysed niitb Ml?,. (d) 512 pointa in tt, sinebell fdtercd ,and acre-fqed, analyscd with FT. Note that specttwn (b) ia i&ins&ally better than spectrum (a). ME produce_ far better resolution than does FT because the sinebell fittcriug is obviated. Spectrum(c)is bcttv yet: if oixly 128 points arc to analyscd it is wiser to samplethem exponet$ally tban regularly. Spectrum (c) should be comparedwith spectrum (a): in a practical situation the measuring times for both data sets arc equal, but sp&uum (c) is of far grcatcr interpretive value. Barna and Laue compare spectrum (c) to spectrum (d) which would require four times as much measuring time. (reproduced from Ref.168, with permission)
Barna and Laue(“j*) have extended their earlier work@‘) on exponential sampling to two dimensions. They start by showing that for noisy 1D data an optimum trade-off between sensitivity and resolution for a given number of measured data points is realised by sampling these points on an exponential basis and analysing them with ME. As a 2D example they employ the data from a NOESY(“) experiment on BPTI (bovine pancreatic trypsin inhibitor) which is a small protein. Bama and Laue were limited by computing facilities and were thus forced to use a hybrid algorithm consisting of,Fourier transformation in the f2 dimension followed by ME analysis in the tl dimension at selected values of $. The spectra, shown in Fig.29, indeed demonstrate that a superior resolution is obtained for a given number of points in tl using ME,
but these results for 2D are less convincing
than those for 1D. This lies partly with the difficulty of presenting 2D spectra as contour plots. Barna and Laue note that this exponential sampling cannot be carried too far and that exponential sampling of 128 points out of 512 produces no better results than sampling 128 points out of 256. This seems to be in agreement with their previous workts3) on 1D spectra.
Linear prediction and maximum
entropy
601
To round off this Section I include a short comment on the temporal Maximum Entropy We saw in Section 7.1.5 that in one dimension to 2D spectral estimation.
approach temporal
ME is equivalent to autoregressive modelling. This is not the case in two In the same way that the entropy rate of a one-dimensional process is given dimensions. by eqn (106) so the entropy rate of a 2D process can be defined as(s*~1’4) ‘N2
‘Nl
log
sI
-
I -
‘Nl
P(v1v2)
dv,dv*
‘N2
and it is possible to derive an expression for the power spectrum similar to eqn (108),
i.e
1 P(v1v2)
-
K
L
1
k-
1 -K
I-
exp(i2n(kvl+lv2)At)
8, -L
But whereas for 1D the polynomial in the denominator of eqn (108) can be factored as in eqn (1 lo), this is no longer generally the case for the two-dimensional polynomial of eqn (153). (3**134)Iterative algorithms designed to find the temporal ME solution have been propose&%‘~%‘7e) but as these methods have not been applied to NMR I shall forego the pleasure of discussing them.
8.
DEMONSTIUTIONS
In order to provide a comparison of the several methods introduced in Sections 6 and 7 I sought to concentrate on the two main problems which might be alleviated by these modern techniques, namely truncation of the FID and its possible noise corruption. To this end I conceived
for the one-dimensional
demonstrations
a synthetic
complex
time signal
whose spectrum consists of sixteen Lorentzian lines arranged as the eight doublets shown in Fig.SOa (real part only). The intensities of the doublets vary linearly with frequency as do also the splittings. The largest splitting on the left corresponds to a frequency separation of 0.016~~ and the smallest “splitting” on the right to O.O02v,, where vN is the Nyquist frequency, whilst the width at half-height of all the lines has been chosen as 0.003~~. This basic signal was corrupted by Gaussian noise and was also truncated to test the signal discrimination and resolution capabilities of the several methods. The two-dimensional demonstrations,
which deal only with the spectral maximum
entropy method,
are based on
the phase-sensitive COSY spectrum of an AMX spin system to be described later. All calculations were performed on a Control Data Corporation (CDC) Cyber 180-9953 computer operating systems, probably somewhat
with a main memory of 128 Megabyte working either under the NOSNE system with a word length of 64 bits which, as with all virtual environment offers an ease of programming at the expense of high overheads but which is representative of a class of work stations now being introduced into NMR, albeit faster, or under the NOS operating system with a 60-bit word and a limited
amount of addressable memory. optimised FORTRAN
Computing
times under each system using non-vector&d
code were similar, typically a 2048
to 2048
point complex
transform with bit inversion took 0.05 sec., a 4096 to 4096 transform took 0.12 sec. 8192 to 8192 transform took 0.25 set, using the code in Niederdrenk’s
book.(23)
Fourier and a
DAVID S. STEPHENSON
602 8.1.
One-Dimensional
Of the simple AR or LP techniques of Section 6.1, I have chosen to compare the Yule - Walker, Burg, Modified Covariance and Covariance methods. These are all easy to implement, indeed source code is given by MarpIe and needed only trivial modification before compilation. The prediction coefficients determined by these methods may be used in eqn (21) to calculate a power spectrum or alternatively could be employed in the LPZ formula of Tang
and Norris(42), eqn (26).
It is this latter approach
which I have taken
here, Tang and Norris have shown that a reliably phased spectrum may be constructed this is of more interest to the majority of practitioners SP~$lpJp
than is a power spectrum.
0.016
0.010
o.ooa
0.008
10
7
6
5
a
Intwlaltiea J
0.004
0.00s
and
Linear prediction and maximum
The methods of Section 6.2 are represented by the programs are excellently documented
603
entropy
packages available from the authors,
LPSVD
and HSVD,
both
and I have implemented
the complex versions of both programs. I was imprudent enough to write my own spectral maximum entropy program based on the prescription given by Skilling and Bryan. This program calculates the first derivatives of the entropy and the fit analytically, and the second derivative of the fit in the space detined by two search directions numerically. Six spectrum files are used and one FID file. Sii Fourier transforms per iteration are needed if the phase is not included in the iteration. I also programmed both the self-consistent version and the Newton iteration version of Gull and Daniell’s implementation, but these proved in practice to be neither so robust nor so fast as Skilling and Bryan’s algorithm. introduced
As I pointed out in Section 7.1.7 the phase may be into the problem via eqn (136). The entropy is not a function of the phase (in
the sense that neither X9&p0 nor &Slacp, are non-zero) but the fit G is, and has first and second derivatives with respect to Q o and a first derivative with respect to Q, which are easily determined. Inclusion of the phase into the problem most easily involves one extra Fourier transformation per iteration.
e
_-A_ C
Let us start with the almost noiseless truncated
FID whose FT spectrum
is shown in
Fig.SOb. If we compare the AR/LP methods then we see that Yule- Walker and the covariance method provide stable spectra without any computational tricks, examples being Increasing the filter length of the YW method improves the given in Figs.31 and 32. resolution (Figs.Sla to 31d) and the splittings can be emphasised by evaluating the transforms of the various arrays in the LPZ function on a circle within the unit circle by a as shown by Tang and suitable choice of k in the expression for z - exp(kAt+i2nuAt) Norris.(‘H) This resolution enhancement
has been applied in Figs.Sle
to 31h and the lines
604
DAVID S. STEPHENSON
are seen to be sharper although care must be taken to avoid negative-going sidelobes as in Fig.Slh where too much enhancement has been employed. The spectra in Fig.31 were obtained by using the YW procedure iteratively as suggested by Tang and Nor&**) to overcome the problem of non-stationarity, if this is not done the spectra do not look as good, and this seems to be a general necessity for the YW method in which the assumption of stationarity has to be counteracted. It is simple enough to implement: the filter coefficients of the first iteration being used to form a predicted time series which is subtracted from the original time series. The residuals are then subjected to a new analysis whose spectrum is added to that from the previous iteration and whose predicted time series is subtracted
from the old residuals to form new residuals which are
then treated in the same manner.
a
I\ b
’
I’\
h
x \
C
I FIG.32. Spectra produced by the covariatuzemethod from the almost noiseless tnmcated FID K - 64 FT spcctnun is abown in Pig.SOb. (a) Covariance mctbocl with a prediction falter lcngtb o$ M- 16rhose (b) Covariane method with M - 24. (c) Covariance method with M = 32. In Fig.32 the results of the covariance method for different values of the filter length M are presented. This method is obviously superior to YW. The spectra for Fig.32 were all evaluated on the ,unit circle using the LPZ formula and we see that, although in general the spectra are good, anomalies sometimes occur as in the second line from the right of Fig.32b All these “high-resolution” where one root obviously lies too close to the unit circle. spectral estimators suffer from such occasional problems. Burg’s method can sometimes produce weird spectra, examples are shown in Fig.33 for a filter length M - 32 where the FPE had a discontinuity. The value of M was not critical. Figure 33a is the power spectrum obtained using eqn (21) and it is clearly very distorted. Figure 33b shows the spectrum calculated from the LPZ formula. The spectrum obtained by extrapolating the data and then Fourier transforming was much the same. The lines in Fig.33b are all very sharp and indeed all the splittings are resolved (and some extra splittings as well!), the lines can be broadened by evaluating the spectrum outside the unit circle and this has been done in Fig.33c, the lines become broader and the intensities are more reliable in the phased spectrum, I shall call this method post-broadening because it is applied after the calculation of the filter coefficients. The defects of Burg’s method could be attributed to the non-stationarity of the FID, and this can be counteracted by multiplication with an increasing exponential, this can be called pre-narrowing since it is applied before
Linear prediction and maximum the calculation of the coefficients. calculated with the LPZ formula
605
entropy
This has been done in Fig33d and the spectrum outside the unit circle, i.e. with post-broadening.
Figure 33e shows the spectrum with a filter length of M - 40 employing
pre-narrowing
and
post-broadening. It seems that regardless of whether one applies pre-narrowing or not it is imperative that post-broadening is used to avoid the distorted spectra in Fig.SSb. Even then the spectra are not very impressive.
a
If the modified covariance method is applied directly to the FID it produces the nonsensical spectrum of Fig.34a when the LPZ is employed. By post-broadening the lines in the LPZ formula the spectrum is improved but spurious splittings remain (Fig.34b). Pre-narrowing alone does not help (Fig34c), but by pre-narrowing and then post-broadening
(Fig.Sld)
a perfect spectrum is obtained!
Both Burg’s and the modified covariance method treat the FID as if it were stationary, so it might be thought that the iterative application of these methods as for YW would improve the spectra. I have found, however, that this is not the case for truncated data. For all of these methods the calculation times for the prediction coefficients was about 0.005 sec., so tbat the main cost in using such a program lies still with the Fourier transformation, thus applying the Yule - Walker method iteratively takes about 0.2 sec. to produce Fig.Slh which should be compared with 0.05 sec. required to calculate Fig.SOb.
606
DAVID
S. STEPHENSON
modified covariance method from the almost noiseless FID (K - 64) whose (a) Modified covariance method with M- 32. @) Same as (a) but with was pre-narrowed before calculationof the fdtcr ceehiiients. (d) Pre-narrowing
The programs LPSVD and HSVD should produce identical results, but small differences presumably due to rounding error are noticeable. Figure 35 shows results for both programs for M - 44 which is roughly an optimum and the number of singular values retained in the decomposition was set at 16. Variation of M within the range 32
LPSVD
and HSVD
notice that the second line on the right should also be a
doublet and divide it into a correctly phased intense singlet and a weak component with incorrect phase and damping factor, this is visible in Fig.35a for the LPSVD program. Such artefacts could, of both programs can also the spectral linewidth. HSVD, additionally the
course, be suppressed before plotting the spectrum. The output of be used to produce enhanced resolution spectra by suitably altering Program times were about 0.3 sec. for LPSVD and 0.16 sec. for FID must be reconstructed and Fourier transformed.
a
b
,
FIG.35. Prony spectra of the almost noiseless truncated FID (K - 64 whose FT spectrum is shown in Fig.3qb. (b) HSVD spectrum wtth with fiter length M = 44 and 16 singular v A ues retakud. a LEWD s are noticeable between the two spectra, presumably M - 44 and 1T smgular values retained. Small dilkcnces due to rounding error.
Linear prediction and maximum
The ME method is the slowest of all for these truncated proportional
to the spectrum size J not the data length K.
607
entropy
data because the run time is The results for these non-noisy
data are also not too good: in no case does the algorithm of Skilling and Bryan, efficient -Ku2 given by eqn (75), where o2 is though it is, achieve the target fit value C_ estimated from the data, within a reasonable number of iterations. The program knows this and the value of TEST (Section 7.1.2) remains large. Futhermore, the resolution is Figure 36a shows the results of my not as high as that obtained with the other methods. one-channel
program
after 10 iterations.
clean, but the splittings are not evident.
The intensities are good,
the baseline is fairly
After 25 iterations (Fig.36b)
the first line on the
left has split, but progress is slow at about 0.16
sec.
per iteration for a spectral length of
2048. This value is smaller than would be expected for 7 Fourier transformations because on the one hand neither bit inversion nor normalisation is needed in the FFT and, furthermore, with only 64 data points a good many “butterflies”(21) can be eliminated. Noiseless data form a hard test for ME programs,
and better conceived
codes employing
more than just two search directions will certainly perform faster. Exponential multiplication of the type suggested by Laue et a./.(151~‘55)(see eqn (137)) may be used to emphasise the splittings, but this slows down the progress towards a solution considerably, and even then the spectra do not have the resolution obtainable with linear prediction. The target value of the fit may also be set very much higher than the estimated value whereupon the program
rapidly achieves
a maximum
entropy
solution
(small value of TEST)
but the
changes to the spectrum are minimal. Because the algorithm is completely stable it is only a matter of computer time before the correct termination criteria are fulfilled, thus after several hundred iterations the estimated target value of the fit has been obtained and TEST But these electronic exertions are not worth the effort, for the becomes less than 0.1. spectrum changes only unappreciably from that shown in Fig.36.
cctlum is shown in Fig.30b. FIG.36. ME spectra of the almost noiselesstruncated FID (a) ME rpectnm after 10 iterations. (b) ME spectrum at&r 2!v4) ltcrauons. whose FT sp
The value of the default intensity A in eqn (87) also had little effect on the spectrum, the smaller the better. But there was only a minimal difference between the spectrum for A = 1O-6 andthatwithA - 10 - 2 (both relative to a maximum spectral intensity of 1 .O). However, the approach to a solution is impaired if A is chosen to be too small. The reasons for the poor resolution are not hard to find and have been described by Bryan and Skilling(‘71). The ME program finds the spectrum with the highest entropy consistent with the fit given by eqn (75) having a certain value. This constrains the second moment of the residuals but not their first moment, and thus the smoothing effect of the entropy will tend to take intensity away from the peaks and use this to fill-in the valleys Bryan and Skilling’71) suggested employing a different constraint by
without increasing C. fitting the ordered
residuals to a Gaussian
distribution,
thus ensuring
that the residuals
608
DAVID S. STEPHENSON
remained white. They showed that indeed this “E statistic” produced a better resolution than the standard x2 statistic of eqn (75) for the deconvolution of blurred photographs, but my experience suggests that it is of little value for truncated NMR data. For ahnost noiseless truncated data the conclusions must by now be clear: only two algorithms are worth considering. These are the covariance method applied to the data without any tricks (Fig32c) post-broadening
and the modified
applied iteratively (Fig.34d).
covariance
method with pre-narrowing
Both approaches
good as that of Fig.SOa, and neither is considerably
and
yield spectra which are as
more expensive
than FT.
Next, the
silver medal winners so to speak, are the programs LPSVD and HSVD, and the bronze medal is taken by Yule-Walker. I do not think that the Burg spectra (Fig.33) are of great value in this context, neither does the ME method produce the required resolution.
Of course, although the problem of truncated data seems to covariance and modified covariance methods, the real tests start in Fig.SOc. When significant noise is present Burg’s method (Fig.37a), and here the iterative procedure of Tang and Norris(“)
have been solved by the with noisy data as shown improves its performance is advantageous, even so
the intensities leave much to be desired. The YW method generally has a lower resolution for a given filter length M (Fig37b) but a higher stability in that the filter length can be pushed higher without artefacts appearing. The modified covariance method is also similar (Fig.37c) as is the covariance method (Fig.37d). Whereas the Burg, YW and covariance spectrum were all obtained without post-broadening, the modified covariance method Pre-narrowing cannot be needed this treatment to produce a sensible-looking spectrum. applied
to these data, for to multiply the full FID by an increasing exponential would emphasise the noise and not convert the FID to a stationary time series. Neither is it required for the spectra in Fig.37 which all look better than the FT spectrum Fig.SOb
Linear prediction and maximum because the baseline noise has been removed.
609
entropy
The rightmost
splitting which is not visible
in the FT spectrum is resolved by Burg’s method (insert to Fig.37a)
but if one knows that
Burg’s method is subject to spurious splittings not much weight can be put on this information. The splitting does not show up with the other three methods even with resolution enhancement. The calculation time depends strongly on the !iher length M, for M = 192 the f&test method YW required 0.29 sec. to find the coefficients, whereas the modified covariance
was the slowest at 0.67
methods can, however, be used non-iteratively HSVD
also has no difficulty in analysing
sec.
The modified covariance
and covariance
so that they tend to be faster overall. such data although as may be expected the
computing times are enormous: 200 sec. for a filter length of 1536 yielding a XXH matrix of dimension 512x512, and 1600 sec. for a filter length of 1024 where the matrix to be In each case 32 singular values were retained and diagonalised has dimension 1024x 1024. in neither case was the rightmost splitting resolved.
The ME method is a surprise considering
its poor performance
on truncated
non-noisy
data. The program now has no difficulty in reaching the estimated fit value within 5 iterations (time about 3 sec. for transforming 2048 complex points to 4096 real points) and after 10 iterations the TEST value is well below 0.1. The spectra in Fig.38 are true ME spectra. In Fig.38a no exponential correction was employed and we see that the baseline noise is still present because the program is not actively distinguishing between signal and noise. As more and more exponential multiplication is used (Figs.38b to 38d) the baseline noise disappears,
the lines become sharper and the splitting on the right is resolved.
resolution in Fig.38d
is not as good as that of Fig.37a
The
for Burg’s method, but I would tend
to give it more credence. The problems increase for truncated noisy data. Let us consider YW first. Figure 39a shows a YW spectrum obtained with a filter length of M - 32 after S iterations. It can be
DAVID S. STEPHENSON
610
compared with Fig.Slc. Strangely, the resolution appears to be better for these noisy data: the first and second lines on the left have noticeable splittings as does the fourth line. But the third line is unsplit.
Post-narrowing
the spectrum
leads to Fig.39b,
the splittings
become more pronounced, but the third line still remains unsplit. This irregularity in the resolution is caused by the non-linear nature in which the algorithm tackles the noise, i.e. additive noise in the data appears no longer as additive noise in the spectrum as with FT, but leads to irregular intensities and splittings. The value of M was not very critical in this case, and pre-narrowing the data to enforce stationarity had also very little effect. The computation
times, of course, remain the same as for the almost noiseless truncated FID.
FIG.39. r
S a .produccd by the Yule- Walkmethod from the noia truncated FID K- 64) whose FT is S!%I m ,Fig.SOd. (a) yule - Walker y m with prcdicdon rdtcr length M - 31 after 3 iterations. ) same as (a) but wrth post-narrowmg m the LP formula.
The covariance
method which performed so well for ahnost noiseless truncated data gave
rise to the spectra in Fig.40 for various prediction filter lengths and with neither pre- nor post-treatment. If post-narrowing is applied the lines in Fig.40b become sharper, but the splitting of the second peak on the left is never resolved. Post-broadening can also be applied to improve Fig.40c and the spectrum then looks similar to that of Fig.40b. It would appear that the best the covariance method can do is shown in Fig.40b. computation times were the same as for the almost noiseless truncated FID.
Again the
FIG.40. S tra produced by the covariance method from the noisy truncated apcctrum is sgwn in Fig.SOd. (a) Covariancc SF with falter length M- 16. with M - 24. (c) Covariancc spectrum with M - 9 .
Burg’s method is true to its name of a “high-resolution” estimator. Even for these noisy data the splittings are well resolved. Although the minimum of the FPE lay at
truncated
about M - 28, the spectra are improved by choosing a longer filter. The spectra in Fig.41 were obtained with a filter length of M - 40, although this value was not critical. In Fig.4la the spectrum was calculated with post-broadening, that is the transforms were calculated outside the unit circle, this seems always to be necessary with Burg’s procedure to avoid instability. Figure 41b was calculated by initial pre-narrowing followed by final
Linear prediction and maximum should improve
the spectrum
entropy
611
because the FID then comes closer to
post-broadening,
this
being stationary,
but obviously the influence of noise plays a hidden role here.
t~sF_Fozir_$$
uccd by Burg’s method from the no’ truncated FID (K - 64) whose FT s ~G$.r lengd~ of 3 - 40 and post-broadening. (b) Post-=$
The modified covariance method gives similar noiseless truncated data, pre-narrowing is essential.
results and,
as in the case of almost
The programs based on Prony’s method, LPSVD
and HSVD,
do not perform as well as
might be expected and both are very dependent on the particular noise realisation if used to extract
the splittings.
Spectra from LPSVD
for a filter length of M - 44 and 16 retained
singular values are shown in Figs.42a to 42~ for three different realisations of the additive noise. Comparable spectra from HSVD are shown in Figs.42e to 42g. Obviously HSVD is more stable with respect to noise, and splittings can be resolved as in Fig.42g. If the number of singular values which are to be retained is reduced to 8 then the programs cannot resolve the splittings and the spectra of Fig.42d are now largely independent of the noise realisation.
and 42h are obtained,
these spectra
4
FIG.42. Prony spectra of the noisy truncated FID (K- 64) whore FT spectrum is shown in Fig.SOd. (a) $PSVD sp+rum ,wi+ filter length M - 44 and 16 ainguk valua retained. @) same conditions as (a) but a p$ere-et$~ l-s.. (c ) same as (a) and b but yet another noise rcalisatton. * (e) H&&r with fdtcr ley M = 8 k???s!!z~$ retained. (0 same as (e) -g?r ut avah?es* d&rent noise r tion. (g) same as e) and (t) but yet another noise rcalisation. (h) HSVD spectrum with M - 44 and 8 retained singular values.
612
DAVID S. STEPHENSON
_ua
PIG.43. ME s tra of the no’ truncated FID -64) whose FT spectrum is shown in Fig.SOd. (a) ME qxctnun after 1cr= mxations. (b) 3 E spectrum after P 5 iterations.
The results of the ME method applied to the noisy truncated FID are almost identical to those in Fig.36 for the almost noiseless truncated FID. The splittings are not well resolved (Fig.43) but the baseline is clean and the intensities are good. With considerable noise present in the data Shilling and Bryan’s
algorithm rapidly finds a solution at the estimated
fit value within 10 iterations. small splitting.
After 25 iterations (Fig.43b) the leftmost line has a very The spectra in this case are not strongly dependent on whether or not
exponential multiplication is employed.
FIG.44. Spectra of the full very noisy FID K- 2048) whose v spect~m is shown in FigJOe. (a) e spectrum after multiplication of the FID with a L exponenti (matched fdtc3J. Jl&~~~~~rw~$ a red&don t&r leng$ of M - 384. (c) Burg spaxnun with a falter length of M - 1 8 Fib and after 3 iterations. (d) Burg spectrum with M = 32, pre-broadcnmg, 3 iterations and post-narrowing. AU the AIULP methods are pretty poor when applied to a full very noisy FID. They should be compared with the effects of a matched filter (Fig.44a). None of these methods is capable of extracting the splittings with any greater reliability than FT after matched filtering and at the same time providing good estimates of the intensities. To pull out the splittings a large filter length is required and this can lead to gross intensity distortions and, furthermore, considerable noise remains in the baseline. This is shown in Fig.44b for
Burg’s method with a filter length of 384. The effects of noise can be reduced by multiplying the FID by a decreasing exponential, i.e. pre-broadening and then applying an AIULP method. In this case it is advisable to use a shorter filter, but the results are not
Linear prediction and maximum much better (see Fig.44c).
613
entropy
By reducing the filter length drastically the baseline noise and
the splittings can be made to disappear and the spectrum becomes cleaner (Fig.44d) information loss is great when compared with post-narrowing is able to restore this. If covariance methods are superior, but only in closer to that of Fig.44a. The intensities and dependent
both
on the particular
noise distribution
and
on the choice
functions and filter lengths, I think my only general recommendation The failure. of the in Section 6.1.8 the ARMA process and approaches, Prony’s
but the
matched filtering (Fig.44a) and no amount of anything the Yule- Walker method or the the rather negative sense that their results are splittings in all the methods are very strongly of broadening
can be: don’t bother.
AR/LP methods for very noisy data is to be expected, as I pointed out model is wrong. The correct treatment of additive noise leads to an neglect of the MA branch cannot be warrented. The other two method and spectral ME both take noise into account so one may
expect better performance. There is one great difference, however, between these two methods and that is computing time. Employing a filter length of M - 1536 leads to a XXH
matrix of dimension 512x512
of the order of 200 correspondingly
sec.
and the diagonalisation
on the Cyber.
A shorter
greater calculation times.
of this takes considerable time,
filter leads to a larger
Spectral ME requires about 5 sec.
matrix
results we have already seen suggest that the HSVD program is superior to LPSVD only the former. In
general
the
recommendation number
results
improve
of Kumaresan
for a larger
Glter and
and Tufts. (“1 Furthermore,
of singular values which are retained
this
and
Because the
is consistent
I tested with
the
the spectra look better if the
is set higher than the number
of signals
expected. This introduces extra noise in the form of signals with incorrect damping factors and phases, but these can be eliminated before plotting the spectrum.
C
1 ,
In Fig.45 with 32,
\
I present the results of the HSVD
mu,
1
I_______________
analysis for a filter length of M - 1556 and
64 and 128 singular values retained respectively.
In Fig.45a
I have left in the
spurious signals which can in general easily be identified, as shown by Gesmar and Led@‘), on the basis of their standard errors and also on the frequency dependency of their phase. A spectrum in which the weak signals are eliminated is shown in Fig.45b, here the error
614
DAVID S. STEPHENSON
analysis of the HSVD program has been used to identify those lines whose amplitudes are smaller than their corresponding estimated standard deviations. In Fig.45c this technique has been employed and also those signals whose phase lay more than 30° away from the expected phase at that frequency have been suppressed, this results in the elimination of all spurious signals. Another item characterises the three spectra in Fig.45. Figure 45a was obtained by retaining 32 singular values in the calculation, Fig.45b with 64 singular values and Fig.45c
with 128.
Apart from the better resolution as the number of retained singular
values increases it is a worrying
feature of all three spectra that lines are missing.
second line on the right is just not found by HSVD
The
if only 32 or 64 singular values are
retained, so it would appear that a good strategy is to retain a large number of such values. In Fig.45c which
it is there but two other lines are absent.
spurious
The reason for this lies in the way in
signals are suppressed.
The missing lines are shown in the insert to Fig.45c. It is visible that one is a close doublet, by investigating the tabular output of HSVD one can see that indeed both lines consist of two closely spaced signals with large
amplitudes
but
also
with
considerable
errors,
presumably
when
HSVD
cannot
quite
distinguish whether two signals are present or only one it assumes two but gives both of them large errors.
If an automatic
sorting procedure
signals then lines which should be there recommend
that the original
output
might
is then used to weed-out
well be suppressed.
be scrutinized
and such selection
the weak
I would strongly be made by the
spectroscopist. Figure 46 presents ME spectra of the very noisy full FID, all after 10 iterations, a calculation time of about 5 sec. In Fig.46a the spectrum was obtained with only minimal exponential multiplication (equivalent to a line narrowing of 0.001~~) and we see that the program leaves a lot of baseline noise in the spectrum. For a greater degree of exponential multiplication (0.002~~) the program starts discriminating against the noise (Fig.46b) and, furthermore, the splittings are intensified, the inserts show that the second line on the right is well split, and if one employs some imagination the first line on the right is also! Figure 46c shows the results after using exponential multiplication equivalent to the Lorentzian line-width
of the signals,
the spectrum
program had not reached a true maximum TEST
is not so impressive
as is Fig.46b
entropy solution after 10 iterations,
because
the
the value of
remaining at about 0.2.
a) ME A ent to
Linear prediction and maximum To
round
previously
off this section
I show the
to the very noisy truncated
FID.
entropy
615
results of applying the methods considered Figure 47a illustrates the YW method with
M - 32, 3 iterations and post-narrowing to improve the resolution. The spectrum is not too dependent on the filter length, but M should not be chosen too large. Pre-narrowing was of no value as might be. expected for a high noise level. In Fig.47b we see the results of the Burg method for M - 32 with post-broadening which is essential. Again M should not be chosen too large and it seems that iteration
helps here.
The modified covariance
method is illustrated in Fig.47c, as for Burg’s method post-broadening is necessary and iteration may be helpful. Pre-narrowing however did not help. In this case and for the covariance method (Fig.47d) a fairly short prediction filter improves the spectrum. Of these AR/LP methods only the Yule - Walker algorithm seems to give generally reliable results whose dependence on the parameters is low.
In comparison attempt is made introduced.
the HSVD spectrum of Fig.47e is considerably to resolve the splittings a considerable number
By using a large filter and by retaining
better, although if an of spurious signals is
a small number
of singular values
reliable estimates of the intensities are obtained. Maximum entropy (Fig.470 also does fairly well, although even with exponential multiplication the baseline noise is not removed. The intensities though are relatively good. are
Several things become clear after considering these results. Firstly, the AFULP methods The best method here, considering the cost of not reliable for full noisy FIDs.
computer time, is spectral Maximum Entropy illustrated in Fig.38 and 46. The ME method does provide considerable advantage over matched filtering as regards resolution for very noisy data. For truncated data the decision should be made on the basis of the signal-to-noise ratio. For severely truncated data with good signal-to-noise the covariance method, the modifted covariance method or HSVD can all be used successfully. Problems might be encountered with the modified covariance method if the line-widths of the individual signals differ As the significantly since this method requires pre-narrowing to ensure stationarity. signal-to-noise ratio decreases the decision should be made whether one wants high .JPlWES 20/6-C
616
DAVID S. STEPHENSON
signal-to-noise ratio decreases the decision should be made whether one wants high resolution or good intensities. A compromise solution is given by the Yule-Walker method. HSVD and ME give good intensities, whereas Burg’s method yields higher resolution. For very low signal-to-noise ratios we can forget the splittings and concentrate on the intensities and here again HSVD and, to a lesser extent, ME are successful. These conclusions are based solely on the foregoing results and can be taken only as very general recommendations. 8.2.
Two-Dimensional
To demonstrate the application of spectral Maximum Entropy to two-dimensional NMR I chose the synthetic FID of an AMX spin system measured in a COSY(“) experiment. Complex points were taken to provide quadrature detection in 5. To provide quadrature detection in t1 the results of two experiments differing in the phase of the second 90° pulse were combined to select the n peaks. Q5) This combination of two experiments introduces the phase-twist lineshape mentioned in Section 4.2. Furthermore, restricting the number of points in each dimension to 64 results in truncation arteEacts when the FID is zero-filled to 256x256
and Fourier transformed.
Both these effects can be seen in Figs.48a
. . .
l. .
.
l* . a. l.
c
*a
.
l.
.
.
l .
l
.
.
.
.
.r .e
l o .* . . . . .
and 49a.
l l
l .
.A?. l ‘W.*
1
4
*
,Z..
,,..
I
4
b)
softhe PIG.46. Two-dimensional COSY spectra of an WX spin system. The contour plots of the real Four positive and four negative contours arc plotted at heights of 0.5, rt.25, 0.125 complax I and 0.06 r+5 tunes highest peak, the negative contours are shown as dashed lines. (a) FT spwtrum of a - the shown. 64x64 complex PID zero-tlllcd to 256~256 points before transformation. (b) ME spectrum of the same FID. After 10 iterations of a two-channel ME program based on the algorithm of Skilling and Bryan and employing exponential multiplication to discriminate against the noise the spectra The phase-twist is eliminated, as are the shown in Figs.48b and 49b were obtained. truncation artefacts, the lines are sharp and in the contour plot it is easy to identify positive Since all the lines are now in pure absorption and negative signals in the cross-multiplets. the interference of dispersion mode tails from the diagonal peaks is no longer a problem, as in normal phase-sensitive
COSY,
obtainable with a double-quantum
and the sensitivity is of course much higher than that filter. The price to be paid for this achievement is a
Linear prediction
and maximum
entropy
617
calculation time of 9.2 sec. per iteration for this 256x256 point spectrum which should be compared with a time of 0.81 sec. for a straightforward 256x256 FFT using a complex version of the 2D FFT program in Niederdrenk’s book(2s) including bit-inversion and normalisation.
FIG.49. Two-dimensional COSY spectra of an AMX I in system. The real parts of the complex spectra are shown. (a) FT spectrum of a 64x64 complex FID zero- ii ed to 256~256 points before transformation. (b) ME spectrum of the same FID.
618
DAVID S. STEPHENSON
APPENDIX A: SOME STA-ITTITCALCONCEPTS
RELATING TO TIME
SERIES
Without going too much into the details of the statistics of stochastic processes, which are extensively treated by Papor&,@) Haykin, and by Gardner(‘72) as well as by MarpIe and by Kay(“‘) we can distinguish a few topics which are relevant to the proper treatement of time series. Firstly let us define a discrete time series as a set of observations
made at equidistant
time intervals At thus {xJ
-
{x@xJ,xz’
*** *xx_, 1
is a time series composed of measurements of the variable xf,t) made at times t - 0, At, 2At, . . . ,(K - 1)At. Each of these measurements is in some sense a number drawn from a stock of possible numbers and our chances of obtaining “x~” at time kAt is governed by the probability distribution of x~. This defines xk as a random variable, i.e. if we were to try to predict xk we should be hard put to say what value would be obtained, although if we were to measure one and the same “x~” many times the average of our measurements and their scatter would fit the probability distribution.
Since the time series consists of K such
measurements we have a K-fold indeterminacy, each element of {xk} having its own probability distribution. The ensemble of K probability distributions is called a stochastic
process and our time series is just one possible reahsation of this process. If the signal we are dealing with is completely known the time series is described as in the sense that all the elements are predictable. Normally, however, we
deterministic
have noise to contend with and the process is non-deterministic. In this case, though, certain averages may be defined which lead to a reduced set of parameters describing the process
-
its statistics.
Of these the most
important
are the first and second
statistics, i.e. the mean, the autocorrelation and the autocovariance. statistics can also be defined they are of no interest to us here.
order
Although higher order
The mean of the stochastic process which gives rise to a particular time series is defined as pk
-
E[xkl
where E is the expectation
Rkl -
E[xk
operator,
the autocorrelation
function of the process is
+Jd
where again E stands for the expectation operator,
and the autocovariance
function is
. Ckl = E[(xk+I-~k+3(X~-r3*l These latter two statistics are functions which link the individual elements of (xk}. Unfortunately the us: all we can do is estimate them by carrying out on the available elements of the one measured time A stochastic process all of whose statistics are
together the probability distributions of statistics themselves are not accessible to the appropriate mathematical operations series at our disposal. independent of time origin, or for a
discrete time process are independent of index shift, is called strictly stationary or strictsense stationary. Processes for which the mean and the autocorrelation and autocovariance function
only are independent
stationary of order two. p -
pk -
constant
of time shift are called
wide-sense
stationary
In these cases the mean becomes independent of index:
,
or
weakly
Linear prediction and maximum and the autocorrelation R,
-
RW
and autocovariance
, C,
An important
-
property
C,
619
entropy
become functions of one index only:
.
of stationary
processes is the Hermiticity
of the autocorrelation
function R,
-
R_;.
This relationship is put to use in the derivation of the Yule-Walker equations in Section 6.1.2, of the Burg algorithm in Section 6.1.4 where the forward and backward prediction coefficients are related by complex conjugation, and of the modified covariance method of That stationarity is not an inherent Section 6.1.5 where the same assumption is made. property
of the NMR
FID can be seen if we consider a time signal corresponding
to a
single damped sinusoid the elements of which are given by xk
= b exp[(u + i2nv)kAt]
The autocorrelation
function is then
E[xk+,x,*]
Rkl -
.
= Ib12exp[crAt(21+k)
+ i2nukAt]
= Ib12exp(clkAt+ i2nukAt) exp(2alAt) which shows a dependence on the index 1. The importance of this non-stationarity for linear prediction can be seen by considering again the case of a single damped sinusoid: xk
-
alfxk_l
xk
-
alb
exp[(u+i2nv)At]
=
xk+l
exp[ -(u
x~_~
+i2av)Ar]
,
x~+~
,
so that it is easy to see that alf # (alb)’ unless a - 0. Methods such as the abovementioned would thus be expected to lead to spectral distortion if applied to NMR because one of the assumptions is incorrect, and this is in i&t what one sees. Methods such
as the
functional
modelling
existence of damped sinusoids, the covariance
algorithms approach
which
specifically
of Section 6.1.5
assume
the
or the LPWQRD
technique of Tang and Norris should thus be preferable.
APPENDIX
B: THE
Z TRANSFORM
AND THE
COMPLEX
PLANE
Whilst the Fourier transform should be well-known to all NMR spectroscopists the z The standard text is the book by Jury(“‘) who takes a transform is probably a novelty. very mathematical approach. The essentials of the z transform are, however, simple and are given in many books on signal processing, for instance Cadzow(174) or Roberts and Mulli~(*~) who define it as x(z)
=
5
xk
.-k
(154)
DAVID S. STEPHENSGN
620
where {xk}, k - -=, . . . ,-, is a discrete time sequence whose elements may or may not be complex, z is a complex variable and the polynomial X(z) is clearly a complex function of z which is only defined for those values ,of z for which the power series eqn (154) converges. This idea of convergence is very important for it leads to the definition of the singularities (or poles) of X(z) and their position in the complex plane. If we restrict our attention to causal sequences ‘{Q}, i.e. those sequences having xk - 0 for k < 0, then it can be shown (174) that X(z) converges everywhere outside a circle of radius R in the complex plane where R depends on .the sequence (xk) . For 1.~1> R, i.e. in the region
of convergence,
the infinite
series eqn (154)
can
nearly
always
be ex-
pressed(174) as a simple ratio of polynomials in z. To clarify the situation let us take the specific example of an exponential sequence {xk} where xk -
exp(ikw) for k 2 0, thus OD
0 X(z)
-
exp(ikw)z -k
1
I
1
k-0
(Z-leia)k
(155)
k-0
The power series in eqn (155)
converges
only for IzI > [eiol and in this case we may
express X(z) as 1 X(Z)
-
l_z-leio
Z
-
z-e
io
which is ,a simple ratio of polynomials in 2. This example also nicely illustrates several other features of the z transform.
It can be
seen from eqn (156) ,that the expression for X(z) becomes zero at z - 0, and becomes infinite for z - exp(io). We say that X(z) has a zero at z - 0 and a pofe at z - exp(iw) even though eqn (156) is no longer a valid expression for the z transform within the region of divergence. This is shown in Fig.50.
Re z
FIG.50. A sectionof the complex plane showing the regions of convw ence and divergence separated by the boundary cimle of radius R centred on the origin. The simpli :cxponent i3 function of the text has a zero at the origin and a pole on the boundary. Since the z transform of a sequence (xk) can be expressed as the ratio of an mth order polynomial to a nth order polynomial b,z=‘+b#‘-‘+ x64
-
P+alza-l+
1..
...
+bm
+a,
.
’
Linear prediction and maximum
entropy
621
and because both polynomials may be written in Ezctored form b,(z - zl)(z - 22) . . . (z - z,,,) X(4
-
(Z-P1XZ-&)
...
(157)
(Z-P,)
we see that, in general, X(z) has m zeros and n poles. Furthermore these zeros and poles contain all the essential information about the function X(z) except for the multiplicative The positions of these poles and zeros are thus crucial in applications of the z constant b,. transform. One other very important point to be made here is that at least one of the poles is located
on the boundary
divergence
which can be seen for the simple example of Fig.50.
separating
the region
of convergence
from
the region
of
This pole will always be
that pole of X(z) which is located furthest from the origin, meaning that each pole lies either in the region of divergence or on the boundary. As may be expected no poles exist in the region of convergence. The usefulness of this property is that a knowledge of the positions of the poles tells us where the boundary is between convergence and divergence. Let us now see how the Fourier transform fits in here. The z transform may be viewed as a very general representation of a sequence over the whole complex z plane, but if we restrict 2 to the unit circle, i.e. 2 - exp(io) or jz( - 1, then the sole variable becomes cu and the z transform reduces to X(exp(io))
x(z) -
=
1 k-
xk exp( - icuk)
,
--oD
which is the continuous Fourier transform of the sequence { xk} . Im 2 unit
x _
circle
1
--.-___
w+
\
2 -----_____
.
Re z
FIG.51. The stability of a sequence is related to the positionof the poles of its z transformwith respectto the unit circle. Stable systems have their polls inside the unit circle.
It is interesting to consider the type of time series which may arise and to discuss their behaviour in terms of their z transforms evaluated on the unit circle. If the unit circle lies in the region of convergence
the time series has a bounded Fourier transform.
If, on the
other hand, the unit circle lies within the region of divergence the Fourier transform is unbounded which means that the elements of the time series do not converge with increasing
time.
This latter condition will occur if any of the poles of the z transform
lie
outside the unit circle, for we have seen that the furthermost pole from the origin lies on This is shown in Fig.5 1 for simple the boundary between convergence and divergence. exponential
time series.
If the z transform
of the series has a pole outside the unit circle
this is equivalent to a series with exponentially
increasing amplitude (case 1).
In terms of
622
DAVID S. STEPHENSON
response theory this corresponds to the transfer function of an unstable filter.
Case 2 where
the pole lies on the unit circle is also unstable, and although the amplitude of the series does not diverge neither does it converge. Case 3 is the system corresponding to the usual situation for NMR signals, the amplitude decays with time and the pole lies inside the unit circle. In some of the literature described
as minimum
a polynomial
in z having its roots inside the unit circle is
or if its roots lie outside the unit circle as minimum deb~y(‘~~). This nomenclature is confusing in the present context and I have avoided it. The important thing is that if the spectrum of a time sequence is to be represented by a
phase(“),
quotient of polynomials such as eqn (157) then the poles must lie within the unit circle if the spectrum is to make any sense. For remember that the methods which rely heavily on the z transform notation are those methods which deliberately extrapolate the time data beyond
the measurement
period:
a spectrum
with poles outside
the unit
circle would
correspond to a divergent extrapolation which is physically unrealistic. Another point is the relevance of the position of the poles with respect to the unit circle in the context of resolution. The closer these poles lie to the unit circle the sharper will be the spectral lines if the Fourier transform is employed, i.e. if the z transform is evaluated on the unit circle. Alternatively, if the spectrum is calculated using a z transform other than on the unit circle, we have the possibility of decreasing the spectral linewidth by choosing a circle of evaluation inside the unit circle. and Norris in their L&‘Z method(
Postscript
-
The search for the maximum
grai1(175), many seek and few find.
This fact has been put to use by Tang
of entropy is akin to the search for the holy
This, however, does not discourage the proposal of new
algorithms, such as the recently suggested approach of Zhuang et a1.(17@ These authors follow the path of Gull and Daniel1 and look for a maximum of Q(A) defined in Section 7.1.2.1. They show that the equations may be recast in the form of a Cauchy boundaryvalue problem(74) and that a solution may be obtained by defining an initial spectrum for A - 0 and then by following the solution curve V2Q(dX/dA) - VC using the notation of Section 7.1.2. This leads to an iterative procedure in which 1 is continually incremented and the spectrum can be updated from iteration to iteration (note that on the right-hand side of eqn (37) in reference (176) Fk should be replaced by fk). Provided that the A increments are small enough a maximum entropy solution should be achieved. Zhuang et al. show that this algorithm is as efficient as that of Skilling and Bryan. Gesmar and Led(177) have applied their functional modelling method to two dimensions and their one-dimensional technique has been used to evaluate the differences in the “C spectra of insulin from various sources.(‘78) Pisarenko’s method (Section 6.1.7) has been adapted by Shinnar and Eleff(17’) and applied both to synthetic data and to the 31P signals of a rat. Delsuc and Levy( 18’) have used spectral ME to recover J-coupling patterns in both oneand two-dimensional NMR, whilst Davies et d.( la1) have compared spectral ME with the CLEAN program of H6gbom(1*2~1*3) and conclude that the results of both approaches are similar but that CLEAN is faster. The fundamental idea behind the application of Maximum
Entropy to NMR
spectrum is related to a probability density whose entropy can be calculated. relationship is the one considered in Section 7.1.2
is that the
The simplest
where the spectrum of Fig.Sa is regarded
as the probability density of the amplitudes of the isochromats Zerothe spectrum be positive and all the signals in-phase.
of Fig.Bc. This requires that and first-order phase effects
Linear prediction and maximum
entropy
623
can be corrected for, and the procedure of Laue et al. 0”) allows the treatment of negative signals (Section 7.1.7). But these ad hoc modifications are not applicable to more general situations where the FID to be measured corresponds
to magnetisation
vectors brought to
different starting positions on the x’ y’ z sphere by a multipulse experiment. The associated spectra may have signals whose phases take on any value between 0 and 2n, and the artificial restriction to a small number of discrete phases is a severe constraint. To circumvent this problem Daniell and Hore (‘s4) have recently proposed maximising an entropy
based on a novel probability
treating all orientations
density.
In a classical sense this is equivalent
of the spins as equally likely, so that the unconstrained
to
maximisa-
tion of the entropy leads to an isotropic distribution of magnetisation.
Development of this idea produces an entropy which may be defined in terms of the modulus of a complex spectrum, thus overcoming the restriction to real positivity and the concomitant problem of signal phase. Daniell and Hore show that this classical approach has a quantum mechanical analogue in which the role of the probability density is taken by the diagonal elements of a density operator describing the system before excitation. For spins I = ?4 this produces an entropy which differs from the traditional form given by eqn (87) by becoming quadratic in lXjl for values of lXjl whl‘c h are small compared to the default A. This interesting new proposal is almost certain to rejuvenate with the application of Maximum
the controversy
associated
Entropy.
AchowJedgemenrs - It is a pleasure to acknowledge the support and advice of Prof.G.Bins+. The Bavarkn Academy of sciences is to be thanked for allowing me access to the machinery at the Lelbniz Computmg Centre, Munrch; arq gratefU1 to Dr..& .Led for sending me preprinta of references 80 177 and 178 and to Furthermore, I am much obliged to Dr.R.deBeer for making the Dr.J.Tang for a reprmt of reference tly indebted to Dr.G.J.Daniell for his valuable programs LPSV and HSVD available. Finally, I am comments, and to Dr.P.Hore for communicating reference 1!?a4.
REFERENCES 1. S.SIBISI, Nature 301, 134 (1983). Feynman!“, Unwin Paperbacks, Hemel Hempstead 2. R.P.FXYNMAN, ‘Surely you’re joking. Mr. (1986). 14, 27 (1980). 3. J.C.LINDON and A.G.FERRICE, Prog. Nud. Magn. Reson. SJxchmc. 4. HJ.GRAY and A.ISAACS, A New Dictionary OfPhysics, Longman, London (1975). 5. S.HAYKIN, Communicarion Systems, 2nd Bdn., Wiley, New York (1983). 6. A.PAPOULIS, Probability, Random Variables, and Smcbastic Pmcesses, 2nd Edn., McGraw-Hill, New York (1984). 7. E.A.ROBINSON, Proc. IEEE 70, 885 (1982). 8. B.R.FRIKDEN in Picture Processing and Digital FiJtering, T.S.HUANG (Ed.), Topics in Appfiai Physics, Vol.6, Springer, Berlin (1979). 9. E.FUKUSHIMA and S.B.W.ROEDER, Experimental Pulse NMR, Addison-Wesley, Reading, Mass. (1981). 10. D.I.HouL.T, Prog. Nud. Magn. Reson. Speczrosc. 12, 41 (1978). Magn. Reson. 54, 324 (1983). 11. C. -N.CHEN, D.I.HouLT and V.J.SANK,J. Magn. Reson. 66, 410 (1986). 12. C.J.TURNER and H.D.W.HILL,J. 37, 93 (1966). 13. R.R.BRNsT and W.A.ANDERSON, Rev. ScJ. Instrum. 14. D.SHAW, Fourier Transform NMR Spectroscopy, 2nd Edn., EIsevier, Amsterdam (1984). GBODENHAUSEN and A.WOKAUN, Principles of NucJear Magnetic Resonance in One 15. R.R.BRN~, and Two Dimensions, Oxford University Press, Oxford (1987). Orthogonal Transbms for Digital Signal P-sing, Springer, Berlin 16. N.ACHMED and K.R.RAo, (1975). 17. D.F.ELLIOTT and K.R.RAo, Fast Transforms, Academic Press, New York (1982). and its AppJkatbns, 2nd Edn., McGraw-Hi, New York 18. R.N.BRAcE~ELL., The Fourier Transkm and R.E.NORBERG, Phys. Rev. 107, 46 (1957). 19. I.J!zs 19, 297 (1965). 20. J.W.COOLEY and J.W.TUKEY, Math. Comput. PrenticcHall, Englewood Cliffs, NJ. (1974). 21. E.O.BRIGHAM, The Fasr Fourier Transform, (German Translation: FJV’, ScJmeJJe Fourier Transformation, Oldenbourg, Miinchen (1982)).
624
22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. ::: 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74.
DAVID
S. STEPHENSON
H J.NUSSBAUMER, Fast Fourier Transfbrm and Convolution Algorithms, 2nd Edn., Springer, Berlin (1982). K.NIEDERDRENK, Die endiiche Fourierund Walsh - Tmsformation mit einer EinUntng in die Bildverarbeitung, Viewcg, Braunsehweig (1982). GERIVARD, IEEE Trans. ASSP-25, 250 (1977). LRRABINER and B.COLD, Theory and Application of Digital Signal Processing, Prentice-Ha& Englewood Cliffs, NJ. (1975). RAROBERTS and C.T.M~LLIS, DighaZ Signal Processing, Addison-Wesley, Reading, Maar. (1987). A.BAx, Two-Dimensional Nuclear Magnetic Resonance in Liquids, Delft University Press, Delft (1982). G.BODENHAUSEN, RFREEMAN, R.NIEIXRMEYER and D.L.T~RNER, J. Magn. Reson. 26, 133 (1977). J.KI?.ELERand D.NEUHAUS, J. Magn. Reson. 63, 454 (1985). K.NAGAYAMA, A.KUMAR, K.WUTIUUCH and R.R.ERNsT,J. Magn. Reson. 40, 321 (1980). A.E.DERoME, Modern NMR Techniques for Chemistry Research, Pergamon Press, Oxford (1987). P.J.HoRE,J. Magn. Reson. 62, 561 (1985). J.C.J.BARNA, E.D.LAIJE, M.R.MAYGER, JSKILLING and S.J.P.WORRALL,J. Magn. Reson. 73, 69 (1987). E.D.LA~E, K.O.B.POLLARD, JSIULLING, JSTAUNTON and A.C.SUTKOW~KI, J. Magn. Reson. 72. 493 (1987). D.ZI&SOW; On-iine Rechner in der Chemie, de Gruyter, Berlin (1973). G.U.YULE, Phil. Trans. Roy. Sot. London A226. 267 (1927). \ l.A.C~~zow. Proc. IEEE 7d. 907 (1982). ~.L.MARPLE; Digital Spectral ‘AnaZy& P&ice-Hall, Englewood Cliffs, NJ. (1987). SHAYKIN and S.KESLER in Nonlinear Methods of Spectrai Analysis, S.HAYKIN (Ed.), Topics in Applied Physics, Vol. 34, 2nd Edn., Springer, Berlin (1983). S.M.KAY and S.L.MARPLE, Proc. IEEE 69, 1380 (1981). ’ ’ S.L.MARPLE, Proc. IEEE 70, 1238 (1982): T.J.ULRYCH and M.00~ in Nonlinear Methods of Spectral Analysis, S.HA~KIN (Ed.), Topics in Applied Physics, Vol. 34, 2nd Edn., Springer, Berlin (1989). J.TANG and J.R.NORRIS, Chem. Phys. Lett. 131, 252 (1986). F.NI and H.A.SCHERAGA,J. Magn. Reson. 70, 506 (1986). J.TANG and J.R.NoRRIs,J. 1Magn. Reson. 69, 180 (1986). G.WALKER, Proc. Roy. Sot. London AlSl, 518 (1931). N.LEVINSON, J. Math. and Phys. 25, 261 (1947). J.DURBIN Rev. Instrum. Znr. Stat. 28, 233 (1960). J.P.BuRG, Proceedings of the 37th Meeting of the Society of Exploration Geophysisists, Oklahoma City (1967). N.ANDERSON, Prop. IEEE 66, 1581 (1978). L.J.FABER, Proc. IEEE 74, 1046 (1986). P.F.FOUGERE. E.T.ZAWALICK and H.R.RADosKI. Phvs. Earth Planet. Interiors 12, 201 (1976). W.Y.CHEN &d G.R.STEGEN, J. Geophys. Res. 79, 3019 (1974). D.N.SWINGLER, IEEE Trans. m-28, 257 (1980). . D.N.SWINGLER. Prop. IEEE 67. 1368 (1979). . T.J.ULRYCH aAd R.W.CLAYT~N, Phys. E&h Pianet. Interiors 12, 188 (1976). L.MARPLE, IEEE Trans. ASSP- 28, 441 (1980). C.L.LAWSON and R.J.HANsoN, Solving Least Squares Problems, Prentice-Hall, Englewood Cliffs, NJ. (1974). J.TANG and J.R.NoRRIs, J. Chem. Phys. 84, 5210 (1986). M.MoRF, B.DICKINSON, T.KAILATW and A.VIEIRA, IEEE Trans. ASSP-25, 429 (1977). S.L.MARPLE, ZEEE Trans. ASSP-29, 62 (1981). P.F.FOUGERE, J. Geophys. Res. 82, 1051 (1977). C.H.CHEN, Nonlinear Maximum Entropy Spectral AnaZysis for Signai Recognition, Wiley, Chichester (1982). H.AKAIKE. Ann. Inst. Stat. Math. 21. 243 (1969). T.J.ULRY&H and T.N.BIsHoP, Rev. G&physi~~ & Space Physics 13, 183 (1975). H.AKAIKE, IEEE Trans. AC-19, 716 (1974). E.PARZEN, ZEJ%ETrans. AC - 19, 723 (1974j. V.Vrrr. P.BARONE. L.GUIDONI and E.MASSARO. I. Maen. Reson. 67. 91 (1986). ’ ’ ’ T.E.LA&DERs and ~.T.LAcoss, IEEE Trans. G<- 15, f6 (1977). V.VITI, E.MA%ARo, L.GUIDONI and P.BARONE, .I. Maan. Reson. 70, 379 (1986). . I R.G.C&RIE, GeophyS. J. R. Astr. Sot. 38, iv9 (1974). S.M.KAY, IEEE Trans. B-28, 292 (1980). V.F.PISARENKO, Ceophys. J. R. Astr. Sot. 33, 347 (1973). V.C.KLEMMA and AJLAuB, IEEE Trans. AC - 25, 164 (1980). G.A.KoRN and T.M.KoRN, Mathematical Handbook for scientists and Engineers, 2nd Edn., McGraw-Hill, New York (1968).
Linear prediction and maximum 75. 76. 77. 78.
79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92.
93. 94.
95. 96. 97.
98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126. 127. 128. 129. 130.
entropy
625
H.BARKHIJJJSEN,R.DEBEER, W.M.M.J.B~VEE and D.VANORMONDT, J. Magn. Reson. 61, 465 (1985). P.A.BUSINGER and G.H.GOLUB, C-. ACM 12, 564 (1969). G.H.GOLUB and C.RF.IN%Xi, Numer. Math. 14, 403 (1970). B.S.GARB~%V, J.M.BoYLE, J.J.DONGARRA and C.B.MOLER, Matrix EigensrJem Routines EISPACK Guide Extension, Lecture Notes in Computer Science, Vol. 51, Springer, Berlin (1977). J.J.DONGARRA, C.B.MOLER, J.R.BuNcH and G.W.STEWART, LZNPACK User’s Guide, SIAM, Pbiiadelbbii (1979). H.GEsI& and‘J.J.Lku, J. Mkgn. Reson. 76, 183 (1988). .I.C.HCCH,.I. Magn. Reson. 64, 436 (1985). J.C.Hocn,]. Ma&r. Reson. 66, 580 (1986). D.N.SWINGLER, J. Geophys. Res. 84, 679 (1979). J.S.Lu.i and N.A.MALIK, IEEE Tnu~. m-29, 1215 (1981). M.KA~EH arid G.A.LIPF%RT. IEEE TIWW. ASSP-31, 438 (1983). J.TANC and J.R.NoRRls, J. ‘Magn. Reson. 78, 23 (i988): ’ R.KUMARESAN, L.L.SCHARP and A.K.SHAW, ZEEE Trans. ABSP-$4, 637 (1986). M.A.JENKINS and J.F.TRAIJB, Commun. ACM 15, 97 (1972). P.BARONE, L.GUIDOM, E.MASSAROand V.VITI, J. Magn. Reson. 73, 23 (1987). D.W.Tuprs and R.KUMARESAN, Prop. IEEE 68, 419 (1980). R.KUMARESAN and D.W.TUPTS, Prue. IEEE 68, 1354 (1980). D.W.TvPrs and R.KUMARESAN, proc. IEEE 70, 975 (1982). D.W.ms and R.KUMARESAN, IEEE Trans. ASSP-30, 671 (1982). R.KUMARESANand D.W.Twrs; IEEE Trans. ASSP-Jo; 833 (1982). H.BARKH~JIJSEN,R.DEBEER and D.VAN~RMONDT, .I. Magn. Rcson. 67, 371 (1986). M.A.DEL&, F:NI and G.C.LE~Y, J. Magn. Resin. 73;548 (1987). B.PORAT and BFRIEDLANDER, IEEE Trans. ASSP-34, 1336 (1986). H.BARKHIJTJSEN,R.DEBEER and D.VANGRMONDT, J. Magn. Reson. 73, 553 (1987). H.BARK~SEN, R.DEBEER and D.VANGRMO~, J. Magn. Reson. 64, 343 (1985). Y.BEPPIJ and LNINOMIYA, Comp. Phys. Commun. 23, 123 (1981). J.TANG, C.P.LIN, M.K.BOWMAN and J.R.NoRRIs, J. Sign. Reson. 62, 167 (1985). K.~TEIGLITZ and B.DICKINSON, IEEE Trans. ASSP-30, 984 (1982). B.SV~JGARD, Bm 7, 240 (1967). A.VANDEBOS, in Handbook of Mcasurem em Scknce,P.H.SYDENHAM (Ed.), Wiley, London (1982). B.NOORBEHESHT,H.LEE end D.R.ENZMANN, J. Magn. Reson. 73, 423 (1987). A.E.SCHUSSHEIM and D.COWBURN, J. Magn. Reson. 71, 371 (1987). J.H.MCCLELLAN, Proc. IEEE 70, 1029 (1982). D.E.DUDGEON and R.M.M ERSEREAU, Multidimensional Digital Signal Processing, Prentice-Hail, Englewoed Ciiffs, NJ. (1984). C.L.NIKIAS end M.R.RAGHWEER, IEEE ASSP Spectrum Estimation Workshop ZI, 213 (1983). R.KUMARESAN and D.W.Tus’rs, Rot. IEEE 69, 1515 (1981). Y.MANASSEN. G.NAVON and C.T.W.MOONEN, .I. Magn. Reson. 72, 551 (1987). E.T.JAYNES in proceedings of the First Syr&osium~ on Engirxering Ap&&ons of Random Function Theory and Probability, J.L.BOCDANOW and F.KOZIN (Eds.), Wiley, New York (1963). ETJAYNES, IEEE Trans. E-4, 227 (1968). E.T.JAYNES in The Maximum Entropy Form&m, R.D.LEVINE and M.TRIBUS (Eda.), MIT Press, Cambridge, Mass. (1979). C.E.SHANNON, Beu System Tech. J. 27, 379 (1948). C.E.SHANNON and W.WEAVER, The Mathematical Theory of Communication, University of Iilinois Press, Urbana (1962). E.T.JAYNJ%, Phys. RCV. 106, 620 (1957). A.I.KIUNCHIN, MathematicalFoundationsof Information Theory, Dover, New York (1957). R.V.L.HARTLEY. BefJ System Tech. .I: 7. 535 (1928). L.BRILLOUIN, S&rice &d Infbrmatio~ Thkry, &I E&L, Academic Press, New York (1962). K.G.DENBIGH and JSDENBIGH, Entrupy in relation to incomfiete knowlodge, Cambridge University Press, C&nbridgc (1985). M.TRIBUS in The Maximum Entropy Formalism, R.D.LEVINE and M.TRIBUS (Eds.), MIT Press, Cambridge, Mass. (1979). E.T.JAYNEs, Phys. Rw. 108, 171 (1957). B.R.FRIEDEN, J. Opt. Sot. Am. 62, 511 (1972). E.T.JAYNEs, Pmt. IEEE 70, 939 (1982). S.J.WERNECKE and L.R.d’AuDAlUO, IEEE Trans. C-26, 351 (1977). S.F.G~LL and G.J.DANIELL, Nature 272, 686 (1978). D.M.COLLINS, Nature 298, 49 (1982). RFLETCHER and C.M.REEVES, Computer J. 7, 149 (1964). J.%ILLING and R.K.BRYAN, Mon. Not. R. Astr. Sot. 211, 111 (1984). M.LAGALJ_Y and WFRANZ, Vorksungen iiber Vektorrechnung, 6th Fdn., Akademisehe Verkgsgeseiisebaft, Leipzig (1959).
626
DAVID S. STEPHENSON
131. W.B.DAVENPORTand W.L.ROOT. An Introduction to the 271eor~ of Random Shafa and Noise. MeGraw - Hill, New York (1958). 132. J.P.BuRG, NATO Adv. Study Inst. on Signal Proc., Ensebede,NetberIsnds(1968). 133. D.E.SMYLIE, G.K.C.CLARKE and T..I.ULRYCH in Methods in Com~u,utationd Phvsics. B.ALDER. S&RNBA& and M.ROTENBERG@e&e Ede.), Vol. 13, Geoph&s, B.A.‘Bo~T, (Vol. Ed.); AcademicPress, New York (1973). 134. S.M.KAY, Modern SpectralEstimation, PrentiecHaB, EnglewoodCliffs (1988). 135. N.R.GOODMAN, Ann. Math. Statist. 34, 152 (1963). 136. G.SZE&, Math. 2. 6, 167 (1920). 137. J.A.EDWARD and M.M.PITELsoN, IEEE Trans. Inf Theory lT- 19, 232 (1973). 138. D.VAN~RMONDTand K.NEDERVEEN,Chem. Phys. Lm. 82, 443 (1981). and J.H.MCLzLLAN, IERR Trans. ABSP-32, 410 (1984). 139. J - P.Scxi~ Theory of Probability, 3rd Edn., Oxford UniversityPress, Oxford (1961). 140. H.JEFPREYS, 141. G.J.DANIELLand S.P.G~LL, IRR Proc. 127E, 170 (1980). 142. J.D.O’SLILLI~AN and M.M.KOME~AROFF in Indirect Imaging, J.A.ROBF.RTS(Ed.), Cambridge UniversityPress, Cambridge(1984). 143. A.K.LIVESEY and J.SKu.r.rNG,Acta. Opt. All, 113 (1985). 144. B.R.FRIEDEN,Proc. IEEE 73, 1764 (1985). 145. B.R.Mu%cus, Proc. IEEE 74, 762 (1986). 26 (1980). 146. J.E.SHOREand R.W.JOHNSON, IEEE Trans. lT-26, 942 (1983). 147. R.W.JOHNSONand J.E.SHORE, IEEE Trans. lT-29, 148. Y.TIKOCHIN~KY,N.Z.TISHBY and R.D.LE~IN& Phys. Rev. I_&. 52,1357 (1984). 149. JSIULLING, Nature 309, 748 (1984). 150. S.F.GLILLand J.%LLING in Indirect Imaging, J.A.ROBERTS (Ed.), Cambridge University Press, Cambridge(1984). 151. E.D.LAuE, JSKILLING,JSTAUNTON, S.&BISI and R.G.BRERETON,J. M&p. Reson. 62, 437 (1985). 152. K.M.WRIGHT and P.S.BELTON,Mol. Phys. 58, 485 (1986). 153. KBLUM, DensityMatrix Theoryand Apphcarions,Plenum Press, New York (1981). 154. R.C.TOLMAN, The Principles of St.&tkaI Mechanks, Oxford UniversityPress, Oxford (1954). 155. S.SIBISI,J.!%ILLING, R.G.BRERETON,E.D.LAuE and J.STAUNTON,Narore 311, 446 (1984). 156. F.NI, G.C.LE~~ and H.A.SCH~RAGA,J. Magn. Reson. 66, 385 (1986). 157. E.D.LAuE, J.SKILLINGand J.STAUNTON,J. Magn. Reson. 63, 418 (1985). 158. J.F.MARTIN, J. Magn. Reson. 65, 291 (1985). 159. D.L.PIULLIPS,J. Assoc. Comput. Mach. 9, 84 (1962). and A.C.KAK. Diairal PieroreProcessimr,2nd Edn., Vol. 1, AeademiePress. New 160. A.ROSENPELD York (1982). 161. P.S.BELTONand K.M.WRIGHT, J. Magn. Reson. 66, 564 (1986). 162. B.R.HuNT, IEEE Trans. AC- 17. 703 (19721. ‘Reson. 69, 386 (1986). 163. P.J.HoRE &d G.J.DANIELL,J. h&n. 164. E.D.L,+uz, M.R.MAYGER, J.SKILLINGand J.STAUNTON,J. Magn. Reson. 68, 14 (1986). 165. D.MARION and K.W~~THRICH,Biodhem. Biophys. Res. Commun. 113, 967 (1983). 166. D.J.STATES,R.A.HABERKORNand D.J.RuBEN,J. Magn. Reson. 46, 286 (1982). Reson. 58, 315 167. A.P.MEHLKOPP, D.KORBEE, T.A.RGGLEMAN and RFREEMAN, J. Magn. (19841. 168. J.C.J.B~NA and E.D.LAu~, J. Mbgn. Reson. 75, 384 (1987). 401 (1981). 169. J.S.LIM and N.A.MALIK, IEEE Trans. ASSP-29, 170. J.W.WOODS, IEEE Trans. IT-22, 552 (1976). 171. R.K.BRYAN and JSKILLING, Mon. Not. R. Astr. Sot. 191, 69 (1980). to Random Processes with Applications to Signals and Systems, 172. W.A.GARDNER, Introduction MxmiBen. New York (1986). 173. EIJuRY, Theory and A&i&on of the z Transbrm Method, Wiley, New York (1964). 174. J.A.CADZOW, Discrete - Time Systems, Prentice-HsU,EnglewoodCiiis, NJ. (1973). 175. T.MALORY, Le Morre &Arthur, PenguinBooks, Harmondsworth (1969). 176. X.ZHIJANG, E.~STEVOLDand R.M.HAR=IcK, IEEE Trans. ASSP-35, 208 (1987). 177. H.GESMARand J.J.LED, J. Magn. Reson. 76, 575 (1988). 178. J.J.Lzu, H.GESMAR, K.R.H~NRs and F.B.I-IAN~EN, J. Am. Chem. Sot. 110, 4165 (1988). 179. M.SI-IINNARand S.M.BLEFF,J. Magn. Reson. 76, 200 (1988). 180. M.A.DELSUC and G.C.LEVY, J. Magn. Reson. 76, 306 (1988). 181. SJ.DAVIES, C.BAUER, P.J.HoRE and RFREEMAN,J. Magn. Reson. 76, 476 (1988). 182. J.A.H&BoM, ASWOII. Astrophys. S~ppl. 15, 417 (1974). 183. A.J.SHAKA,J.KEELERand RFREEMAN,J. Magn. Reson. 56, 294 (1984). 184. G.J.DANIELLand P.J.HoRE, privatecommunication. 185. D.G.CHILDERS(Ed.), Modern Spectrum Analysis, IEEE Press, New York (1979), disrributedby Wiley, New York. 186. S.B.KESLER(Ed.), Modern SpecrromAnalysis, II, IEEE Press, New York (1986). That edited by CHiLDERSincludesreferences(48,59,64, The last two books are collections of reprints. 65,66,68,125,132,137 and 170). That from KESLERcontainsreferences(36,4O,56,72,83,92,108 and 169).