ADVANCES IN IMAGING AND ELECTRON PHYSICS, VOL. 117
Resolution R e c o n s i d e r e d - - C o n v e n t i o n a l Approaches and an Alternative A. VAN DEN BOS and A. J. DEN DEKKER Department of Physics, Delft University of Technology, 2600 GA Delft, The Netherlands
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . II. Classical Resolution Criteria . . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . B. R a y l e i g h and Rayleigh-like Resolution Criteria . . . . . . . . . . . . C. Proposals to Improve the R a y l e i g h Resolution . . . . . . . . . . . . D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . III. Other Resolution Criteria . . . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Extrapolation and Superresolution . . . . . . . . . . . . . . . . . . 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2. Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 3. Inverse Filtering . . . . . . . . . . . . . . . . . . . . . . . C. Approaches Based on Information T h e o r y . . . . . . . . . . . . . . D. Approaches Based on Signal-to-Noise Ratio . . . . . . . . . . . . E. Approaches Based on Decision T h e o r y . . . . . . . . . . . . . . . E Approaches Based on Asymptotic Estimation . . . . . . . . . . . . G. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . IV. M o d e l i n g and Parameter Estimation . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . B. Parametric Statistical M o d e l s of Observations . . . . . . . . . . . . C. Parameter Estimatiort . . . . . . . . . . . . . . . . . . . . . . 1. D e p e n d e n c e of the Probability Density Function of the Observations on the Parameters . . . . . . . . . . . . . . . . . . . . . . . . 2. Limits to Precision: The C r a m f r - R a o L o w e r Bound. . . . . . . . . 3. M a x i m u m Likelihood Estimation . . . . . . . . . . . . . . . . 4. Limits to Resolution of Parameters: A Numerical E x a m p l e . . . . . . D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . V. Elements of Singularity T h e o r y . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . B. Definitions and N o t i o n s . . . . . . . . . . . . . . . . . . . . . C. Functions around Stationary Points . . . . . . . . . . . . . . . . 1. The Morse L e m m a and the Splitting L e m m a . . . . . . . . . . . . 2. Simple E x a m p l e s of Singular Representations . . . . . . . . . . . 3. Bifurcation Sets . . . . . . . . . . . . . . . . . . . . . . . D. Functions near Singularities . . . . . . . . . . . . . . . . . . . . 1. Derivation o f the Reduction A l g o r i t h m . . . . . . . . . . . . . . 2. Useful P o l y n o m i a l Substitutions . . . . . . . . . . . . . . . . . E. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .
242 245 245 245 246 248 248 248 249 249 250 253 256 258 260 261 264 264 264 265 267 267 269 271 273 276 277 277 278 279 279 280 283 284 284 289 291
241 Volume 117 ISBN 0-12-014759-9
ADVANCES IN IMAGING AND ELECTRON PHYSICS Copyright 9 2001 by Academic Press All rights of reproduction in any form reserved. ISSN 1076-5670/01 $35.00
242
A. VAN DEN BOS AND A. J. DEN DEKKER
VI. Singularity of Likelihood Functions . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . B. Likelihood Functions for Two-Component Models . . . . . . . . . . C. Stationary Points of the Likelihood Functions . . . . . . . . . . . . D. Examples of Stationary Points . . . . . . . . . . . . . . . . . . . E. The Hessian Matrix of the Log-Likelihood Function . . . . . . . . . . 1. Change of Coordinates . . . . . . . . . . . . . . . . . . . . . 2. The Hessian Matrix at the One-Component Stationary Point . . . . . . 3. Summary, Discussion, and Conclusions . . . . . . . . . . . . . . E The Degenerate Part of the Likelihood Functiort . . . . . . . . . . . 1. The Quadratic Term . . . . . . . . . . . . . . . . . . . . . . 2. The Cubic Term . . . . . . . . . . . . . . . . . . . . . . . 3. The Quartic Term . . . . . . . . . . . . . . . . . . . . . . . G. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . VII. Singularity and Resolution . . . . . . . . . . . . . . . . . . . . . . A. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . B. Coinciding Scalar Locations . . . . . . . . . . . . . . . . . . . . 1. Conditions for Resolution . . . . . . . . . . . . . . . . . . . 2. Numerical Computation of the Resolution Criterion . . . . . . . . . 3. Conditions for the Use of the Criterion . . . . . . . . . . . . . . 4. Application to Rayleigh's sinc-Square Model . . . . . . . . . . . 5. Resolution as a Property of Observations . . . . . . . . . . . . . 6. Resolution from Error Disturbed Observations . . . . . . . . . . . 7. Probability of Resolution and Critical Errors . . . . . . . . . . . . C. Coinciding Two-Dimensional Locations . . . . . . . . . . . . . . 1. Conditions for Resolution . . . . . . . . . . . . . . . . . . 2. Application to the Airy Model . . . . . . . . . . . . . . . . D. Nonstandard Resolution: Partial Coherence . . . . . . . . . . . . E. A Survey of Related Literature . . . . . . . . . . . . . . . . . . E Summary and Conclusions . . . . . . . . . . . . . . . . . . . VIII. Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
292 292 293 295 298 304 305 307 310 311 311 316 320 323 325 325 325 325 327 328 330 332 337 339 344 344 347 350 352 353 353 355
I. INTRODUCTION T h e p r i m a r y p u r p o s e o f this a r t i c l e is to p r o p o s e a n d e x p l a i n an a l t e r n a t i v e defi n i t i o n o f t w o - c o m p o n e n t r e s o l u t i o n w h i c h m a y also b e u s e d as a r e s o l u t i o n criterion.
Two-component resolution is
the a b i l i t y to d i s t i n g u i s h t w o o v e r l a p -
p i n g c o m p o n e n t f u n c t i o n s o f the s a m e f a m i l y in a set o f o b s e r v a t i o n s . F o r i n c o h e r e n t o p t i c s a n d e l e c t r o n o p t i c s , the c o m p o n e n t s are o f t e n p o i n t s p r e a d f u n c t i o n s . T h u s t w o - c o m p o n e n t r e s o l u t i o n is the s a m e as w h a t in t h e l i t e r a t u r e is c a l l e d two-point resolution. The r e a s o n f o r the a l t e r n a t i v e d e f i n i t i o n is t h a t the e x i s t i n g a p p r o a c h e s , h o w e v e r u s e f u l , are o f t e n s p e c i a l i z e d or s t e m f r o m an e r a w h e n n e i t h e r a d e q u a t e p h o t o n o r e l e c t r o n d e t e c t o r s n o r fast c o m p u t e r s a n d s o f t w a r e existed.
RESOLUTION RECONSIDERED
243
So that the differences with the proposed approach can be seen, the existing approaches will first be reviewed briefly. Next statistical and mathematical introductory material needed by the new approach will be presented. This material consists of two distinct parts. The first part is a description of the statistical model of two-component observations along with the most important statistical methods for estimation of the parameters of the components. The second part consists of elements of nonlinear mathematics. This introductory material is used later in the derivation of the remarkable and unusual properties of the estimates of the component parameters. The proposed resolution definition and the corresponding resolution criterion will be shown to be a direct consequence of these properties. The detailed outline of the sections in this article is as follows: Section II is a description of classical resolution criteria. These are modified versions of Rayleigh's well-known resolution criterion. The main characteristic of classical resolution criteria is that they are simply measures of the width of the main lobe of the point spread function concerned. This is a consequence of the fact that these criteria are intended to describe the resolving capabilities of a human observer, and in this limited sense they are operational. Also, these definitions are based on the exact functional formof the point spread function, that is, on limitations of the optical system concerned, not on imperfections of the optical observations. Section III starts with a brief description of digital image processing methods intended to improve resolution. The resolution achieved by these methods exceeds the classical limits but is empirically found to be limited by noise. Therefore, these methods show that the limit to resolution is not the width of the point spread function or, equivalently, diffraction, but imperfections of the observations. In Section Ill.C, literature is addressed relating information theory to resolution. The described results, which are not intended to assess practical limits to two-component resolution, also emphasize the influence of the signal-to-noise ratio. Then, in Section III.D, more or less empirical methods are discussed directly linking resolution with the signal-tonoise ratio. The signal-to-noise ratio is equally important in theories, briefly reviewed in Section III.E, relating resolution to decision theory. The methods discussed in Sections III.C through III.E share an emphasis on the influence of noise on resolution. Methods described in Section III.F, which make use of parametric statistical models of the observations combined with statistical parameter estimation techniques, do the same but are much more general. However, the examples of such methods found in the literature require three conditions to be met: the parametric component function must be correct, the probability density of the observations must be known, and the number of observations must be infinite since asymptotic expressions for the variances are used.
244
A. VAN DEN BOS AND A. J. DEN DEKKER
Sections IV through VII introduce and explain the alternative resolution definition and the corresponding criterion proposed in this article. These will also be based on parametric models of the observations and parameter estimation methods. However, the conditions of the parametric methods described in Section III.F will not be made, simply because they are not needed. Therefore, the alternative resolution definition and criterion are much more general. First, in Section IV, the parametric statistical model of the two-component observations will carefully be described. A description will also be given of the most important method for estimation of the parameters of such observation models: the maximum likelihood method. The maximum likelihood estimate is the solution for the parameters that maximizes the so-called likelihood function, a function of the parameters directly derived from the probability density function of the observations. The asymptotic properties of the maximum likelihood method will also be discussed. Section IV is concluded by a numerical example. In this example, least squares estimates of the locations of a pair of strongly overlapping Gaussian peaks are computed from simulated, noisecorrupted observations. The experiment is repeated many times to assess the distribution of the estimates. The results show that the properties of parameter estimates from numbers of observations such as are usual in practice may be fundamentally different from those computed from asymptotic expressions. In particular, the asymptotic distributions do not predict the frequent occurrence of exactly coinciding estimates of the locations resulting in a one-component solution from two-component observations. Therefore, these solutions disclose the existence of limits to resolution not revealed by asymptotic considerations. In Sections V through VII, it will be seen that this remarkable behavior of the location estimates is a consequence of the way the solutions for these parameters depend on the observations. Section V presents elements of singularity theory, a branch of mathematics dealing with structural change of parametric functions under the influence of changes of the parameters where structure is the pattern of the stationary points of the function. In Section VI, the singularity theory discussed in Section V will be used to show that under the influence of modeling errors or statistical errors in the observations the structure of the likelihood function may change. In Section VII, the results of Section VI are used to explain the remarkable coincidence of estimates of the locations of the components under the influence of the observations and to derive the main result of this article: a criterion distinguishing the observations for which the location estimates are distinct from those for which the estimates coincide. If two components are defined as resolved if the estimates of their locations are distinct, the derived criterion is a natural resolution criterion. Moreover, by use of this criterion, it is easy to find out whether a set of observations corresponds to resolved components or not. In Section VII.C the criterion is extended to include resolution in two dimensions
RESOLUTION RECONSIDERED
245
and in Section VII.D it is applied to partially coherent observations. A survey of relevant literature concludes Section VII. Section VIII is devoted to conclusions.
II. CLASSICAL RESOLUTION CRITERIA
A. Introduction Resolution is widely used as a performance measure of imaging systems in optics and electron optics. It was Lord Rayleigh who made the concept of resolution popular by introducing his well-known criterion for two-point resolution. The criterion defines the possibility of perceiving separately two point sources in an image formed by a diffraction-limited imaging system or objective lens. The model of an object in the form of two points originated from astronomical problems, in which many objects are effectively point sources. In this section, the Rayleigh resolution criterion and criteria related to it are briefly discussed. Furthermore, attempts to improve the Rayleigh resolution are described. Finally, it will be pointed out that, in spite of their usefulness in the quality assessment of imaging systems, the so-called classical resolution criteria, with which this section is concerned, do not provide absolute limits to resolution.
B. Rayleigh and Rayleigh-like Resolution Criteria The Rayleigh criterion states that two point sources of equal brightness are just resolvable when the maximum of the intensity pattern produced by the first point source falls on the first zero of the intensity pattern produced by the second point source (Rayleigh, 1902). Consequently, the Rayleigh resolution limit is given by the distance between the central maximum and the first zero of the diffraction-limited intensity point spread function of the imaging system concerned. The Rayleigh resolution criterion can be generalized to include point spread functions that have no zero in the neighborhood of their central maximum by taking the resolution limit as the distance for which the ratio of the value at the central dip in the composite intensity distribution to that at the maxima on either side is equal to 0.81. This corresponds to the original Rayleigh limit for a rectangular aperture. Rayleigh's choice of resolution limit is based on presumed resolving capabilities of the human visual system when it is used to detect differences in intensity at various points of the composite intensity distribution. Well aware of the arbitrariness of the proposed criterion, Rayleigh himself evaluated it as follows: "This rule is convenient on account of its simplicity and it is sufficiently accurate in view of the necessary
246
A. VAN DEN BOS AND A. J. DEN DEKKER
uncertainty as to what exactly is meant by resolution" (Rayleigh, 1902; p. 85). Since Rayleigh's days, several other resolution criteria have been proposed which are similar to Rayleigh's. Notable examples of such so-called classical criteria for two-point resolution are those of Buxton (1937), Houston (1927), Schuster (1924), and Sparrow ( 1916). Schuster's criterion states that the two point sources are just resolved if no portion of the main lobe (central band) of the intensity distribution of one overlaps the main lobe of the other. This criterion provides a resolution limit that is twice that of Rayleigh's. According to Houston's criterion the two point sources are just resolved if the distance between the two maxima in the composite intensity distribution produced by the two point sources equals the full width at half maximum (FWHM) of the intensity distribution of either point source. The criterion proposed by Buxton is quite similar to that of Houston. Buxton, however, did not consider the intensity distributions, but their square roots, the amplitude distributions. According to Buxton's criterion, at the limit of resolution the amplitude distributions of the two point sources should intersect at the points of inflection. It was Sparrow who pointed out that the Rayleigh criterion as originally proposed was not, as many researchers assumed, intended as a measure of the actual limit of resolution; it was instead intended as a figure of merit for different instruments. According to Sparrow, the resolution limit for two point sources of equal brightness is given by the undulation condition, that is, by the condition that the central minimum (dip) in the composite intensity distribution just disappears. Then both central maxima and the minimum between just coincide. In this case, a central dip in the composite intensity distribution could never be detected, not even if visual inspection would be replaced by a hypothetical perfect intensity measurement instrument. The reason is simply that there is no such dip anymore. The Sparrow limit is, therefore, often referred to as the natural or physical limit to resolution. A clear description of the previously mentioned and several other classical resolution criteria is given by Ramsay and coauthors (1941). It is worth mentioning that although in their original context classical resolution criteria such as Rayleigh's and Sparrow's tacitly assume incoherent illumination, they have later been generalized to include partially as well as fully coherent illumination (Born and Wolf, 1980; Lummer and Reiche, 1910; Luneberg, 1944; McKechnie, 1972).
C. Proposals to Improve the Rayleigh Resolution All classical resolution criteria express resolution in terms of the width of the main lobe of the point spread function of the diffraction-limited imaging
RESOLUTION RECONSIDERED
247
system. The narrower the point spread function, the better the resolution. Even recently, it was pointed out once more that a good image resolution measure should be proportional to the width of the point spread function of the imaging system (Wang and Li, 1999). The shape of the point spread function depends on the shape of the system's aperture and the spatial transmittance in the plane containing this aperture. The function describing the spatial transmittance is called the pupil function (Castleman, 1979). A uniform transmittance over the aperture corresponds to a pupil function that is equal to one at points in the aperture and equal to zero elsewhere. This gives rise to conventional point spread functions such as Airy functions and sinc-square functions for circular and rectangular apertures, respectively (Castleman, 1979). However, it is possible to modify the transmittance over the aperture by implementing a suitable filter. Such a modification of the uniform amplitude transmission over the aperture (or pupil) is known as apodization. In the literature, many attempts to decrease the width of the central band of the system's point spread function by means of apodization have been described (Barakat, 1962; Barakat and Levin, 1963; Jacquinot and Roizen-Dossier, 1964; Osterberg, 1950; Osterberg and Wilkins, 1949; Osterberg and Wissler, 1949; Wolf, 1951). Wilkins (1950) has shown that, in principle, apodization can reduce the width of the main lobe of the point spread function indefinitely, so that, theoretically, unlimited resolution in the Rayleigh sense can be attained. There is, however, no practical interest in carrying the apodization process to extremes since a narrowing of the main lobe will generally have the side effect of a considerable rise of the level of the side lobes of the modified point spread function (McKechnie, 1972). Furthermore, as shown by Luneberg (1944), the point spread function with the highest central maximum corresponds to uniform transmittance, so that any transmittance other than uniform gives rise to a lower central maximum of the point spread function. This may be undesirable. For example, it can be shown that obstructing the central part of the aperture narrows the main lobe of the system's point spread function and thus increases the resolution in the Rayleigh sense. However, this effect is accompanied not only by an increasing loss of light in the image as more of the aperture is obstructed but also by a considerable rise in the level of the side lobes of the point spread function since a larger amount of light is diffracted from its proper geometrical position. In spectral analysis, such a leak of power from the main lobe to the side lobes is called leakage. Obviously, leakage makes the interpretation of the images more difficult. Because of these practical limits, later work on apodization focused on the problems of finding for a specified Rayleigh limit the pupil function (and associated point spread function) having the maximum central irradiance (Kibe and Wilkins, 1983, 1984; Wilkins, 1950) and the pupil function corresponding to a point spread function having as much of its energy as possible concentrated in a circle of a specified radius (Clements and Wilkins, 1974).
248
A. VAN DEN BOS AND A. J. DEN DEKKER
D. Conclusions
It may be concluded from the foregoing subsections that classical resolution criteria in fact do not concern detected images but calculated images (Ronchi, 1961), that is, noise-free images exactly describable by a known parameterized two-component mathematical model. The shape of the component function, that is, the point spread function, is assumed to be exactly known. However, the classical resolution criteria disregard the possibility of using this a priori knowledge about the point spread function to extract analytic results from observations by means of deconvolution or by model fitting using parameter estimation methods. Obviously, in the absence of noise, numerically fitting the known two-component model to the images with respect to the component locations and amplitudes would result in a perfect fit. The resulting solutions for these locations and amplitudes would be exact, and despite diffraction no limit to resolution would exist, no matter how closely spaced the two point sources. In reality, however, calculated images do not occur. Instead, one has to deal with detected images. In this case, the shape of the point spread function will never be known exactly. There will always be irregularities or aberrations that are unknown or incorrectly described by the mathematical model adopted. This means that systematic errors are introduced. Furthermore, detected images will never be noise free, so that nonsystematic errors will also be present. It is these errors, both systematic and nonsystematic, that ultimately limit the resolution (den Dekker and van den Bos, 1997). Although classical resolution criteria are still widely used as a measure of the relative merit of different imaging systems, it should be recognized, and is in fact recognized by many, that such criteria do not provide absolute limits to resolution. If there is an ultimate limit to resolution, it must be a consequence of the fact that, as a result of systematic errors (modeling errors) and nonsystematic errors (statistical fluctuations of the observations), detected images are never exactly described by the model adopted.
III. OTHER RESOLUTION CRITERIA
A. Introduction
As mentioned in Section II, the classical Rayleigh resolution criterion is based on presumed limitations to the resolving capabilities of the human visual system. Since Rayleigh's days, visual inspection has been supplemented with intensity measuring instrumentation and, above all, digital computing facilities. These new facilities brought about a reconsideration of the notion of resolution and as a result many resolution criteria different from the classical
RESOLUTION RECONSIDERED
249
Rayleigh-like criteria have been proposed in the literature. In this section, a selection of these more modem criteria will be reviewed. Furthermore, a number of methods to enhance resolution will be discussed. The aim is to show that these resolution criteria and resolution enhancing methods are based both on the object and on the noise in the observations. It will be shown that it is only in the absence of any a priori knowledge about the object that resolution is limited by diffraction. In other cases, resolution is not limited by diffraction but by the noise in the observations.
B. Extrapolation and Superresolution 1. Introduction Generally, lenses and other imaging systems may be treated as two-dimensional shift-invariant linear systems. For coherent illumination, the imaging system is linear in complex amplitude whereas for incoherent illumination the system is linear in intensity, which is the square of the modulus of the amplitude. The characteristics of a shift-invariant linear system are defined by its point spread function or, equivalently, by its transfer function, which is the Fourier transform of the point spread function. Transfer functions are particularly useful in light and electron microscopy and in camera applications whereas point spread functions are more often used in the quality assessment of telescopes or spectroscopic instruments. The coherent point spread function is merely the Fourier transform of the system's pupil function. The transfer function of a coherent system, which is also known as the amplitude transfer function or coherent transfer function, therefore has the same shape as the system's pupil function. The incoherent point spread function is the squared modulus of the coherent point spread function, which is equal to the power spectrum of the pupil function. The incoherent transfer function, which in optics is known as the optical transfer function, is the autocorrelation function of the pupil function. This implies that both coherent and incoherent transfer functions are strictly bandlimited; that is, spectral components with frequencies beyond the cutoff frequency are not transferred by the system and therefore do not appear in the images. Furthermore, it has been shown that partially coherent imaging systems are also strictly bandlimited (Born and Wolf, 1980). The cutoff frequency of the imaging system is often used as a measure of resolution. This resolution measure, which is also known as the diffraction limit, is directly related to the Rayleigh-like classical resolution limits, which are a measure of the width of the main lobe of the point spread function involved. This can be seen as follows. Since the transfer function of the imaging system is the Fourier transform of the system's point spread function, a point
250
A. VAN DEN BOS AND A. J. DEN DEKKER
spread function with a narrow main lobe corresponds to a transfer function with a high cutoff frequency. In fact, the product of the Rayleigh resolution limit and the cutoff frequency of the system involved is equal to a constant that is close to or even equal to one (Castleman, 1979). Transfer functions in electron microscopy are bandlimited as well. Resolution measures used in electron microscopy, such as the fringe resolution, also known as line resolution, and the information-limit resolution of the microscope, are likewise based on this bandwidth limitation (O'Keefe, 1992; Spence, 1988).
2. Extrapolation Due to the bandwidth limitation, spatial frequencies above the cutoff frequency seem to be irrevocably lost. Under certain circumstances, however, superresolution, that is, resolution beyond the diffraction limit, turns out to be possible. The key to superresolution is a priori knowledge. The incorporation of prior knowledge may make it possible to reconstruct the object spectrum beyond the diffraction limit from that within the diffraction limit. Such prior knowledge is usually available. For example, a very realistic assumption for a real object is that it is spatially bounded; that is, it is nonzero only in a region of finite extent. This single condition can be shown to be sufficient to guarantee that the object spectrum is analytic. A well-known property of an analytic function is that if it is known on a certain interval, it is known everywhere, since there cannot exist two analytic functions that agree over a given interval and yet disagree outside the interval (Castleman, 1979). Consequently, given the values of an analytic function over a specified interval, the analytic function can always be reconstructed in its entirety. This process of reconstruction, by means of certain mathematical operations, is called analytic continuation. If the noise is ignored, inverse filtering of the image spectrum by the known transfer function of the imaging system would result in an exact reconstruction of the object spectrum within the passband of the imaging system, that is, up to the diffraction limit. Analytic continuation would then make it possible to extrapolate the object spectrum beyond the diffraction limit (Barnes, 1966; Harris, 1964a). This would result in an exact reconstruction of the total object spectrum and thus, after computation of the inverse Fourier transform, the exact object. Consequently, the finite extent of the object is a sufficient condition to guarantee that the resolution is not limited by diffraction. In practice, however, the images are disturbed by noise. Furthermore, the transfer function of the imaging system will rarely be exactly known. It is these errors, both nonsystematic and systematic, that limit the performance of superresolution techniques and limit the resolution in the process. In fact, at one time the feasibility of superresolution was seriously doubted by many. Although analytic continuation shows that it is, in principle, possible to
RESOLUTION RECONSIDERED
251
retrieve object properties beyond the diffraction limit if the available a priori knowledge about the object is used, it does not yield a practical way to achieve superresolution. This is because analytic continuation requires computation of derivatives but the presence of noise in real images makes an accurate computation of derivatives hopeless (Frieden, 1967). Considerations such as this even led to the assertion that superresolution is no more than a "myth" (Andrews and Hunt, 1977). It is now generally known that superresolution is not a myth. Its practical feasibility, even in the presence of noise, has clearly been demonstrated by several superresolution algorithms (Biraud, 1969; Dempster et al., 1977; Frieden, 1972, 1975; Frieden and Burke, 1972; Gerchberg, 1974; Holmes, 1988; Lucy, 1974; Richardson, 1972; Schell, 1965; Sementilli et al., 1993; Shepp and Vardi, 1982; Walsh and Nielsen-Delaney, 1994). All superresolution algorithms are based on the same principles as the concept of analytic continuation is, namely the following (B. R. Hunt, 1995): 9 The spatial frequencies that are captured by image formation below the diffraction limit contain information needed to reconstruct spatial frequencies beyond the diffraction limit. 9 Using additional knowledge about the object makes it possible to reconstruct object spatial frequencies beyond the diffraction limit from the image spatial frequencies below that limit. Both empirically and theoretically, it has been shown that there are certain necessary conditions to be satisfied by a superresolution algorithm for it to be effective (B. R. Hunt, 1994). First, the algorithm must contain nonlinear operations, since a purely linear operation can only modify the amplitudes of the Fourier harmonics that are already present in the image but cannot create object spatial frequencies beyond the diffraction limit. It should be noted, however, that although any trivial nonlinear operation, such as squaring the image gray levels, will cause the generation of spatial frequencies beyond the diffraction limit, such an operation will not necessarily construct spatial frequencies that bear any relation to the spatial frequencies of the true object. Therefore, a second condition is that the algorithm should explicitly utilize a mathematical description of the image formation process that relates object and image via the point spread function of the imaging system. This condition is necessary to prevent misconstruing an arbitrary nonlinear operation as superresolving (B. R. Hunt, 1994). A third condition is that, for Nyquist sampled images, aliasing distortion caused by the reconstruction of frequency components beyond the bandwidth be avoided by reconstructing the object on a finer grid than the image. The algorithm should therefore contain some suitable form of interpolation. Obviously, such an interpolation is not needed if the image is sufficiently oversampled. A fourth condition follows directly from the principles on which superresolution is based; that is, the algorithm should
252
A. VAN DEN BOS AND A. J. DEN DEKKER
incorporate a priori knowledge about the object. Examples of such additional knowledge used by superresolution algorithms are as follows: 9 Finite extent o f the object. Finite extent implies that the object possesses
an analytic Fourier transform. However, as indicated previously, superresolution algorithms based on this one piece of a priori knowledge (Barnes, 1966; Frieden, 1967; Harris, 1964a; Slepian and Pollak, 1961) have been found to be highly sensitive to noise in the image (Pask, 1976; Rushforth and Harris, 1968). 9 Positivity o f the object. Positivity has been found to be a very important constraint. Remarkable results have been gained by algorithms that make use of the fact that the object is known to be positive and that therefore force the reconstructed object to be positive as well (Biraud, 1969; Frieden, 1972; Frieden and Burke, 1972; Gerchberg, 1974; Janson et al., 1970; Schell, 1965; Walsh and Nielsen-Delaney, 1994). The way the positivity constraint is realized differs from algorithm to algorithm. In the Biraud algorithm, for example, positivity is ensured in a rather ad hoc manner, namely through representing the object as the square of an unknown function (Biraud, 1969). In other methods, the positivity constraint follows logically from some prior meaningful principle, such as maximum entropy (Frieden, 1972; Frieden and Burke, 1972). The maximum entropy solution can be regarded as the most uniform image consistent with the observed data and the noise. It is the positivity constraint inherent in the entropy, though, which makes superresolution possible (Burch et al., 1983). 9 Upper or lower bounds on object intensity. Obviously, the positivity of the object can be forced in the algorithm as a lower bound. A second bound on the object may be its maximum intensity, which is usually more difficult to fix precisely. However, by looking at an image, the observer can always guess at some upper bound to the unknown object. The importance of incorporating this extra knowledge has been established empirically by Janson and coauthors (1970). 9 Object statistics. In this case, not only the image, but also the object is modeled as a stochastic variable, which is, by definition, characterized by a probability density function. An illustrative example of a superresolution algorithm incorporating object statistics is the m a x i m u m a priori ( M A P ) estimator, which we will discuss briefly now. Suppose that the linear image formation is represented in terms of a matrix-vector formulation (Banham and Katsaggelos, 1997) g = Hf + n
(1)
where g is the image vector, f is the object vector, and n is the noise vector. The matrix H represents the point spread function, which is supposed to
RESOLUTION RECONSIDERED
253
be known. The MAP estimate of the object is found by maximizing the probability density function of the object f conditioned on the image g"
arg n~x P(f lg)
(2)
which, according to Bayes' rule, is equivalent to arg max
P(glf)P(f)
p(g)
(3)
where p(f) is the prior probability density function of the object, whereas P(glf) is the probability density function of the image conditioned on the object. The image probability density, p(g), is not a function of fand is therefore irrelevant to the maximization process. Obviously, the virtue of a Bayes' MAP estimator is that it allows one to incorporate a priori knowledge about the object into the solution via the prior probability density function, p(f). Notice, for example, that assuming Poisson statistics for the object automatically imposes a positivity constraint. One should, however, always be aware of the fact that if incorrect a priori information is incorporated, this will give rise to biased results according to the principle "garbage in, garbage out." In the literature, different choices of mathematical models for the probability density functions p(f) and P(glf) have been described (Frieden, 1980). For example, in the derivation of the Poisson MAP estimator of B. R. Hunt and Sementilli (1992), both the object and the image probability density functions, p(f) and p(g If), were chosen to be Poisson densities. Geman and Geman (1984) developed a MAP superresolution algorithm for a Poisson image probability density function and a Markov random field governing the object statistics. In the special case of normal distributions p(f) and P(glf), it can easily be shown that the MAP estimator reduces to the linear minimum mean-square-error (LMMSE) estimator (Sezan and Tekalp, 1990). The LMMSE estimator is equivalent to the well-known Wiener filter, which is based on a priori second-order statistical information about the object and the noise. It should be noted, however, that the Wiener filter will not provide superresolution because it involves only linear operations.
3. Inverse Filtering At this stage, it is worthwhile to consider the case in which the prior probability p(f) in Expr. (3) is assumed to bc uniform. This reflects the absence of any a priori knowledge about the object. In this case, the MAP estimator Expr. (3) becomes a maximum likelihood estimator The maximum likelihood estimate
254
A. VAN DEN BOS AND A. J. DEN DEKKER
of the object is then
arg max p(g lf )
(4)
where the quantity p ( g l f ) is usually referred to as the likelihood function. It can be shown that for the special case of independent and identically normally distributed noise, the maximum likelihood estimator becomes a least squares estimator (B. R. Hunt and Andrews, 1973): arg rn)n Ilg - n f l l 2
(5)
which, if the matrix product H rH is nonsingular, produces the least squares solution
f Ls = (H r H) - 1 H r g ,
(6)
where the superscript T denotes transposition. This is recognized as a generalized inverse filter. Without loss of generality, from now on we will assume that the dimension of the image vector g is equal to that of the object vectorf The matrix H then becomes a square matrix. For a nonsingular square matrix H, Eq. (6) reduces to
f Ls - H - l g
(7)
Inverse filtering can be implemented by discrete Fourier transform. It then consists of dividing the Fourier transform of the image by the known transfer function of the imaging system and next computing the inverse Fourier transform of the result. Alternatively, inverse filtering can be performed purely by convolution in the spatial domain. Unfortunately, inverse filtering has been found to be extremely sensitive to noise. Depending on the nature of H and the noise, the inverse filtering solution may possess an excessively high variance, even though it is unbiased under normality assumptions (B. R. Hunt, 1973). This may be understood as follows. It follows from Eqs. (1) and (7) that the object estimate obtained by inverse filtering deviates from the actual value of the object f by the additional error term H - i n , where n represents the additive noise. Since the matrix H, representing the point spread function, is mainly filled with zeros and small elements near the diagonal, which causes H -1 to have very large elements, the error term H - i n manifests itself as amplified noise (Frieden, 1975). Notice that if the point spread function is such that H is singular, computation of the least squares solution by Eq. (7) becomes impossible. A unique least squares solution then no longer exists. Unfortunately, it can be shown that matrices H associated with bandlimited systems, which are the systems of interest in this article, will be singular. This becomes clear if we consider the counterpart of the inverse operator H-~ in the spectral frequency
RESOLUTION RECONSIDERED
255
domain, which is given by the inverse of the transfer function. For bandlimited systems, the inverse of the transfer function is undefined for frequencies beyond the cutoff frequency. This means that for bandlimited systems unmodified inverse filtering cannot be carried out. Coping with this problem often involves carrying out inverse filtering up to a so-called processing frequency, which is chosen within the strictly limited bandwidth of the system transfer function. The result of this modified inverse filtering is that the image is effectively filtered by a product of the inverse system transfer function and a transfer function uniform up to the processing frequency. If the noise is disregarded, the final result of modified inverse filtering is the object convolved with a sinc function, which is the point spread function corresponding to the uniform transfer. A simple calculation shows that the first zero crossing of this point spread function is located at half the reciprocal of the processing frequency. This location may be considered to be the resolution in the reconstructed object. For the purposes of this article, it is important to mention that thus a twofold improvement of the Rayleigh resolution in the reconstructed object over that in the measured image may be achieved. However, even in the absence of noise, modified inverse filtering will not provide us with the exact inverse solution, which is the exact object. The method is strictly bandlimited: it does not extrapolate outside the processing frequency and will therefore not provide superresolution. This is not surprising since inverse filtering is a linear method and it does not incorporate any a priori information about the object, which, as mentioned previously, is a necessary condition to achieve superresolution. For non-normally distributed noise, the least squares estimate described by Eq. (6) is generally not the maximum likelihood estimate since it does not maximize the pertinent likelihood function. Let us consider, for example, Poisson distributed image observations. Then it can be easily shown that the maximum likelihood estimator Expr. (4) becomes a nonlinear estimator. In this case, a closed form maximum likelihood estimator does not exist. Hence, the maximum likelihood solution, which is the location of the maximum of the likelihood function, can be found only by an iterative, numerical optimization process. For this purpose, several nonlinear recursive algorithms have been developed, such as the Richardson-Lucy algorithm, which has been derived independently by Richardson (1972) and Lucy (1974) and has been rediscovered in a different context by Dempster and coauthors (1977) and Shepp and Vardi (1982). It should be noted that unconstrained maximum likelihood estimation described by Expr. (4) does not assume any a priori knowledge of the object to be reconstructed. It can be shown, however, that the aforementioned iterative nonlinear algorithms developed to maximize the likelihood function for Poisson distributed observations share the properties that the estimates at each stage satisfy the positivity constraint for the object and that the total
256
A. VAN DEN BOS AND A. J. DEN DEKKER
energy is preserved. The algorithms, therefore, implicitly impose constraints on the solution of the object that represent additional a priori knowledge. Notice that the imposing of such extra constraints may, of course, diminish the likelihood of the final solution; effectively, a smaller likelihood may be accepted for the benefit of a solution that is in agreement with the a priori knowledge. Nevertheless, the algorithms satisfy the necessary conditions to achieve superresolution as described previously, and their ability to achieve superresolving image restoration has been reported (Holmes, 1988; Meinel, 1986; Snyder et al., 1993). It is outside the scope of this article to describe the methods of superresolution in detail. For a review, we refer the reader to Frieden (1975), B. R. Hunt (1994), and Meinel (1986). What we should keep in mind is that, obviously, the performance of any superresolving algorithm will always be limited by noise. This is clearly demonstrated by Lucy (1992a, 1992b), who empirically investigated the limits imposed by photon statistics on the degree of superresolution achievable with image restoration techniques. For this purpose, he applied the Richardson-Lucy algorithm for maximum likelihood image restoration to synthetic, Poisson distributed images of a pair of identical point sources and empirically investigated the degree to which Rayleigh's resolution limit for diffraction-limited images can be surpassed. Finally, it should be noted that often the available a priori information about the object consists of a parametric model. Then, it is needed only to compute the relatively small number of unknown parameters characterizing the object from the available observations. The image reconstruction problem thus becomes a parameter estimation problem. Parametric restoration methods can, in principle, achieve far higher resolution than that of nonparametric methods, as will become clear in the remainder of this article.
C. Approaches Based on Information Theory
In the literature, authors have also discussed resolution in the framework of information theory, relating it to the number of degrees of freedom in the image. Most of this work is based on the concept of channel capacity as developed by Shannon (1949). By application of the sampling theorem, Shannon showed that the number of points M needed to completely specify a one-dimensional signal of duration T and bandwidth Br is given by M = 2TBT- + 1
(8)
It should be mentioned that in Shannon's original papers the unity term is mentioned, but not included, since 2TBr is usually much greater than one (Shannon, 1949). Furthermore, if it is assumed that the detected signal has an
RESOLUTION RECONSIDERED
257
average energy s and additive noise energy n, then the number of levels that can be distinguished reasonably well is given by [(S -at- n)/n] 1/2
(9)
so that the total number of possible distinct signals is given by
m --- (1 + SNR) M/2
(1 O)
with S N R - s/n representing the signal-to-noise ratio. The information capacity (in bits) of the system is then defined as (Shannon, 1949) N-
1 logzm - ~(1 + 2TBr) log2(1 + SNR)
(11)
Several researchers applied this concept to optics to estimate the resolution limit of an imaging system for a given SNR (Fried, 1979; Toraldo di Francia, 1955). Cox and Sheppard (1986) derived an expression for the information capacity of an optical system, in which the signal is not limited simply to a single temporal dimension but possesses three spatial dimensions and two independent states of polarization as well. Their work extended not only the work of Fellgett and Linfoot (1955) by including temporal properties of the optical system, but also that of Lukosz (1966, 1967) by including the signalto-noise ratio. The main result is that the information capacity of an imaging system, that is, the number of degrees of freedom that the system can transmit, is invariant and is given by
NF --- (1 + 2LxBx)(1 + 2LyBy)(1 + 2LzBz)(1 + 2TB~) log2(1 + SNR) (12) where Lx and Ly are the lengths of the sides of the rectangular image area; Lz is the depth of the field of view of the system; and Bx, By, and Bz are the spatial bandwidths of the system in the x, y, and z directions, respectively. T is the observation time, Br is the temporal bandwidth of the system, and SNR is the signal-to-noise ratio. The agreement of Eq. (12) with Eq. (11) is clear, if we take into account the incorporation of a factor 2 that is due to the two possible states of polarization. According to this invariance theorem, the parameters in Eq. (12) may be varied as long as NF remains constant. A priori knowledge can be used to determine if the actual information received is less than theoretically possible. For example, if it is known that an object is independent of time, the temporal bandwidth of the system, which seems useless at first sight, can be used to transmit encoded additional high spatial frequency information. This information can then be decoded again at the receiver, that is, the detector after which a superresolution image can be formed. In this way, the spatial bandwidths of the system can be extended beyond the diffraction limit by a proportional reduction of another constituent parameter
258
A. VAN DEN BOS AND A. J. DEN DEKKER
in Eq. (12), namely the temporal bandwidth, Br. Furthermore, the fact that the SNR is included in Eq. (12) makes it possible to analyze the practical limits of superresolution methods. This may be seen as follows. Analytic continuation may be regarded as a trade-off between the bandwidth parameters and the SNR parameter in Eq. (12): in order to increase the spatial bandwidth parameters, the SNR parameter in Eq. (12) must be decreased proportionally. It is then possible to derive the maximum resolution improvement attainable by means of analytic continuation given a minimum acceptable image SNR (Cox and Sheppard, 1986). On the basis of Shannon's concept of information capacity, Kosarev (1990) derived an absolute limit for resolution enhancement in comparison with the diffraction limit. He showed that optimum superresolution is, in principle, determined by noise and may be computed via Shannon's result, Eq. (10). Kosarev found that the resolution limit ~ for a signal in the form of equidistant lines, all having equal amplitudes, can be described as A -- -- 2 !Clog2(1 + SNR)
(13)
where A is the width of the point spread function, and the constant C is equal to WA, with W the spatial bandwidth of the signal. The ratio A/~ is called the superresolution coefficient. Note that Kosarev considered the case of a one-dimensional and time independent signal but his results may be extended to include more dimensional as well as time dependent imaging systems. It is worth noting that, as recognized by Kosarev, the resolution limit could be exceeded if the unknown signal, which is the object, could be parameterized. Parametric methods can, in principle, attain a better resolution than the Shannon limit. Such methods will be addressed in detail in the remainder of this article. In conclusion, in agreement with what was pointed out in the previous subsections, the work on information theory also shows that, ultimately, resolution is not limited by diffraction but by the unavoidable presence of noise.
D. Approaches Based on Signal-to-Noise Ratio In this subsection, other attempts to determine the resolution limit of an imaging system as a function of the signal-to-noise ratio (SNR) will be discussed. A vast amount of literature on this subject is available. For example, Idell and Webster (1992) define a resolution limit based on an expression for the SNR in the frequency domain. They consider a coherent imaging process and view this as a procedure for estimating the image's Fourier spectrum. Using a continuous detection model to describe the operation of photo-count noise-limited image recording and known statistical properties of laser speckle patterns, they compute the SNR of the Fourier spectrum of a detected, coherent
RESOLUTION RECONSIDERED
259
[E[D(f)]I [Var{D(f)}] 1/2
(14)
image, defined as
SNR O ( f ) =
where D ( f ) is the estimate of the Fourier transform of the detected, coherent image. This SNR expression can now be used to quantify the effective spatial frequency resolution limit achievable with a given coherent imaging system. For this purpose, one should first establish a minimum frequency domain SNR for which D ( f ) may still be considered useful. The resolution limit is then defined as the highest spatial frequency value for which the system achieves this SNR value. The magnitude of spectra of real object scenes tends to drop off at higher frequencies. This means that for many object scenes, in practice, effective cutoff frequencies can be defined which provide a practical measure of frequency domain resolution (Idell and Webster, 1992). An alternative SNR-based definition of the resolution limit has been proposed by Falconi (1967), who studied the two-point resolution of an imaging system in the presence of noise. For this purpose, he considered an object consisting of two closely located point sources of known separation imaged by an imaging system with a known point spread function. Furthermore, he assumed that the noise consists of fluctuations in the number of detected photons only. In order to find the limit, m0m,up to Which the angular separation of the two sources can be measured, Falconi assumed the separation to be slightly decreased from A0 to AOa, where the angle 0 is a distance measured in the focal plane and divided by the focal length of the objective lens. An infinitesimal detector, of width dO, at the focal plane will then detect a photon flux change, or signal change (I-Ia)dO, with I and la the photon intensities in the composite image of the two point sources at separations A0 and AOa, respectively. The standard deviation of the noise seen by the same detector is just (IdO) 1/2. Falconi then defined the measurement limit, AOm, as the minimum angular change in separation of the two sources to give an overall SNR equal to one, where the overall SNR is found from
SNR 2 = foo ~(I -d- I aO) 2 1 J_
(15)
According to Falconi, the measurement limit thus obtained also defines the standard deviation of the error of measurement of the separation of the two sources after the detection of a given number of photons. The resolution limit is then defined as the separation of both sources for which the standard deviation of the measurement error, which is the measurement limit, is equal to the actual separation. This resolution limit depends on the total dose of photons detected, the shape and the width of the system's point spread function, and the intensity ratio of both sources.
260
A. VAN DEN BOS AND A. J. DEN DEKKER
A similar approach is based on the finding that the measurement precision with which a single point source can be measured can be expressed as the ratio of some constant of the order of the Rayleigh limit, which is called the resolution scale, to the SNR (Fried, 1979, 1980). Analogous results have been found in radar theory, where a resolution scale is used which, as a fraction of the SNR, equals the precision with which a single target position can be measured (Dunn et al., 1970). Fried studied the general problem of measuring simultaneously the midpoint location, separation, and relative intensities of a pair of point sources. Treating this problem as a signal processing problem and applying a matched filter, Fried (1979, 1980) calculated the root mean square (rms) precision with which these parameters can be estimated. In addition, Fried calculated the rms precision with which the location of a single point source can be measured, and found, in agreement with earlier results, that this precision can be defined as the ratio of a resolution scale to the SNR. In examining quantitative results for measurement of the parameters of a pair of point sources, Fried found that there is no fundamental impediment to measuring these parameters---even when the separation of both point sources is significantly smaller than the Rayleigh limit---other than the absence of an SNR required to achieve the desired level of precision. The results of Fried also show that a significant increase of the SNR, in the sense of by a factor of two or more, to counter the complicating effect of small separation, is not needed until the separation approximates the resolution scale. From these results, Fried concluded that, apparently, the resolution scale defines not only the precision with which the position of a single point source can be measured but also the minimum separation of a pair of point sources that is required to avoid significant interference of the estimates of the parameters of both sources. This inspired Fried to suggest that the resolution scale, rather than the Rayleigh limit, ought to be considered basic to the resolution of an optical system. The resolution criteria discussed in this subsection have in common that they relate resolution to the SNR. However, they differ from one another in the assumed amount of available a priori knowledge. Furthermore, being more or less heuristic, they provide little insight into what the ultimate and exact limits to resolution are.
E. Approaches Based on Decision Theory
Several authors have discussed the concept of resolution in the framework of classical decision theory (Cunningham and Laramore, 1976; Harris, 1964b; Helstrom, 1969; Nahrstedt and Schooley, 1979). In this framework, resolving two identical objects such as point sources is defined as deciding whether two
RESOLUTION RECONSIDERED
261
objects are present in the field of view or only one. A measure of resolution is then the probability of making this binary decision correctly. Generally, the procedure is as follows. It is assumed that the set of possible objects is known a priori and that it consists of two alternatives only. For two-point resolution, these alternatives are one point source located at a specified position and two point sources located at specified positions, respectively. Given the noise staffstics, the likelihood that two point sources are present and the likelihood that one point source is present are computed from the image observations. The ratio of both likelihoods is then used as a decision function. Obviously, this ratio will be a stochastic variable with a certain probability distribution. This distribution will depend on the two specified object alternatives, the object that is actually present and the noise statistics. From this distribution, it is possible to calculate the probability that the application of the decision function would result in a correct decision. Accordingly, Harris (1964b) considered the case in which there are actually two point objects with a known separation while the observations are disturbed by additive normally distributed noise. The probability that application of the decision function would result in a correct decision then corresponds to the probability of resolution. The results of Harris clearly indicate that "although diffraction increases the difficulty of rendering a correct binary decision, it does not prevent a correct binary decision from being made, no matter how closely spaced the two points may be" (Harris, 1964b; p. 610). In the absence of noise, the probability of making the binary decision correctly would be equal to one. It is therefore only the limited precision of the observations and not diffraction that ultimately limits the resolution. Notice, however, that the preceding analysis breaks down if the true object is not a member of the a priori specified set of possible objects, which may be a serious restriction in practice.
F. Approaches Based on Asymptotic Estimation
Most of the resolution criteria and resolution enhancing methods discussed in the preceding subsections dealt with nonparameterized objects. However, the available a priori information about the object is often a parametric model. For instance, the notion of two-point resolution clearly implies a two-component model parametric in the locations, and possibly the amplitudes, of the components. An example of such a model is the sinc-square Rayleigh model. Resolving the components may then be regarded as a parameter estimation problem in which the most important parameters are the locations of the components. These and any other parameters of interest have to be estimated from a set of noisy observations. Usually, the number of parameters to be estimated is relatively small compared with the number of observations.
262
A. VAN DEN BOS AND A. J. DEN DEKKER
In Section IV, it will be shown how the parameters enter the probability density function of the statistical observations. From this parameterized probability density function, the Cram~r-Rao lower bound may be computed. This is a lower bound on the variance with which the parameters can be estimated. A detailed description of the Cramrr-Rao lower bound can also be found in Section IV. Furthermore, from the probability density function of the observations the maximum likelihood estimator of the parameters can be derived. An important property of the maximum likelihood estimator is that it achieves the Cramrr-Rao lower bound asymptotically, that is, for an infinite number of observations. Therefore, it is asymptotically most precise (Stuart et al., 1999). In the literature, many attempts to express resolution in terms of precision computed by use of statistical parameter estimation theory are found (Bettens et al., 1999; Cathey et al., 1984; Farrell, 1966; Helstrom 1969, 1970; Orhaug, 1969). Some of these references consider single-source resolution, which is defined as the capability of an imaging system including digital computing facilities to determine the position of a point-source object (Cathey et al., 1984; Farrell, 1966; Helstrom, 1969, 1970; Orhaug, 1969). Other references treat differential (two-source) resolution, which is defined as the system's capability to determine the separation of two point sources (Bettens et al., 1999; Orhaug, 1969). For single-source and differential resolution, the parameters of interest are the location of the single point source and the separation and midpoint position or, equivalently, the locations of the two point sources, respectively. Farrell (1966) derived expressions for the Cramrr-Rao lower bound for estimators of the location of a point source in terms of its intensity, the shape of the point spread function, and the noise characteristics. He found that the Cramrr-Rao lower bounds decrease with increasing width of the imaging aperture and tend to zero for an increasing SNR. This was also found by Helstrom (1969), who determined the Cramrr-Rao lower bounds for estimators of object parameters such as radiance and position for a uniformly radiating circular object, not necessarily a point source, and a circular aperture. Furthermore, Helstrom found that for a given total energy received from the object, the Cramrr-Rao lower bounds increase with increasing radius of the object, that is, with decreasing degree of coherence of the light received at the imaging aperture. Orhaug (1969) and Cathey and coauthors (1984) used results obtained in a radar context (Kelly et al., 1960; Root, 1962; Swerling, 1964) to derive expressions for the variances of pertinent maximum likelihood estimators. In particular, Cathey and coauthors (1984) considered single-source resolution for one-dimensional coherent imaging in the presence of additive normally distributed noise. Taking the variance of the maximum likelihood estimator of the location of a point source as a measure of resolution, they showed that
RESOLUTION RECONSIDERED
263
resolution is improved as the curvature of the point spread function at its center is increased. This result corresponds to the guideline provided by the classical resolution criteria: the narrower the main lobe of the point spread function, the better the resolution. Furthermore, Cathey and coauthors found that, in agreement with earlier results, the variance is inversely proportional to the SNR. They also compared the result obtained when the aperture is a perfect low-pass spatial filter with that obtained when the aperture is a perfect bandpass filter with the same total bandwidth. It was found that enhanced resolution, that is, a lower variance, may be achieved by moving the passband away from the origin, since this results in a point spread function with a narrower main lobe. This is in essence an apodization procedure as described in Section II. Theoretically, the main lobe of the point spread function can thus be made arbitrarily narrow. However, as we carry the apodization process to extremes, eventually the probability of selecting a secondary lobe rather than the main lobe will no longer be an exception. Of course, this produces a large error in the location estimate. Then, the preceding analysis breaks down, as was pointed out by the authors themselves, who stated that their analysis is valid only for large SNRs and that it is restricted to the main lobe of the point spread function (Cathey et al., 1984). Applying statistical parameter estimation theory, Orhaug (1969) expressed both the single-source and differential source resolution of an imaging system in terms of the system and the noise parameters. He also discussed the relation between both kinds of resolution, which is found to be determined directly by the shape of the system's point spread function, and he found that for separations that are large compared with the width of the point spread function, the variance with which the separation can be estimated is twice the variance with which the position of a single point source can be estimated. At this stage, it should be emphasized that the statistical parameter estimation theory as applied in the references discussed in this subsection is valid only if the parametric image model used is correctly specified. This means that exact knowledge of the point spread function is required. Furthermore, computation of the Cram6r-Rao lower bound and application of maximum likelihood estimation both require the probability density function of the observations to be known. In addition, the results are asymptotic, since maximum likelihood estimators generally attain the Cram6r-Rao lower bound only asymptotically. (See, for instance, Bettens et al. (1999).) The three assumptions--correct model, large number of observations, and known probability density function of the observations--are often not realistic in practice. Therefore, the results of statistical parameter estimation theory should be treated with caution. On the other hand, the assumptions just listed are not made in the alternative parameter estimation based approach to the concept of two-point resolution proposed in this article and described in detail in Section VII.
264
A. VAN DEN BOS AND A. J. DEN DEKKER
G. Conclusions
Several approaches to the concept of resolution different from the classical Rayleigh-like approach have been discussed in this section. It has been pointed out that, in principle, superresolution, that is, resolution beyond the diffraction limit, is possible. A condition necessary to achieve superresolution is the availability of a priori knowledge about the object. This a priori knowledge may be knowledge about finite object extension, positivity of the object, upper or lower bounds on the object intensity, object statistics, or parametric model of the object. Image restoration methods that incorporate a priori knowledge may, in principle, attain superresolution but their performance will always be limited by noise. These limits have been analyzed in various ways, by using, for instance, information theory, decision theory, or statistical parameter estimation theory. This has resulted in a variety of alternative resolution criteria. These criteria have in common that they define the attainable resolution limit in terms of the signal-to-noise ratio but they differ from one another in the assumed amount and nature of available a priori knowledge. The subsequent sections of this article will focus on a parameter estimation approach to resolution. This approach is concerned with two-point resolution and all it requires is that a parametric model for the observations and a loglikelihood function or, equivalently, a criterion of goodness of fit are chosen. Since this approach requires neither known error distributions and asymptoticity nor a correct model, it is essentially different from and a practical alternative to the approaches found in the literature and described in this section and Section II.
IV. MODELING AND PARAMETER ESTIMATION
A. Introduction
In this section, parametric statistical models of observations will be introduced. Specifically, these will be used in this article to model statistical-errorcorrupted observations made on two-component models such as the Rayleigh sinc-square two-point resolution model. For the purposes of this article, the most important parameters of such models are the locations of the components. It will be shown how these and other parameters enter the probability density function of the statistical observations. This parameterized probability density function is used for two purposes. First, from it, the Cramdr-Rao lower bound may be computed. This is a lower bound on the variance with which the parameters can be estimated. Second, from the probability density function the maximum likelihood estimator of the parameters is derived. This estimator actually achieves the Cramrr-Rao lower bound asymptotically, that is, for an
RESOLUTION RECONSIDERED
265
infinite number of observations. Therefore, it is asymptotically most precise. For this and other reasons, the maximum likelihood estimator is very important and is often used in practice. In this article, emphasis will be on the analysis of maximum likelihood estimators of the parameters of two-component models. However, it will be seen later that the results of this analysis also apply to estimators that are not always maximum likelihood such as the widely used least squares estimator The section is concluded by a numerical example. Its purpose is to show that for finite numbers of observations the properties of maximum likelihood and comparable estimators may essentially differ from the asymptotic properties and that this may have important consequences for the ability of these estimators to resolve the location parameters and, hence, to distinguish the components.
B. Parametric Statistical Models o f Observations
Any applied physicist will readily admit that his or her observations "contain errors." For the purposes of this article, these errors must be specified. This specification is the subject of this subsection. Generally, sets of observations made under the same conditions nevertheless differ from experiment to experiment. The usual way to describe this behavior is to model the observations as stochastic variables. The reason is that there is no viable altemative and that it has been found to work. Stochastic variables are defined by probability density functions. In this article, the observations are either counting results or continuous measurements. If the observations are counting results, they are nonnegative integers and the probability density function defines the probability of occurrence of each of these integer outcomes. If the observations are continuous, the probability density function describes the probability of occurrence of an observation on a particular interval. Useful books about distributions and their properties are those of Chatfield (1995), Mood et al. (1987), Papoulis (1965), and Stuart and Ord (1994). Suppose that a set of observations tOn, n = 1 . . . . . N , is available. Then the N x 1 vector tO defined as tO =
(tO1
9 99
tON) T
(16)
represents a point in the Euclidean N space having wl . . . WN as coordinates. This will be called space o f the observations throughout. The expectations of the observations are defined by their probability density function. The vector of expectations E[w] = ( E [ w , ] . . . E[toN]) T
(17)
is also a point in the space of observations and the observations are distributed about this point.
266
A. VAN DEN BOS AND A. J. DEN DEKKER
These ingredients are sufficient for a definition of nonsystematic, or statistical errors in the observations. These are the differences of each observation and its expectation. Therefore, the expectation of the nonsystematic errors is equal to zero. In this article, the expectations of the observations in every measurement point are function values of a function of one or more independent variables which is parametric in the quantifies to be measured. This is illustrated in the following example. Example IV..1 (Observations of Poisson Distributed Biexponential Decay) Suppose that observations wn, n = 1. . . . . N, are made on a biexponential decay process at the measurement points Xl . . . . . XN. Here and in what follows, these measurement points are assumed to be exactly known. This is realistic in the many applications where the measurement points are time instants or locations. Furthermore, in this example, it is assumed that the observations are statistically independent and have a Poisson distribution. This implies that the probability that the observation wn is equal to ~o~ is equal to
exp(-~.n ) n~----[" ~" (.On!
(18)
where the parameter ~'n is equal to the expectation E[w,]. Since the Wn are assumed to be independent, the probability p(og) of a set of observations o9 = ( w l . . . coN)r is the product of all probabilities described by Eq. (18): p(oo) -- H e x p ( - ) ~ ) X ~ n
(19)
O)n!
with n = 1. . . . . N. Since the parameter )~. in this expression is equal to the expectation E[wn], it is, by definition, equal to the value of the biexponential function in the measurement point x.: E[w.] = ~.. = otlexp(-fllXn) nt- otzexp(-flZXn)
(20)
where the amplitudes C~l and t3t 2 and the decay constants/31 and/52 are the unknown parameters of the function which have to be estimated from the observations. The quantities w. - ) v ,
(21)
are the nonsystematic errors in the observations. Their expectations are equal to E[w,] - )v, and are, therefore, equal to zero. Example IV. 1 shows that the expectation of the observations is an accurate description of what experimenters usually call the model underlying the observations. Also, substitution of )Vnas described by Eq. (20) in Eq. (19) shows how the probability density of w, depends on the parameters oil, c~2,/~1, and/52.
RESOLUTION RECONSIDERED
267
This dependence will be extensively used later for the design of estimators of these parameters from a set of observations. Of course, such an estimation procedure is correct only if the expectation model is an accurate description of the true expectations. Example IV.2 (Wrong Model of the Expectations of the Observations) Suppose that in Example IV. 1 the expectations of the observations are described by )~n = Yl "~- y2Xn Jr-Otlexp(--fllXn)q-ot2exp(--fl2Xn)
(22)
with 9/1, 9/2 > 0. In practice, a function like Y1 + y2x usually represents a background function. It constitutes a systematic contribution to the expectations and thus to the observations. Then, if this function is not included in the expectation model of the observations and/~n described by Eq. (20) is substituted in the probability density function instead of A n described by Eq. (22), the functional dependence of this probability density function on c~1,or2,/3~, and 132is wrong. In what follows, this will be called a systematic error, or modeling error It will be shown later that the modeling error has consequences for the estimates of these parameters and, therefore, for distinguishing estimates of closely located parameters fll and/32.
C. Parameter Estimation
In the previous subsection, examples were presented of introducing the parameters to be measured into the probability density function of the observations. In this subsection, these results will first be somewhat generalized. Next, it will be shown how the probability densities thus parameterized may be used to define the Fisher score concept and to compute the Cramdr-Rao lower bound, which is a lower bound on the variance of any unbiased estimator. Then, it is discussed how, from the parameterized probability density functions, the maximum likelihood estimator of the parameters may be derived. This subsection is concluded by a numerical example demonstrating the limits to resolution of maximum likelihood estimates of the parameters of two-component models like those used in optics. 1. Dependence of the Probability Density Function of the Observations on the Parameters
In this subsection, examples will be presented illustrating the dependence of the probabilities or probability density functions of the observations on the parameters. To simplify the analysis, in what follows we will use the logarithm
268
A. VAN DEN BOS AND A. J. DEN DEKKER
of the probabilities or probability density functions of the observations instead of the probabilities or probability density functions themselves.
Example IV.3 (Poisson Distributed Observations) The logarithm of the probability of the Poisson distributed observations described by Eq. (19) becomes E
-- ~.n -+- conlnXn -- In co', !
(23)
n
Suppose that the expectation of the nth observation is described by
E[w',] = f.(O) = f (xn; O)
(24)
where x, is the measurement point which is supposed to be known and which may be vector valued. The vector 0 = (01 ... 0 r ) r is the vector of parameters of the function. Then Expr. (23) may be written as
q(O) -- E
- fn(0) + co',lnf,(0) - In co,,!
(25)
n
where the last term does not depend on 0.
Example IV.4 (Normally Distributed Observations) The probability density function of normally distributed observations is defined by
1
l(w_ E[w])rw-l(w_ E[w])]
p(co)-- (2yr)N/2(de t W)l/2 exp [ - ~
(26)
where co = (col ... coN)r, W is the N • Ncovariance matrix of the observations with as its (nl, n2)-th element cov(w',,, Wn2), and where det W and W -1 are the determinant and the inverse of W, respectively. The N • 1 vector E[w] of expectations is defined by Eq. (17). The logarithm of p(co) is equal to
q(O) = - - ~N l n 2 z r _
llndetW_ -~
1 - f (O)) T W-1 (w - f (O)) -~(w
(27)
where f(O) - ( f l ( 0 ) . . . fN(O)) r. Notice that both first terms of Eq. (27) are independent of the parameters 0. If the Wn are uncorrelated, 2 W = d i a g ( a 2, ...awN )
(28)
and Nln2:rr-Zlnaw q(O) -- - - ~ n
-
1 -2
n~
[con -- f~(O)] 2 O"Wn
(29)
RESOLUTION RECONSIDERED
269
2 = crw 2 for all n, 2 is the variance of Wn. If, in addition, crw. where aWn N In 2re - N In Crw
q(O) =
2
1
2~r~
~~[o) n -
fn(O)] 2
(30)
Example IV.5 (Binomially Distributed Observations) Next, consider independent and binomially distributed observations. Then the probability that an observation Wn is equal to (.Dn is defined as
(
- - p n ) M-wn
M)p~"(l On
(31)
where M)
(32)
COn
is the binomial coefficient; M is the total number of trials, which is assumed to be equal in all points; (.On is the number of successes; and Pn is the probability of success. The expectation of a binomially distributed Wn is equal to Mpn and, therefore, p~ = fn(O)/M. Then the logarithm of the probability of w is equal to
q(O)=~ln(M)-NMlnM n +
COn
2 0 ) n In fn(0) + (M - (.On)ln(M -- fn(0))
(33)
n
Notice that only the last sum depends on the parameters 0. These three examples show that it may be relatively easy to establish the functional relationship of the probability or the probability density function of the observations to the parameters. This relationship will be used for two purposes. First, in Section IV.C.2, it will be used to compute a lower bound on the variance of estimators of the parameters, the Cramdr-Rao lower bound. Then, in Section IV.C.3, it will be used for the construction of the most important type of estimator, the maximum likelihood estimator.
2. Limits to Precision: The Cramdr-Rao Lower Bound In this subsection, the Cram6r-Rao lower bound will be introduced. This is a lower bound on the variance of any unbiased estimator of a parameter. An estimator is said to be unbiased if its expectation is equal to the true value of the parameter. Stated differently, an unbiased estimator has no systematic error.
270
A. VAN DEN BOS AND A. J. DEN DEKKER
First, the Fisher score vector (Jennrich, 1995) is introduced. This is defined as the K x 1 vector
s(O) =
Oq(O)
(34)
00
The expectation of the Fisher score vector is equal to the null vector. This follows from
f f
p(w) da~ = 1
(35)
...
(36)
and hence "'"
00
-------~---p(w) dw - o
and this is equivalent to
E[Oq(O)] =E[s(O)]-~
(37)
where o is the K • 1 null vector. Notice that this result applies to any allowable parameterization of the probability density function since it is a direct consequence of the volume of any probability density function being equal to one. Since the expectation ofthe Fisher score is equal to zero, its K • Kcovariance matrix is described by
F--E[s(O)s~(O)]--El Olnp(w) Olnp(w)] 00
00 ~
(38)
This covariance matrix is called the Fisher information matrix. It can be shown (Dhrymes, 1970; Goodwin and Payne, 1977) that under general conditions the covariance matrix cov(t) of any unbiased estimator t of 0 satisfies
cov(t) >_ F -1
(39)
This inequality expresses that the difference of cov(t) and F -1 is positive semidefinite. Since the diagonal elements of cov(t) represent the variances of t~. . . . . tu and since the diagonal elements of a positive semidefinite matrix are nonnegative, these variances are larger than or equal to the corresponding diagonal elements of F -~. In this sense, F -~ represents a lower bound to the variances of all unbiased t. The matrix F -1 is called the Cram~r-Rao lower
bound on the variance of t. The Cram6r-Rao lower bound can be extended to include unbiased estimators of vectors of functions of the parameters instead of the parameters proper.
RESOLUTION RECONSIDERED
271
p(O) = ( p l ( 0 ) . . . pL(O)) T
(40)
Let
be such a vector and let r = (rl ... rL) r be an unbiased estimator of p(O). Then it can be shown that
Op Op T cov(r) _> - ~ F -1 O0
(41)
This expression will be used in Section IV.C.4.
3. Maximum Likelihood Estimation The maximum likelihood method for estimation of the parameters assumes the probability or probability density function of the observations to be known. It also assumes the dependence of the probability density function on the unknown parameters of the observations to be known. Examples of how this dependence may be established have been presented in Section IV.C.1. The maximum likelihood procedure consists of the following three steps: 1. The available observations w = (Wl ... WN) r are substituted for the corresponding independent variables co = (col... coN)r in the probability or the probability density function of the observations. Since the observations are numbers, the resulting expression depends only on the elements of 0 - (01...0K) r. 2. The elements of 0 = (01... OK)r, which are the hypothetical true parameters, are considered to be variables. To express this, we replace them by t = (tl . . . tr) ~. The resulting function q(t) is called the log-likelihood function of the parameters t for the observations w. The pertinent log-likelihood functions for the three distributions discussed earlier in this section are described by the Eqs. (25), (27), and (33) with con replaced by w, and 0 by t, respectively. 3. The maximum likelihood estimates ~ of the parameters 0 are computed. These are defined as the values of the elements of t that maximize q(t), or -- arg maxt q (t)
(42)
An important consideration is that maximum likelihood estimators are usually relatively easily found. All that needs to be done is to establish the dependence of the supposed probability or probability density function of the observations on the parameters. Examples of this process have been presented in Section IV.C.1. In these examples, the dependence on the parameters is established via the expectation of the observations.
272
A. VAN DEN BOS AND A. J. DEN DEKKER
The theory of maximum likelihood estimation can be found in Norden (1972, 1973), Stuart et al. (1999), Stuart and Ord (1994), and Zacks (1971). The content of these references may quickly convince any reader that a rigorous description of the statistical properties of maximum likelihood estimators is far outside the scope of this article. For the purposes of this article, their most important properties may generally be described as follows:
9 Consistency. Generally, an estimator is said to be consistent if the probability that an estimate deviates more than a specified amount from the true value of the parameter can be made arbitrarily small by increasing the number of observations used. 9 Asymptotic normality. If the number of observations increases, the probability density function of a maximum likelihood estimator tends to a normal distribution. 9 Asymptotic efficiency. The asymptotic covariance matrix of a maximum likelihood estimator is equal to the Cramrr-Rao lower bound. In this sense, the maximum likelihood estimator is most precise. Also notice that unless the likelihood function q(t) is quadratic in the elements of t, the actual computation of the maximum likelihood estimate is a nonlinear optimization problem that cannot be solved in closed form. Finally, Eq. (30) shows that the likelihood function for independent and identically normally distributed observations is described by
q(t) -
N2 In 2zr - N In crw
1 2o.2 ~-~[Wnn - fn(t)]2
(43)
From this expression, it follows that under these conditions the value of t minimizing the sum in the last term is identical to the maximum likelihood estimate. To simplify the terminology, in what follows we will also use the term likelihood function for the nonlinear least squares criterion
~-~[Wn -- fn(t)] 2
(44)
n
on the understanding that it has to be minimized instead of maximized. Minimizing Eq. (44) with respect to t and using the value of t at the minimum as an estimate of 0 is called nonlinear least squares estimation. For an overview of this method and numerical methods to implement it, see van den Bos (1982). The nonlinear least squares method is often used in practice, irrespective of or in the absence of knowledge of the probability density function of the observations. If it is applied to observations other than those that are independent and identically normally distributed, it is generally not a maximum likelihood estimator. Therefore, in what follows, so that estimators that are not necessarily maximum likelihood can be included, the probability or probability density
RESOLUTION RECONSIDERED
273
function of the observations will not necessarily correspond to the likelihood function used. Consequently, the theory developed will apply to all probability density functions of the observations on the one hand and likelihood functions and criteria of goodness of fit, such as the least squares criterion, on the other. The three properties of maximum likelihood estimators listed previously are asymptotic. For example, if the observations are described by
ll)n : fn (0) + Un
(45)
with the Vn independent and identically normally distributed errors with an expectation equal to zero and variance crw, 2 the least squares estimates of the elements of 0 may be expected to be normally distributed about 0 with covariance matrix
2(0f T Of)-' aw 00 00 r
(46)
with f = (fl ( 0 ) . . . fu(O)) z if the number of observations is sufficiently large. If this condition is not met, the properties of the solutions may be essentially different from the asymptotic properties. This is illustrated by a numerical example in the next subsection.
4. Limits to Resolution of Parameters: A Numerical Example Suppose that observations w = (wl ... WN)T have been made with bi-Gaussian expectations described by
fn(0) __ 0/lexp {_ ~(Xn 1 -- 31 )2 } q- 0/2exp { _ 1~(Xn -- 32) 2}
(47)
where fn(O) = f(Xn; 0) and the unknown parameters 0 are the locations (3132) ~. For simplicity, the parameters 0/1 and 0/2 are supposed to be known and to be positive. Furthermore, assume that the component functions of f (x; 0) strongly overlap. Since the half-width of the components of f ( x ; O) is approximately equal to one, this means 131 - 321 << 1. Let the observations be described by Eq. (45) with the Vn independent and identically normally distributed errors with a standard deviation crw and an expectation equal to zero. Next define ~'1 -- 0/1/(0/1 "q- 0/2) and 0/~ - 0/2/(0/1 q- 0/2). Then, the functions ofthe parameters Pl - a'131 + 0/2fl2 and/9 2 -- /~1 -- f12 are a measure of the overall location of f (x;O) and the difference of the locations of its components, respectively. By Eq. (41), the Cram6r-Rao lower bound on the variance of unbiased estimators of p = (Pl P2) T is described by
Op F - 1OpT 007" 00
(48)
274
A. VAN DEN BOS AND A. J. DEN D E K K E R i
i
0.8
z< o . 6 CO 09
< (.9
0.4
0.2
o
-2
0
2
x FIGURE 1. Measurement points and expectations of the observations in the numerical example of Section IV.C.4. The expectations are the sum of two overlapping Gaussian functions.
where F -1 is described by Eq. (46) and
O0 r
-(1'
,49,
Suppose that the values of the parameters are, respectively, O/1 - - 0.4, Ct 2 = 0.6, and/31 - -/32 - - 0 . 0 5 . Furthermore, assume that there are 21 measurement points x, - ( n - 11) • 0.4, n -- 1. . . . . 21. Figure 1 shows the twocomponent function and the location of the measurement points. The standard deviation crw of the observations is taken equal to 0.004. If the asymptotic covariance matrix described by Eq. (48) is used as covariance matrix, it yields a standard deviation 0.003 in the estimates of the location Pl -- 0.4/31 -k- 0.6/32, a standard deviation 0.09 in the estimates of the distance /92 -- fll -- f12, and a low correlation coefficient of the estimates of -0.01. If the corresponding asymptotic normal distribution is taken as distribution of the estimates, the estimates of/92 are marginally normally distributed with an expectation - 0 . 1 and a standard deviation 0.09. These asymptotic considerations predict a probability of an estimate on the interval ( - 0 . 2 , 0) of about .74. Furthermore, it is easily shown that the probability of a positive estimate, which causes the components to change places, is about. 13. Since this implies a probability of .87 that the components do n o t change places, the conclusion from these asymptotic considerations is that with a probability of .87 an
RESOLUTION RECONSIDERED
275
FIGURE2. Histogram of estimates of the difference of the locations (horizontal) and the overall location of the components (diagonal) of the overlapping Gaussian functions in the numerical example of Section IV.C.4. estimate of the distance is obtained that is perhaps not very precise but may be called meaningful. To check these results, we next carry out the following simulation experiment (den Dekker, 1992). A number of 100,000 sets of 21 normally distributed observations such as described are generated and for each set the values of the estimates of/~1 and/~2 minimizing the least squares criterion are computed. Under the described conditions, this is a maximum likelihood estimate. The results, computed by use of a numerical minimization procedure, are collected into a histogram shown in Figure 2. This histogram may be explained as follows. The horizontal coordinate is b l - b2, which is the difference of the estimates of the locations. Therefore, at the origin of this coordinate bl - b2. This corresponds to coinciding components or, equivalently, to one single component at a location b - b I = b2 and with an amplitude ot = Ota + ot2. To the left of the origin, bl < b2. Therefore, this is where the exact difference 131 - / ~ 2 is found. The diagonal coordinate is equal to Ctlbl + ot2b2, which is an estimate of the overall location of the twocomponent model. The horizontal plane is divided into classes as follows. The interval (0, 0.02) of the (ct'lbl + ctzbz)-axis and the interval (-0.28, 0.28) of the (bl - bz)-axis are divided into 20 and 17 equal subintervals, respectively. l
I ^
^
i ^
276
A. VAN DEN BOS AND A. J. DEN DEKKER
Thus the 100,000 solutions are divided into 340 different classes with the exception of 11 solutions which were found outside the intervals mentioned. The vertical coordinate represents the frequency, that is, the number of times a solution in a particular class occurs. The resulting histogram is trimodal. The peak in the middle represents solutions bl - b2 distributed about zero. Together, the classes having bl - b2 = 0 as their midpoint contain a number of 29,782 solutions. Of these, 29,323 solutions exactly coincide, that is, bl = b2. On the other hand, the additional lower peaks to the left and to the fight of bl - b2 = 0, by definition, correspond to solutions with distinct values for bl and b2. The solutions represented by the left-hand peak are more or less distributed about the true values:
(bl -- b2, og'1bl "-}-0/282) = (/~1 -/~2, og'1/~1 "q- 0~2/~2)
(50)
The location of the component with amplitude u ~relative to that with amplitude c~2 agrees with the relative location of the component with amplitude ot~ in the exact model. In the solutions represented by the fight-hand peak of the histogram, the components have changed places. The number of solutions in the fight-hand peak is almost equal to that in the left-hand peak. The conclusion drawn from these simulation results is that there are substantial and essential differences between the asymptotic predictions and the actual simulation results. The asymptotic prediction is a unimodal normal probability density function around the true values of the parameters. This probability density function is such that with a probability of. 13 the sign of the difference of the locations will be opposite that with exact observations. Then the components change places. The simulations, on the other hand, produce a trimodal probability density function. The probability of a solution in the neighborhood of the true values of the locations is about .35. This probability is more or less equal to that of a similar solution but with the locations reversed. The probability of exactly coinciding solutions is about .3. Thus the solution from two-component observations is a one-component model. This means that with a probability of .3 it is not possible to resolve the locations of the components. The resolution limit thus encountered will be the subject of the sections to follow.
D. Conclusions In this section, maximum likelihood estimators for measurement of parameters of statistical observations have been introduced with emphasis on estimation from noise-corrupted observations made on two-component models such as the biexponential and bi-Gaussian function. By definition, these estimators maximize the likelihood function of the observations. The favorable properties
RESOLUTION RECONSIDERED
277
of maximum likelihood estimators are asymptotic, that is, for an infinite number of observations. A numerical example has been presented showing that forfinite numbers of observations these properties may essentially differ and obstruct the resolving of the location parameters of the two-component models. It will be shown later in this article that this behavior is caused by a change of the structure of the likelihood function under the influence of the fluctuations in the observations. For an explanation of this structural change, elements of a branch of mathematics called singularity theory are needed. These will be presented in Section V.
W. ELEMENTS OF SINGULARITY THEORY
A. Introduction Section IV was concluded by an example illustrating how fluctuations in the observations may obstruct the resolving of the components of a two-component model. In Sections VI and VII, singularity theory will be used to explain this phenomenon. The relevant results and terminology from this theory will be presented in this section. The combination of singularity theory and its applications is also called catastrophe theory (Arnol'd, 1992). For the purposes of this article, useful texts on catastrophe theory are those of Arnol'd (1992), Gilmore (1981), Poston and Stewart (1978), and Saunders (1980). In Thompson (1982), fascinating examples of the use of singularity theory in theoretical and applied science are described. Generally, singularity theory deals with structural changes of parametric functions of one or more independent variables as a result of changes of the parameters. In this context, the word structural relates to the structure of the function, that is, the number and nature of its stationary points. These are the points where the gradient of the function vanishes. The nature of stationary points may be different: they may be, absolute or relative, maxima or minima or be saddle points. Saddle points are minima in one or more coordinates and maxima in the others. A simple example of a saddle point is a mountain pass. Thus structural change of a function may be defined as a change of the number and nature of the stationary points. Central in singularity theory is the insight that structural change of a function may occur only if, as a result of a change of the parameters of the function, one of the stationary points becomes degenerate. Whether a stationary point is degenerate depends on the Hessian matrix in that stationary point. This is the matrix of the second-order partial derivatives of the function with respect to its independent variables. At a maximum, all eigenvalues of the Hessian matrix
278
A. VAN DEN BOS AND A. J. DEN DEKKER
are negative, at a minimum they are all positive, while at a saddle point they are partly negative, partly positive. If at a stationary point one or more eigenvalues vanish, the Hessian matrix is called singular and the stationary point is said to be degenerate. It is clear that the nature of a stationary point remains the same as long as changes of the values of the parameters do not cause one or more eigenvalues to vanish. If this applies to all stationary points, the structure of the function remains the same and the function is called structurally stable. On the other hand, if as a result of changing parameters, an eigenvalue vanishes, further change of the parameters may change the structure of the function. Therefore, for parameter values corresponding to vanishing eigenvalues, the function is said to be structurally unstable. In the Euclidean space of the parameters, the parameter values for which a particular eigenvalue is equal to zero may be collected into sets. These sets are called bifurcation sets. An important consideration in what follows is that one may freely move in parameter space without altering the structure of the function as long as no bifurcation set is crossed. In the applications studied in this article, at the bifurcation set two stationary points merge to form a single degenerate stationary point and, subsequently, vanish. If the bifurcation set is crossed in the opposite direction, two new stationary points appear. In Sections VI and VII, it will be shown that this merging and subsequent vanishing of stationary points also explains the vanishing of distinct solutions for the model parameters of the numerical example of Section IV.C.4. The outline of this section is as follows. In Section V.B, relevant notions and definitions are introduced. Then, in Section V.C, the representation of a function for parameter values close to a bifurcation set is discussed. Finally, in Section V.D, an algorithm is described for the systematic derivation of such a representation. Conclusions are drawn in Section V.E.
B. Definitions and Notions In this section, f ( x ; ~.) is a real scalar function of the elements Xp of the P x 1 vector of independent variables x = (Xl...xe) r
(51)
The function is parametric in the elements/~,q of the Q x 1 vector ofparameters X = (L1 ... )~a)r
(52)
In these expressions, the superscript Tdenotes transposition. It will be assumed throughout that f ( x ; X) is a smooth function of the Xp; that is, all its partial derivatives exist and are continuous. For the validity of some results to follow, this assumption is too severe but it simplifies the mathematical formalism.
RESOLUTION RECONSIDERED
279
Moreover, the likelihood functions and criteria of goodness of fit that the theory will be applied to in Sections VI and VII are smooth functions. The P x 1 vector of first-order partial derivatives of f ( x ; ) 0 defined as OX 1 "" OXp
O---X --
(53)
is called the gradient o f f ( x ; )~) with respect to the vector x. A point x is called stationary if in that point the gradient is equal to the null vector. The Hessian matrix is defined as f
02f
~92 f
192f
ax 2
OX10X2
OX 10Xp
a2f
0if
O2f
OX ~X T
OX2 0X 1
ax 2
'~
(54) 192f
a2f axe axl
ax 2
)
If, at a stationary point, the eigenvalues of the Hessian matrix are nonzero, that stationary point is called nondegenerate or Morse. At a Morse maximum all eigenvalues are strictly negative. At a Morse minimum all eigenvalues are strictly positive. A Morse stationary point is called a Morse R-saddle point if R of the eigenvalues of the Hessian matrix concerned are strictly negative and the remaining ones are strictly positive. The representation of functions in and around stationary points, degenerate and nondegenerate, is the subject of the next subsection.
C. Functions around Stationary Points 1. The Morse Lemma and the Splitting Lemma It can be shown (Poston and Stewart, 1978) that in the neighborhood of a Morse R-saddle point a smooth function f(x; )~) may be represented in local coordinates ~ = (~1 ... ~e) r as +
.
.
.
.
~ -: ~ + 1
_.
.._
~2
(55)
where the saddle point has been chosen as the origin and the function has been translated so as to make it vanish at the origin. The result Expr. (55) is called the Morse lemma (for parametric families of functions). Notice that the Morse lemma implies that the form Expr. (55) is valid for all values of,k concerned. If all stationary points of the function f(x;)~) are Morse, each of them is locally represented by an expression of the form Expr. (55) and the structure of f(x; ~)
280
A. VAN DEN BOS AND A. J. DEN DEKKER
does not change if the elements of X are changed. Under these conditions, the function is called structurally stable. Next consider a function f(x; X) that has, for a particular value 2. of the parameters )~, a degenerate stationary point 2. That is, one or more, say S, eigenvalues of the Hessian matrix at the stationary point are equal to zero. Then the splitting lemma O~orparametric families of functions) states that in the neighborhood of (~, ~.) the function f(x; ~.) may be represented as dl~? + d2~ 2 + ' ' " - + -
2
dp-s~p_s +
~
g ( ~ P - S + l , . . . , ~e, )~)
(56)
where the dp, p -- 1 . . . . . P - S, are either equal to 1 or to - 1 (Poston and Stewart, 1978). In this expression, the ~p are local coordinates with the stationary point as origin, S is the number of eigenvalues equal to zero, and g(~P-S+I . . . . , ~p;)~) is a function of the coordinates ~P-S+I . . . . . ~p and depending on ~.. The quadratic terms in Expr. (56) are called the nondegenerate or Morse part of the representation of the function f(x; X) in the neighborhood of (~, X), and g(~P-S+l . . . . . ~p;~) is the degenerate part. It is clear that ~1, ~2. . . . . ~e-s are the coordinate directions in which nothing happens if X is varied. They are, therefore, called the inessential variables. If through variation of X the structure of the function f(x; )~) changes, it is the degenerate part g(~P-S+I . . . . . ~p; ~.) that changes. This is why ~P-S+I . . . . , ~P are called essential variables.
2. Simple Examples of Singular Representations In this subsection, two simple examples of representations of functions in the neighborhood of singular stationary points will be introduced.
Example 1/.1 (The Fold Catastrophe) The first example is the function 1 3 f (x; ~) -- x~ + x 2 + . . . + X2_l -Jr-~.Xp "Jr- -~X
(57)
The gradient of this function is described by (2xl 2x2...2Xe_,
X+x2) r
(58)
Its Hessian matrix is the diagonal matrix
2 diag(1 1...1
Xp)
(59)
Equations (58) and (59) show that the function has two stationary points: (0 0 . . .
0
..[_(_~.)1/2)
(60)
for X < 0 and that the Hessian matrices of the function at these stationary
281
RESOLUTION RECONSIDERED
(a) 2
(c)
(b)
4
5
~>0
~=0 D o ..d
0
0
ii
-
-2
0 X
2
-4
-2
0 X
2
-5 -2
0 X
2
FIGURE 3. The fold catastrophe if its parameter is (a) negative, (b) equal to zero, and (c) positive, respectively.
points are equal to
2 diag(1 1 . . . 1
+(-)01/2)
(61)
These stationary points merge to form one degenerate stationary point if )~ vanishes. For )~ > 0 the function has no stationary points at all. It is clear that 1 3 g(Xp; ~,) -- ~.Xp Jr- ~Xp
(62)
is the degenerate part ofthe function. Figure 3 shows g(Xp; ~,) for ~ = - 1, ~ = 0, and ~. = 1, respectively. The function has one minimum and one maximum if )~ is negative. For increasing )~, the minimum and the maximum get closer and closer and merge to form one degenerate stationary point for ~. equal to zero. If ~. becomes positive, the function has no stationary points anymore. Therefore, the bifurcation set, that is, the set of all points in the Euclidean space of the parameters where the function g(Xp; ~,) changes its structure, is simply )~ = 0
(63)
In catastrophe theory, the function described by Eq. (62) is called the (canonical) fold catastrophe. Applying these results to the function f ( x ; )~) defined by Eq. (57) shows that this function changes from a function with a one-minimum, one one-saddle point structure into a function without any stationary point or the reverse if the sign of ~. changes.
Example V.2 (The Cusp Catastrophe) The second example of a function representation in the neighborhood of a singular stationary point is f (x; ~,) -- x 2 + x 2 + . . . + X2p_l -~ g(Xp ;,L)
(64)
282
A. VAN DEN BOS AND A. J. DEN DEKKER
with
g(x; )~) --
1 4 )~lxp -~- ~1/~,2X2 + ~Xp
(65)
where )~ = (~,1 ~.2)r is the vector of parameters. The gradient off(x; )~) is described by (2Xl 2 x 2 . . . 2xp_l
)~1 + )~2Xp + x3) r
(66)
and its Hessian matrix is equal to
diag(2 2 . . . 2
)~2 + 3X2p)
(67)
From Expr. (66), it follows that the number of stationary points is equal to the number of real roots of the cubic polynomial )~1 + ~2Xp "F" X 3
(68)
Define the discriminant D of this polynomial as D = 4)L3 + 27~.2
(69)
Then the polynomial has one real root if D > 0, three real roots of which at least two are equal if D -- 0, and three different real roots if D < 0, respectively. This is a standard result from algebra (Selby, 1971). Moreover, it is easily shown that as a result of the absence of a quadratic term in Expr. (68), three equal real roots can occur only if both ~.1 and ~.2 are equal to zero. In that case, Expr. (68) reduces to x 3 and, consequently, the three roots are equal to zero. Since the coefficient of the quartic term x 4 in Eq. (65) is positive, the single real root occurring for D > 0 represents a minimum of g(x; )~). Furthermore, the three different real roots occurring for D < 0 correspond to two minima with a maximum between. Finally, if for D = 0 two of the three real roots coincide, the coinciding roots correspond to a degenerate stationary point resulting from the merging of one of the minima with the maximum of g(x; )~) while the remaining root corresponds to a minimum. If for D = 0 the three real roots coincide, the resulting stationary point is a degenerate minimum. From these considerations, it is clear that the function g(x; )~) changes from a two-minima, one-maximum structure into a single-minimum structure if the sign of the discriminant changes from negative to positive. Therefore, the bifurcation set is defined by 4~.3 + 27~.2 = 0
(70)
It is shown in Figure 4. In the literature, this bifurcation set is called cusp and the function g(x; )~) described by Eq. (65) is called the (canonical) cusp catastrophe. The same function but with the coefficient of the quartic term - ~1 instead of 1 is called the dual (canonical) cusp catastrophe.
RESOLUTION RECONSIDERED
283 i
2
-2
-015
0
0.5
1
FICURE 4. The bifurcation set for the cusp catastrophe (thick line), and the catastrophe itself for three different parameter combinations (insets).
Three subplots in Figure 4 show g(x; )~) for different (~.1, ~.2)-combinations. Inside the cusp, g(x; ~.) has a two-minima, one-maximum structure. On the cusp, one of the minima and the maximum have just merged to form a degenerate stationary point. Outside the cusp, the function has the one-minimum structure. Applying these results to the function f(x; ~) defined by Eqs. (64) and (65) shows that the structure of this function changes from a two-minima, one saddle point structure into a one-minimum structure or the reverse if in parameter space the bifurcation set is crossed. The fold and cusp catastrophes are analyzed in detail in Arnol'd (1992), Poston and Stewart (1978), and Saunders (1980).
3. Bifurcation Sets The fold catastrophe and the cusp catastrophe have been chosen as examples in this section since it can be shown that the singularities exhibited by these functions are generic. That is, the occurrence of possible alternatives has measure zero (Whitney, 1955). In general, more complex singularities may always be split into folds and cusps. Besides, in the applications of singularity theory to parameter estimation and resolution to be described later, the singularities
284
A. VAN DEN BOS AND A. J. DEN DEKKER
of likelihood functions occurring will be folds and cusps. In particular, the bifurcation sets of these catastrophes play a dominant part since they are the subsets of the Euclidean space of the parameters where the structure, that is, the number and nature of the stationary points, changes. The maximum of the likelihood function represents the solution of the estimation problem. Being a stationary point, this solution will be seen to be affected by structural change. Structural change occurs only if at a stationary point the Hessian matrix of the function concerned becomes singular, that is, if one or more of the eigenvalues of this matrix vanish. Therefore, bifurcation sets are found by studying these eigenvalues. This study is considerably simplified by the observation that in the applications discussed in this article, no more than one eigenvalue will vanish at the same time. The subsequent analysis will be limited to this particular case. Readers interested in singularity in more than one coordinate are referred to Gilmore (1981), Poston and Stewart (1978), and Saunders (1980). The splitting lemma described by Eq. (56) shows that, under the assumption made, the function in the neighborhood of a degenerate stationary point (~, ~.) - (0, ~) may be represented as
dl~ 2 + d2~2 + . . . + dp-,~2_, + g(~P; ~.)
(71)
where the dp, p = 1 , . . . , P - l , are either equal to 1 or to - 1 . In any case, the Taylor expansion of the degenerate part g(~p; ~) has no linear term since the origin ~ = 0 is a stationary point. Also, it has no quadratic term since this stationary point is degenerate. The question now arises as to whether the third-order and, possibly, higher order derivatives of g(~p; ~.) are equal to zero as well. The answer to this question is important since it determines the number of terms to be included in the representation g(~e; ~.) in the neighborhood of (~, ~.) = (0, ~.). An algorithm for the computation of this representation will be presented in the next subsection.
D. Functions near Singularities 1. Derivation of the Reduction Algorithm In this subsection, an algorithm will be presented for the computation of the representation of arbitrary smooth functions in the neighborhood of a singular stationary point. It is the reduction algorithm described in Poston and Stewart (1978) but slightly modified and specialized to corank 1, that is, the case that the Hessian matrix in the stationary point has one eigenvalue equal to zero. For an alternative approach, see G. W. Hunt (1981). Since the functions f ( x ; )~) are supposed to be smooth, they may be represented by their Taylor expansion. Without loss of generality, suppose that ~, the stationary point concerned, is the origin and that this stationary point
RESOLUTION RECONSIDERED
285
is degenerate for X - ~.. Also suppose that f ( x ; )~) is translated such that the constant term of the Taylor expansion is equal to zero. Then the Taylor expansion may be written as follows" !xT 2 H x + cubic and higher degree terms in the elements of x
(72)
where H is the P • P Hessian matrix of f(x;X) with respect to x evaluated at x - 0 and defined by Eq. (54). Notice that in Expr. (72) the linear terms are absent since the expansion is about a stationary point. The Reduction Algorithm consists of the following steps: Step 1 The P x 1 vector x is linearly transformed into the P x 1 vector y: y-
Ax
(73)
where A is a nonsingular P x P matrix such that
"~xlT H x
_ y r Gy
(74)
G-5 1A-rHA-1
(75)
where the P x P matrix
is diagonal: G - diag(61...
~P-1 ~P2)
(76)
Since, by assumption, H has corank 1, Expr. (72) may be written ~ly 2 --1--...-+- 6p_ly2_l -1--6p2y 2 -+- p(y)
(77)
where the 6p are either strictly positive or strictly negative for all values of )~. The coefficient e p2, on the other hand, may as a function of 3. become positive or negative and is equal to zero if X is equal to ~.. The function p(y) represents all cubic and higher degree terms in the elements of y. Furthermore, all elements of the vector x present in the cubic and higher degree terms in Expr. (72) have been replaced by elements of the vector y by using x = A-ly
(78)
This completes the first step of the algorithm. Step 2 The purpose of this step is removing all cubic terms with the exception of that in y3. This is done as follows. First, it is observed that the sum of all cubic terms ofp(y) containing Yl may be described as
261Yl fl (Yl, 99 9 YP)
(79)
286
A. VAN DEN BOS AND A. J. DEN DEKKER
while the sum of the remaining terms containing Y2 is equal to 262 Yefe (y2 . . . . , YP)
(80)
28p-lYe-l f P-l(YP-1, Yp)
(81)
and so on, up to
Therefore, the general form of these sums is
28pyp f p(yp . . . . , yp)
(82)
with p = 1. . . . . P - 1. Notice that the description Expr. (82) of the cubic terms implies a division by 8p of the cubic terms originally present. This is allowed since the 8p never vanish as opposed to ep2. The fp(.) are polynomials consisting of quadratic terms only. Subsequently combining this sum with 8pyp2 and completing the square yields 8p[y2 + 2ypfp(yp . . . . . yp)] --- 8p[yp "+- f p ( y p , . . . ,
yp)]2 - 6p f p2 (yp . . . . . yp)
(83)
Next define
Zp = yp + fp(yp . . . . . Yp)
(84)
with p = 1 , . . . , P - 1 and substitute this in Eq. (83). The result is r
e _ 6pf2p(yp , . . . , yp)
(85)
Substituting these results in Expr. (77) yields ~l Z 2 -4;-...--~ ~ p_ l Z 2 _ l -Jr- e p2 y 2 + e p3 y3p
- E
6pf2(yp . . . . . YP) + q(Yl, 99 9 YP)
(86)
p
where p = 1. . . . . P - l, and q(y~,..., yp) is the sum of all quartic and higher degree terms present in p(yl . . . . . yp). The term epay 3
(87)
is unaffected by the substitution. Two remarks are in order. First, the coordinates z l, 99 9 Zp_ 1 appearing in the quadratic part of Expr. (86) are coordinates that are curvilinear in the original coordinates Yl . . . . . yp. Second, Expr. (86) does not contain cubic terms in Z l , . . - , ZP-1 anymore since thefp(.) are quadratic and, therefore, their squares are quartic. Thus, the goal of Step 2, the removal of all cubic terms with the exception of that in y3, has been achieved.
RESOLUTION
287
RECONSIDERED
Finally, notice that the treatment of the coordinates Yl YP-1 is essentially different from that of the coordinate yp. The reason is that the representation Expr. (86) has to be valid for all values of)~ including ~ for which the coefficient E~p2vanishes, and, therefore, the quadratic term in ye is absent. Then the first term in yp present is cubic or of a higher degree. . . . . .
Step 3
A difficulty with the result Expr. (86) is that the quadratic terms are a function of the new coordinates z~ . . . . , Zp-1 but the quartic and higher degree terms are still a function of the old coordinates yl,. 9 YP. The purpose of Step 3 is to find a polynomial expression in z~ . . . . , Zp only. Therefore, expressions are needed for Yl . . . . . yp as a function of Zl . . . . . Zp. Substituting these expressions in Expr. (86) then yields an expression in Zl,. 9 Zp only. Simplest is the expression for yp. It is (88)
yp = Zp
since this coordinate is not changed. The remaining expressions for y l , . . . , yP-1 may be derived from Eq. (84) as follows. First, write, for each of the yp separately, a polynomial expression in Zl . . . . . Zp containing all possible terms
from all linear ones up to and including all those of a specified maximum degree and leave for the moment the coefficients of the polynomial unspecified. What the maximum degree required is will be addressed in Section V.D.2, where examples of substitutions for the yp will be described. Then these polynomials are substituted for the yp in each of the expressions for Zl . . . . . Zp-a described by Eq. (84). The values of the polynomial coefficients then follow from the identity of the left-hand member and the fight-hand member. All that is left to be done is to substitute the resulting polynomials yp(z) for the corresponding yp in Eq. (86). The result is a polynomial expression in the coordinates Zl,. 99 Zp only described by (~lZ 2 -+" 9 ""-JI- (~p-lZp_ 1 2 + ee2Z 2 + ep3z3e + r ( z )
(89)
where z = (zl... zp) r and r(z) -
- Z
3 P f 2 ( y p (z)' " " " ' y p ( z ) ) + q ( y l ( z ) . . . . , y p ( z ) )
(90)
p Step 4
The purpose of Step 4 is to remove the quartic terms with the exception of that in z 4 from Expr. (89). The approach is very similar to that followed in Step 2 to remove the cubic terms. Here, for the sum of all quartic terms of r(z) containing zl the following form is chosen: 2~lZlgl(Zl
.....
Zp)
(91)
288
A. VAN DEN BOS AND A. J. DEN DEKKER
and so on, with as general expression 26pZpgp(Zp ....
(92)
, Zp)
where p - 1 . . . . . P - 1. Combining this sum with the terms 3pZ2p and completing the square removes all quartic terms if as new coordinates (93)
Up -- Zp + gp(Zp . . . . . Zp)
are chosen. Substituting the polynomial expressions Zp(U), p - 1 . . . . . then yields (~1u2 -Jr-''" "31"-(~p-lU2_I "31-ep2 u 2 "k- 6:p3 u 3 -Jr- 8p4Up4 -JC S(U)
P - 1,
(94)
where s(u) is the sum of all quintic and higher order terms in the elements of
u - ( u l . . . u p ) r. Step 5 and Possible Further Steps In Step 5, all quintic terms with the exception of that in USp are removed, and in further steps, all higher degree terms if needed.
This concludes the derivation of the Reduction Algorithm. The result will be of the form ~lWl2 -~-""" "-~ ~P-1 t/-)2p_l + E p 2 W 2p -~- 8p3 w 3 -~- E p 4 w 4 -t- " " " -+- E p K W pK -~- t ( w )
(95) where K is arbitrary and t ( w ) is the sum of all (K + 1)-th and higher degree terms in the elements of to - (Wl . . . Wp) r. The first P - 1 terms of Expr. (95), representing the Morse part of the function, will remain the same, independent of the value of K. These terms will neither change sign nor vanish as a result of changes in the parameters )~. It is also seen that terms in Wl . . . . . wp_l of any degree higher than two will be removed as K increases. The quadratic term in Wp, on the other hand, may as a result of changes in the parameters change sign and vanishes for )~ equal to ~.. If the degree K is increased, the number of terms of the polynomial in Wp and the number of the corresponding coefficients increase. Suppose that for the parameter vector ~. the coefficients ee3 . . . . . ep, K-1 also vanish but that e p r is the first coefficient that does not vanish for any allowable parameter vector ~.. Then it can be shown that a polynomial coordinate transformation of we may be derived removing all terms in W p having a degree higher than K (Gilmore, 1981). The result is that the function around the degenerate stationary point (x, ~.) - (0, ~.) is represented by 611)2 "~-""" "['- ~P-11)2_1 + F-,P2U2 + 8p3V3p + e, P4V 4 + ' ' "
+ epKVI~
(96)
RESOLUTION RECONSIDERED
289
where Vl . . . . . Ve are the coordinates after removal of the terms t(w) by the described polynomial transformation of Wl, . . . , We-1 and that of We. It is emphasized that in the local coordinates Vl , . . . , re, the representation is exact. No approximations have been made. Therefore, for us to find the number and nature ofthe stationary points of f ( x ; )~) in the neighborhood of (x,)~) = (0, ~), it is sufficient to investigate the number and nature of the stationary points of the polynomial (97)
E p21)2 -~- 8P31)3 "~- F'P4 U4 "at - . . . "at- E PK uK
2. Useful Polynomial Substitutions As an example, in this subsection polynomial substitutions will be derived that are needed in Step 3 of the Reduction Algorithm described in Section V.D. 1. The results are used in Section VI of this article. Suppose that the Taylor expansion of the function to be represented is described by
+ + •
+
+ Y ooY +
+ Y olY Y + Yl oYlY
2
+ Ylo YIY + Yo oY + Yo IY Y3 + YoI Y Y
+ 833Y3 -q-q(Yl, YZ, Y3)
(98)
where q(Yl, Y2, Y3) is the sum of all quartic and higher degree terms and the coefficient e32 may vanish as a result of changes of the parameters X as opposed to the coefficients 61 and 32. Here, and in what follows, the coefficient of yk1yl2Y3 . m in the Taylor expansion will be denoted as )~kin. Expression (98) may be rewritten as follows: 61 y2 + 261Yl • 1 ()/300Yl2 + )/210YlY2 + )/201YlY3 + )/120Y2 + )/111Y2Y3 -Jr-)/102y2)/ 1
~1 + ~2Y2 -q- 2~2Y2 X g ()/030Y2 -~ )/021Y2Y3 nt- )/012y2)/~2
+e32Y 2 + e33y33 k-q(Yl, Y2, Y3)
(99)
Then, in the notation of Section V.D. 1, fl -- fl (Yl, Y2, Y3) __ )/p y2 p p I 2 t t 2 3001 -Jr-)/210YlY2 -+- )/ZOlYlY3 + )/120Y2 + )/lllYZY3 -k- )/102Y3 (100) and ~ , y2 , , _2 f2 -- f2(Y2, Y3) -- 0302 + Yo21Y2Y3 nt- )/012Y3 t
1
(101)
where a single prime denotes a division by 261 (e.g., Y3oo = ~)/3oo/61)-A double
290
A. VAN DEN BOS AND A. J. DEN DEKKER
prime denotes a division by 2t~2. Therefore, Eq. (99) may be written ~lZ 2 4- ~2Z 2 4- 832Z 2 4- 833Z~ -- ~ l f ? -- ~ 2 #
4-
q(Yl, Y2, Y3)
(102)
with zl = Yl 4- fl
(103)
Y2 4- f2
(104)
Z2 =
and Z3 = Y3
(105)
In Expr. (102), fl, f2, and q(yl, y2, Y3) are still functions of yl, Y2, and Y3 instead of functions of Zl, z2, and z3. Therefore, Eqs. (100), (101), and (103)(105) are used to derive polynomial expressions for yl, Y2, and Y3 in terms of z 1, z2, and z3 to be substituted in Expr. (102). These expressions are obtained as follows. Suppose that it is decided that quadratic expressions in z l, z2, and z3 are sufficient. Then substitution of these quadratic expressions for Yl, Y2, and Y3 in fl, f2, and q(yl, y2, Y3) of Expr. (102) produces a function representation in zl, z2, and z3 only. Inspection of f2 and f2 after this substitution shows that the degree of all terms containing Zl or z2 and resulting from the quadratic terms of the polynomial substitutions for y~ and Y2 is five or higher. After substitution, the same is true for all terms of q(yl, y2, Y3) containing yl or Y2. Therefore, these quadratic terms do not affect the quartic terms of the function representation. The conclusion is that for a quartic function representation, it is sufficient to substitute Zl for Yl and z2 for Y2, respectively. In addition, Eqs. (100) and (101) show that substituting z3 for Y3 and subsequent squaring off1 and f2 produces in Expr. (102) two quartic terms in z3, respectively described by __t~l
t2 4 Y102Z3
(106)
it2_4
(107)
and ,,
--~2Y012z3 Then 834, the coefficient of z4 in the function representation, becomes t2
.it2
834 - - J/004 - - ~1 Y102 - - 62Y012
(108)
where Yoo4is the coefficient of y43 in the original Taylor expansion. Since the remaining quartic terms containing z~ or z2 may be removed in the next step, the quartic function representation becomes
,~lZ2 4- a2z 2 4- s32z~ 4- s33z~ 4- s34z 4
(109)
with 1332- " J/002, 8 3 3 = Y003, and 8 3 4 defined by Eq. (108). Notice that Exprs. (108) and (109) are easily generalized to any number of inessential variables. Specifically, let there be a number of M 4- 3 variables Zl, . . . , zM+3 of which
RESOLUTION RECONSIDERED
291
zM+3 is essential. Furthermore, let ap/4! be the coefficient of z4+3 before the removal of the inessential cubic terms. Denote the coefficients of ZmZM+3 ,2 m = 1 , . . . , M + 2, by ~m and the nonvanishing eigenvalues of H by 8m, m = 1. . . . , M + 2. Then, by Expr. (108), the coefficient Of ZM+3after removal of the inessential cubic terms is described by 1 [ ~ _ 6~ordiag(6~-I 4!
"'"
t~M12)~]
(110)
where ~ois the (M + 2) • 1 vector of the ~Om.Expression (110) will be used in Section VI. The procedure may, of course, be continued to produce representations of a degree higher than four. For example, the quintic function representation may be shown to be (111)
~lZ 2 -{-" 62Z 2 -+- ea2Z~ nt- e33Z~ "+" ea4Z 4 + 635Z~
with ~35
--
)'005
--
~1
t t )'102()'103
__
)'t
t
tt
102)'201 -- ) ' 0 1 2 ) ' ; 1 1 ) -
It
~
tt
tt
It
~2)'0121')'013 -- )'012)/021 )
(112) However, for the purposes of this article, the quartic representation described by Eq. (109) suffices.
E. Conclusions
In this section, a relatively simple polynomial representation has been derived for functions in the neighborhood of singular stationary points. Here, the concept neighborhood refers both to the independent variables and to the parameters of the function. The parameter values have to be in the neighborhood of a bifurcation set representing all parameter combinations for which the stationary point concerned is singular. The parameter neighborhood in which the representation is valid consists of all parameter combinations between the bifurcation set concerned and the next one encountered in parameter space. The function representations analyzed have been restricted to functions that may become degenerate in only one of the independent variables at a time. This is sufficient for the purposes of this article. It has been shown that under these conditions the representation may always be put in the form described by Eq. (96): ~1U21 ..~_ . . . _~ ~p_lU2_l _+_Ep21) 2 .qt_F_,p3V3p ..+-F_,p4v4 .-~- . . . .-]- F,pK vK
(113)
In this expression, the sign of each of the coefficients 6 . . . . . ~ P - 1 remains the same for all parameter values considered. Therefore, the structure of the
292
A. VAN DEN BOS AND A. J. DEN DEKKER
function does not change in the directions of the corresponding independent variables Vl, . . . , re-1. This is the reason that these variables are called inessential The coefficient e~,2, on the other hand, may become equal to zero as may P3 up to and including ee,/c_ 1, but eetc is the first coefficient that cannot vanish under the influence of the parameters. The variable ve represents the direction in which structural change may occur and, therefore, ve is called an essential variable. It has been shown that the eek are equal to or are relatively simple algebraic expressions in the coefficients of the original Taylor expansion of the function about the stationary point concerned. The analysis of the structure of the function using the representation Expr. (113) is much simpler than that of the function from which Expr. (113) is derived, for the following reasons: 9 It requires an analysis in only one variable, the essential variable vp. The inessential variables v~. . . . . vp_~ may be left out of consideration. 9 The expression in the essential variable re, the degenerate part of the function, is the polynomial E p21)2p -~- ~ p31) 3 "~- E p41)4p -'~- " " " -3I'- E p K I) K
(114)
All that is required is a study of the nature and number of the stationary points of this polynomial. Combining the results of this study with the signs of ~1 . . . . . ~e-1 determines the structure of the function in the neighborhood of the stationary point concerned unambiguously. 9 The degree K of the polynomial (114) is such that the el, x is the first coefficient that cannot vanish. In many practical applications, this degree is low, which further simplifies the analysis of the structural change. In Section VI, the elements of singularity theory described in this section will be used to explain the occurrence of singularities in a general class of likelihood functions of parameters of two-component observations.
VI. SINGULARITY OF LIKELIHOOD FUNCTIONS
A. Introduction
In this section, it is shown that likelihood functions for two-component models may have singular stationary points. These singularities are caused by the observations. Thus the observations play the part of the function parameters in Section V. For different sets of observations, the structure of the likelihood function may be different. It will also be shown that these singularities may cause the values of the quantities to be measured to coincide exactly at the maximum of the likelihood function. For example, if error-disturbed observations are made on two seriously overlapping spectral peaks, the maximum
RESOLUTION RECONSIDERED
293
TABLE 1 CORRESPONDINGNOTIONSIN SINGULARITYTHEORY AND ESTIMATION Singularity theory
Estimation
Function Independent variables Parameters of the function Absolute maximum
Likelihood function Parameters of the model (Errors in) observations Solution for parameters
likelihood estimates of the locations of the peaks may be distinct for the one set of observations and exactly coinciding for the other. In the latter case, both quantities can, of course, no longer be resolved from the available observations. To avoid confusion, at this point, we introduce the terminology to be used in this section and compare it with that of Section V. In Section V, parametric functions of a number of independent variables were studied. In this section, the function is the likelihood function which is a function of the model parameters. The observations act as the parameters of the likelihood function. By definition, the location of the absolute maximum of the likelihood function is the solution for the model parameters. This corresponds to the absolute maximum of the function in Section V. Table 1 summarizes these corresponding notions. The outline of this section is as follows. In Section VI.B, two-component models are introduced and their parameterization, that is, dependence on the quantifies to be measured, is discussed. Next, it is explained how these parametric models enter the probability density function of the observations and thus define the likelihood function of the parameters. In Section VI.C, the results of Section VI.B are used to show that for all distributions of the observations and all types of component functions, the corresponding likelihood functions may be expected to have similar structures. Numerical examples in Section VI.D confirm this. In Sections VI.E and VI.F, a remarkably simple polynomial representation is derived describing and explaining the two different structures likelihood functions for two-component models may have. Conclusions are drawn in Section VI.G.
B. Likelihood Functions for Two-Component Models
Suppose that the elements ff)n of the vector w = ( w l . . . WN) T
(115)
are the available observations and that their expectation is described by the
294
A. VAN DEN BOS AND A. J. DEN DEKKER
two-component model
E[wn] -- f(xn; O) -- g(x,; F) + C~lh(X,;/~1) + ct2h(x,; t32)
(116)
In this expression, h(x;/~k), k - 1, 2, are the component functions with nonlinear parameters/3k and amplitudes ak. For example, in an exponential decay model h(x;/~k) = exp(--/3kX) while for Gaussian spectral peaks h(x; ~k)-e x p { - ~l(x -/3k)2}. The values xn of the independent variable x are the measurementpoints and are supposed to be known. For example, they may be time instants or spatial coordinates. The background function g(x; V) may be used to model contributions to the two-component model such as a trend g(x; V) -y lx or the sum of a trend and an offset g ( x ; F ) = ylX-[-Y2. The function g(x; F) may also model additional components akh(x; ~k), k = 3,... The M x 1 vector F is the vector of unknown parameters of g(x; F). The vector 0 is defined as
O = ( F T 0~1 0/2 /~1 f12) T
(117)
The model f ( x , ; O) described by Eq. (116) is called the generalized twocomponent model. The addition generalized refers to the inclusion of the function g(x; y). For reasons to become clear later, f ( x , ; 0) is first reparameterized as follows:
f(xn;O)--- g(Xn;F)-k-ot[~,lh(Xn;fll)-k-~.zh(xn;fl2)] with ~1 = ~. and is defined as
~,2 - -
(118)
1 - )~. Then the new vector of unknown parameters 0
O=(Y T ~
~. /~1 /~2) T
(119)
These are the parameters to be estimated from the observations. Notice that the parameter ~. distributes the amplitude ct over the components. Next suppose that the joint probability density function of the observations wn with expectation f(x~; 0) may be described as
p(w; f (O))
(120)
f (O) -- ( f l ( 0 ) . . . fN(O)) r
(121)
where
with f . (0) -- f (x.; 0). Then the corresponding log-likelihood function of the parameters t given the observations w is equal to
q(w; f(t)) = ln[p(w; f(t))]
(122)
where the elements of the parameter vector
t ' - ( c r a s bl b2) ~r
(123)
RESOLUTION RECONSIDERED
295
correspond to those of Eq. (119) while the vector f ( t ) is defined as f (t) -- ( f l (t) . . . f N(t)) T
(124)
and fn(t) -- f (Xn; t) -- g(x,; C) + a[g.lh(Xn; bl) + g.2h(xn; b2)]
(125)
with ~1 - ~ and g 2 - 1 - ~. Simplification of the analysis involves replacing maximum likelihood estimation of 0 through maximizing q(w; f ( t ) ) with respect to the parameter vector t described by Eq. (123) by the following hypothetical procedure. For a fixed value of ~, the likelihood function is maximized with respect to the newly defined parameter vector t = (c T a
bl
b2) T
(126)
obtained by leaving the parameter g out of the parameter vector Eq. (123). This is repeated for all allowable values of ~, and the optimal values ~ and ~ are subsequently selected. The parameterization Eq. (126) will be used in what follows. The maximum likelihood estimation of the parameters of the generalized two-component model may be summarized as follows. First, the generalized two-component model Eq. (118) is substituted for the expectations of the observations in the logarithm of the probability density function of the observations. Then the observations are substituted for the corresponding independent variables. Next, the exact parameters 0 are replaced by the variable parameters t described by Eq. (123). Finally, the log-likelihood function thus obtained is maximized with respect to the parameters t described by Eq. (126) for all allowable values of ~, and the optimal ~ and 7 are selected. The latter procedure has two advantages. First, as a result of keeping ~ fixed in every step, the number of parameters decreases by one, which substantially simplifies the analysis. Second, by using s the ratio of the amplitudes of the components may be limited to values agreeing with the available a priori knowledge. Since it will be assumed that Otl and or2 are known to have the same sign, in what follows only values of s will be chosen on subintervals of (0, 1). See also the remark in Section III.B.3 about incorporating a priori knowledge in estimation.
C. Stationary Points o f the Likelihood Functions
In this subsection, first the gradient of log-likelihood functions for the special case of two-component models is presented. Equating each of the elements of the gradient to zero yields a set of nonlinear equations called the normal equations. Among the solutions of the normal equations is the maximum of
296
A. VAN DEN BOS AND A. J. DEN DEKKER
the likelihood function which represents the maximum likelihood solution for the parameters. Furthermore, the solutions of the normal equations define the structure of the likelihood function, which is defined as the pattern of its stationary points. Such solutions are, therefore, essential to the study of structural changes of the likelihood functions to be carried out later. The gradient of the log-likelihood function Eq. (122) with respect to the parameters described by Eq. (126) has three types of elements:
0q
0q 0In
--- Z~,
OCm
Of, Ocm
n
----- ~
O__qq- ~n 0q 0fn 0a ~ 0a = Z
0q 0..
m--
Of', O C m
1 ... '
0q ~-~(s
M
(127)
'
+ s
(128)
n
and
Oq _ ~ Obk --
Oq Of,, Ofn Obk --" ~
Oq --(1) b -~ag.kn', ( k )
k-
1, 2
(129)
n
where q denotes the log-likelihood function q(w; f(t)), and f', is defined as f',(t) described by Eq. (125), while g', and h',(bk) are defined as g(x',; c) and h(x',; bk), respectively, which are both present in Eq. (125). Furthermore, h(,P)(bk) is defined as the pth-order derivative of h(x',; bk) with respect to bk. Then the normal equations are defined as
~
Oq Og', = 0 Ofn OCm
Oq y ~ -~n(s
m=l
M
(130)
' ... , +s
-- 0
(131)
1 2
(132)
n
and
~
Oq h~)(bk)= 0 ~n "
k-
Notice that Eqs. (130)-(132) constitute M + 3 nonlinear equations that can, in principle, be solved for the M + 3 unknown parameters Cl . . . . , CM, a, bl, and bz. However, since the equations are nonlinear, more than one solution is possible, as will now be demonstrated. Suppose that, instead of the parameters of the two-component model, the parameters
t = (c r a b) r
(133)
RESOLUTION RECONSIDERED
297
of the one-component model of the same nonlinearly parametric family (134)
f n ( t ) -- f (Xn; t) = g(Xn; C) -4- ah(xn; b)
would be estimated from the same observations w. Then, the normal equations of the likelihood function q for this model become of,,~
OcmOgn = 0
m - 1, . . . , M
(135)
n
~n~fn
hn(b) - 0
(136)
Oq h(1)(b) _ 0
(137)
and
CM
Supposethat (~1 ... CM h ~,)r is asolutionfortheparameters ( C 1 . . . a b) r in Eqs. (135)-(137). Then, if ~ = (~1... OM & b b) r is substituted for t (Cl ... CM a bl b2) r in the gradient equations Eqs. (127)-(129) for the twocomponent model Eq. (125), this gradient becomes equal to the null vector. That is, the parameter values ~ generated by the solution of the normal equations for the one-component model satisfy the normal equations for the two-component model. Therefore, t ' = (C'I...C'M
a
b
b) T
(138)
is a stationary point of the log-likelihood function for the two-component model. This stationary point will be called a one-component stationary point since it has been generated by a stationary point of the log-likelihood function for a one-component model. Conclusion VI.1 The first conclusion of this section is that log-likelihood functions for generalized two-component models always have a onecomponent stationary point. This is true for any component function, background function, and log-likelihood function.
Furthermore, let, for the moment, s = s = ~ 2 - - 0.5. Then, for symmetry reasons, (b2, bl) is the absolute maximum of the log-likelihood function if (bl, b2) is. This means that in this special case there are two equivalent absolute maxima located symmetrically with respect to the line bl = b2. If, subsequently, s is somewhat changed, a slight asymmetry in the locations of the maxima occurs and one maximum becomes relative but, for continuity reasons, does not disappear. This leads to the second conclusion.
298
A. VAN DEN BOS AND A. J. DEN DEKKER
Conclusion VL2 The second conclusion of this section is that the log-likelihood function for two-component models has two maxima.
D. Examples of Stationary Points In this subsection, examples are presented illustrating the structure of the loglikelihood function for simple two-component models.
Example VI.I
(A Biexponential Model and Normally Distributed Observations) In this example, it is assumed that the observations w = (Wl . . . wN) r have a normal probability density function. In Section IV.C. 1, it has been shown that then the log-likelihood function is described by N 1 q = --~-ln2zr - ~ lndet W -
1
-~(w- f ( t ) ) T w - I ( w - f(t))
(139)
where W is the covariance matrix of the observations and f(O) is the N x 1 vector of the expectations of the w~. In this example, it will be supposed that these expectations are biexponential and contain a linear background function fn(O) = 1/1 -+- 1/2Xn +
fflexp(--/~lXn)-~-ctEexp(--/32Xn)
(140)
or, in the parameterization of Eq. (118),
fn(O) = 1/1 + 1/2xn d- ct[~.lexp(-fllXn) + ~.2exp(-fl2Xn)]
(141)
The function g(x; 1/) introduced in Section VI.B is equal to 1/ix + 1/2 and h(x;/3k) = exp(--/3kX), k = 1, 2. Then, in agreement with Section VI.B, the model substituted for the expectations in the likelihood function is described by
fn(t)
-- Cl ~1_ CEXn +
a[g.lexp(-blXn) d- g.Eexp(-b2xn)]
(142)
with s - s and s - 1 - s From Eqs. (130)-(132), it then follows that the normal equations for the two-component model are described by y ~ Wmndnxm -- 0 m >_n
~-~ Wm.a. = 0 m>_n
(143) W m n d n [ e l e x p ( - b l X m ) q- e 2 e x p ( - b 2 x m ) ] - 0 m>_n Z t~ m>_n
= 0
k = 1, 2
RESOLUTION RECONSIDERED
299
with dn --
113n - - Cl - - C 2 X n - -
a[elexp(-blXn) -1- ~.2exp(-b2xn)]
(144)
while the W~n are the elements of W -1. The normal equations for the onecomponent model are described by
~
w2,.d~xm - 0
m>_n E tomndn m>_n
--0
(145)
E
wmndnexp(-bxm) -- 0
m>n E
wmndnxmexp(-bxm)
-- 0
m>n
with d n --- 11)n -
ClX n -
c2 -
a
exp(-bxn)
(146)
Notice that Eqs. (145) and (146) are directly obtained by substituting bl b2 = b in Eqs. (143) and (144). In the numerical example to be described now, the observations are supposed 2 Then W -- a~21, with I the to be uncorrelated and to have equal variance a~. identity matrix of order N. As shown in Section IV.C. 1, under these conditions the log-likelihood function described by Eq. (139) simplifies to N ln2rr - Nlnaw 2
1 2a 2
~'~[113 n
fn(t)] 2
(147)
Consider as a numerical example the following model:
f,(O) = 0.005 + 0.01Xn + 0.7 exp(--x,) + 0.3 exp(--0.8x,)
(148)
Then 0 = (?'1 Y2 ot fla f12)T = (0.005 0.01 1 1 0.8) T, ~.1 = 0.7 and, therefore, ~.2 = 0.3. For the x,, the values 0.4 x n, n = 1. . . . . 10, are chosen. Figure 5 shows f,(O) in these points. Next, in the log-likelihood function, the observations are replaced by their expectations. Generally, the structure of likelihood functions and criteria of goodness of fit as functions of the parameters with the observations replaced by their expectations provides the experimenter with a reference structure to which the structure for all other realizations of the observations can be compared. Notice that in this example with these "observations," all
dn = ll)n - - fn(t) = E [ w n ] - f n ( t ) = f n ( 0 ) - fn(t)
(149)
300
A. VAN DEN BOS AND A. J. DEN DEKKER
O.01x
0.5
% • FIGURE 5. Measurement points and expectations of the observations in Example VI. 1. The expectations are the sum of a biexponential function and a linear background.
for s -- )~ are equal to zero if t is equal to 0. Then the normal equations given in Eq. (143) are always satisfied while the log-likelihood function Eq. (147) is seen to be absolutely maximal. The log-likelihood function Eq. (147) is a function of the elements of t - (Cl c2 a bl bE)r. To visualize this functionin spite ofits high dimensionality, we adopt the following procedure. The first three equations of Eq. (143) are solved for the linear parameters Cl, c2, and a. Then the solution for these linear parameters is a function of the nonlinear parameters bl and be. This solution is subsequently substituted in the log-likelihood function Eq. (147), which makes it a function of b l and b2 only. Contours of this log-likelihood function are shown in Figure 6 as a function of b l - b2 and ~ , l b l -~-~,2b2, that is, for s = )~. These transformed coordinates have been chosen to make the contours easier to interpret. The contours show an absolute maximum at bl - bE = 0.2 and ~.lbl -k- ~.2b2 - - 0.94 corresponding to b l - - 1 and b2 - - 0 . 8 being the exact values/31 and/~2. There is an additional, relative, maximum at ~.lbl + ~.2b2 - - 0.941 and bl - b2 = --0.191 corresponding to bl - 0.884 and b2 - 1.075. More or less between both maxima of Figure 6, there is a saddle point at ~.lbl -k- ~.2b2 - 0.94 and bl - b2 - 0 corresponding to bl - b2 = 0.944. This is the one-component stationary point predicted in the previous subsection. For obvious reasons, in what follows the structure of the likelihood function thus found will be called the two-maxima, one saddle point structure.
RESOLUTION RECONSIDERED
301
0.96
..Q + 0.94
0
0.5
bl-b 2
FI~trRE 6. Contours of the normal log-likelihood function of Example VI. 1.
Example VL2
(A Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, it is assumed that the observations w = (Wl . . . W~v)r are independent and Poisson distributed. In Section IV.C.1, it has been shown that then their log-likelihood function is described by
E
--fn(t) -F Wn In fn(t) -- In Wn!
(150)
n
where, as before, t is the vector of parameters with respect to which the likelihood function has to be maximized. Furthermore, fn(O) is the expectation of the nth observation w,. In this example, it will be assumed that this expectation is described by fn(0) -" o~[)~lhn(fll) + )~2hn(fl2)]
(151)
where
hn (ilk) -- exp { - l ( X n
ilk) 2 }
(152)
with k - 1, 2. The components of this bi-Gaussian function are described by ot)~kexp{- ~1(Xn -/3k)2}, k -- 1, 2, located at 131 and/32 and having peak values 0 ~ . 1 - - 0g)~and ot~.2 = or(1 - )~), respectively. Therefore, the model fn(t) to be substituted in the likelihood function is
fn(t) -- a[s
+ e2hn(b2)]
(153)
302
A. VAN DEN BOS AND A. J. DEN DEKKER
with s - - s a n d s - - 1 - s Then the normal equations for the two-component model are described by Z
(154)
dn(t)[elhn(bl) -k- e2hn(b2)] - 0 n
and Z
k -- 1, 2
d~(t)(x. - bk)h.(bk) -- 0
(155)
n
with d.(t) =
Wn
(156)
1
f.(t)
Similarly, the normal equations for the one-component model are Zd.(tlh.(b)
(157)
-0
n
and dn(t)(x, - b)hn(b)
--
(158)
0
n
with d,(t) -
Wn
1
(159)
f,(t)
and f . ( t ) - a h . ( b ) = a exp { - 5l(x,, - b)2 }
(160)
Again, it is seen that if b is the solution of the normal equations for the onecomponent model, the point (b, b) is a solution of the normal equations for the two-component model. Consider as a numerical example the following model: f~(0) = 0.8 exp { - 5l(xn - 0.05) 2} + 0.2exp {_15 (x. + 0"05)2 }
(161)
Then 0 = (ct /31 /32)r = (1 0.05 - 0 . 0 5 ) r and ~. = ~.1 = 0.8 and, therefore, ~ . 2 - ( 1 - ~ . ) - 0.2. The measurement points are taken as x , - - 1 . 8 + (n - 1) • 0.4, n = 1 . . . . ,10. Figure 7 shows fn (0). Next the observations w, in the likelihood function Eq. (150) are replaced by their expectations f,(O) described by Eqs. (151) and (152). It is observed that then for e = ~. the deviations dn
ll)n =
fn(t)
1=
E[wn] fn(t)
1=
fn(O) fn(t)
1
(162)
RESOLUTION RECONSIDERED
303
1.2 0.8 exp{-O.5 (x-O.05) 2 } + 0.2 exp{-O.5 (x+O.05) 2 }
0.8
0.4
,,,,. 13 -2
-1
0 x
1
2
FIGURE 7. Measurement points and expectations of the observations in Example VI.2. The expectations are the sum of two overlapping Gaussian functions.
are equal to zero if t is equal to 0. Then the normal equations Eqs. (154) and (155) are always satisfied. The Poisson log-likelihood function Eq. (150) is a function of the three elements of t = (a bl b2)T. To eliminate a, we solve this parameter from the normal equation Eq. (154) where it is present in the dn(t): a --
~a n113 n
(163)
E, elh,(bl) + s
1 where hn(bk) -- exp{-~(Xn - bk)2}. If this expression for a is substituted in Eq. (153) and, subsequently, the resulting expression is substituted forfn(t) in the likelihood function Eq. (150), the result is --
Z nl
Lonl
Z n2
In {s
+s
+Z
ll)n
ln{s
+s
n
(164) where terms depending on the w. only have been omitted. Contours of the likelihood function Eq. (164) as a function of bl-b2 and ~-1bl "+" ~.2b2, that is, for s = )~, are shown in Figure 8. There is a maximum at (0.1, 0.03) corresponding to the exact parameter values bl - fll -- 0.05, b2 = /~2 = -0.05, and c~ = 1. There is an additional maximum at ( - 0 . 1 , 0.03) corresponding to bl - 0.01 and b2 = 0.11. Between these maxima, there is a saddle point at (0, 0.03) corresponding to bl - b2 = 0.03 which is the one-component stationary point.
304
A. VAN DEN BOS AND A. J. DEN DEKKER 0.035
eq J~
r
+ ,c- 0.03 T-
~
o
o.2
b 1- b 2
FIGURE 8. Contours of the Poisson log-likelihood function of Example VI.2.
Conclusion VL3 Example VI. 1 of this section concerns a normal likelihood function. The model of the expectations of the observations in this example is the sum of a biexponential function and a linear background function. The likelihood function of Example VI.2, on the other hand, is Poisson while the model of the expectations consists of the sum of two overlapping Gaussian peaks. In spite of these differences, the log-likelihood functions in both examples have the same two-maxima, one saddle point structure. This confirms the conclusion of the preceding subsection that this particular structure is characteristic of all log-likelihood functions for which the expectation of the observations consists of a generalized two-component function.
E. The Hessian Matrbc of the Log-Likelihood Function In this subsection, an expression is derived for the Hessian matrix of the log-likelihood function of the parameters of the generalized two-component model. This is a first step in the assessment of the behavior of the stationary points under the influence of the observations, which, as has been explained in Section V, requires Taylor expansion of the log-likelihood function with the model parameters as variables. In the preceding two subsections, the onecomponent stationary point of the likelihood function for two-component models has been introduced. This point, located between both two-component stationary points, will be chosen as the origin of the Taylor expansion. This choice will be seen to simplify the analysis substantially.
RESOLUTION RECONSIDERED
305
For the computation of the elements of the Hessian matrix of the loglikelihood function q with respect to the parameters t -- (c r a b~ b2)T of the generalized two-component model, use is made of the expressions Eqs. (127)(129) for the gradient with respect to these parameters. Straightforward differentiation yields five different types of second-order derivatives:
OZq
=~ ~ OZq Og,,~(c) Og,,2(c) + Oq 02gn (165) Z__, OctalOCm2 OCm'OCm2 ~nl Z..,,,: Of,,~Of,,20Cml OCm2 n 02q -- ~ Z ~ 02q Og"~(c)[s OCmOa nl n2 Of n~Of n2 OCm 02q = Z Z 002
02q [g'lhn1(b') + s
+s s
(166)
q- s
nl n2 Of nl Ofn;~ (167)
02q = ae, ~n~ Z ~ Ozq Ogn~(c)h(~(b') OCmObk
(168)
n2 OfniOfn2 0Cm
02q OZq = agk Z Z Ofn, Ofn2h(nl~(bk~)[elhnz(bl)+ s Oa Ob~ rtl ~12 Oq h~nlt(bk)
(169)
and
02q 02q (1) bkl h (1)(bk2 0bk~0bk: = agk~s ~n l ~n2 OfnlOfn2hnl ( ) ne ) Oq (2)(b~) + 6kl'k2as Z ~n h"
(170)
II
where
6kl,k: is the Kronecker symbol defined as 8kt,k2 =
1 0
if kl --- k2 otherwise
(171)
1. Change of Coordinates For reasons explained later, the coordinates t = (Cl...cM a bl b2)r are transformed into t t = ( c 1 . . . C M a b~ b~)T with b~ = g.lbl + s and
306
A. VAN DEN BOS AND A. J. DEN DEKKER
b~ -- bl - b2"
re: tt=
C~
O b
1
-(o o) I
= diag(l
L)t
(172)
\b where L=(~I 1
s
(173)
and I is the identity matrix of order M + 1. Then, if H denotes the (M + 3) • (M + 3) Hessian matrix defined by Eqs. (165)-(170), the Hessian matrix H t with respect to the new coordinates t t is described by H t -- diag(l
L - r ) H diag(I
L -1)
(174)
where the inverse L -1 of L is defined as
(l1 where use has been made of the fact that s + s = 1. Next suppose that the original matrix H is partitioned as follows:
(HllH12~ H = \Hrl 2 H22,]
(176)
where the dimensions of H11, H12, and H22 are (M + 1) x (M + 1), (M + 1) x 2, and 2 x 2, respectively. Then it follows from Eqs. (174) and (176) that
\Htl r
H~2J-"
(HI2L-I)T L-TH22L-1
(177)
The three different partitions H~I, H~2, and H2t2 of H t are now successively computed: 9 Equation (177) shows that the upper diagonal partition, H~I, is, of course, unchanged:
U~l
--
Hll
(178)
RESOLUTION RECONSIDERED
307
9 The upper off-diagonal partition, Hit2, is seen to be equal to l
hll --I-h12
s
- s
) (179)
HlI2_ H12L-1 -hm+l,1 --I-hM+l,2 ~2hM+l,1 -- g.lhM+l,2/12 9 The lower diagonal matrix, HJ2, is equal to
Ht22 -- L-r H22L-1 _ {
hll -t- 2h12 -+- h22
~ 2 h l l -~- (~2 - s
~2hll -+- (/~2 -/~l)h12 - ~1h22
s
- ~1h22~ !
- 2~1~2h12 -~- ~2h22 /22 (180)
where use has been made of the fact that H22 is symmetric. This completes the description of the Hessian matrix H t after the coordinate change Eq. (172) and in terms of the elements of the original matrix H.
2. The Hessian Matrix at the One-Component Stationary Point The next step is the evaluation of Ht at the one-component stationary point -- (Cl . - . CM a b ~))T described by Eq. (138). Recall that, by definition, the elements of (~1...OM ~ b) r satisfy the normal equations Eqs. (135)-(137) of the likelihood function for the estimation of the parameters of the onecomponent model. Again the three partitions Hltl, Hlt2, and HEr2 of H t are addressed consecutively: 9 Consider first the elements of the partition Hit1 described by Eqs. 165167. Since bl - b2 - b and e I + ~2 -- 1, these elements are easily seen to be equal to the corresponding elements of the Hessian matrix of the log-likelihood function for the parameters of the one-component model, evaluated at the stationary point (~1 ... OM ~ b) r. 9 Next the elements of the partition Hit2 defined by Eq. (179) are addressed. These elements are functions of the elements of the matrix H that are defined by Eqs. (168) and (169). Simple computations then show that the first M elements of the first column of Hit2 are equal to the second-order partial derivatives of the log-likelihood function for the one-component model with respect to Cm and b. Therefore, they are described by ~ " OcmOb = a Z Znl
n2
02q ~ h(n~(b) Ofn18fn2 8 C m
(181)
308
A. VAN DEN BOS AND A. J. DEN DEKKER Similarly, the (M + 1)-th element of the first column is the second-order partial derivative with respect to a and b. It is, therefore, equal to
02q OaOb
:
OZq h(1)(b)h,,~(b) + ~n Oq h,,(1)b) (
nl
n2
oS.,OSn..1
(182)
02q h(l)(blh,,2(b) =~l Znl Zn2 OfnlOfn2 nl where the last step follows from Eq. (137). Equally simple computations show that the elements of the second column of Hit2 are all equal to zero. 9 Finally, computation of the elements of H2t2 using Eqs. (170) and (180) shows that the upper diagonal element of this matrix is the second-order derivative of the log-likelihood function for the one-component model with respect to the parameter b:
02q O2q (~) ,, h(~) ~ ~ 8q h(2) Ob2 -- gt En, Zn~ Ofn:O-fn2hn, (b) n2 ( ) + ~ - - n (b) Ofn
(183)
Furthermore, the off-diagonal elements are equal to zero while the lower diagonal element is equal to
p -- hs
- g.) ~ Oqh(2)(b] n Ofn-n " "
(184)
Hence, in summary, the particularly simple outcome for the Hessian matrix of the log-likelihood function for two-component models evaluated at the one-component stationary point is
Ht=(oG ~
;)=diag(G
p)
(185)
where G is the (M + 2) x (M + 2) Hessian matrix of the log-likelihood function for the one-component model and o is the (M + 2 ) x 1 null vector. Before the results Eqs. (184) and (185) are discussed, two examples illustrating the computation of the quantity p are presented. These examples correspond to those of Section VI.D.
309
RESOLUTION RECONSIDERED
Example VI.3 (The Quantity p for a B iexponential Model Fitted to Normally Distributed Observations) For the normal log-likelihood function and the biexponential model with background, described by Eqs. (147) and (142), respectively, the quantity Oq/Ofn becomes Oq .....
1 =
(186)
2 (Ul)n - - L )
Ofn
t7 w
with
fn = fn(t) -- a[el exp(-blXn) +
e 2 exp(-b2xn)]
+ cl + C2Xn
(187)
Therefore, p
as
- s O"2
Z ( W n - fn)x2exp(-bxn)
(188)
n
at ~ = (c1 c2 a b b) r. Notice that (C1 C2 t~ b) T is the least squares solution for the parameters when the one-component model C 1 -~- C2X n -~- a
exp(-bxn)
(189)
is fitted to the observations. Therefore, the procedure for computation of the quantity p from given observations w is this" 1. Compute the least squares solution (Cl C2 t~ b) T for the parameters (Cl c2 a b) T of the one-component model Eq. (189). 2. Substitute the solution in the expression Eq. (188) for p.
Example VL4 (The Quantity p for a Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, the quantity p is computed for the Poisson log-likelihood function and the bi-Gaussian expectation model described by Eq. (150) and Eqs. (151) and (152), respectively. From Eq. (150), it follows that
Oq
Wn
=
Ofn
In this example, h n ( b ) 1]hn(b). Therefore,
--
exp{-~l ( x
1
(190)
fn n
--
b) 2} and hence h(~2)(b) - -
[(x
n
-
b) 2
p--&s n
= ag(1-- g) ~ (-~n -1)(xn - b)2hn(b)
(191)
310
A. VAN DEN BOS AND A. J. DEN DEKKER
at (a b) r = (gt b) r where, in the last step, use has been made of Eq. (157), which is the normal equation for the one-component model.
3. Summary, Discussion, and Conclusions In Sections VI.E. 1 and VI.E.2, the expression Eq. (185), Ht=(oGr
;)=diag(G
p)
(192)
has been derived for the (M + 3) x (M + 3) Hessian matrix of the log-likelihood function q(t) at the one-component stationary point (~1 ... ~M gt b b) T. In Eq. (192), the matrix G is the (M + 2) x (M + 2) Hessian matrix of the likelihood function for the one-component model with the same observations and evaluated at the stationary point ( ~ . . . ~M g~ b) T. The vector o is the (M + 1) x 1 null vector. The quantity p is the scalar defined by Eq. (184),
Oq h(2) ,, (b)
p = ,~e(] - s
~
(193)
n
evaluated at (C1 . . . CM a b b) r. The scalar quantity s is discussed in Section VI.B. It is a parameter of the two-component model
f , ( t ) = g,(c) + a[e~h,(bl) + e2h,(b2)]
(194)
with g,(c) and h,(bk) the background function and the component function, respectively, both in the point x,. It is important to realize that under these conditions H t is fully defined by the elements of (~ ... ~M ~ b) r, that is, the stationary point of the log-likelihood function for the one-component model
f~(t) -- g,(c) + ahn(b)
(195)
From these considerations, the usefulness of transforming the coordinates bl and b2 into b~ = e~bl + e2b2 and b2t = bl - b2 is now apparent: this transformation partly diagonalizes the Hessian matrix at the one-component stationary point and produces the diagonal element p. Furthermore, up to now, no particular assumptions have been made with respect to the nature of the stationary point (Cl . . . CM Cl b) T of the log-likelihood function for the onecomponent model. However, if it is assumed that it is the maximum likelihood solution for the parameters (c~... CM a b) r of the one-component model, (el ... ~M ~ ~)r is, by definition, a maximum and, therefore, the Hessian matrix G in this point is negative definite. Then the nature of the one-component
RESOLUTION RECONSIDERED
311
stationary point ( C l . - . CM a b b) T of the log-likelihood function of the twocomponent model parameters is determined by the sign of p. If p is positive, this point is a saddle point. This saddle point is clearly seen in the reference structures of Figures 6 and 8. On the other hand, if p is negative, it is a maximum. Finally, if p is equal to zero, the point is degenerate. Therefore, in any case, the coordinates C1 - . . , r a, and b~ are inessential variables. On the other hand, if p might vanish, b~ would be essential. (For an explanation of the terms essential and inessential, see Section V.C.1.) For the moment, assume that p may vanish. It will be shown later that this may happen under the influence of fluctuations of the observations. Then, for a study of the structures of the log-likelihood function that may occur, the degenerate part of this function is needed, that is, the Taylor polynomial in b~ up to and including the first term with a nonvanishing coefficient such as described in Section V.D. 1. The derivation of this polynomial will be the subject of the next subsection.
E The Degenerate Part o f the Likelihood Function 1. The Quadratic Term
Equation (192) describes the Hessian matrix of the log-likelihood function for the two-component model in the one-component stationary point (~1... On4 h b b) T. Then the quadratic terms of the Taylor expansion of the loglikelihood function about this point are described by 1 2~[( C _
~)T
a -- a
b~ - b]G[(c - ~)r
a - a
1
t2
b~ - b] T + ~.pb 2
(196) As in Section VI.E.3, it is now assumed that (~1... OM a ~,)r is the maximum likelihood solution for the parameters of the one-component model. Then (~1 ... O~t a b) r is a maximum of the likelihood function for the onecomponent model. Since the matrix G is the Hessian matrix at this maximum, it is negative definite. This implies that G can always be diagonalized by a suitable nonsingular transformation of ((c - ~)r a - a b~ - b) r into an M + 2 vector f so that the quadratic terms Eq. (196) become 1 2~
~2
t2
1 p ff22
~mtm + ~
(197)
m=l
with b~ = b~ where the a,n, rn = 1 , . . . , M + 2, are strictly negative but p may be positive, zero, or negative. Then the splitting lemma presented in
312
A. VAN DEN BOS AND A. J. DEN DEKKER
Section V.C.1 shows that the first M + 2 terms of Expr. (197) are the nondegenerate part of the representation of the log-likelihood function in the neigh! l borhood of the one-component stationary point. The coordinates t 1. . . . . tu+ 2 are inessential. The term pb'22 is the first term of the degenerate part. This part has to be investigated by computing the higher degree terms up to and including the first term having a coefficient that cannot vanish. This computation will be carried out in the subsequent subsections. This subsection will be concluded by two examples illustrating the influence of errors in the observations on the sign of the coefficient p. E x a m p l e V L 5 (The Sign of the Quantity p in the Presence of a Modeling Error) In this example, the quantity p is computed for the log-likelihood function described by Eq. (147):
N
ln2zr - N l n a w
1 2a 2 2 [ W n
- fn(t)] 2
(198)
n
when, as in Example VI. 1, the expectations of the observations are described by fn(O) = )'lXn + )'2 + C~[~.exp(--fllx,) + (1 -- )~) exp(--fl2x,)]
(199)
but, different from Example VI. 1, the model fitted to the observations is taken as
fn(t) -- a[g. exp(-blXn) + (1 - g.)exp(-b2xn)]
(200)
That is, the background, present in the observations, is not present in the model. The following numerical values are chosen for the parameters: c~ = 1, )~ 0.7, fll = 1, and t2 - 0.8. The measurement points are x~ -- 0.4 x n, n 1 . . . . ,10. The standard deviation aw may be any value but is taken equal to one. Next p is computed for )'1 - - 0 and --0.003 < )'2 < 0. As described in Example VI.3, this is done by fitting the model a exp(-bxn)
(201)
to the Wn and substituting the solution ~, b in the expression for p" p - ~s
- s Z [ W n - ~l e x p ( - b x , ) ] x 2 e x p ( - b x , )
(202)
n
with e -- 0.7. The results are shown in Figure 9. The figure shows that in the absence of background, p is positive. It decreases for decreasing background until for )' -~ - 0 . 0 0 2 the quantity p becomes negative. Thus, this example shows the influence of a relatively minor modeling error upon the sign of the quantity p.
RESOLUTION RECONSIDERED
313
4,x 10 -4
czO
-
I
40
-3 BACKGROUND
x 10 .3
FI6ul~ 9. The quantity p as a function of the background modeling error in Example VI.5.
V I . 6 (The Sign of the Coefficient p in the Presence of Statistical Errors) The purpose of this example is to illustrate the influence of statistical fluctuations in the observations on the sign of the coefficient p. For that purpose, sets of independent, Poisson distributed observations Wn, n = 1, . . . , N , are generated having expectations
Example
f(Xn; 0) ~-- f n ( 0 ) ~- ot[~.lhn(fll) + ).2hn(fl2)]
(203)
where
hn(flk) --
exp{ -1~(Xn
- ilk) 2 }
(204)
with k = 1, 2. The reference structure of the likelihood function of this example is discussed in Example VI.2. The expression for the coefficient p is derived in Example VI.4 and is described by Eq. (191). As in Example VI.2, suppose that/31 = -/31 = 0.05 and )~ = 0.8, while the measurement points are Xn = 1.8 + (n - 1) • 0.4, n = 1 . . . . ,10. The shape of f ( x ; O) and the location of the measurement points are shown in Figure 7. The observations are generated for ot = 625, 2500, 10,000, 40,000, and 160,000, respectively. In Figure 7, f(xs; 0) and f(x6, 0) are approximately equal to one. The relative standard deviation in these points is 4, 2, 1, 0.5, and 0.25% for the chosen values of c~, respectively. The generation of a set of observations corresponding to each of these values of c~ has been repeated 100 times. For each of the resulting -
314
A. VAN DEN BOS AND A. J. DEN DEKKER 60
50 >
~4o c t~ t~ c::
30
.,9. Q.
20
1%
4
1'2
+ (z
1'6 x 104
FIGURE 10. Percentage of negative values of the quantity p as a function of the expected number of counts in Example VI.6.
500 sets of observations the coefficient p has been computed. Figure 10 shows the percentage of the negative values of p for each of the values of ct. This percentage decreases for increasing ct, that is, for decreasing relative standard deviation of the observations. The results of Example VI.6 may be commented upon as follows. The definition by Eq. (184) shows that p is a linear combination of the quantities Oq/Of,, n = 1 . . . . . N. Here, f, is the one-component model f , ( t ) = ah,(b) at t = ~ - (~ b) r and, in the notation of Eq. (122), q is the logprobability density function q(w; f(?)) = log{p(to; f(?))}, that is, for hypothetical observations with expectations f,(?). Since these expectations are parameters of the probability density function p(w), the quantity Oq/Of, is equal to the Fisher score of these parameters. (The Fisher score concept was introduced in Section IV.C.2.) If the observations made and substituted in Eq. (184) had a probability density function p(w; f(?)) with expectation f,(?), the expectation of Oq/Of, would vanish since this is a property of the Fisher score (see Section IV.C.2). However, the true expectation of the observations w, is the two-component model f.(O) -- ct[Xh.(/31) + (1 - )~)h.(/~2)]
(205)
The conclusion is that 8q ~Of,, tends to a stochastic variable with an expectation
RESOLUTION RECONSIDERED
315
equal to zero as fn (t) tends to fn (0). Generally, this will happen as/~1 and/~2 are more closely located. This is illustrated in Examples VI.3 and VI.4 where
Oq = __1(Wn - f~)
a~2
Of.
(206)
and
Oq
af.
=
Wn
1
in
(207)
respectively, with f , -- f,(~). Since ~ is the solution for the parameters of the one-component model, it differs little for different realizations of the observations and may be considered to be constant. Then the expectations of the fight-hand members are approximately equal to 1 a~ [f~(0) - f~]
(208)
and
f.(0) in
1
(209)
respectively. On the other hand, if/31 and/~2 are more widely apart, the approximate expectations Exprs. (208) and (209) are increasingly different from zero. The coefficients of the Oq/Ofn in Eq. (184) are proportional to h(n2)(b), n -1. . . . , N. Again, since ~, is an estimate of a parameter of the one-component model, it differs little for different realizations of the observations and may be considered to be constant. Then the expectation of p is a linear combination of the expectations of the Oq/Of~ and tends to zero if/31 and/~2 approach each other. The variance of the Oq/Ofn is seen to be dominated by the variance of Wn and, hence, the variance of p is dominated by the variances of the elements of w - (Wa ... Wu) r. Therefore, sample values of p may be expected to be increasingly frequently negative, for increasing variance of w, a fact demonstrated in Example VI.6.
Conclusion VL4
It is concluded that both systematic errors (wrong model) and nonsystematic errors (statistical fluctuations) may change the sign of the coefficient p as compared with the sign in the absence of the errors. In the absence of errors, the one-component stationary point is a saddle point. After a change of sign of p as a result of the errors, the point becomes a maximum of the likelihood function of Examples VI.5 and VI.6. If this maximum would be
316
A. VAN DEN BOS AND A. J. DEN DEKKER
absolute, this would have important consequences since then the onecomponent stationary point would, in both examples, represent the solution.
2. The Cubic Term In Step 2 of the Reduction Algorithm, derived in Section V.D.1, all cubic terms of the Taylor expansion of the function investigated are removed with the exception of the cubic term in the essential variable, which, in the notation of the preceding subsection, is the term in b~3. It was found that the coefficient of this term was not affected by the removal procedure. Therefore, for us to compute the coefficient of the cubic term of the degenerate part of the representation of the likelihood function around the one-component stationary point, it is sufficient to compute the coefficient of b~3 in the coordinates t' -(t~ tM+ 2 ' b'2)r introduced in Section VI.E 1. However, this coefficient must t3 be the same as that of b 2 in the coordinates t t -- (Cl ... CM a b~ b~) r since t' has been obtained from t t by a linear transformation such that b~ = b~. Finally, t t is a linearly transformed version of the vector t = (Cl... CM a bl b2) T introduced in Section VI.E. 1 and described by .
.
.
t t __ diag(l
L)t
(210)
where I is the identity matrix of order M + 1 and
L=(s
_s
(211)
It is clear that this transformation affects b~ and b2 only and that
bl -- b~ + e2b~
(212)
b2 = b~ - g.lb~
(213)
and
Generally, the cubic terms in bl and b2 of the Taylor expansion of q about t = are in operator notation described by
(b~ - b)5-~ + (b2 - b)
(214)
q
By Eqs. (212) and (213) this is equal to 3~1 ( b ~ - k , )
~1+~20-
+ b~
~2~ 1 --el~ 2
q
(215)
RESOLUTION RECONSIDERED
317
Straightforward calculus shows that the genetic expression for third-order partial derivatives of q with respect to bl and b2 is
03q 03q Ofnl Ofn20fn 3 Obk Obe Obm = Enl Zn2 Zn3 Ofnl Ofn2 0fn3 0bk Obe Obm OZq
+ V~ E nl n2
+
Of n l Ofn2
O2fnl
t
Obk abm Obe
02L10L2 0 b k 0 b~. a bm
2fn, Ofn Obe Obm Obk
Oq 03f n + Zn Ofn Obk Obe Obm
(216)
where k, s and m are equal to either 1 or 2. In this expression, fn is the generalized two-component model described by Eq. (125). Combining Eqs. (215) and (216) and some straightforward algebra then yield for the coefficient of b~3 O"
3~
(217)
with
Oq h~n3)(b)
(218)
n
This completes the computation of the cubic term of the degenerate part of the representation of the likelihood function q in the neighborhood of the one-component stationary point. Notice that in the important case that the amplitudes are equal, s = s -- 0.5 and, hence, cr is exactly equal to zero.
Example VI.7 (The Coefficient cr for the Least Squares Criterion and a Biexponential Model) For the normal log-likelihood function described by Eq. (147), 0q ~fn
1 =
2 [Uon --
%
fn(t)]
(219)
Then for this log-likelihood function and the biexponential model with background described by Eq. (142), the quantity cr becomes o" - - --
aglgZ(g2
2 O'w
-- gl)
E[Wn n
-- C.1 -- CZXn -- a
exp(-bxn)]X 3 exp(-bxn) (220)
with the least squares solution (h b) ~ substituted for the parameters (a b) ~.
318
A. VAN DEN BOS AND A. J. DEN DEKKER
E x a m p l e VI.8 (The Coefficient tr for a Bi-Gaussian Model and Independent Poisson Distributed Observations) In this example, the quantity tr is computed for the Poisson log-likelihood function and the bi-Gaussian model described by Eq. (150) and Eqs. (152) and (153), respectively. From Eq. (150), it follows that Oq
af.
=
Wn
f.
1
(221)
1 In this example, hn(b) - e x p { - ~(x. - b) 2} and hence h~3)(b) - [(x. - b) 2 - 3] (Xn -- b)h,,(b). Therefore,
O" - - a e l e 2 ( e
=
t~eleZ(e2
2 -
el)
-
el)
~n(wn)
-~n -- 1 [(Xn -- b) 2
~
( -w,, _ l)(x, ~n
-
3](Xn
_ b)3h,,(b )
--
b)hn(b) (222)
at (a b) r - (~ b) r where, in the last step, use has been made of the normal equation for the one-component model Eq. (158). The definition of cr by Eq. (218) is similar to the definition of p by Eq. (184) since both are linear combinations of the Fisher scores Oq/afn. Therefore, the properties of tr are comparable to those of p as well. The quantity tr also has an expectation tending to zero as fll tends to 32. Its variance is dominated by and increases with that of the elements of w. As a consequence, sample values of both signs may occur with a frequency dependent on the distance of fll and 32 and the statistical properties of the observations wn. Expression (108) of Section V.D.2 shows that for computing the coefficient of b[4 in the degenerate part after removal of nonessential cubic terms, the coefficient of b~4 and the coefficients of tmb ' 2,2, m = 1, . . . , M + 2, in the Taylor expansion with respect to the elements of t ' - - (t~ . . . tM+2 2 , b,)r before the removal are needed. Since the elements t_'_, m = 1 . . . . . M + 2, are liner combinations of the elements of (Cl . . . cM a b~l) r, the coefficients of the t2 terms in 2 tmb 2 a r e l i n e ~ combinations of the coefficients of the terms Clb~ . . . . . cMb 2 , ab E , and htht: of the Taylor expansion with respect to "l ~ (Cl . . . cM a b~ b~) r. Therefore, for us to get an impression of the behavior of the coefficients of the terms in 9 tm , b 2r2~it is sufficient to study the behavior of the coefficients of c 1b~2, 9 9., cMb 2t , ab 2t , and ~-1'-2 h* h t 29 This is the reason why the latter coefficients are now discussed. Since their computation is similar to that of cr described earlier in this subsection, onl~r the results are given. These are as follows. The coefficients of cmb~ 2 , abt2 , and 9hit, -1"2t2 are described by, 9
t
RESOLUTIONRECONSIDERED
319
respectively, 1
e,e a
02q
nl,n2
!e~e2a2 Z
nl,n2
h~2~(b)Ogn2(~')
Ofn;-Ofn2 82q
(223)
OCm
1 h(Z)(b)h,, (b) + ~p 2
Of,,, Ofn2 "'nl
(224)
and 1
~1'~2~2
Z
""2/ "q
Ofnl
(2) "
(1) "
1
Ofn2hnl(b)h,,2(b) + 70.
(225)
nl,n2
Study of these coefficients reveals that they are relatively simple expressions of the one-component solution (Or a b) r. Furthermore, Eqs. (224) and (225) both contain a second term which is by definition much smaller than the first term and can in practice be left out. This is illustrated in the following examples.
Example Vl.9 (Coefficients of Cubic Terms for the Normal Log-Likelihood Function and an Exponential Model) For the Log-Likelihood Function of Eq. (147) and the exponential model with background described by Eq. (148), the coefficients Eqs. (223)-(225) become t~le220.2~
xZexp(-bxn)
gtg.lg.2 ~ x3exp(_bXn ) (226) 2o.2
and
for m = 1, 2, respectively,
(-2bxn)
(227)
x3exp(_2bXn)
(228)
ae1~22o .2 ~nx z e x p and t~ele220. 2 n~
These coefficients are stable since the one-component solution (~1 ... Og a b) r is stable in the sense that it does not differ very much from the one-component solution for exact observations.
Example VI.IO (Coefficient of Cubic Terms for the Poisson Log-Likelihood Function and a Bi-Gaussian Model) For the log-likelihood function Eq. (150) and the bi-Gaussian model described by Eqs. (151) and (152), the coefficient of the terms in cmb~2 is, of course, equal to zero since there is no background function. The expression for the coefficients of the terms in ab~ and b~b~ are
320
A. VAN DEN BOS AND A. J. DEN DEKKER
as follows:
el~-2 Z UOn{(Xn-1 ) e a
1}
(229)
n
and
s163 y ~ W , { ( X n - 1)2 - l } ( X n - 1) a
(230)
n
1 if the terms ~1 p and ~a are left out. These are again stable coefficients.
The results of this subsection may be summarized as follows. Expressions have been derived for the coefficients of the cubic terms of the Taylor repret t sentation in the coordinates t' -- (t~ ... tu+ 2 tM+ 3) T . These are the coordinates after full diagonalization of the Hessian matrix and t~t+3 is the essential variable b~ which, in its turn, is equal to bl - b2. The coordinates t' are a linear transformation of the coordinates t t - ( t ~ . . . tiM+3)T -- (Cl ... cu a ~.lbl 4~.262 bl - b2) T introduced earlier to partly diagonalize the Hessian matrix in order to show that bl - b2 is essential. The coefficient of b~3 is the quantity a / 3 ! described by Eq. (218). This equation shows that a is a linear combination of the quantities Oq/Ofn discussed earlier. Therefore, cr may have a different sign for only slightly differing sets of observations. This behavior differs essentially from that of the coefficients of the cubic terms in tmb ' 2,2, m = 1 . . . . . M + 2, to be used in the next subsection. These coefficients are linear combinations of the coefficients of the terms in ttmb'22 since t' is a linear transformation of t t. It has been shown that the latter coefficients are stable in the sense that they do not change very much and keep the same sign if the observations change. Therefore, the coefficients of t'mb;2, m = 1, . . . , M 4- 2, are stable in the same sense, a fact that will be employed in the next subsection, which deals with the quartic terms. Notice that if in Eqs. (224) and (225) the terms p and cr are left out, Eqs. (223)-(225) can be compactly expressed as
~_~ lgleza
~1 ,•2
where ~ - (~'1...
OZq h(Z)(b)Of~2(t) afn, Ofn2 n, Otto
~'M ~
b
m - 1, . . . , M + 2
(231)
b)T.
3. The Quartic Term In this subsection, the quartic term of the degenerate part of the representation of the log-likelihood function in the neighborhood of the one-component stationary point is computed. For this purpose, use is made of the Reduction Algorithm described in Section V.D, of Eq. (110) in particular. This equation
RESOLUTION RECONSIDERED
321
shows that the coefficient r/4! of the quartic term of the degenerate part after the removal of the inessential cubic terms is equal to the difference of the corresponding coefficient before the removal and a linear combination of the squares of the coefficients of a selection of the cubic terms. In the notation of the preceding subsection, this is the difference of the coefficient of b~4 and a linear combination of the squares of the coefficients of the cubic terms in t'mb'22, m = 1 . . . . , M + 2. Equations (223)-(225) describe these coefficients. Therefore, all that is left to be done is to compute the coefficient of b~4 before the removal of the pertinent cubic terms. This computation is highly similar to that of the coefficient ~r described in Section VI.E2 and will, therefore, be left out. The result is 1 4~ 1/t --
4!
nl,n20fnlOfn2
nl
Ofn n2
(232) The first term of this expression is a stable term not changing very much under the influence of small changes in the observations. The second term, on the other hand, is a linear combination of Oq/Ofn, n = 1 . . . . , N , and, as before, may be left out. This produces 1 1 &z~2p2 4~ r ~ 2 1"2 Z
nl,n2
02q (2) ^ (2) ^ OfnlOfn'--------~zhnl (b)hn2(b)
(233)
It is observed that Expr. (232) is the quantity ~ / 4 ! used in Eq. (110) in Section V.D.2. The quantifies ~Pm,m = 1 . . . . . M + 2, appearing in the same expression, are the coefficients of tmb2, ' ,2 m - - 1 . . . . . M + 2, discussed in the preceding subsection. Then, by Eq. (110), the coefficient of b~4 after removal of the inessential terms becomes r/4! with r - [~ -6q9 r diag(611... 6Ma2)qg]
(234)
It is straightforward to show that this is equal to r = [~ - 6~p'r G-1 ~t]
(235)
where the elements of qg' = (r . - . r r are defined by Exprs. (223)-(225), respectively, and G is the Hessian matrix of the log-likelihood function for the one-component model. In other words, the coefficient of b;4 after removal of the nonessential cubic terms is equal to the difference of two terms. The first term is equal to the quantity 7t/4! described by Expr. (232). Expression (233) shows that it is approximately a quadratic form in second-order derivatives of
322
A. VAN DEN BOS AND A. J. DEN DEKKER
the one-component function with respect to its location parameter and evaluated at the maximum of the likelihood function for the one-component model. The second term is merely a correction of the coefficient ~p/4! for the introduction of the curvilinear coordinates removing the nonessential cubic terms. The term is seen to be a quadratic form in the coefficients of the cubic terms described by Eqs. (223)-(225). All these quantities are stable in the sense that they do not differ very much from their values for hypothetical exact observations. Typically, r/4! is relatively insensitive to errors in the sense that its standard deviation is much smaller than the absolute value of its expectation and, therefore, it will be supposed to always have the same sign. In this respect, the coefficient r/4! of b~4 = (bl - b 2 ) 4 is essentially different from the coefficients p/2! and a/3! which, being linear combinations of the derivatives Oq/0fn, n = 1. . . . . N, easily change sign if statistical fluctuations in the observations or systematic errors are present. Example VI.11 will present a convincing illustration of this behavior. Since the coefficient r/4! is the first coefficient of the Taylor expansion of the degenerate part of the representation of the log-likelihood function about the one-component stationary point that is not supposed to vanish under the influence of errors, the degenerate part is supposed to be fully represented by 1 m 2 1 m3 1 m4 T.,P + T.,a + ~.,r
(236)
where, for simplicity, A has been substituted for b~ = bl - b 2 . If, in this expression, the variable A is transformed by the Tschirnhaus transformation (Poston and Stewart, 1978), A ' - - A + ~ (~.lo. )
(237)
the resulting polynomial in A' will be different from Eq. (236) in two respects: the cubic term will vanish and a linear one will appear. This new polynomial is easily recognized as the dual canonical cusp catastrophe discussed earlier in Example V.2 in Section V.C.2. In this article, however, Expr. (236) is preferred to the canonical cusp since in the former representation the origin is the onecomponent stationary point. This makes the analysis simpler.
Example VLI1 (Statistics of Quadratic, Cubic, and Quartic Terms) The purpose of this example is to illustrate the difference in statistical behavior of the coefficients p/2! and a/3!, on the one hand, and the coefficient r/4!, on the other. Suppose that the observations available are the Poisson distributed observations with bi-Gaussian expectations described in Examples VI.2 and VI.6. Furthermore, suppose that the coefficient c~ is equal to 2500. This means that the peak value of the observed bi-Gaussian corresponds on the average to approximately 2500 counts with a standard deviation of 50 counts. For this value of or, the observations are generated 100 times and each time the maximum
RESOLUTION RECONSIDERED
323
likelihood estimates ~ and b of the parameters ot and/~ of the one-component model ot exp {- ~1 (x, -/3)2 }
(238)
are computed. From these values of ~ and b the coefficients p/2!, or/3!, and r/4! are computed by use of Eqs. (184), (218), and (235), respectively. The histograms of the results for each of these coefficients are shown in Figure 11. They show that the coefficients p/2! and or/3! are approximately distributed about zero with a standard deviation of about 10 and 3, respectively. The coefficient r/4!, on the other hand, is distributed around - 145 with a standard deviation of approximately 3.
G. Summary and Conclusions In this section, the generalized two-component model has been introduced. It is described by
f (x; O) - g(x; y) + ct[~.lh(X; ~1)
(239)
~'- ~,2h(x; ~2)]
(a)
(b) 20
20
o>,
t-
g r
glO
II
0
0
i1 7.5
~/3!
p/2!
(c) 20
o~ ell)
|
tT"
-~)52
-145
-138
"d4! FIGURE 11. Histograms of the coefficients p/2! (a), cr/3! (b), and r/4! (c) in Example VI. 11. The quantities p and cr are distributed about zero but r is always negative.
324
A. VAN DEN BOS AND A. J. DEN DEKKER
In this expression, the component function h(x; ~k) is a function of x and is nonlinearly parametric in/3k. For example, h(x; ~k) may be the model of a spectral peak as a function of frequency or a point spread function as a function of a spatial coordinate located at/~k or it may be exponential with decay constant/Sk. The amplitude of the component h(x; ~k) is the fraction )~k of the parameter c~. The background function g(x; y) is parametric in the elements of the vector ?, = (Yl... yM)r. Its purpose is to model contributions often occurring in practice such as trends or further components h(x;/5i), k = 3 . . . . . The elements of the (M + 3) x 1 parameter vector 0 are or, 151,152, and the elements of y: O"-(Y T
~
~1
]~2) T
(240)
Throughout, it has been assumed that observations W l, . . . , W N are made at the exactly known measurement points xl . . . . . xN, respectively. It has also been assumed that the observations w, fluctuate about the corresponding model values f (Xn;0) if the experiment is repeated, that is,
E[wn] = f (x,,; O)
(241)
and that the parameters 0 enter the probability density function of the observations via these expectations. The log-likelihood function q(t) corresponding to this probability density function has also been introduced, where the elements of the (M + 3) x 1 parameter vector t correspond to those of the vector of hypothetical true parameters 0. In addition, the structure, that is, the pattern of stationary points, of the log-likelihood function has been analyzed. This analysis has shown that in addition to the absolute maximum representing the maximum likelihood solution, the log-likelihood function has a further, usually relative, maximum. The analysis has also revealed the presence of a stationary point representing the solution for the parameters of a one-component model of the same nonlinearly parametric family from the same observations. This stationary point is called the one-component stationary point. It has also been shown that under the influence of systematic errors or of statistical errors in the observations the likelihood function may become degenerate in or in the neighborhood of the one-component stationary point. The variable in which the log-likelihood function becomes degenerate is called the essential variable. Since degeneracy may change the structure of the loglikelihood function, it may change the solution. To investigate this, we have computed the degenerate part of the likelihood function in the form of its Taylor expansion in the essential variable up to and including the first term that cannot vanish under the influence of the observations. This degenerate part
RESOLUTION RECONSIDERED
325
will be employed in Section VII so that we can investigate structural changes of the likelihood function and their consequences for the resolution of the components.
VII. SINGULARITY AND RESOLUTION
A. Introduction In this section, in Section VII.B.1, the polynomial representation of the loglikelihood function derived in Section VI is analyzed. The most important conclusion from this analysis is that there exist two types of sets of observations: sets of observations producing distinct solutions for the location parameters and sets of observations producing coinciding solutions. From the latter observations, the parameters, of course, cannot be resolved. A criterion is derived showing to which type a given set of observations belongs. Then, in Section VII.B.2, the numerical computation of the criterion is explained and, in Section VII.B.3, the conditions for validity of the criterion are discussed. It is concluded that the criterion is remarkably general and simple. In a numerical example in Section VII.B.4, the criterion is applied to Poisson distributed observations made on a pair of overlapping sinc-square functions, which is the model originally used by Rayleigh to define his resolution limit. The results of Section VII.B.4 demonstrate that with statistical observations, resolving occurs with a certain probability only: the probability of resolution. Various aspects of this concept are investigated. In particular, in Sections VII.B.6 and VII.B. 7, it is emphasized that the probability of resolution may also be seriously influenced by systematic errors. The purpose of Section VII.C is to show that the description of resolution presented in Section VII.B may easily be extended to components in two dimensions. In a numerical example, this is demonstrated by applying these results to the most important representative of the twodimensional component functions in optics: the Airy function. Section VII.D is devoted to a discussion of the resolution criterion in the presence of coherence. Related literature is briefly surveyed in Section VII.E. Conclusions are drawn in Section VII.E
B. Coinciding Scalar Locations 1. Conditions for Resolution In Section VI.E 1, it has been shown that the nondegenerate part of the loglikelihood function in the neighborhood of the one-component stationary point
326
A. VAN DEN BOS AND A. J. DEN DEKKER
is described by
1 ~28mt~ 2!
(242)
m--1
The coefficients 3m in this expression are the eigenvalues of the Hessian matrix at the maximum of the log-likelihood function of the parameters of the onecomponent model. Since this maximum is supposed to be nonsingular, these coefficients are supposed to be strictly negative. The degenerate part of the likelihood function around the same point has been derived in Sections VI.E 1 through VI.E3 and is described by Expr. (236): ~ p A2 -k- ~o "A3 + ~1 r
A 4
(243)
where A -- bl - b2 with bl and b2 the locations of the components. The quantities p, or, and r in Eq. (243) are defined in Section VI by Eqs. (184), (218), and (235), respectively. In the same section, convincing reasons have been given to assume that as opposed to the sign of r, the sign of p and cr depends on the particular set of observations. Together, Eqs. (242) and (243) describe the possible structures of the log-likelihood function in the neighborhood of the one-component stationary point. In any case, these structures must include the two-maxima, one saddle point structure. A simple analysis then shows that r must be negative. The stationary points of the degenerate part Expr. (243) satisfy 1 A2 + g 1r p A + ~tr
A3
= 0
(244)
The second-order derivative of Expr. (243) at the stationary point A = 0 is equal to p. Therefore, if p is positive, this point is a minimum, if it is negative it is a maximum, and the origin is a degenerate stationary point if p is equal to zero. The remaining stationary points are the roots of 1 A + g 1r A p + ~tr
2
(245)
Therefore, they are located at 3tr' 4- 3 (tr,2
8
(246)
with p' = - p / r and or' = - t r / r where the minus signs ensure that p' and tr' have the same sign as p and tr, respectively. Since the roots must be real, there are three stationary points if the discriminant satisfies O"t2 -~- ~8 p t > 0
Otherwise, the origin is the only stationary point. Because p' =
(247)
-O/r, tr' =
RESOLUTION RECONSIDERED
327
- t r / r , and r is negative,
L
p > 0 [
(248)
is an accurate approximation to the condition Ineq. (247) if Irl ~ Icrl which, in practice, is typically true. Equation (218) shows that tr and, therefore, or' is equal to zero if s is equal to 0.5. Then Ineqs. (247) and (248) are identical. The simple condition Ineq. (248) is the key result of this article. If it is true, the origin is a minimum of the degenerate part and there are two additional stationary points. Since r is negative, these must be maxima. Since the (~min the nondegenerate part of the log-likelihood function, described by Expr. (242), are all strictly negative, the origin is a saddle point of the log-likelihood function and both other stationary points are maxima with A # 0. Of these, the absolute maximum represents the maximum likelihood solution. If Ineq. (248) is not true, the origin is a maximum and, since it is the only stationary point, this maximum is absolute and represents the maximum likelihood solution. In the special case that p is equal to zero, this maximum is degenerate. These considerations imply that the log-likelihood function has a maximum with A = bl - b2 ~ 0 if Ineq. (248) is true. Then the solutions for/~1 and/~2 are clearly distinct. On the other hand, if Ineq. (248) is not true, the origin, which is the one-component stationary point, is the absolute maximum. Then the solution is located at the origin, that is, A = bl - b2 - 0 and, therefore, the solutions for/~1 and/52 coincide exactly.
Conclusion VII.1 If p > 0 the parameters t51 and/~2 can be resolved from the available observations. Otherwise, they cannot. Thus the sign of p is a criterion for the resolvability of the parameters.
2. Numerical Computation of the Resolution Criterion To see how the criterion p is computed numerically for a given set of observations, we first repeat its definition by Eq. (184): p - he(1 - s ~
)n~Oq n h(Z)(b
(249)
In this expression, q is the log-likelihood function, fn = fn(t) is the model fitted as a function of its parameters t - (Cl ... CM a bl b2) T, and hn (b) is the component function. Expression (249) is evaluated at the one-component stationary point ? - ( c 1 . . . CM a b b) T where (C1 . . . CM a b) is the maximum likelihood estimate of the parameters of the one-component model
gn(C) + ahn(b)
(250)
328
A. VAN DEN BOS AND A. J. DEN DEKKER
from the same observations. These considerations show that for a given set of observations, p may be computed as follows. First, the maximum likelihood solution (~1... ~M ~ b) r for the parameters (9/1... YM ~ /3)r of the onecomponent model is computed. Then this solution, substituted in the expression Expr. (249), produces a numerical value for p. 3. Conditions f o r the Use o f the Criterion
In this subsection, the conditions for the use of p as the criterion for resolvability of the parameters 131 and 132 are investigated. First, consider the approximation of Ineq. (247) by Ineq. (248). It is seen that both inequalities are identical if or' = 0. Equation (218) reveals that this is so if s - 0.5. This means that the heights or amplitudes of the components are equal. Therefore, in this important case, the sign of p decides exactly if the parameters of the two-component model may be resolved from the available observations. Furthermore, Eqs. (184) and (218) show that both p and cr are linear combinations of the quantities 8 q / 8 f n , n = 1, . . . , N , which are stochastic variables with an expectation close to zero if/~ 1and/~2 get close. This has been explained in Sections VI.E 1 and VI.E2. They are, therefore, typically absolutely small as compared with r, which, as Eqs. (234) and (235) and Eqs. (223)-(225) show, is dominated by definite quadratic forms in the derivatives of the component function in the point b. Finally, the dependence of p, or, and r on the value of s is addressed. The pertinent expressions reveal that these coefficients are equal to zero if s is equal to zero or, equivalently, is equal to one. This justifies an analysis of these coefficients for s approaching zero. Simple considerations then show that if s ~ 0, p and tr become proportional to s The expressions Eqs. (232), (234), and (223)-(225) show that under the same condition the coefficient r becomes approximately 0q h(4~(b) 4!
(251)
n
since the other terms contributing to r are, under the conditions concerned, proportional to s It is observed that Expr. (251) is a linear combination of the derivatives aq/0fn, n = 1 . . . . . N . Therefore, it could, like p and tr, change sign and become positive. This would violate the assumption that r is strictly negative. Note, however, that this can occur only if s << g, that is, if s << 1. This is illustrated by the following example. (Validity of the Coefficient p as Discriminant under Extreme Conditions) In Example VI.11, histograms of p / 2 ! , or/3!, and r/4! are
E x a m p l e VII.1
RESOLUTION RECONSIDERED
329
computed for a bi-Gaussian two-component model with Poisson distributed observations. The value of X is 0.2 and s is chosen accordingly. In this example, the simulations concerned are repeated but now s = 0.02 instead of 0.2. Under the experimental conditions chosen, this implies that in all measurement points the weakest component is equal to the standard deviation of the fluctuations in the observations or is smaller. The results of this simulation are as follows. In all of the 100 experiments, the value of r was negative. Moreover, the sign of the discriminant Eq. (247) and that of p were the same in 99 experiments. In one experiment, producing comparatively small values for p and for the discriminant, the signs were different. Another 100 experiments produced similar results.
Conclusion VIL2 The preceding considerations justify the use of the simple discriminant
V-- VnOq h~)(b)
(252)
n
to decide on the resolvability of the components from the observations. Under exceptional conditions, such as those in Example VII.I, the validity of D as discriminant can always be checked by computing the coefficients a and r in addition to the coefficient p and subsequently comparing the result of Expr. (247) with that of Expr. (248) and checking the sign of r. The resolution discriminant D has the following remarkably general and simple properties" 9 It is valid for all component functions h(x; ilK) and all background functions g(x; y). Therefore, it is applicable not only to point spread functions and spectroscopic peaks but to exponential functions with unknown decay and sinusoids of unknown frequency as well. 9 It is valid for all log-likelihood functions q. This means that the loglikelihood function used need not correspond to the probability density function of the observations. 9 For a particular component function and background function, it depends only on the observations because it is computed from the observations and the estimates ~ and b of the parameters of the one-component model and the estimates ~ of the parameters of the background function. However, these estimates have been computed from the same observations. 9 It does not depend on the parameter s
330
A. VAN DEN BOS AND A. J. DEN DEKKER
4. Application to Rayleigh's sinc-Square Model In this subsection, the proposed resolution discriminant is applied to observations with an expectation described by a pair of overlapping sinc-square functions. This function is important for the purposes of this article since it is the function used by Rayleigh to define optical two-point resolution (Rayleigh, 1879).
Example 1111.2 (Probability of Resolving the sinc-Square Components of the Rayleigh Model) In this example, the probability of resolving the components of the Rayleigh sinc-square model is studied as a function of the standard deviation of the observations and the distance of the components. It is assumed that the expectations of the observations Wl . . . . , wN are described by ct[~. sinc2(2zr(x, - 31)] + (1 - L) sinc2(2zr(Xn - 32)]]
(253)
where sinc (2zrx) = sin(27rx)/(2rrx). The parameter ~. is chosen equal to 0.5. Therefore, the heights of the peaks are equal. Three different choices of/~1 and/32 are considered: 31 = -/32 = 0, 31 = - 3 2 = 0.0125, and 31 = - 3 2 = 0.025. Then, the distances A 3 of the peaks are 0, 0.025, and 0.05, respectively. The measurement points have been taken as xn = - 1 . 4 5 , - 1 . 2 5 . . . . . 1.55. Figure 12 shows the function and the location of the measurement points for /x3 = 0.05. The observations w l, . . . , w N are assumed to be independent and Poisson distributed. For each of the A3 and a number of different values of ct, 300 sets of observations are generated thus distributed. For each of these sets, the values ~ and b of the parameters a and b of the one-component model a sinc2[2zr(xn - b)]
(254)
are computed that maximize the Poisson likelihood function concerned. Subsequently, these values are substituted in the expression Eq. (249) for p and the number of negative results is counted. This is the number of times the components cannot be resolved from the available observations since they coincide. These numbers are shown as frequencies in Figure 13, giving an impression of the probability of unresolved peaks. This is done as a function of the parameter ct which controls the expectation of the number of Poisson counts in any point. The values of c~ chosen were 225,450, 900, 1800, and 3600. Since the standard deviation of a Poisson stochastic variable with expectation ct is equal to ct 1/2, the relative standard deviations corresponding to these values are equal to 6.6, 4.7, 3.3, 2.4, and 1.6%, respectively. Hence, since the expectations described by Eq. (253) are all smaller than or equal to c~, these relative standard deviations are a lower bound on the relative standard deviations in any point.
RESOLUTION RECONSIDERED
331
2OO L. CT
(
6
t-"
,m oO
tO cO 0,.
E ?
100
t
0
0
0 -2
, jo--...aq~
i
-1
o
1
0 X
FIGURE 12. Measurement points and expectations of the observations in Example VII.2. The expectations are the sum of two overlapping sinc-square functions.
[]
50
t/) E) 0 t.-
0 ,"m 0 e-0
o 25
~, AI3=O.O 0 AI3=0.05 [] AI3=0.025
O 02
. . . . . . . . . . 1 0 .3 . . .
O
. . . . . .
1(
number of counts
FIGURE 13. Percentage of coinciding solutions for the locations of the sinc-square functions of Example VII.2 as a function of the expected number of counts for three different distances of the sinc-square functions.
332
A. VAN DEN BOS AND A. J. DEN DEKKER
These considerations show that the results of Figure 13 correspond to observations that are, from left to fight, increasingly precise. From the results shown in Figure 13, the following conclusions may be drawn: 9 First, for A/~ = 0, the model defined by Eq. (253) is a one-component model. Therefore, the stars in Figure 13 represent the frequency of occurrence of a one-component solution for one-component observations. Figure 13 shows that this frequency is approximately 50% for all values of c~. This also implies that the probability that a two-component model will be obtained from the one-component observations is about 50%. This phenomenon will be addressed later in Example VII.8. 9 Furthermore, for A/~ = 0.025, the frequency of nonresolution decreases from about 50% to 25% for an increasing number of counts. That is, for increasingly precise observations, the probability of resolution increases. 9 Finally, for A/3 ---0.05, that is, for components more widely separated than before, the frequency of unresolved components decreases from about 30% to 0% as the number of counts increases. Comparison to the results for A/~ = 0.025 shows that, from equally precise observations, the components are more likely to be resolved if they are more widely apart in the model of the expectations. The frequency of 0% reflects the fact that the components are nearly always resolved if the observations are sufficiently precise and the components of the expectations are sufficiently widely apart. Example VII.2 clearly shows the fundamental differences between the classical Rayleigh resolution criterion and the resolution criterion proposed in this article. From the point of view presented here, the Rayleigh resolution criterion is based on expectations of observations, not on observations themselves. It is a distance measured in terms of widths of the peaks. In the criterion presented here, the distance of the peaks in the expectation model only influences the probability of resolution. As the components of the model of the expectations get closer, a high probability of resolution requires increasingly precise observations. With hypothetical ideal observations, being equal to their expectations as in Rayleigh's considerations, the criterion presented here would, correctly, always classify the peaks as resolvable no matter how small their distance. 5. Resolution as a Property of Observations
The simple resolution discriminant D described by Eq. (252) is purely operational: from the observations the parameters ~ and b of the one-component model are estimated, the resulting estimates are substituted in the expression for D, and from the sign of D it is decided whether the parameters of the
RESOLUTION RECONSIDERED
333
two-component model can be resolved. Thus resolvability is connected to the sign of the scalar D. In this subsection, however, the resolvability is directly connected to the observations. This will be done by use of bifurcation sets, which were discussed earlier in Section V.C.3. In singularity theory, bifurcation sets are the subsets of the Euclidean space of the function parameters where structural changes of the function occur. In the measurement problems discussed in this article, the observations play the role that the parameters play in singularity theory. Therefore, here the bifurcation sets are the subsets of the Euclidean space of the observations where the changes of structure of the likelihood function occur. The purpose of this subsection is to show how these bifurcation sets can be found and how they relatively simply separate observations from which the parameters can be resolved from those from which they cannot. To belong to the bifurcation set, a set of observations w = (Wl... Wu) ~ must satisfy the equation D = 0, or Oq h~2)(b) _ 0
(255)
n'
Furthermore, since this equation has to be evaluated at the one-component stationary point ~ = (0 r a b b) T, Oq Ofn = 0 ~na f ~ at
(256)
or
Oq Ogn = 0 n
m = 1,..., M
(257)
Ofn OCm
~~f
hn(b) = 0
(258)
Oq h(~)(h ~ _ 0 af~ "
(259)
and
It is observed that Eqs. (255) and (257)-(259) constitute M + 3 equations in the N + M + 2 variables w~ . . . . , WN, ~1 . . . . . ~M, gt, b. This means that one equation in the observations w~, . . . . WANresults after hypothetical elimination of the M + 2 variables ~ . . . . , ~M, ~, b. This single equation represents the bifurcation set in the Euclidean N-space of the observations. Generally, the number of equations specifying an object is called codimension (Poston and Stewart, 1978; Saunders, 1980). The dimension of the object is the difference
334
A. VAN DEN BOS AND A. J. DEN DEKKER
of the dimension of the space in which it is located and the codimension. Therefore, the bifurcation set has codimension 1 and dimension N - 1. Objects of codimension 1 have the interesting property that they divide the space in which they are located into two parts. For example, a point divides a line into two parts and so does a plane to the Euclidean 3-space.
Conclusion VII.3 The bifurcation set divides the Euclidean space of the observations into two parts. The points w = (Wl ... wN) r in the one part represent sets of observations from which the parameters can be resolved. The points of the other part represent sets of observations from which the parameters cannot be resolved because the corresponding solutions exactly coincide. The bifurcation set is the border of both types of observations. The bifurcation set therefore represents the limit to resolution in terms of the observations.
Example VII.3 (Visualizing the Bifurcation Set) If there are N observations, the dimension of the bifurcation set is N - 1 and, therefore, it cannot usually be visualized. Instead, in this example, the intersections of the bifurcation set with a number of the two-dimensional coordinate planes in the space of the observations are computed. The origin is taken as the expected values of the observations. This means that in the computation of the intersections, N - 2 of the N observations are chosen equal to their expectations. This constitutes N - 2 equations. The bifurcation set itself is defined by one equation. Therefore, the intersection has codimension N - 1 since it is defined by N - 1 equations. Then, in the N-space of the observations the intersection with the two-dimensional coordinate plane has a dimension 1" it is a curve in the plane. In this example, these curves are computed numerically for the Poisson likelihood function and observations with overlapping bi-Gaussian expectations without the background function. This is done as follows. Suppose that the intersection with the (wn I , W,e)-Plane has to be computed. Then the points of the intersection have to satisfy the equations f,(?)
1 h~(b) = 0
(260)
f~(~)
1 h~l~(~,)= 0
(261)
f,(?)wn
1) h~2)(b) = 0
(262)
and ~(
The first two equations correspond to Eqs. (258) and (259), respectively, and are easily recognized as the normal equations of the likelihood function with
RESOLUTION RECONSIDERED
335
respect to the parameters a and b of the one-component model. The last equation puts the discriminant equal to zero. It is observed that the three equations Eqs. (260)-(262) are linear in the observations and, therefore, in//)nl and//-)n2. This will be used in the computation of the intersection of the bifurcation set with the (w,,, w,:)-plane as follows. The functionfn (t) is equal to ahn (b) and, therefore, fn(t) is equal to cthn(1)). Substituting this in Eq. (260) yields a =
(263)
E n Wn ~_,n h n ( l9)
Substituting this result in Eqs. (261) and (262) and substituting f,(O) for all f,(t) with n different from nl and n2 then yields two equations linear in wnl and Wn~ and nonlinear in b. Solving these ecluations in closed form for w,, and w, 2 for a number of selected values of b then yields the corresponding points of the intersection of the bifurcation set with the (Wnl, Wn2)-plane. This procedure has been carded out for observations with expectations fn(O)
or[0.7 exp{ - gl (x n __ /~1)2 } + 0.3 exp{
=
-
1 n -~(x
/~2) 2 }]
(264)
with x, = - 2 . 5 + 0 . 5 ( n - 1 ) + 0.025, with n = 1 . . . . ,11, and/51 = 0 and /52 = 0.15. Figure 14 shows the function and the location of the measurement
I -
5
7
z < co co <
-~o.5
1
0 -3
11
3
FIGURE 14. Measurement points and expectations of the observations in Example VII.3. The expectations are the sum of two overlapping Gaussian functions.
336
A. VAN D E N BOS A N D A. J. D E N D E K K E R
0.03 0.02
$
(b)
r
0.01
0.01
-0.01 -0.01
0.01
0.03
-0.01 -0.05
o
(0 6 5
~
x 10"a
0.01
(c)
,r
0.05
(d)
O3
$ 0
-0.01
-5 -0.02
0
-0.02 -'5
0) 9
-5
0) 3
5
xlO 3
FIGURE 15. Intersections of the bifurcation set with the coordinate planes in Example VII.3.
points. For simplicity, in this example the parameter ct has been taken equal to one. The equations defining the bifurcation set then show that results for other ot are proportional. Figure 15 shows four different intersections of the bifurcation set with coordinate planes. The quantifies o9, in this figure are defined as ~o, = to, - f~ (0)
(265)
which is the difference of the observation Wn and its expectation. In this sense, o9, represents the error in the observation. As a result, sets of observations on the same side of the bifurcation set as the origin correspond to the two-maxima, one saddle structure of the likelihood function and, hence, to distinct solutions for the parameters. Points on the other side of the bifurcation set represent sets of observations for which the solutions for the parameters coincide. Figures 15a and 15b show the difference in sensitivity of the resolvability of the parameters to different to,. In particular, the error O)4 has strikingly little influence on the resolvability. This may be explained b y t h e fact that at x4 the second-order
RESOLUTION RECONSIDERED
337
derivative h~n2)(b) appearing in Eq. (262) is very small. On the other hand, it is observed in Figure 15c that an error 0)11 equal to -0.004 suffices to make resolution impossible. Sufficiently negative values for w3 and 0)9 have the same effect, as shown in Figure 15d. The intersections shown in Figure 15 are strikingly linear. This is characteristic of bifurcation sets for one-dimensional two-component models in general. Illustrative examples dealing with the normal likelihood function and exponential models are presented in van den Bos (1981). This linearity simplifies the computation of the probability of resolution for a particular distribution of the observations substantially. This is the probability that a point w = (Wl ... WN)T is on the side of the bifurcation set corresponding to distinct solutions for the nonlinear parameters. The linearity also simplifies the computation of the critical errors. The vector of critical errors is defined as the vector from the origin to the nearest point of the bifurcation set. Probability of resolution and critical errors are addressed in Section VII.B.7. First, in the next subsection, the behavior of the solutions for the parameters as a function of the observations is described.
6. Resolution from Error Disturbed Observations This subsection addresses the solutions for the parameters as a function of the observations. In particular, the solutions will be studied if in the Euclidean space of the observations the point representing the set of observations approaches the bifurcation set and eventually crosses it. As before, this is illustrated in a numerical example.
Example Vll.4 (Poisson Maximum Likelihood Solutions for the Location of Component Functions of a Bi-Gaussian) In this example, the component functions, the measurement points, and the likelihood function are those of Example VII.3, but, different from Example VII.3, there is a background present that is equal to zero in all points except in one point where in a number of simulation experiments it is subsequently given a number of positive and negative values. In this example, this will be the sixth point where the observation will be equal to f6(0) + 0)6 where f6(0) is the exact value of the two-component function with 0 = (/~1 /~2) T and 0)6 is the background contribution. However, in the model fitted, the background function is absent. Figure 16 shows the results. First let 0)6 be equal to zero. Then the solutions b l and b2 for the locations are the exact values fll and f12, respectively. Next 0)6 is decreased. Then, as Figure 16a shows, the distance of bl and b2 increases. If, on the other hand, 0)6 is increased, bl and b2 get closer and closer until, for o96 approximately equal to 0.020, they coincide. Figure 16a also shows the solution b for the location of the one-component model. Although this solution changes relatively little,
338
A. VAN DEN BOS AND A. J. DEN D E K K E R i
i
i
(a)
0.15
A
0.05
L-
b
A
A
bl=b2=b Y
b1
"0-~5()05
o. os
o.ols
o.o2s
0.035
0)6
(b)
0.02
~
o. os
I
o.ols
I
o.o2s
o.o3s
0~6 FIrtrRE 16. (a) The solutions for the locations of the Gaussian components in Example VII.4 as a function of the error in the sixth measurement point. (b) The corresponding values of the resolution discriminant.
the value of the discriminant D computed from it is seen, in Figure 16b, to change sign for about the same value of 0 ) 6 for which bl and b2 coincide, as could be expected. Figure 16 shows that for 0)6 - - 0 . 0 0 5 , the log-likelihood function must have the two-maxima, one saddle point structure. On the other hand, if 0)6 - - 0.025 the structure must be a one-maximum structure. Figure 17 shows the corresponding contours of the log-likelihood function. In Figure 17a, there are two maxima and a saddle point between. The saddle point is the point of intersection on the middle contour. It is the one-component stationary point and is, therefore, located on the (b~ - b2)-axis. In Figure 17b, the one-component stationary point is a maximum and is the only stationary point. Example VII.4 is somewhat artificial: all observations are assumed to be exactly equal to the bi-Gaussian function values with the exception of one observation which contains a background contribution. However, it is illustrative in three different respects. First, the example shows that disregarding a seemingly insignificant background contribution of approximately 2% in only one of eleven observations may obstruct the resolving of two peaks at a distance of 0.15 half-width. Second, the example shows the importance of the sign of
RESOLUTION RECONSIDERED
0.05
(a)
339
(b)
0"05I
.o~ -I-
.o~
0.04
-0.2
iioo,' 0
0.2
-0.2
bl-b 2
0
0.2
bl-b 2
FIGURE 17. Contours of the log-likelihood function in Example VII.4 for the error in the sixth measurement point equal to (a) 0.005 and (b) 0.025, respectively.
the background contribution. A background contribution 0 ) 6 = 0.02 impedes resolution while 0)6 --" - - 0 . 0 2 does not. Third and most important, the example shows how detrimental systematic errors may be to resolvability. By definition, a systematic error is a modeling error. This means that the model fitted and used for the computation of the bifurcation set on the one hand and the model of the expectations of the observations on the other are members of different nonlinearly parametric families of functions. Then the origin, which represents the expectations of the observations, can no longer represent exact observations of the parametric family of the model fitted. Therefore, it necessarily represents error corrupted observations made on such a model and may as such, in the space of observations, be located anywhere relative to the bifurcation set. This means that the origin may even be located in the region beyond the bifurcation set where the solutions for the location parameters coincide. Since, by definition, the observations are distributed around the origin, this may drastically change the fraction of them corresponding to distinct solutions and, therefore, the probability of resolution. This is one of the topics addressed in the next subsection.
7. Probability of Resolution and Critical Errors The subjects of this subsection are probability of resolution and critical errors. As mentioned previously, the vector of critical errors is defined as the vector from the origin of the space of the observations to the nearest point of the bifurcation set. In this definition, the bifurcation set is the one corresponding to the chosen likelihood function and model fitted. In general, the critical errors are the smallest change of the exact observations causing structural change of the
340
A. VAN DEN BOS AND A. J. DEN DEKKER
likelihood function employed. This definition may be shown to imply that their computation is a nonlinear minimization problem under equality constraints. This problem has been solved in van den Bos and Swarte (1993) for the ordinary least squares criterion, which, as has been explained earlier in this article, is equivalent to the log-likelihood function for independent and identically normally distributed errors. The result is remarkably simple: in this special case and in the absence of systematic errors, a very accurate approximation of the critical errors is the additive inverse of the difference of the exact observations and the one-component model best fitting to them. Unfortunately, such a simple solution has not been found for other distributions of the observations. However, bifurcation sets as functions of the observations are strikingly linear if the component function is one-dimensional, in the sense of a function of one variable only, as were all functions studied up to now. Linear functions of codimension 1 are called hyperplanes. For bifurcation hyperplanes, the critical errors are relatively easily computed from the intersection points with the coordinate axes, as will be demonstrated now. Let 6~1. . . . . &N be the points of intersection of the bifurcation hyperplane with the N coordinate axes of the space of observations, with the expectations of the observations as origin. Then, it is not difficult to show that the bifurcation hyperplane is described by v~og. = 1
(266)
n
where Vn - 1/(On. Simple geometrical arguments show that the coordinates o9'. of the point of this bifurcation set nearest the origin are described by t ~ I)n O)n [ivl12
(267)
where Ilvll is the Euclidean norm of the vector v = (Vl... vN) r defined as (
Ilvll =
2) 1/2 ~ Vn n
(268)
The distance of the point o9' from the origin is 1 Ilvll
(269)
The w', are the critical errors since of all increments causing structural change, they have the smallest Euclidean norm. This norm is given by Eq. (268). Errors smaller in this sense cannot cause structural change. (Critical Errors for Resolving the Components of a BiGaussian from Poisson Distributed Observations) Let the expectations of the Example VII.5
RESOLUTION RECONSIDERED
341
observations, the measurement points, and the likelihood function be those of Example VII.3. Then simple computations such as those of Example VII.3 show that the coordinates of the points of intersection of the bifurcation set with the coordinate axes are given by d~ - (-0.0036
-0.0060
0.0260
-0.0140
0.4000
-0.1680
-0.0148
0.0272
-0.0064
0.0192
-0.0036) T
(270)
if the bi-Gaussian expectations of the observations are chosen as the origin. Then it follows from Eq. (267) that the critical errors are equal to c0'= (-0.0013
-0.0008 0.0002
-0.0003
0.000
-0.000
-0.0003
0.0002
-0.0007
0.0002
-0.0013) T
(271)
The norm of this vector, that is, the distance from the origin to the bifurcation set, is equal to 0.0021. Notice that this is smaller, and in most cases much smaller, than any of the intercepts described by Eq. (270). Next suppose that there is a background present in the observations in the form of the linear trend to(0.0
0.0005... 0.005)
(272)
where tc is a constant and that this contribution is not included in the model fitted and used for the computation of the bifurcation set. First, let tc be equal to -0.25. Then it is not difficult to show that the distance from the origin, which is now the point representing the expectations of the observations including the trend, to the bifurcation set is equal to 0.0010. The conclusion is that the origin has moved toward the bifurcation set. On the other hand, had tc been equal to 0.25 instead o f - 0 . 2 5 , the distance would have been 0.0033. This distance exceeds the distance in the absence of the trend. Next consider the case that x = - 1 . Then the distance from the origin to the bifurcation set is 0.0025. This is farther from the bifurcation set than for i( = -0.25. However, substitution of the trend for this tc in Eq. (266) defining the bifurcation set shows that the origin is now located on the side of the bifurcation set where the solutions coincide. Example VII.5 demonstrates that systematic errors in the observations may influence the resolvability of components in two respects. First, systematic errors influence the distance from the origin to the bifurcation set. Second, the systematic errors determine on which side of the bifurcation set the origin is located. This would also have been true if the bifurcation set had been nonlinear. Since, by definition, the observations made are distributed about the point representing the expectations, the following conclusion may be drawn.
342
A. VAN DEN BOS AND A. J. DEN DEKKER
Conclusion VIL4 The probability of resolution of the components, that is, the probability that the point representing the set of observations is on the side of the bifurcation set corresponding to distinct solutions for the parameters, is determined by both the systematic errors and the statistical errors in the observations. The probability of resolution is simply the probability that a set of observations is on the side of the bifurcation set corresponding to distinct solutions for the parameters. Using this definition makes computing the probability of resolution relatively easy if the bifurcation set is linear in the observations. Let v be the vector of critical errors. Then a set of observations o) is on the same side of the linear bifurcation set as the origin if its projection on the critical error vector v is, that is, if o)T v < v T v
(273)
Notice that in Ineq. (273) the elements of v are constants while those of o) are stochastic variables representing the fluctuations in the observations. Therefore, the left-hand member of Ineq. (273) is a linear combination of stochastic variables having an expectation equal to zero. For simplicity, assume that the fluctuations o~ are independent with a standard deviation rr ~and that the central limit theorem applies to the linear combination. Then the left-hand member of Eq. (273) is a normally distributed variable with an expectation equal to zero and a variance 2 2 v~G
(274)
n
It follows that the probability of resolution is equal to the probability that a standard normal variable is smaller than 2 Un gt
/(E
VnO" n
(275)
Pl
Therefore, the procedure for computing the probability of resolution is as follows. First, compute the critical errors. Then compute the quantity described by Expr. (275). The probability of resolution is the value of the tabulated standard normal cumulative distribution function for this quantity. This procedure is illustrated in the following example, where both a trend and statistical fluctuations are present.
Example Vll.6
(Probability of Resolution of the Components of a BiGaussian) The purpose of this example is to illustrate the proposed procedure for computing the probability of resolution. The model of the expectations is
RESOLUTION RECONSIDERED
343
TABLE 2 PROBABILITY OF RESOLUTION (%) c~
x = 0.25
tc = 0
tc = - 0 . 2 5
tc = - 1
625 10,000
60 85
57 75
53 62
42 21
that of Example VII.5: fn(O)
= x(n-
1) x 0.0005 -+- Ot[0.7 e x p { - - ~ (1X n
-- i l l ) 2} + 0 . 3 exp { -- ~1( X n - - fl2)2}]
(276) n = 1. . . . . l l ; x n = - 2 . 5 + 0 . 3 ( n - 1 ) + 0 . 0 2 5 , w i t h n -- 1. . . . . l l ; a n d /~1 - 0 and t32 = 0.15. The model underlying the bifurcation set is fn(t) = a[s exp{--~l ( x
n __ b l ) 2
} q--(1 -- s
1
-- b2 )2 }]
(277)
with s = 0.7. In addition to the systematic errors represented by the trend, there are statistical fluctuations as a result of the Poisson statistics of the observations. In the cases considered, the parameter ot controlling the expected number of counts is equal to 625 and 10,000, respectively. The chosen values of the slope x of the trend are the same as those of Example VII.5. Table 2 shows the results. The first column of Table 2 shows the number of counts or. The other columns show the probability of resolution in percent. From left to fight in the table, the distance of the point representing the expectations of the observations to the bifurcation set decreases until, in the last column, this point has passed the bifurcation set. (See Example VII.5.) This explains why the probability of resolution decreases from left to fight. The standard deviations relative to the expectations for ~ -- 625 are four times larger than those for ot = 10,000. Therefore, in the latter case, the observations are more concentrated around their expectations than in the former. This explains why the probability of resolution for c~ = 625 is smaller than that for ot = 10,000 if the expectation of the observations is on the side of the bifurcation set corresponding to distinct solutions and why the opposite is true if the expectation is on the other side. The last topic addressed in this subsection is the probability of resolution as a function of the distance of the peaks. Again this is addressed by using an example.
Example VII. 7 (Number of Observations Needed for Resolving the Components of a Bi-Gaussian as a Function of the Distance of the Components) Again
344
A. VAN DEN BOS AND A. J. DEN DEKKER 10 7
i
i
10 6
10 s
10 4
103
102 0
o11
A~
o12
o13
FIGURE 18. Expected number of counts in Example VII.7 required for a probability of resolution of 75 %.
let the expectations of the observations, the measurement points, and the likelihood function be those of Example VII.3. Note that this implies that the only errors present are the statistical Poisson fluctuations. Then for various distances A/~ = ~ 2 - r] the intersections with the coordinate axes may be computed such as has been done in Example VII.3. Substituting these intersections in Eq. (267) then yields the critical errors. Finally, Eq. (275) may be used to compute the standard deviations and, hence, the number of observations corresponding to a specified probability of resolution. In this example, for a number of values of A/~ the number of observations is computed that is required for a probability of resolution of 75%. Figure 18 shows the results. The figure shows that in this particular case the required number of observations ct is inversely proportional to approximately the fourth power of the distance of the peaks A/~.
C. Coinciding Two-Dimensional Locations 1. Conditions for Resolution Unlike the Rayleigh resolution criterion, the criterion of Section VII.B may be extended to vector-valued parameters. For the purposes of this article, the
RESOLUTION RECONSIDERED
345
most important example of a vector-valued parameter is the two-dimensional location parameter (fix fly)T where fix and /~y represent the location of the component with respect to the x-axis and the y-axis, respectively. Then the components are functions of two variables x and y and are said to be resolved if their location vectors differ. In optics, probably the best known example of a two-dimensional component function is that describing the diffraction pattern of a circular aperture, the Airy diffraction pattern"
[
]
ol 2 Jl(27rr) 2 2Jrr
(278)
with r 2 - ( x - fix)Z+ ( y - fly)Z, where Jl(') is the Bessel function of the first kind of order one. The general expression for the generalized twodimensional two-component model now becomes fn(t) = f (Xn, Yn; t) = g(Xn, Yn; C) + a[glh(xn - bxl, Yn -- byl) nt- s
- bxz, Yn - byz)]
(279) where (Xn, Yn), n = 1 , . . . , N, are the known measurement points; a is the amplitude, s = s and s = 1 - s with 0 < s < 1 as before; and (bxl byl) T and (bx2 by2) T are the coordinates of the peaks. The background function g(x, y; c) is now, of course, also a function of the two variables x and y and is parametric in the elements of the M x 1 vector c. The parameter vector t is defined as t=(c r
a
bxl
bx2
byl
byz) T
(280)
Furthermore, the one-component model corresponding to Eq. (279) is described by fn(t) = g(Xn, Yn; C) -k- ah(xn - bx, Yn -- by)
(281)
where the parameter vector t is now described by (c r a bx by) T. In the following analysis of the structure of the likelihood function q(t) for the model described by Eq. (279), the steps will be analogous to those for the scalar model used in Sections VI.C and VI.E and defined by Eq. (125). A special case, the analysis of the structure of the least squares criterion in two dimensions, is presented in van den Bos (1992). Suppose that (~r
~
bx
by) r
(282)
is the maximum likelihood solution for the parameters of the one-component model described by Eq. (281). Then the two-dimensional equivalent of the
346
A. VAN DEN BOS AND A. J. DEN DEKKER
one-dimensional one-component stationary point is
"~__(oT
~l
bx
bx
by
by) r
(283)
The value of the likelihood function at this point is equal to the maximum value of the likelihood function for the one-component model. To establish the nature of the one-component stationary point, we compute the Hessian matrix with respect to the parameters t described by Eq. (280). Next the coordinates t are linearly transformed into a vector t t by
L)t
t t = diag(l
(284)
where I is the identity matrix of order M + 1 and L is defined as
t =
i,ai 0 0t 0 -1 0
~1 0 1
~2 0 -1
(285)
Hence t t -- (c r
a
elbxl -t- g.2bx2
~.lbyl -+- ~.2by2 bxl - bx2
byl - by2) T (286)
Finally, transforming the Hessian matrix H with respect to t into the Hessian matrix H t with respect to t t yields
H t = L - r i l L -1
(287)
Simple algebraic operations then show that
(o o) or
(288)
where G is the (M + 2) x (M + 2) Hessian matrix of the log-likelihood function for the one-component model and evaluated at its maximum defined by Eq. (282), O is the M x 2 null matrix, and the 2 x 2 matrix R has as its elements
rpq where b l -
:
Oq O2hn he(1 - s Z Ofn Obp Obq n
bx and b 2 -
P' q - 1, 2
by and all derivatives are evaluated at t -
(289) ~=
(at a bx bx by by) ~. Since G is the Hessian matrix of the likelihood function for the one-component model and is evaluated at the maximum of this function, it is negative definite. Then the nature of the one-component stationary point
"t--(C T
a
bx
bx
by
by) T
(290)
RESOLUTION RECONSIDERED
347
is fully determined by the matrix R or, equivalently, by its eigenvalues. If these are both positive, this point is a maximum in M + 1 directions and a minimum in both remaining ones. Then it is an M + 1 saddle. If the signs of the eigenvalues are different, the one-component stationary point is an M + 2 saddle. If both eigenvalues are strictly negative, the point is a maximum. Then ~ represents the maximum likelihood estimate and, consequently, the solution vectors for the locations of both components coincide. Therefore, the two-dimensional component functions cannot be resolved from the observations available. The bifurcation set is composed of those sets of observations for which the largest eigenvalue just vanishes. Since the determinant of R is equal to the product of the eigenvalues and its trace to their sum, a set of observations w = (Wl ... WN) r belongs to the bifurcation set if det(R) -- rll r22 - r~2 -- 0
and
rll + r22 < 0
(291)
Note that by Eq. (289), these conditions are independent of the value of ~ if 0 < ~ < 1. Then the bifurcation set is defined by the M + 3 normal equations of the log-likelihood function for the one-component model and the equation det(R) = 0. These are M + 4 equations in the N observations and the M + 3 elements of (Or ~ ~,~ ~,y)r. Hence, if, hypothetically, these elements are eliminated, one single equation in the observations results. The conclusion is that in the Euclidean space of the observations the bifurcation set has codimension 1. It, therefore, divides the space of the observations into two regions: the one region is the collection of all sets of observations from which the components can be resolved and the other region is the collection of sets of observations from which this is not possible.
2. Application to the Airy Model In this subsection, the resolution criterion for component functions of two variables is applied to the Airy component function, which is important for this article since it is the diffraction pattern for a circular aperture.
Example VII.8 (Probability of Resolving the Components of a Bi-Airian Model) In this example, the probability of resolving the components of a pair of incoherent Airy patterns from error corrupted observations is illustrated. Both systematic and statistical errors are present to show their interaction. It is assumed that N observations w = ( w l . . . WN) r are available and that the expectation of the nth observation wn is described by d(xn, Yn) + ot[~,h(x,,
-
~ 1 , Yn - / ~ y l ) -~- (1 - ~,)h(xn
-
]~x2, Yn - ~y2)]
(292)
348
A. VAN DEN BOS AND A. J. DEN DEKKER
In this expression,
h(x,y)=(2Jl(2rcr))
2zrr
2
(293)
with r 2 = x 2 4- y2, is the function describing the two-dimensional Airy intensity pattern. See Born and Wolf (1980). Thus, the second term of Eq. (292) describes two Airy components located at the points (flxl, flyl) and (fix2, fly2) and having amplitudes c~;~ and c~(1 - ;~), respectively. The first term, d(x, y), is a function of x and y. It represents the systematic error, that is, the contribution to the expectations of the observations not included in the model fitted. The 49 measurement points (x,, y,) are described by ( - 0 . 7 5 + 0 . 2 5 ( s - 1), - 0 . 7 5 + 0.25( t - 1)) with s, t = 1 . . . . . 7. To avoid atypical conditions such as symmetrical location of measurement points, in this example we take the locations of the components as (fix l, i~yl) = (0.0475, 0.0575) and (fix2, ~ y 2 ) --" (0.0699, 0.1022). Then the Euclidean distance of these locations is 0.05. Since the width at half maximum of the Airy component function is 0.5, the distance of the components is one tenth of the width. The heights of the components are taken equal, that is, )~ = 0.5. It has been assumed that the observations have a Poisson distribution. Two cases are considered: ,~ = 625 and c~ = 10,000. The systematic errors d(x,, y,) are generated as follows. A one-component model
a h ( x , - bx, y, - by)
(294)
with h(x, y) defined by Eq. (293) is fitted to exact two-component observations:
f~(O) = ~[~.h(x, - flxl, y~ - flyl) + (1 - Z)h(x~ - fix2, Y~ - fly2)] (295) Suppose that the solution is (fi bx by) r. Then the systematic errors are taken as d(xn,
Y n ) - - --/C[fn(0) -- 6 t h ( x n -
b x , Yn -- by)]
(296)
where tc is a gain factor. The values of tc are chosen as 0, 0.5, 1, and 1.5, respectively. Notice that tc = 1 means that the expectations of the observations
d ( x , , y,) + f,(0)
(297)
are exactly equal to the one-component model ~ t h ( x n - b x , Yn -- b y )
(298)
Then Eqs. (260)-(262) show that the observations are in the bifurcation set. For each of the chosen values of to, 100 sets of observations are simulated
RESOLUTION RECONSIDERED
349
TABLE 3 FREQUENCY OF RESOLUTION(%) c~
tc = 0
x =0.5
x = 1 x = 1.5
625
92
82
62
42
10,000
100
100
5!
12
and the relative frequency of occurrence of distinct solutions for the location parameters is computed. Table 3 shows the results. The first column of Table 3 shows the value of or, the number of counts. The other columns show the measured frequency of occurrence of resolved components in percent. From left to fight, tc = 0 corresponds to expected observations without systematic error, x -- 1 to expected observations in the bifurcation set, tc - 0.5 to expected observations between the origin for tc = 0 and the bifurcation set, and x = 1.5 to expected observations beyond the bifurcation set. The relative standard deviation of the observations for ot -- 625 is four times as large as that for ot = 10,000. As a result, for c~ = 625 and tc = 0 or tc = 0.5, a number of 8 and 18 sets of observations occur beyond the bifurcation set while for ot = 10,000 this does not occur at all for these values of x. If the bifurcation set would be a hyperplane, the probability of resolution for K = 1 would be 50% and so would approximately be the corresponding relative frequencies. The relative frequencies of 62 and 51% for ot = 625 and ot = 10,000, respectively, may be explained by the larger area over which the observations are distributed in the former case and the corresponding decrease of linearity of the bifurcation set. The relatively low percentage of resolved components for ~ = 10,000 and tc = 1.5 is caused by the fact that the point representing the expectations of a set of observations is now beyond the bifurcation set while the observations are distributed over a relatively small neighborhood of this point. Example VII.8 shows, again, that the probability of resolution is determined by the combination of statistical errors and systematic errors. Precision, that is, low standard deviation of the observations, is always helpful if the expectation of the observations is on the same side of the bifurcation set as the origin. On the other hand, if the systematic errors are such that the expectation of the errors is not on the side of the bifurcation set corresponding to distinct solutions, improving the precision of the observations to improve the resolution is pointless. Example VII.8 and the theory presented in Section VII.C.1 show that the concept resolution adopted in this article is equally well applicable to onedimensional problems as to two-dimensional problems. This is a further
350
A. VAN DEN BOS AND A. J. DEN DEKKER
difference from the classical Rayleigh resolution. Application of the Rayleigh resolution criterion to the one-dimensional intersection of overlapping twodimensional Airy patterns with a vertical plane through their locations (see Orloff, 1997, p. 322) disregards all points outside this plane.
D. N o n s t a n d a r d Resolution: Partial Coherence
Up to now, the model components were assumed to be incoherent. If the images are coherent, the expression for the expectation of the two-point image becomes (Born and Wolf, 1980) ~
- i~xl, Y~ - / ~ y l )
-~-
(1
-
~,)2h2(xn - fix2, Yn --/~y2)
+ 2p)~(1 - ~.)h(xn - ~xl, Y~ - ~yl)h(x~ - / ~ x 2 , Y~ -/3y2)]
(299)
where p is the real part of the complex degree of coherence and the remaining symbols have their usual meaning. Generally, this model differs from the standard two-component model used up to now as a result of the presence of the last term unless p = 0 or p = 4-1. If p = 0, the components are incoherent. This model is covered by the two-component model assumed up to now with h(., .) replaced by h2(.,.). If p = 1, the sources are fully coherent and Eq. (299) may be written ot[~.h(xn - / ~ x l ,
Yn -- / ~ y l ) + (1 - ~.)h(xn - i~x2, Yn -- f l y 2 ) ] 2
(300)
Then, if this model is substituted for the expectations of the observations in the log-likelihood function, the theory developed in the preceding subsections is fully applicable since the likelihood function is now a function of the twocomponent model ~
--/~xl,
Y, -/~yl) -~- (1 - Z)h(x~ - ~x2, Y~
-/~y2)]
(301)
The log-likelihood function is, therefore, a function of a standard twocomponent model, which is all that is required by the developed theory. For all other values of p, the theory described in the preceding subsection has to be modified to be applicable to the model described by Eq. (299). In (van den Bos and den Dekker, 1995, 1996), this is done for one- and two-dimensional location parameters for the uniformly weighted least squares criterion. In den Dekker (1997a), a detailed computation of the probability of resolution for least squares and scalar location parameters is presented, den Dekker (1997b) provides a detailed treatment of probability of resolution in one and two dimensions for the least squares criterion. These references show that, analogous to the least squares criterion for the incoherent model, the least squares criterion for the coherent model has a
RESOLUTION RECONSIDERED
351
one-component stationary point. The one-component model is now
ax(s
p)h(xn
--
bx, Yn
(302)
-- by)
with
x(s p) = [s + (1 - s + 2ps
- s
(303)
Furthermore, analogous expressions for the coordinate transformation defined by Eq. (285) are
-1
0
0
1
(304) 1
with s = s163 + pe2)/K(g., p) ands = e 2 ( e 2 -q- t O e l ) / / c ( e , p) with, as before, e~ = ~ and ~2 "- 1 - e. Analogous expressions for the elements rpq of the 2 • 2 matrix R described by Eq. (289) are rpq =
~le2
K(g., [9)
[ele2( 1 -
p2)Xpq
-3I-
(e 2 -3I- tOel)(el -Ji- pe2)l/tpq]
(305)
where Xpq --
Oh____sObq Oh____y_n and -4~ ~n dn Obp
~pq
=
OZh-Obq -----Lh~ -~ (306) -43 ~n dn Obp
with
dn - -
W n --
ag(e, p)h(xn - bx, Yn - by)
(307)
In Eqs. (305)-(307), bl = bx and b2 = by, while hn = h(xn - bx, Yn - by) and its derivatives with respect to bx and by are all evaluated at the one-component stationary point (h bx bx by by) r. The results described by Eqs. (305) and (306) are specialized to the least squares criterion. It is not difficult to show that the pertinent results for a general likelihood function q are obtained if in Eq. (306) X p q and l[tpq a r e replaced by
Xpq - -
~1 E n
8q Ohn Ohn Ofn Obp Obq
and
~Pq
__
~ ~ Oq 02h________2__hn ~ (308) --Z Ofn Obp Obq
Conclusion VIL5 First, a comparison of the elements rpq of the matrix R in the incoherent case defined by Eq. (289) with those in the coherent case shows that in the latter case the elements of R are dependent on s Therefore, in the coherent case the resolvability of the parameters (flxl,/~ya) and (3x2,/~y2) depends on s as well. Second, the generalization of the concept resolution introduced in this article to coherent images is relatively straightforward.
352
A. VAN DEN BOS AND A. J. DEN DEKKER
E. A Survey of Related Literature In this subsection, a short survey of earlier work related to the results presented in this article will be given. In an example illustrating a numerical method for the computation of the parameters of exponential decay models, Lanczos (1957) truncates his simulated triexponential observations to three decimals and concludes that as a result of the truncation, his triexponential observations produce a biexponential solution. This conclusion is based on the fact that a set of linear equations that has to be solved in the process is ill-conditioned for three exponentials but is not for two. Although at first sight there is a certain similarity to the fundamental limits to resolution exposed in this article, Lanczos's problem is a numerical problem only. Applying the resolution criterion proposed in this article to his observations reveals that a triexponential least squares solution does exist. The first publication reporting exactly coinciding least squares solutions for the nonlinear parameters of a multicomponent model and explaining this phenomenon using singularity theory is that of van den Bos (1980). The theory concerned is discussed in more detail in van den Bos (1981). The simultaneous estimation of the component amplitudes is added in van den Bos (1983). These publications are all restricted to nonlinear least squares estimation of the parameters which is equivalent to maximum likelihood estimation for independent and identically normally distributed observations. In van den Bos (1984), the results obtained for least squares are extended to include the more general class of criteria of goodness of fit or likelihood functions that are functions of the differences of the observations and the model only. A background function is also added to the component model. The first application to optical resolution is van den Bos (1987), addressing the resolution of two overlapping incoherent intensity peaks of equal height. In van den Bos (1988), the results obtained so far are extended to include the nondifferentiable likelihood functions corresponding to least-absolute-values and minimax model fitting. The coincidence of the least squares solutions for the locations of more than two components is the subject of van den Bos and van der Werff (1990). Coincidence of two-dimensional location parameters, needed for the definition of optical or electron optical resolution in two dimensions, is analyzed in van den Bos (1992). In addition to a number of clarifying numerical examples and corresponding visualizations, den Dekker (1992) addresses component coincidence in the presence of coherence and for a generalized class of likelihood functions. In Swarte (1992), the developed resolution theory is applied to spectroscopic peaks. Critical errors defined as the errors that have smallest energy among all errors causing coincidence of solutions are introduced in van den Bos (1993). In the same reference, the concept probability of resolution is introduced. Critical
RESOLUTION RECONSIDERED
353
errors are also one of the main subjects of van den Bos and Swarte (1993). The coincidence of the solutions for one-dimensional and for two-dimensional component locations in the presence of partial coherence is studied in van den Bos and den Dekker (1995, 1996), respectively. In den Dekker (1997a), the probability of resolution in the presence of partial coherence is computed for least squares model fitting. Probability of resolution in one and two dimensions and in the presence of partial coherence is analyzed and discussed in detail in den Dekker (1997b). Bettens and coauthors (1999) compute the probability of resolution of a pair of Gaussian peaks when the observations are counting results. Finally, Sijbers and coauthors (1998) show that singularity theory may also be used to explain structural change of a likelihood function used in magnetic resonance imaging.
E Summary and Conclusions In this section, a Taylor polynomial representation of the likelihood function of the location parameters of the components of overlapping component functions has been used to derive a criterion for the resolvability of the components from error corrupted observations. This has been done for one-dimensional and for two-dimensional observations. The use of the resulting criteria has been demonstrated in numerical examples employing simulated observations made on a pair of sinc-square components such as used by Rayleigh in his criterion and a pair of overlapping Airy patterns, respectively. The resulting criteria are simple and general. They clearly show that resolution is limited by errors. These are systematic errors in the sense of modeling errors and nonsystematic errors which are statistical fluctuations of the observations. The criteria are also operational since they divide all possible sets of observations into two categories: sets of observations from which the components can be resolved and sets of observations from which they cannot be resolved. To which of the two categories a particular set of observations belongs is easily discovered without the need to actually try to resolve the components. The use of the criterion has also been extended to include resolution in the presence of coherence. VIII. SUMMARY AND CONCLUSIONS
In this article, conventional resolution definitions and criteria have been reviewed and an alternative definition and corresponding criterion have been proposed and explained. The proposed definition is this: Two overlapping component functions are said to be resolved from a given set of error disturbed observations if the estimates of their location parameters are distinct. The component functions are unresolved if the estimates of the location parameters exactly coincide.
354
A. VAN DEN BOS AND A. J. DEN DEKKER
Although exactly coinciding solutions for parameters may seem highly improbable, they regularly occur and do so more often as the component functions increasingly overlap. Singularity theory has been used to explain the occurrence of coinciding solutions and to show that they are a result of systematic errors or nonsystematic errors in the observations. The corresponding resolution criterion operates as follows. First, the parameters t = (c r a b) r oftheone-componentmodel f ( x ; t ) = ah(x;b) + g(x; c) are estimated from the available observations w = (Wl ... WN) r. In this expression, h(x; b) is the component model as a function of the variable x and located at b, and a is the amplitude of the component. The function g(x; c) is a background function with parameters c - (c~... CM)r. As estimates of t, the maximum likelihood estimates ~ = (~r gt b) r are taken. These maximize the likelihood function chosen by the experimenter. This likelihood function may be different from the likelihood function corresponding to the actual probability density function of the observations w. Next the estimates ~ and the observations w are substituted in the discriminant function
Oq h(2) n
where q = q(t) is the logarithm of the likelihood function, fn = f(xn; t), the expression h~2)(~,) is the second-order derivative of the component function h(x~; b) with respect to b at b = b, and xn is the nth measurement point with n = 1. . . . . N. Notice that D is a function of the observations only since ? is fully defined by w. Also notice that D is specific for the component model, the background model, and the likelihood function chosen by the experimenter. It has been shown that if D > 0, the estimates of the location parameters are distinct and that if D < 0, they coincide. This is why the sign of D has been proposed as the resolution criterion. In this article, the observations have been modeled as stochastic variables since, in our opinion, there is no more useful alternative. In this model, the expectations of the observations represent the actual two-component and background function underlying the observations. This function need not be the same function as that adopted by the experimenter and used in the computation of the discriminant D. Anyway, if the observations are stochastic variables, then the discriminant D and its sign become stochastic variables as well. Consequently, resolution of the components occurs with a certain probability only. This has been called the probability of resolution. For a chosen likelihood function, component function, and background function, the probability of resolution is determined by the expectations of the observations and the way the observations are distributed about these expectations. To exlSlain this
RESOLUTION RECONSIDERED
355
further, we have introduced the notion of the Euclidean space of observations. In this space, the nth coordinate axis represents the nth observation Wn, and a point in this space represents a set of observations (Wl... WN) r. It has been shown that all sets of observations corresponding to vanishing D form a hypersurface called the bifurcation set, which divides the Euclidean space of observations into two regions. From the observations in the one region, the components can be resolved. From the observations in the other region, this is not possible. The bifurcation set is characteristic of the component function, the background function, and the likelihood function chosen. It has been shown that increasing the standard deviation of the observations may decrease the probability of resolution since it increases the probability of occurrence of a set of observations on the side of the bifurcation set corresponding to coinciding solutions for the locations. The probability of occurrence of a set of observations in either region also depends on the position of the point representing the expectations of the observations relative to the bifurcation set. Systematic errors, that is, differences of the model describing the expectations and the model adopted by the experimenter, may move the bifurcation set toward this point and even past it. This causes the probability of resolution to decrease and shows how systematic errors may adversely influence resolution. These considerations clearly show that errors, both systematic and nonsystematic, are the only limit to resolution. The generality of the proposed definition and criterion have also been demonstrated. Different from conventional definitions, in the proposed definition the number of observations need not be such that the use of asymptotic statistical results is justified. Furthermore, the likelihood function used in the estimation of the location parameters and for the computation of the bifurcation set need not correspond to the probability density function of the observations. Also, as set out earlier in this section, the influence of both systematic and nonsystematic errors may be analyzed. Since all results derived depend on the location of the measurement points, the influence of these experimental parameters upon resolution may also be investigated. Furthermore, it has been shown how in this approach, different from conventional approaches, the resolution definition and criterion may be extended to include resolution in more dimensions. Extension to resolution in the presence of coherence has also been briefly outlined. Finally, a brief survey of relevant literature has been presented.
REFERENCES Andrews, H. C., and Hunt, B. R. (1977). Digital Image Restoration. Englewood Cliffs, NJ: Prentice Hall. Arnol'd, V. I. (1992). Catastrophe Theory. Berlin: Springer-Verlag.
356
A. VAN DEN BOS AND A. J. DEN DEKKER
Banham, M. R., and Katsaggelos, A. K. (1997). Digital image restoration. IEEE Signal Processing Magazine 14, 24-41. Barakat, R. (1962). Application of apodization to increase two-point resolution by the Sparrow criterion. I. Coherent illumination. J. Opt. Soc. Am. 52, 276-283. Barakat, R., and Levin, E. (1963). Application of apodization to increase two-point resolution by the Sparrow criterion. II. Incoherent illumination. J. Opt. Soc. Am. 53, 274-282. Barnes, C. W. (1966). Object restoration in a difffraction-limited imaging system. J. Opt. Soc. Am. 56, 575-578. Bettens, E., Van Dyck, D., den Dekker, A. J., Sijbers, J., and van den Bos, A. (1999). Modelbased two-object resolution from observations having counting statistics. Ultramicroscopy 77, 37-48. Biraud, Y. (1969). A new approach for increasing the resolving power by data processing. Astronomy Astrophys. 1, 124-127. Born, M., and Wolf, E. (1980). Principles of Optics. New York: Pergamon. Burch, S. E, Gull, S. E, and Skilling, J. (1983). Image restoration by a powerful maximum entropy method. Comput. Vision, Graphics, Image Processing 23, 113-128. Buxton, A. (1937). Note on optical resolution. Lond. Edinb. Dublin Philos. Magazine J. Sci. 23, 440-442. Castleman, K. R. (1979). Digital Image Processing. London: Prentice Hall. Cathey, W. T., Frieden, B. R., Rhodes, W. T., and Rushforth, C. K. (1984). Image gathering and processing for enhanced resolution. J. Opt. Soc. Am. A 1, 241-250. Chatfield, C. (1995). Statistics for Technology. London: Chapman & Hall. Clements, A. M., and Wilkins, J. E., Jr. (1974). Apodization for maximum encircled-energy ratio and specified Rayleigh limit. J. Opt. Soc. Am. 64, 23-27. Cox, I. J., and Sheppard, C. J. R. (1986). Information capacity and resolution in an optical system. J. Opt. Soc. Am. A 3, 1152-1158. Cunningham, D. R., and Laramore, R. D. (1976). Detection in image dependent noise. IEEE Trans. Inform. Theory IT-22, 603--610. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B. Methodological 39, 1-38. den Dekker, A. J. (1992). Resolutie: Illustraties, experimenten en theorie. Master's thesis, Department of Physics, Delft University of Technology, Delft, The Netherlands. (in Dutch). den Dekker, A. J. (1997a). Model-based optical resolution. IEEE Trans. Instrum. Meas. 46(4), 798-802. den Dekker, A. J. (1997b). Model-based resolution. Ph.D. thesis, Delft University of Technology. Delft, The Netherlands: Delft Univ. Press. den Dekker, A. J., and van den Bos, A. (1997). Resolution: A survey. J. Opt. Soc. Am. A 14, 547-557. Dhrymes, P. J. (1970). Econometrics--Statistical Foundations and Applications. New York: Harper & Row. Dunn, J. H., Howard, D. D., and Pendleton, K. B. (1970). Tracking Radar. New York: McGrawHill. Falconi, O. (1967). Limits to which double lines, double stars, and disks can be resolved and measured. J. Opt. Soc. Am. 57, 987-993. Farrell, E. J. (1966). Information content of photoelectric star images. J. Opt. Soc. Am. 56, 578-587. Fellgett, P. B., and Linfoot, E. H. (1955). On the assessment of optical images. Philos. Trans. R. Soc. Lond. A. Math. Phys. Sci. 247, 369-407. Fried, D. L. (1979). Resolution, signal-to-noise-ratio, and measurement precision. J. Opt. Soc. Am. 69, 399-406.
RESOLUTION RECONSIDERED
357
Fried, D. L. (1980). Resolution, signal-to-noise-ratio, and measurement precision: Addendum. J. Opt. Soc. Am. 70, 748-749. Frieden, B. R. (1967). Band-unlimited reconstruction of optical objects and spectra. J. Opt. Soc. Am. 57, 1013-1019. Frieden, B. R. (1972). Restoring with maximum likelihood and maximum entropy. J. Opt. Soc. Am. 62, 511-518. Frieden, B. R. (1975). Image Enhancement and Restoration, Vol. 6. New York: Springer-Verlag. Frieden, B. R. (1980). Statistical models for the image restoration problem. Comput. Graph. Image Processing 12, 40-59. Frieden, B. R., and Burke, J. J. (1972). Restoring with maximum entropy. II: Superresolution of photographs of diffraction-blurred impulses. J. Opt. Soc. Am. 62, 1202-1210. Geman, S., and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal Machine Intell. PAMI-6 (6), 721-741. Gerchberg, R. W. (1974). Super-resolution through error energy reduction. Optica Acta 21 (9), 709-720. Gilmore, R. (1981). Catastrophe Theory for Scientists and Engineers. New York: Wiley. Goodwin, G. C., and Payne, R. L. (1977). Dynamic System Identification--Experiment Design and Analysis. New York: Academic Press. Harris, J. L. (1964a). Diffraction and resolving power. J. Opt. Soc. Am. 54, 931-936. Harris, J. L. (1964b). Resolving power and decision theory. J. Opt. Soc. Am. 54, 606-611. Helstrom, C. W. (1969). Detection and resolution of incoherent objects by a background-limited optical system. J. Opt. Soc. Am. 59, 164-175. Helstrom, C. W. (1970). Resolvability of objects from the standpoint of statistical parameter estimation. J. Opt. Soc. Am. 60, 659-666. Holmes, T. J. (1988). Maximum-likelihood image restoration adapted for noncoherent optical imaging. J. Opt. Soc. Am. A 5(5), 666-673. Houston, W. V. (1927). A compound interferometer for fine structure work. Phys. Rev. 29, 478484. Hunt, B. R. (1973). The application of constrained least squares estimation to image restoration by digital computer. IEEE Trans. Comput. C-22(9), 805-812. Hunt, B. R. (1994). Prospects for image restoration. Int. J. Mod. Phys. C5, 151-178. Hunt, B. R. (1995). Super-resolution of images: Algorithms, principles, performance. Int. J. Imaging Syst. Technol. 6, 297-304. Hunt, B. R., and Andrews, H. C. (1973). Comparison of different filter structures for restoration of images, in Proceedings of the HICCS Conference, University of Hawaii, Honolulu, HI. Hunt, B. R., and Sementilli, P. (1992). Description of a Poisson imagery super-resolution algorithm, in Astronomical Data Analysis Software and Systems, Vol. 25, edited by I. Worrall et al. San Francisco: Astronomical Society of the Pacific, pp. 196-199. Hunt, G. W. (1981). An algorithm for the nonlinear analysis of compound bifurcation. Philos. Trans. R. Soc. Lond. 300(A 1455), 443-471. Idell, P. S., and Webster, A. (1992). Resolution limits for coherent optical imaging: Signal-tonoise analysis in the spatial-frequency domain. J. Opt. Soc. Am. A 9, 43-56. Jacquinot, P., and Roizen-Dossier, B. (1964). Apodisation, in Progress in Optics, Vol. 3 edited by E. Wolf. Amsterdam: North-Holland. Janson, P. A., Hunt, R. H., and Plyler, E. K. (1970). Resolution enhancement of spectra. J. Opt. Soc. Am. A 60, 596-599. Jennrich, R. I. (1995). An Introduction to Computational Statistics--Regression Analysis. Englewood Cliffs, NJ: Prentice Hall. Kelly, E. J., Reed, I. S., and Root, W. L. (1960). The detection of radar echoes in noise. II. J. Soc. Ind. Appl. Math. 8, 481-507.
358
A. VAN DEN BOS AND A. J. DEN DEKKER
Kibe, J. N., and Wilkins, J. E., Jr. (1983). Apodization for maximum central irradiance and specified large Rayleigh limit of resolution. J. Opt. Soc. Am. 73, 387-391. Kibe, J. N., and Wilkins, J. E. Jr. (1984). Apodization for maximum central irradiance and specified large Rayleigh limit of resolution. II. J. Opt. Soc. Am. A 1, 337-343. Kosarev, E. L. (1990). Shannon's superresolution limit for signal recovery. Inverse Probl. 6, 55-76. Lanczos, C. (1957). Applied Analysis. London: Pitman. Lucy, L. B. (1974). An iterative technique for the rectification of observed distributions. Astronomical J. 79(6), 745-754. Lucy, L. B. (1992a). Resolution limits for deconvolved images. Astronomical J. 104(3), 12601265. Lucy, L. B. (1992b). Statistical limits to superresolution. Astronomy Astrophys. 261, 706-710. Lukosz, W. (1996). Optical systems with resolving powers exceeding the classical limit. J. Opt. Soc. Am. 56, 1463-1472. Lukosz, W. (1967). Optical systems with resolving powers exceeding the classical limit. II. J. Opt. Soc. Am. 57, 932-941. Lummer, O., and Reiche, E (1910). Die Lehre vonder Bildenstehung im Mikroskop von Ernst Abbe. Braunschweig, Germany: Vieweg. Luneberg, R. K. (1944). Mathematical Theory of Optics. Providence R. I.: Brown Univ. Press. McKechnie, T. S. (1972). The effect of condenser obstruction on the two-point resolution of a microscope. Optica Acta 19, 729-737. Meinel, E. S. (1986). Origins of linear and nonlinear recursive restoration algorithms. J. Opt. Soc. Am. A 3, 787-799. Mood, A. M., Graybill, E A., and Boes, D. C. (1987). Introduction to the Theory of Statistics, 3d ed. Auckland, New Zealand: McGraw-Hill. Nahrstedt, D. A., and Schooley, L. C. (1979). Alternative approach in decision theory as applied to the resolution of two point images. J. Opt. Soc. Am. 69, 910-912. Norden, R. H. (1972). A survey of maximum likelihood. Int. Stat. Rev. 40(3), 329-354. Norden, R. H. (1973). A survey of maximum likelihood, part 2. Int. Stat. Rev. 41(1), 39-58. O'Keefe, M. A. (1992). Resolution in high-resolution electron microscopy. Ultramicroscopy 47, 282-297. Orhaug, T. (1969). On the resolution of imaging systems. Optica Acta 16, 75-84. Orloff, J., ed. (1997). Handbook of Charged Particle Optics. Boca Raton, FL: CRC Press. Osterberg, H. (1950). Microscope imagery and interpretations. J. Opt. Soc. Am. 40, 295-303. Osterberg, H., and Wilkins, J. (1949). The resolving power of a coated objective. J. Opt. Soc. Am. 39, 553-557. Osterberg, H., and Wissler, E C. (1949). The resolution of two particles in a bright field by coated microscope objectives. J. Opt. Soc. Am. 39, 558-566. Papoulis, A. (1965). Probability, Random Variables, and Stochastic Processes. New York: McGraw-Hill. Pask, C. (1976). Simple optical theory of super-resolution. J. Opt. Soc. Am. Lett. 66, 68-70. Poston, T., and Stewart, I. N. (1978). Catastrophe Theory and Its Applications. London: Pitman. Ramsay, B. E, Cleveland, E. L., and Koppius, O. T. (1941). Criteria and the intensity-epoch slope. J. Opt. Soc. Am. 31, 26-33. Rayleigh, Lord. (1879). Investigations in optics, with special reference to the spectroscope. Lond. Edinb. Dublin Philos. Magazine J. Sci. 8(49), 261-274, 403-411,477-486. Rayleigh, Lord. (1902). Scientific Papers by John William Strutt, Baron Rayleigh, Vol. 3. (18871892). Cambridge, UK: Cambridge Univ. Press, Wave Theory of Light chapter, pp. 47-189. Richardson, W. H. (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. Am. 62, 55-59.
RESOLUTION RECONSIDERED
359
Ronchi, V. (1961). Resolving power of calculated and detected images. J. Opt. Soc. Am. 51, 458-460. Root, W. L. (1962). Radar resolution of closely spaced targets. IRE Trans. Milit. Electronics PMIL-6, 197-204. Rushforth, C. K., and Harris, R. W. (1968). Restoration, resolution, and noise. J. Opt. Soc. Am. 58, 539-545. Saunders, E T. (1980). An Introduction to Catastrophe Theory. Cambridge, UK: Cambridge Univ. Press. Schell, A. C. (1965). Enhancing the angular resolution of incoherent sources. The Radio and Electronic Engineer 29, 21-26. Schuster, A. (1924). Theory of Optics. London: Arnold. Selby, S. M. (1971). Standard Mathematical Tables. Cleveland: CRC Press. Sementilli, P. J., Hunt, B. R., and Nadar, M. S. (1993). Analysis of the limit to superresolution in incoherent imaging. J. Opt. Soc. Am. A 10(11), 2265-2276. Sezan, M. I., and Tekalp, A. M. (1990). Survey of recent developments in digital image restoration. Opt. Eng. 29(5), 393-404. Shannon, C. E. (1949). Communication in the presence of noise. Proc. IRE 37, 10-21. Shepp, L. A., and Vardi, Y. (1982). Maximum likelihood reconstruction for emission tomography. IEEE Trans. Med. Imaging MI-l(2), 113-122. Sijbers, J., den Dekker, A. J., Scheunders, P., and Van Dyck, D. (1998). Maximum likelihood estimation of Rician parameters. IEEE Trans. Med. Imaging 17(3), 357-361. Slepian, D., and Pollak, H. O. (1961). Prolate spheroidal wave functions, Fourier analysis and uncertainty: I. Bell Syst. Tech. J. 40, 43-64. Snyder, D. L., Hammoud, A. M., and White, R. L. (1993). Image recovery from data acquired with a charge-coupled-device camera. J. Opt. Soc. Am. A 10(5), 1014-1023. Sparrow, C. M. (1916). On spectroscopic resolving power. Astrophys. J. 44, 76-86. Spence, J. C. H. (1988). Experimental High-Resolution Electron Microscopy, 2d ed. New York: Oxford Univ. Press. Stuart, A., and Ord, J. K. (1994). Kendall's Advanced Theory of Statistics, Vol. 1: Distribution Theory, 6th ed. London: Arnold. Stuart, A., Ord, J. K., and Arnold, S. (1999). Kendall's Advanced Theory of Statistics, Vol. 2A: Classical Inference and the Linear Model 6th ed. London: Arnold. Swarte, J. H. (1992). Precision and resolution in spectroscopic model fitting. Ph.D. thesis, Delft University of Technology. Delft, The Netherlands: Delft Univ. Press. Swerling, P. (1964). Parameter estimation accuracy formulas. IEEE Trans. Inform. Theory 10, 302-314. Thompson, J. M. T. (1982). Instabilities and Catastrophes in Science and Engineering. Chicester: Wiley. Toraldo di Francia, G. (1955). Resolving power and information. J. Opt. Soc. Am. 45, 497501. van den Bos, A. (1980). A class of small sample nonlinear least squares problems. Automatica 16, 487-490. van den Bos, A. (1981). Degeneracy in nonlinear least squares. IEEE Proc. 128(Part D), 109-116. van den Bos, A. (1982). Handbook of Measurement Science, Vol. 1: Theoretical Fundamentals. Chicester: Wiley Chapter 8: Parameter Estimation, pp. 331-377. van den Bos, A. (1983). Limits to resolution in nonlinear least squares model fitting. IEEE Trans. Automatic Control AC-28, 1118-1120. van den Bos, A. (1984). Resolution of model fitting methods. Int. J. Syst. Sci. 15, 825-835. van den Bos, A. (1987). Optical resolution: An analysis based on catastrophe theory. J. Opt. Soc. Am. A 4, 1402-1406.
360
A. VAN DEN BOS AND A. J. DEN DEKKER
van den Bos, A. (1988). Nonlinear least-absolute-values and minimax model fitting. Automatica 24, 803-808. van den Bos, A. (1992). Ultimate resolution: A mathematical framework. Ultramicroscopy 47, 298-306. van den Bos, A. (1993). Critical errors associated with parameter resolvability. Comput. Phys. Commun. 76, 184-190. van den Bos, A., and den Dekker, A. J. (1995). Ultimate resolution in the presence of coherence. Ultramicroscopy 60, 345-348. van den Bos, A., and den Dekker, A. J. (1996). Coherent model-based optical resolution. J. Opt. Soc. Am. A 13, 1667-1669. van den Bos, A., and Swarte, J. H. (1993). Resolvability of the parameters of multiexponentials and other sum models. IEEE Trans. Signal Processing SP-41, 313-322. van den Bos, A., and van der Werff, T. T. (1990). Degeneracy in nonlinear least squares~ Coincidence of more than two parameters. IEEE Proc. 137(Part D), 273-280. Walsh, D. O., and Nielsen-Delaney, E A. (1994). Direct method for superresolution. J. Opt. Soc. Am. A 11, 572-579. Wang, G., and Li, Y. (1999). Axiomatic approach for quantification of image resolution. IEEE Signal Processing Lett. 6(10), 257-258. Whitney, H. (1955). On singularities of mappings of Euclidean spaces. I. Mappings of the plane into the plane. Ann. Math. 62(3), 374-410. Wilkins, J. E. (1950). The resolving power of a coated objective. J. Opt. Soc. Am. 40, 222-224. Wolf, E. (1951). The diffraction theory of aberrations. Rep. Prog. Phys. 14, 95-120. Zacks, S. (1971). The Theory of Statistical Inference. New York: Wiley.