Journal of the Korean Statistical Society 39 (2010) 417–429
Contents lists available at ScienceDirect
Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss
Multiscale density estimation with errors in variables✩ Laurent Cavalier a,∗ , Marc Raimondo b a
Université Aix-Marseille 1, CMI, 39 rue Joliot-Curie, 13453 Marseille cedex 13, France
b
School of Mathematics and Statistics, the University of Sydney, NSW 2006, Australia
article
info
Article history: Received 24 March 2009 Accepted 29 September 2009 Available online 12 October 2009 In memory of Marc Raimondo AMS 2000 Subject Classification: primary 62G05 secondary 62G08 Keywords: Adaptive estimation Deconvolution Density Wavelets Waved
abstract We present an adaptive method for density estimation when the observations X = (X1 , . . . , Xn ) are contaminated by additive errors Y = X + Z. The error distribution is not specified by the model but is estimated using repeated measurements (Yi )i=1,2,3 of X. In this setting, we propose a wavelet method for density estimation which adapts both to the degree of ill posedness of the problem (smoothness of the error distribution) and to the regularity of the target density. Our method is implemented in the Fourier domain via a square-root transformation of the empirical characteristic function and yields fast translation invariant non-linear wavelet approximations with data-driven choices of fine tuning parameters. When the variable X is observed without errors our method provides a natural implementation of direct density estimation in the Meyer wavelet basis. We illustrate the adaptiveness properties of our estimator with a range of finite sample examples drawn from population with smooth and less smooth density functions. © 2009 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
1. Introduction 1.1. Contaminated data and repeated measurements In this paper we study the density estimation when the data are contaminated by additive errors. The error distribution is not supposed to be known but we suppose that we have access to a small number of replicated observations (for simplicity we will take k = 3 replications). That is, we observe Yi = X + Zi , where Y = ( i
Y1i
i = 1, 2, 3,
,...,
Yni
(1)
), X = (X1 , . . . , Xn ) and Z = ( , . . . , i
Z1i
Zni
). The Xj ’s (respectively
Zji ’s)
are i.i.d. random variables
with a Probability Density Function (PDF) f (respectively g). We wish to estimate f from observations Yi , i = 1, 2, 3. We suppose that Zi , i = 1, 2, 3 are independent from each other and independent from X. We assume that the error density g is symmetric. In this setting we exploit the information provided by repeated measurements (1) of X to define a vector of error differences: W = Y3 − Y2 = Z3 − Z2 ,
(2)
where W = (W1 , . . . , Wn ) are i.i.d. rv’s independent of Z1 . Under some smoothness assumption on the error density g, it turns out that the information provided W, allows us to recover the target density f with the same accuracy as if the error distribution was fully specified. ✩ The authors acknowledge the support of the French ANR under grant ANR-07-BLAN-0234-01 ‘‘Oracle’’.
∗
Corresponding author. E-mail addresses:
[email protected] (L. Cavalier),
[email protected] (M. Raimondo).
1226-3192/$ – see front matter © 2009 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.jkss.2009.09.001
418
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
1.2. Density estimation from contaminated data Density estimation from contaminated data has received a lot of attention over the last decades. A semi-parametric approach has been considered by Horowitz and Markatou (1996). For known error distribution the optimal rates have been derived by Fan (1991). The case where the error distribution is unknown has been treated in Diggle and Hall (1993), Neumann (1997) and Li and Vuong (1998). More recently the model (1) with repeated measurements has been studied in Delaigle, Hall, and Meister (2007), see also Ma and Carroll (2006) and Neumann (2007). The model with replicated data is well suited to applications and data of this type are available in various fields: (Jaech, 1985) gives an example where uranium concentration is measured for several fuel pellets, examples of measurement errors in surveys are given in Biemer, Groves, Lyberg, Mathiowetz, and Sudman (2004). Andersen, Bro, and Brockoff (2003) describes applications to nuclear magnetic resonance; (Oman, Meir, & Halm, 1999) discuss applications to physiotherapy. A good survey of medical examples may be found in Dunn (2004). 1.3. Related works on density estimation Adaptive wavelet methods that specifically address the deconvolution of a density with known error distribution are found in Fan and Koo (2002) and in Pensky and Vidakovic (1999). Fourier and kernel methods can for example be found in Butucea (2004) for smooth error distribution and in Comte, Rozenholc, and Taupin (2006) and Goldenshluger (2002) for super-smooth error distribution. For smooth convolution, our paper can be seen as an extension of Pensky and Vidakovic (1999) when the error distribution is not known. Unlike existing wavelet density estimators, our method adapts both to the regularity of the density and to the smoothness of the error distribution. Our approximation result (see Section 5) can be compared to the results of Fan and Koo (2002) and Neumann (1997) in the dense setting when linear estimator are known to be optimal. However, we consider here a more general setting where non-linear wavelet type estimators outperform linear procedures over a wide class of functions. 1.4. Related works on circular density estimation In this paper, we consider the specific framework of circular density estimation as in Efromovich (1997). The observations belong to a known interval ([0, 1] in the sequel). Moreover, their probability densities are 1-periodic.This model is not as standard as the model with density on R, but on the other hand, it is often more tractable from a numerical point of view. This kind of model is considered in a whole chapter in Efromovich (1999) (Chapter 3). One practically important case corresponds to directional (or angular) data. In this model, the data take the form of angles. These kinds of data may be found in several domains in sciences. The book (Fisher, 1993) gives a rather complete study of these circular data. Some applications are : departure direction of ants or birds, seasonal wind directions, or energy demand over a period of 24 hours (see also Arnold & SenGupta, 2006). The model of circular deconvolution or circular error in variables, where the circular data are contaminated by some error, is considered in Section 3.5 in Efromovich (1999). This model is called ‘‘Data contaminated by measurement errors’’. 1.5. Related works in the white noise model The seminal paper of Donoho (1995) has inspired many wavelet methods to recover f from indirect observations, among others Abramovich and Silverman (1998); Johnstone (1999); Walter and Shen (1999); Wang (1997). Additional references may be found in Johnstone, Kerkyacharian, Picard, and Raimondo (2004) ([JKPR] in the sequel) and in Cavalier and Raimondo (2007). The optimal theory for general inverse problems may be found in Cavalier, Golubev, Picard, and Tsybakov (2002) or Cavalier and Tsybakov (2002). For a semi-parametric setting see Butucea and Matias (2005). Recently, several statistical papers have considered adaptive estimation in the Gaussian white noise framework with a noisy kernel, see Cavalier and Hengartner (2005), Cavalier and Raimondo (2007),Efromovich and Koltchinskii (2001), Hoffmann and Reiss (2008), Marteau (2006) and Willer (2006). 1.6. A computational point of view In the statistical literature on density deconvolution (cited previously) it is customary to consider a density function defined on the real line with its corresponding characteristic function (continuous Fourier transform) also defined on the real line. In this setting a generic density estimate is obtained by inversion of the empirical characteristic function together with some smoothing step. In practice this yields various discretization effects when approximating the characteristic function from finite sample data. On the other hand, signal and image processing deconvolution algorithms (based on the Fast Fourier Transform) are known to have sound numerical properties in regression models when the target function (or image) is rescaled onto the unit interval. It is with this philosophy in mind that we approach the deconvolution of a density problem. Our aim here is to present an adaptive multiscale algorithm whose implementation is well suited to density estimation.
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
Fig. 1. Target PDF f , standard normal (left), claw density ∼ PDF with shape ν = 1 and scale 0.2 (right).
1 N 2
(0, 1) +
∑4
1 i=0 10 N
(i/2 − 1,
1 2 10
419
) (middle); error density g ∼ ΓD (1, 0.2) double Gamma
Our method builds on the WaveD paradigm of [JKPR] and takes full advantage of the fast algorithm of Donoho and Raimondo (2004) and waved software package of Raimondo and Stewart (2006). A general PDF is not periodic with period one as imposed by the WaveD model and the use of the Fast Fourier Transform. However, our estimator can be applied to any population with a general PDF provided that the data (1) and (2) are re-scaled to ensure that the range is well within [0,1]. This can be done by re-scaling the data by kσˆ where σˆ = σˆ Y is the Y sample standard deviation, k is a positive constant and shifting the data centre to 0.5. This needs not to be very precise and any value of k large enough may be used, see e.g. Kay (1998). For most common distributions the value k ≥ 7, is acceptable. This process is illustrated in the bottom plots of Fig. 2 using the raw data of Fig. 2 top plots. This is also related to the notion of wrapping PDF defined on the line around the circumference of a circle (see Efromovich, 1999). On the other hand, in the circular case, one directly uses the discrete coefficients in the Fourier domain. In the case of PDF on the whole R, the continuous Fourier transform appears and is not as easy to compute from a numerical point of view. 1.7. Simulated examples and degree of Ill Posedness For illustration purposes throughout this paper we have simulated data X from a population with density f where (a) f is the standard normal density (homogeneously smooth) and (b) f is the Claw density (non-homogeneous with spikes), see Fig. 1. For the error distribution we use the double Gamma PDF since it appears as a natural candidate to model smooth error distributions with zero mean. Note that for such error distribution the Fourier transform has a polynomial decay |ω|−ν where ν is the so-called Degree of Ill Posedness (DIP). To simulate data according to (1) and (2) we generated a sample X with PDF f and three samples Zii=1,2,3 with PDF g ∼ ΓD (1, 0.2): double Gamma with shape parameter ν = 1 and scale 0.2. The corresponding raw data histograms of (1) and (2) are depicted on the top plots of Fig. 2. The re-scaled data used for implementation are depicted on the bottom plots of Fig. 2. 2. Preliminaries 2.1. Smooth convolution model We consider a model of circular convolution as in Efromovich (1997). Assume that both PDF f and g are periodic functions on T = [0, 1]. Moreover, the error density g is assumed to be symmetric. Denote h the PDF of the Yi ’s, from (1) we see that h(t ) = g ∗ f (t ), where ∗ denotes the circular convolution. Let eℓ (t ) = e2π iℓt , ℓ ∈ Z denotes the Fourier basis and 1 fℓ = ⟨f , eℓ ⟩, gℓ = ⟨g , eℓ ⟩, hℓ = ⟨h, eℓ ⟩ with ⟨f , g ⟩ = 0 f g¯ . By properties of the Fourier transform, hℓ = fℓ × gℓ ,
ℓ ∈ Z.
(3)
420
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
Fig. 2. Top plots: histogram of Y1 (left), histogram of W = Z3 − Z2 (right), n = 4096 with X sample simulated from the Claw density (Fig. 1 middle); Zii=1,2,3
˜ = Y1 /(10 ∗ σˆ ) + 0.5 samples simulated from the double gamma ΓD (1, 0.2) distribution (see Fig. 1 right). Bottom plots: histograms of re-scaled data Y ˜ = W/(10 ∗ σˆ ) (right), here with σˆ = σˆ Y = 0.89. (left); W
We consider ordinary smooth convolution when
|gℓ | ∼ C |ℓ|−ν ,
(Cν )
where C is some positive constant and ν > 1/2. Since the error density g is symmetric, it follows that gℓ are real-valued positive coefficients. Under assumption (Cν ) we have that gℓ ∼ C |ℓ|−ν . 2.2. Empirical Fourier coefficients To simplify the notation we write Yi = Yi1 , for i = 1, 2 . . . , n. Define empirical Fourier coefficients from the (re-scaled) data (1) and (2), for ℓ = 0, ±1, . . ., yℓ =
wℓ =
n 1−
n k =1
e2π iℓYk ,
n 1−
n k=1
(4)
e2π iℓWk =
n 1−
n k=1
3 2 e2π iℓZk e−2π iℓZk .
(5)
It is easy to check that E(yℓ ) = hℓ , E(wℓ ) = gℓ2 , where hℓ , gℓ are defined in (3). The standardised coefficients n 1 − 2π iℓYk rℓ = √ e − hℓ , n k=1
(6)
n 1 − 2π iℓWk e sℓ = √ − gℓ2 , n k=1
(7)
yield a sequence space representation 1
yℓ = hℓ + n− 2 rℓ ,
wℓ = gℓ2 + n
− 12
sℓ ,
ℓ∈Z
(8)
ℓ ∈ Z,
(9)
which is similar to that of Cavalier and Raimondo (2007) derived in the white noise model. 2.3. Adaptive stopping time via Fourier thresholding The wℓ sequence (9) has mean gℓ2 . This renders the deconvolution process harder to perform in the density setting than in the white noise model where noisy eigenvalues are unbiased with mean gℓ . As in Delaigle et al. (2007) we introduce a modified sequence of observations by performing a square-root transformation in the Fourier domain,
sℓ √ vℓ := wℓ = gℓ 1 + n−1/2 2 . gℓ
(10)
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
421
Fig. 3. Illustration of the maximum frequency selection process (11) for DIP ν = 0.6 (left), ν = 1 (middle) and ν = 1.3 (right). Top plots depict the frequency selection process where the empirical characteristic function is based on the Z1 sample, bottom plots with the empirical characteristic function are based on the W sample, after the square-root transform (10). In plots, solid lines depict ℓ → log |gℓ |, dotted lines correspond to empirical versions and 1
increasing dashed lines to Fourier domain thresholding: ℓ → log |ℓ1/2 n− 2 (log n)|. From top left clockwise the maximal frequency M and fine resolution level j1 (17) are: (M , j1 ) = (199, 6); (77, 5); (52, 4); (59, 4); (71, 5); (148, 6).
A precise control of the effect of the square-root transform on our estimation method is given in Section 5. We give here an heuristic argument to describe why this transformation does not affect the estimator accuracy. Since g is symmetric then gℓ are positive. Moreover,Assumption√(Cν ) help to control the behaviour of gℓ . For low to medium frequencies ℓ ≤ M, gℓ is not too small and the approximation 1 + x ≈ 1 + x/2 can be used in (10) for small x. Indeed, for large n we have vℓ = gℓ + small-error. A key step in our method is to find the integer M below which the small-error introduced by the square-root transform will not affect the estimation process. This is done in a data-driven fashion by setting a noise level below which empirical coefficients vℓ should not fall:
M = min ℓ, ℓ ≥ 0 : vℓ ≤
√
√ ℓ (log n)/ n .
(11)
This process is illustrated on the bottom plots of Fig. 3 for various DIP ν . To illustrate the effect of the square-root transform we depicted the selection process (11) based on the Z sample (see top plots of Fig. 3) and on the W sample (see bottom plots of Fig. 3). When using the Z sample the square-root transform is not needed since, in this case, we have gˆℓ = gℓ + n−1/2 tℓ where tℓ are zero-mean standardised random variables, as in the white noise model of Cavalier and Raimondo (2007). Fig. 3 illustrates two phenomena: first, we see that the frequency selection process (11) is not affected much by the square-root transform. In particular, we see that in both cases (top or bottom plots) the stopping times M given at (11) fall into the same wavelet dyadic frequency band (17). Secondly we see that in both cases (top or bottom plots) the empirical coefficients are very close to the true characteristic function for all frequencies ℓ ≤ M. On the other hand we see that there is an effect due to the square-root transform for frequencies ℓ > M—compare top and bottom plots. A key difference with the method of Delaigle et al. (2007) is that our method is adaptive with respect to the DIP parameter ν and does not require cross validation tuning. We stress that the choice of the fine resolution parameter (17) and frequency selection process (11) is data driven. The only knowledge here is that the error distribution is smooth with an unknown DIP ν . In the super-smooth case the situation different and beyond the scope of the paper. Indeed, even the behaviour of wavelet estimator is not really known.
422
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
2.4. The WaveD paradigm The WaveD method [JKPR] is based on a wavelet expansion of f (notice that here and in the sequel κ will denote the multiple index (j, k)), f =
−
βκ Ψκ ,
(12)
κ∈Λn
where Ψκ are band-limited wavelets and βκ = ⟨f , Ψκ ⟩. Throughout this paper and in software Φ , Ψ denote the (periodised) Meyer scaling and wavelet function with the convention that Ψ−1 = Φ , see e.g. Mallat (1998). The WaveD paradigm stipulates that one can combine the singular value decomposition (3) with the wavelet expansion (12). Thanks to Plancherel’s identity this is done by computing wavelet coefficients in the Fourier domain,
βκ =
− hℓ Ψ¯ ℓκ ℓ∈Cj
(13)
gℓ
where Ψℓκ = ⟨Ψκ , eℓ ⟩, with {eℓ } denotes the Fourier basis, Ψ¯ ℓκ denotes the conjugate, and for the Meyer wavelet, Cj = {ℓ : Ψℓκ ̸= 0} ⊂ (2π /3) · [−2j+2 , −2j ]
[2j , 2j+2 ].
3. Adaptive multiscale density estimation 3.1. The method We define our estimator from the (re-scaled) data (1) and (2), using (13) with empirical Fourier coefficients sequences (4) and (10),
βˆ κ =
− yℓ Ψ¯ κ . vℓ ℓ ℓ∈C
(14)
j
The estimator is based on a non-linear wavelet approximation to the PDF f , fˆn =
−
βˆ κ I{|βˆ κ | ≥ λj } Ψκ
(15)
κ∈Λn
where λj is a threshold and Λn = {(j, k), −1 ≤ j ≤ j1 , 0 ≤ k ≤ 2j }. The threshold parameter λj is used to remove small noisy coefficients while the fine resolution level cut-off j1 is used to remove high noise perturbations. 3.2. Tuning The fine tuning parameters λj and j1 are specified in a data-driven fashion: Fine resolution level. The WaveD asymptotic theory [JKPR] prescribes 2j1 = O(n/ log n)1/(1+2ν) .
(16)
Condition (16) is hardly of practical interest. To specify j1 , first, we find the maximal Fourier frequency M, where according to (11),
M = min ℓ, ℓ ≥ 0 : vℓ ≤
√ √ ℓ (log n)/ n ,
then we find the wavelet dyadic frequency band which contains M,
ˆj1 = ⌊log2 (M )⌋ − 1,
(17)
where ⌊x⌋ is the largest integer below x. This process is illustrated in Fig. 3. The level-dependent threshold λj has three input parameters,
λj = γ σˆ j cn .
(18)
• γ : a constant which depends on the tail of the noise distribution. The range √ In software, the default value is the conservative choice
6.
√
2≤γ ≤
√ 6 gives good result in practice.
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
• σˆ j : a level-dependent scaling factor, 1/2 1/2 − − σˆ j = |Cj |−1 vℓ−2 = |Cj |−1 wℓ−1 . ℓ∈Cj
423
(19)
ℓ∈Cj
• cn : a sample size-dependent scaling factor, 1/2 log n . cn = n
4. Numerical properties We illustrate the adaptive properties of the WaveD density estimator (15) with the Claw density (Marron & Wand, 1992). This density distribution exhibits a typical non-homogeneous behaviour (see Fig. 1 middle plot). We also use the standard normal density to illustrate the performances of our estimator for smooth density. For the error distribution, we use the double Gamma distribution with shape parameter ν as customary for smooth deconvolution problems, see Fig. 1 right plot. We simulated data according to (1) and (2) for various combination of sample sizes n = 1024, 2048, 4096 and DIP ν = 0.6, 1, 1.3. 4.1. Implementation In the regression setting the 1D translation invariant algorithm of Donoho and Raimondo (2004) is available as an R package called waved, see Raimondo and Stewart (2006) and Raimondo and Stewart (2007). Extension to the 2D-setting is discussed in Donoho and Raimondo (2005). For matlab users see the NEWaveD1.1 package from the authors web pages. In the density setting the WaveD density estimator (15) can be computed by a simple modification of the waved or NEWaveD1.1 code. An examination of the WaveD algorithm (Donoho & Raimondo, 2004, p. 427) shows that it is sufficient to feed the top WaveD function with empirical Fourier coefficients (4), (10) in order to compute the WaveD estimates (14) and (15). Future releases of the waved and NEWaveD1.1 packages (currently under development by the authors) will include an option for density estimation. 4.2. Density estimation from contaminated data (i) Claw density. The WaveD estimates (after re-scaling) are depicted in Fig. 5 where they are compared to the original PDF, for each of the three scenarios, DIP ν = 0.6, ν = 1 and ν = 1.3. As seen in Fig. 3, the tuning of the fine resolution level (17) is adaptive with respect to the DIP ν . Larger ν yields smaller estimate of the fine resolution level j1 and vice versa. This agrees with the optimal choice (16). As a result the wavelet expansion (15) is prevented from high noise perturbations due to the Fourier inversion process. Next we note that the level-dependent threshold parameter (18) removes small noise effects and smoothes the WaveD estimates while preserving the five density modes of the claw density, see Fig. 4. This illustrates the adaptive properties of WaveD with respect to the non-homogeneous behaviour of the density function. Finally, in Fig. 5, we compare the WaveD estimates (after re-scaling) to the original Claw density for the three cases ν = 0.6, ν = 1, and ν = 1.3, all with n = 4096. We see that for low and medium DIP ν = 0.6, and ν = 1 WaveD is able to recover the five modes quite well but for ν = 1.3 it becomes harder. This is consistent with the effect of ν on the rate (22). (ii) Normal density. The WaveD estimates (after re-scaling) are depicted in Fig. 6 where they are compared to the original PDF, for each of the three scenarios, DIP ν = 0.6, ν = 1 and ν = 1.3. The results depicted on Fig. 6 shows that for small ν (= 0.6) both standard kernel estimate and WaveD estimate are quite close to the normal density. Note that the standard normal is much smoother that the Claw density and that only low frequency are used to estimate f in this case. As to be expected we see that as ν increases the effect of measurement errors affects the estimation. 4.3. Direct density estimation We conclude this section by some remarks concerning wavelet density estimation in the direct setting i.e. when data X are observed without errors. The asymptotic theory for wavelet density estimation is based on the law of large number which states that with a large probability, n 1−
n i=1
Ψj,k (Xi ) −→n→∞ E(Ψj,k (X1 )) =
∫
Ψj,k (u)f (u)du.
(20)
While this device can be used for cases where there exists an analytic formula that defines the wavelet Ψ (x) e.g. Haar with Ψ (x) = 1[0,1/2) (x) − 1[1/2,1] (x), x ∈ [0, 1], the implementation of a wavelet density estimate is, in general, more involved.
424
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
0.6 0.5
0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.1
0.2
0.0
0.1 0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Fig. 4. Illustrating the effect of threshold (18) using the data of Fig. 2. Here the DIP ν = 1 and the fine resolution level estimate ˆj1 = 5 as seen in the middle plot of Fig. 3. Top plots, estimated wavelet coefficients (14); unthresholded (left), after thresholding (right). Bottom plots: corresponding (periodised) WaveD estimates (15). The de-periodised WaveD Claw density estimate appears in the middle plot of Fig. 5.
–4
–2
0
2
4
–4
–2
0
2
4
–4
–2
0
2
4
Fig. 5. Claw density WaveD estimate computed from Y, W data, n = 4096, as in Fig. 2, with DIP ν = 0.6 (left), ν = 1 (middle) and ν = 1.3 (right). Dashed line depicts WaveD estimates (15) after re-scaling fˆre-scaled = (10 ∗ σˆ )fˆ + 0.5, solid lines depict the original Claw density.
For example, a possible approach is to bin the data in order to use wavelet regression algorithm. On the other hand we note that, if there are no measurements errors then vℓ = 1, ℓ = 0, ±1, . . . , and the WaveD formula (14) provides a natural implementation of density estimate in the Meyer wavelet basis. Since the Meyer wavelet is very regular it is well suited to density estimation. As seen in Figs. 5 and 6 we see that, in the indirect case, the WaveD estimate has good adaptiveness properties and can recover smooth density such as the normal as well as non-homogeneous function such as the Claw density. We conclude this section by a comparison with classical kernel density estimation (available from R). Depicted
–4
–2
0
2
4
0.4
425
0.0
0.1
0.2
0.3
0.4 0.3 0.2 0.1 0.0
0.0
0.1
0.2
0.3
0.4
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
–4
–2
0
2
4
–4
–2
0
2
4
–4
–2
0
2
4
0.4 0.0
0.1
0.2
0.3
0.4 0.3 0.2 0.1 0.0
0.0
0.1
0.2
0.3
0.4
Fig. 6. N (0, 1) WaveD estimate computed from Y, W data as in Fig. 2, with n = 1024, DIP ν = 0.6 (left), ν = 1 (middle) and ν = 1.3 (right). Dashed lines depict WaveD estimates (15), solid lines depict the standard normal density.
–4
–2
0
2
4
–4
–2
0
2
4
Fig. 7. N (0, 1) density estimates, from left to right n = 1024, n = 2048 and n = 4096. Thin dashed line depicts WaveD estimates (15), solid lines depict the standard normal density, dashed line depicts the WaveD estimate, dash-dot line depicts the kernel estimate.
Fig. 7 are both the kernel density estimate and the WaveD estimate of a standard normal density for various sample sizes. Visually, it is very hard to distinguish between the two estimates the corresponding square errors are very close with a slight advantage to the kernel estimate for larger sample size. For the claw density the WaveD estimator outperforms the kernel estimator for small and large sample sizes. 5. Mathematical properties We give some smoothness conditions on the probability density f under which near-optimal rates are achievable by our method. Assume that the target density f belongs to some Besov space of periodic functions Bsp,r (T ). We recall that the Besov spaces are characterised by the behaviour of the wavelet coefficients (as soon as the wavelet is periodic and has enough smoothness and vanishing moments). Definition 1. For f ∈ Lq (T ), q ≥ 1,
r /p
f =
− j ,k
βj,k Ψj,k ∈ Bsp,r (T ) ⇐⇒
− j≥0
2j(s+1/2−1/p)r
−
|βj,k |p
< ∞.
(21)
0≤k≤2j
The index s indicates a certain degree of smoothness of the function. Theorem 1. Suppose that we observe samples (1) and (2) under assumption (Cν ). If the PDF f belongs to Bsp,p (T ) with p ≥ 1, s ≥ 1/p. Then, for γ large enough, the estimator (15) fitted with coefficients (14), threshold (18) and maximum level (17), is such that, as n → ∞,
E‖fˆ − f ‖22 = O
α
n−1 log(n)
,
(22)
426
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
2s
α = α1 :=
2(s + ν) + 1 2(s −
α = α2 :=
,
if s > (2ν + 1)
+ 12 )
1 p
,
2(s + ν − 1p ) + 1
if
1 p
1 p
−
1
2
≤ s < (2ν + 1)
,
(23) 1 p
−
1
2
,
(24)
and
α1
n−1 log(n)
E‖fˆ − f ‖22 = O
log(n) ,
if s = (2ν + 1)
1 p
−
1 2
.
(25)
The core part of the proof of Theorem 1 follows that of Cavalier and Raimondo (2007, pp. 2421–2422) derived in the white noise model. Here we establish the key lemmas in the density model (1) and (2). This requires additional care due to the square-root transform (10). Further, in the density setting, the stochastic terms (6), (10) do not follow an exact normal distribution as in the Gaussian white noise setting. For these reasons we derive the key technical lemmas associated with sequence space representation (8) and (10). Denote Mc = min ℓ, ℓ ≥ 0 : gℓ2 ≤ ℓ n−1 (log n)8/3 ,
(26)
Md = min ℓ, ℓ ≥ 0 : gℓ2 ≤ ℓ n−1 (log n)4/3 ,
(27)
M1 = [2j1 ],
(28)
and
where j1 is defined in (16). Define jd and jc with Md = ⌊2jd ⌋
and Mc = ⌊2jc ⌋.
(29)
Under (Cν ) we have,
1
2 − 2ν+ 1
1
2 − 2ν+ 1
Md ≍ n− 2 (log n)2/3
,
(30)
.
(31)
Ω = cMd exp −(log n)1+τ ,
(32)
and Mc ≍ n− 2 (log n)4/3 Let
for some constants c , τ > 0. Let M be defined at (11), Lemma 2 proves that with a very large probability, Mc ≤ M ≤ Md . There is a constant c in the definition of Ω in (32). Thus, we will use Ω as a O(·), and just suppose that the constant c is finally the sum of all the constants that appear in the proofs. The term Ω is very small as n → ∞. Lemma 1. Consider the Taylor expansions 1
vℓ 1
wℓ
=
=
1 gℓ
−
1 gℓ2
n−1/2 sℓ 3/2
2ξℓ1
−
,
n−1/2 sℓ
ξℓ22
(33)
,
(34)
with gℓ2 ∧ |wℓ | ≤ |ξℓj | ≤ gℓ2 ∨ |wℓ |, j = 1, 2. On the event,
B = ∩ℓ=d 1 n−1/2 |sℓ | ≤ M
gℓ2
2
,
we have 1
|ξℓ1 |3/2
≤
3 gℓ3
,
and
1
|ξℓ2 |3
≤
8 gℓ4
.
(35)
Moreover P (B¯ ) ≤ Ω ,
(36)
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
427
under B we have, gℓ2 2
≤ |wℓ | ≤
3gℓ2 2
,
for all ℓ = 1, . . . , Md .
(37)
Proof of Lemma 1. The proof follows from the Taylor expansion of f (wℓ ) = (wℓ )−1 around gℓ2 . We obtain (34) with a residual ξℓ2 which is between gℓ2 and wℓ , and so gℓ2 ∧ |wℓ | ≤ |ξℓ2 | ≤ gℓ2 ∨ |wℓ |. (33) is derived in a similar fashion. However, use instead, the Taylor expansion of f (vℓ ) = (vℓ )−1 = (wℓ )−1/2 around gℓ2 . Under B we have gℓ2 ∧ |wℓ | ≥ gℓ2 /2 and so (35) and (37) follow. From the definition of Md in (27) and using assumption (Cν ), we see that gℓ2 > c ℓ n−1 (log n)4/3 , for any ℓ = 1, . . . , Md , where 0 < c < 1. Thus, P (B¯ ) ≤
Md −
P n−1/2 |sℓ | >
ℓ=1
c 2
ℓ n−1 (log n)4/3 ≤ Ω ,
the last inequality being derived from Lemma 3 (below). Lemma 2. Let M be defined at (11). Define the event
M = {Mc ≤ M ≤ Md },
(38)
as n → ∞,
¯ ) ≤ Ω, P (M and Mc ≤ Md ≤ M1 . Proof of Lemma 2.
¯ ) = P ({M > Md } ∪ {M < Mc }). P (M Due to the similarities between the last 2 events, we only consider the first one,
P (M > Md ) ≤ P |wℓ | > Md n−1 log2 n . Under B , using (37) and (Cν ), as n → ∞, we have
|wMd | ≤
3 2
2 gM ≤ C |Md |−2ν , d
which, by (30), cannot be smaller than Md n−1 log2 n. Thus, P (M > Md ) ≤ P (B¯ ) ≤ Ω . Lemma 3. By use of Bernstein’s inequality we obtain
P (n
−1/2
|sℓ | > t ) ≤ 2 exp −
nt 2 2+
2t 3
.
(39)
Proof of Lemma 3. The r.v. e2π iℓWk are i.i.d., bounded by 1 and with variance Var (e2π iℓWk ) ≤ 1; (39) follows from Bernstein’s inequality (Petrov, 1975). Lemma 4. (Variance Term with Known EigenValues gℓ ) Let
β˜ κ :=
−
e2π iℓYk
ℓ∈Cj
1 Ψ¯ ℓκ . gℓ
Then, for all κ ,
Var β˜ κ = O τj2 .
(40)
428
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
Proof of Lemma 4. For simplicity we give the proof with Cj = [2j , 2j+2 ], noting that the same proof holds for Cj = [−2j+2 , −2j ]. In what follows c denotes a positive constant which does not depend on n or j. By independence of the Yk ’s
Var (β˜ κ ) = Var
− ℓ∈Cj
2 1 − 1 e2π iℓYk e2π iℓYk Ψ¯ ℓκ Ψ¯ ℓκ ≤ E gℓ
gℓ
ℓ∈Cj
(41)
expanding the square, using |Ψℓκ | ≤ 2−j/2 and recalling notation hℓ = Ee2π iℓY1 ,
2 − hℓ 2 − 1 1 − 1 ′ Ψ¯ ℓκ = 2−j + E e2π iℓYk Ee2π i(ℓ−ℓ )Y1 ′ gℓ
ℓ∈Cj
gℓ
ℓ∈Cj
ℓ̸=ℓ′ ,∈Cj
gℓ
gℓ
using (Cν ), and |hℓ | ≤ 1,
2 − − − 1 E e2π iℓYm Ψ¯ ℓκ ≤ 2−j ℓ2ν + ℓν (ℓ′ )ν |hℓ−ℓ′ | gℓ
ℓ∈Cj
ℓ∈Cj
ℓ̸=ℓ′ ,∈Cj
2 − − 1 κ 2 π i ℓ Y m |hℓ−ℓ′ |. E Ψ¯ ℓ ≤ c 22jν + c ′ 22jν 2−j e gℓ
ℓ∈Cj
(42)
ℓ̸=ℓ′ ,∈Cj
Recalling that hℓ = fℓ gℓ and using (Cν ), |hℓ | ≤ |ℓ|−ν |fℓ |
−
|hℓ−ℓ′ | ≤
ℓ̸=ℓ′ ,∈Cj
since f ∈
−
(ℓ − ℓ′ )−ν |fℓ−ℓ′ | ≤
− ℓ≥0
ℓ̸=ℓ′ ,∈Cj
Bsp,p
|ℓ|−ν |fℓ | ≤ c ,
(T ). Combining this with (41) and (42) we have
Var (β˜ κ ) ≤ c 22jν = c τj2 .
(43)
∑ 1 Lemma 5. Let cn = n− 2 (log n)1/2 and τj = (|Cj |−1 ℓ∈Cj |gℓ |−2 )1/2 . Then, for any γ > 0, and j ≤ jd , as n → ∞,
P |βˆ κ − βκ | >
γ τj cn 2
≤ Cn−γ
2 /c
.
(44)
Proof of Lemma 5. We have
Bκ := βˆ κ − Ev βˆ κ =
− (yℓ − hℓ ) vℓ
ℓ
Ψ¯ ℓκ =
−
1 n
n ∑
e
2π iℓYk
− hℓ
k=1
vℓ
ℓ
Ψ¯ ℓκ .
Inverting the order of summations,
n 1 − − (e2π iℓYk − hℓ ) κ Bκ = Ψ¯ ℓ n k =1 vℓ ℓ
=
n 1−
n k=1
Tk ,
(45)
conditionally on gˆ = (vℓ ): Tk , k = 1, 2, . . . n are independent rv’s with zero mean, on the event B , Tk 2j/2 τj
= O(1).
To prove the latter inequality, we use (45), |Ψℓκ | ≤ 2−j/2 and Lemma 1 on the event B ,
|Tk |2 = |Cj |2
− (e2π iℓYk − hℓ ) |Cj |−1 Ψ¯ ℓκ vℓ ℓ
2 ≤ |Cj |
− (1 + |hℓ |)2 ℓ
|vℓ |2
Conditionally on (ˆg ) and on the event B , using Lemma 1
P |βˆ κ − βκ | > λ ≤ E Pgˆ |Bκ | > cn τj (γ /2 + o(1)) IB + Ω .
|Ψℓκ |2 ≤ (c 2j/2 τj )2 .
L. Cavalier, M. Raimondo / Journal of the Korean Statistical Society 39 (2010) 417–429
429
On the event B , using Bernstein’s inequality as in Lemmas 3 and 1
γ ≤ exp − Pgˆ |Bκ | > cn τj 2
γ 2 nτj2 cn2 c τj2 + 2j/2 τj2 cn
γ 2 log n . ≤ exp − c + 2j/2 cn
Since j ≤ jd we have 2j/2 cn = o(1). Thus, we obtain the lemma. Proof of Theorem 1. By using the lemmas 1, . . . 5 and the proof of Theorem 1 in Cavalier and Raimondo (2007, pp. 2421–2422), we obtain the theorem. Acknowledgment We would like to thank the editor and the referee for their interesting comments and remarks. References Abramovich, F., & Silverman, B. (1998). Wavelet decomposition approaches to statistical inverse problems. Biometrika, 85(1), 115–129. Andersen, C., Bro, R., & Brockoff, P. (2003). Quantifying and handling errors in instrumental measurements using measurement error theory. Journal of Chemometrics, 17, 621–629. Arnold, B., & SenGupta, A. (2006). Recent advances in the analyses of directional data in ecological and environmental sciences. Environmental and Ecological Statistics, 13, 253–256. Biemer, P., Groves, R., Lyberg, L., Mathiowetz, N., & Sudman, S. (2004). Measurement errors in surveys. John Wiley and Sons. Butucea, C. (2004). Deconvolution of supersmooth densities with smooth noise. The Canadian Journal of Statistics, 32(2), 181–192. Butucea, C., & Matias, C. (2005). Minimax estimation of the noise level and of the deconvolution density in a semiparametric convolution model. Bernoulli, 11(2), 309–340. Cavalier, L., Golubev, G., Picard, D., & Tsybakov, A. (2002). Oracle inequalities for inverse problems. The Annals of Statistics, 30, 843–874. Cavalier, L., & Hengartner, N. (2005). Adaptive estimation for inverse problems with noisy operators. Inverse Problems, 21, 1345–1361. Cavalier, L., & Raimondo, M. (2007). Wavelet deconvolution with noisy eigen-values. IEEE Transactions on Signal Processing, 55(6), 2414–2424. Cavalier, L., & Tsybakov, A. (2002). Sharp adaptation for inverse problems with random noise. Probability Theory and Related Fields, 123(3), 323–354. Comte, F., Rozenholc, Y., & Taupin, M.-L. (2006). Penalized contrast estimator for adaptive density deconvolution. The Canadian Journal of Statistics, 34, 431–452. Delaigle, A., Hall, P., & Meister, A. (2007). On deconvolution with repeated measurements. The Annals of Statistics, 36, 665–685. Diggle, P. J., & Hall, P. (1993). A Fourier approach to nonparametric deconvolution of a density estimate. Journal of the Royal Statistical Society. Series B. Methodological, 55(2), 523–531. Donoho, D. (1995). Nonlinear solution of linear inverse problems by wavelet-vaguelette decomposition. Applied Computational and Harmonic Analysis, 2, 101–126. Donoho, D., & Raimondo, M. (2004). Translation invariant deconvolution in a periodic setting. The International Journal of Wavelets, Multiresolution and Information Processing, 14(1), 415–423. Donoho, D., & Raimondo, M. (2005). A fast wavelet algorithm for image deblurring. Australian and New Zealand Industrial and Applied Mathematics Journal, 46, C29–C46. URL: http://anziamj.austms.org.au/V46/CTAC2004/Dono. Dunn, G. (2004). Statistical evaluation of measurement errors: Design and analysis of reliability studies (2nd ed.). London: Arnold. Efromovich, S. (1997). Density estimation for the case of supersmooth measurement error. Journal of American Statistical Association, 92(438), 526–535. Efromovich, S. (1999). Springer series in statistics, Nonparametric curve estimation. New York: Springer-Verlag. Efromovich, S., & Koltchinskii, V. (2001). On inverse problems with unknown operators. IEEE Transactions on Information Theory, 47(7), 2876–2894. Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. The Annals of Statistics, 19(3), 1257–1272. Fan, J., & Koo, J. (2002). Wavelet deconvolution. IEEE Transactions on Information Theory, 48(3), 734–747. Fisher, N. (1993). Statistical analysis of circular data. Cambridge: Cambridge University Press. Goldenshluger, A. (2002). Density deconvolution in the circular structural model. Journal of Multivariate Analysis, 81(2), 360–375. Hoffmann, M., & Reiss, M. (2008). Nonlinear estimation for linear inverse problems with error in the operator. The Annals of Statistics, 29, 310–336. Horowitz, J. L., & Markatou, M. (1996). Semiparametric estimation of regression models for panel data. Review of Economic Studies, 63(1), 145–168. Jaech, J. (1985). Statistical analysis of measurement errors. John Wiley and Sons. Johnstone, I., Kerkyacharian, G., Picard, D., & Raimondo, M. (2004). Wavelet deconvolution in a periodic setting. Journal of the Royal Statistical Society, Series B, 66(3), 547–573, with discussion pp. 627-652. Johnstone, I. M. (1999). Wavelet shrinkage for correlated data and inverse problems: Adaptivity results. Statistica Sinica, 9(1), 51–83. Kay, S. (1998). Model-based probability density function estimation. IEEE Signal Processing Letters, 5(12), 318–320. Li, T., & Vuong, Q. (1998). Nonparametric estimation of the measurement error model using multiple indicators. Journal of Multivariate Analysis, 65(2), 139–165. Ma, Y., & Carroll, R. J. (2006). Locally efficient estimators for semiparametric models with measurement error. Journal of American Statistical Association, 101(476), 1465–1474. Mallat, S. (1998). A wavelet tour of signal processing (2nd ed.). San Diego, CA: Academic Press Inc.. Marron, J. S., & Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics, 20(2), 712–736. Marteau, C. (2006). Regularisation in inverse problems with unknown operators. Mathematical Methods of Statistics, 15, 415–443. Neumann, M. (1997). On the effect of estimating the error density in nonparametric deconvolution. Journal of Nonparametric Statistics, 7, 307–330. Neumann, M. H. (2007). Deconvolution from panel data with unknown error distribution. Journal of Multivariate Analysis, 98(10), 1955–1968. Oman, S., Meir, N., & Halm, N. (1999). Comparing two measures of creatinine clearance: An application of errors-in-variables and bootstrap techniques. Journal of the Royal Statistical Society, Series C , 48(14), 39–52. Pensky, M., & Vidakovic, B. (1999). Adaptive wavelet estimator for nonparametric density deconvolution. The Annals of Statistics, 27, 2033–2053. Petrov, V. (1975). Sums of independent random variables. Springer-Verlag. Raimondo, M., & Stewart, M. (2006). waved: Software to perform wavelet deconvolution. R Contributed pakckage. URL: http://CRAN.R-project.org/. Raimondo, M., & Stewart, M. (2007). The waved transform in r. Journal of Statistical Software, 21(3), 1–17. URL: http://www.jstatsoft.org/. Walter, G., & Shen, X. (1999). Deconvolution using the meyer wavelet. Journal of Integral Equations and Applications, 11, 515–534. Wang, Y. (1997). Minimax estimation via wavelets for indirect long-memory data. Journal of Statistical Planning and Inference, 64(1), 45–55. Willer, T. (2006). Estimation non paramétrique et problèmes inverses. Ph.D. thesis, University of Paris 7.