Shrinkage estimator in normal mean vector estimation based on conditional maximum likelihood estimators

Shrinkage estimator in normal mean vector estimation based on conditional maximum likelihood estimators

Statistics and Probability Letters 93 (2014) 1–6 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: www...

386KB Sizes 2 Downloads 157 Views

Statistics and Probability Letters 93 (2014) 1–6

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Shrinkage estimator in normal mean vector estimation based on conditional maximum likelihood estimators Junyong Park Department of Mathematics and Statistics, University of Maryland Baltimore County, MD 21250, United States

article

info

Article history: Received 4 October 2013 Received in revised form 1 June 2014 Accepted 1 June 2014 Available online 12 June 2014 Keywords: Shrinkage Sparsity Empirical Bayes Conditional maximum likelihood estimate Mean vector Stein’s risk unbiased estimate

abstract Estimation of normal mean vector has broad applications such as small area estimation, estimation of nonparametric functions and estimation of wavelet coefficients. In this paper, we propose a new shrinkage estimator based on conditional maximum likelihood estimator incorporating with Stein’s risk unbiased estimator (SURE) when data have the normality. We present some theoretical work and provide numerical studies to compare with some existing methods. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Let z = (z1 , . . . , zn ) be a vector of observations generated independently from zi = θi + ϵi , i = 1, 2, . . . , n where ϵi ∼ N (0, σ 2 ). Our interest is to obtain an estimate of θ = (θ1 , . . . , θn ), say θˆ = (θˆ1 , . . . , θˆn ), with small L2 risk k 2 ˆ R(θ) = i=1 E (θi − θi ) . In particular, when the solutions are sparse in that the number of nonzero θ ’s is small compared to n, a variety of shrinkage estimators have been developed, for example, see Donoho and Johnstone (1994), Donoho and Johnstone (1995), and Johnstone and Silverman (2004). The sparsity situation occurs commonly in many practical problems, for example, the estimation of wavelet coefficients which is also known as wavelet denoising. Under the sparsity, these literatures discussed the use of different types of shrinkage such as hard shrinkage and soft shrinkage. These two shrinkage estimators are the most typical examples when only a small proportion of θi is non-zero. The hard and soft shrinkage have some drawbacks in practice, for example, hard shrinkage has discontinuity in estimator causing a large variability and the soft shrinkage has large bias. Johnstone and Silverman (2004) proposed the empirical Bayes (EB) approach with different prior distributions on θ which has the form of mixture of point mass at 0 and the nonzero part. In this paper, we propose an adaptive and model-based shrinkage estimator of which the form reflect the normality. This will be implemented based on conditional maximum likelihood estimate (CMLE) conditioning on observations greater or smaller than some cut-off value. Our proposed CMLE is also expected to overcome drawbacks of hard and soft thresholding such as discontinuity of hard thresholding and large bias of soft thresholding. We incorporate the idea of SURE (Stein’s Unbiased Risk Estimate) to determine the parameter in the proposed shrinkage estimator as in the case of soft shrinkage estimator with SURE. However it will be seen that our proposed CMLE with SURE performs better than the soft shrinkage with SURE. Our proposed CMLE is a fully data-dependent procedure whereas the EB approach in Johnstone and Silverman (2004) needs to specify the prior distribution. We also provide some optimal property of the proposed CMLE as Donoho and

E-mail address: [email protected]. http://dx.doi.org/10.1016/j.spl.2014.06.005 0167-7152/© 2014 Elsevier B.V. All rights reserved.

2

J. Park / Statistics and Probability Letters 93 (2014) 1–6

Fig. 1. The left panel in the first row shows θˆ CMLE conditioning on Zj > 1 or Zj < −1 and the right panel shows θˆ by modifying θˆ CMLE .

Johnstone (1994) did for hard and soft shrinkage and present numerical studies showing that our proposed CMLE obtains at least competitive or better performances than EB as well as hard and soft shrinkages. This paper is organized as follows. In Section 2, we introduce our proposed CMLE and present some property to reduce computation. Section 3 provides some discussion on the Stein’s risk unbiased estimate to determine the shrinkage parameter. In Section 4, numerical studies are presented to compare the CMLE and some other methods. We present concluding remarks in Section 5. 2. Conditional maximum likelihood estimate In this section, we propose an estimator of θ based on conditional MLE. The conditional densities of Zj given Zj > C and Zj < −C are 1

σ

φ



1−Φ

zj −θj



σ



C −θj

  z −θ φ jσ j  I (zj < −C ).  −C −θj Φ σ 1

σ

 I (zj > C ),

σ

(1)

We consider the CMLE conditioning on either Zj > C or Zj < −C , namely θˆjCMLE which is

θˆjCMLE =

   1 φ (zj − θj )/σ  σ    argmaxθj   1 − Φ (C − θj )/σ   0

     argmaxθj

  φ (zj − θj )/σ σ   Φ (−C − θj )/σ

if zj > C if |zj | ≤ C

1

if zj < −C .

However, as shown in the left graphs in Fig. 1 for C = 1, this θˆ CMLE has negative (positive) estimators even when Zj > 0 (Zj < 0), respectively. In order to avoid this property and to obtain continuity, we modify the θˆ CMLE further to have our proposed estimator, namely θˆ , as follows: max(θˆjCMLE , 0)

if Zj > 0

min(θˆ

otherwise.

 θˆj ≡ θˆ (Zj ) =

CMLE j

, 0)

As shown from two right graphs in Fig. 1, θˆ from θˆ CMLE is smooth (in the sense of weakly differentiable). We see that θˆj ’s have the sparse solution in that we have 0 estimators for relatively small observation Zj s. Furthermore, as mentioned, the estimator has the continuity so we can apply the idea of SURE to determine the parameter C . We demonstrate this estimation of C in the next section. By symmetry, we only need to consider the case of Zj > C since the other case can be done easily. Given an observation Zj > C and σ , the θˆjCMLE is the solution of Zj = θj + σ · h



C − θj

σ

 (2)

J. Park / Statistics and Probability Letters 93 (2014) 1–6

3

φ(a)

where h(a) = 1−Φ (a) and φ and Φ denote density function and cumulative distribution of a standard normal distribution, respectively. The above equation should be solved numerically and with the solution, we obtain our CMLE estimator θˆj = max(θˆjCMLE , 0). Similarly, for Zj < −C , we obtain the solution of Zj = θj − σ · h



C + θj

 (3)

σ

which is the θˆjCMLE and obtain θˆj = min(θˆjCMLE , 0). Remark. When σ 2 is unknown, one may estimate σ 2 as follows. Under sparsity condition, if zi has a small value in absolute value, it is more likely to be generated from N (0, σ 2 ). More specifically, for (z1 , . . . , zn ), we define (z(1) , . . . , z(n) ) such that |z(1) | < |z(2) | < · · · < |z(n) |. Then under sparsity condition, , z(m) for some m ≪ n are most  likely to be generated m z2(1) , . .1.  m m 1 1 2 ¯ ¯ from N (0, σ 2 ), so we may obtain an estimate σˆ 2 = m z or ( z − z ) where z = ( i ) m m i=1 (i) i =1 i=1 z(i) . m m 3. Choosing the parameter C In this section we will consider the problem of choosing the regularization parameter C . For every C there are corresponding conditional maximum likelihood estimators θˆj ≡ θˆ (Zj , C , σ ), j = 1, . . . , n associated with a given data set Z1 , . . . , Zn and σ . We use the Stein’s risk unbiased estimator (SURE) which has been considered in soft shrinkage (1995). For notational convenience, we suppress C and σ , θˆ (Zj , C , σ ) = θˆ (Zj ). We use the following Stein’s Identity. When Z ∼ N (θ , σ 2 ), E (δ(Z )θ ) = E (U (Z ))

(4)

where U (Z ) ≡ Z δ(Z ) − σ δ (Z ) and δ is a derivative of δ . Without loss of generality, we assume σ = 1 since an estimate of σ can be used for unknown cases. The optimal C is chosen by 2 ′

argmin C >0



n n n    (θˆ (Zj ) − θj )2 ≡ argmin (θˆ (Zj )2 − 2θˆ (Zj )θj ) + θj2 C >0

i =1

= argmin C >0

i=1 n 

j =1

(θˆ (Zj )2 − 2θˆ (Zj )θj )

(5)

i=1

≡ argmin ∆(C )

(6)

C >0

θj2 does not depend on C . However, (5) is not directly calculated since the unknown parameters θj ’s are included. Instead of θˆ (Zj )θj , we use an unbiased estimator from Stein’s Identity in (4) such as ˆ Zj ) − θˆ ′ (Zj ) which leads to an approximation of (5) Zj θ( where the last equality is due to the fact that

(5) ≈ argmin C >0



j

n  ˆ (C ) (θˆ (Zj )2 − 2 Zj θˆ (Zj ) + 2θˆ ′ (Zj )) ≡ argmin ∆ i =1

(7)

C >0

ˆ (C ) defined in (6) and (7) is illustrated in where θˆ ′ (Z ) is a derivative of θˆ (Z ). The relationship between ∆(C ) and ∆ supplementary material (see Appendix A). We select ˆ (C ) Cˆ = argmax ∆

(8)

C

as the regularization parameter. We now present some theoretical aspect of our proposed estimator which is about the following optimal property shown for hard and soft threshold by Donoho and Johnstone (1994). We derive a similar result which ensures that our proposed CMLE obtains some asymptotically optimal property. Before we present our main result, we need the following lemma and then state our main result as theorem. Define RCλ (θ ) = E (θˆ − θ )2 for the CMLE θ . Here, we introduce λ depending on C such that θˆ (Zj ) = 0 for C < Zj < λ. In

other words, θˆj = θˆ (Zj )I (|Zj | > λ). In Donoho and Johnstone (1994), it shows that the ideal risk or oracle risk is Roracle = min(θ 2 , 1) and hard and soft threshold obtains nearly the oracle risk except the factor of log n. Similarly, we shall show that our proposed CMLE achieves this nearly oracle risk in the sequel. For this purpose, we present the following lemma and provide a theorem as a direct consequence of the lemma. Lemma 1. For all θ and λ ≥ 1, the risk of CMLE has RCλ (θ ) ≤ (λ2 + 3)



2φ(λ)

λ

 + min(θ 2 , 1) .

4

J. Park / Statistics and Probability Letters 93 (2014) 1–6

Table 1 Average of total squared error of estimation of various methods on a mixed signal of length 1000. Except the results for CMLE and all results are cited from Johnstone and Silverman (2004). See Johnstone and Silverman (2004) for more details in the choice of priors. # of nonzero

5

Value of nonzero

3

4

5

50

CMLE

42

26

13

Exponential Cauchy Postmean Exphard

36 37 34 51

32 36 32 43

Laplace a=1 a = 0.5 a = 0.2 a = 0.1

36 37 38 38

SURE Adapt

7

500

3

4

5

230

154

93

53

891

737

610

511

17 18 21 22

8 8 11 11

214 271 201 273

156 176 169 189

101 103 122 130

73 77 85 91

857 922 860 998

873 898 888 998

783 829 826 983

658 743 708 817

32 34 37 37

19 17 18 18

15 10 7 6

213 244 299 339

166 158 188 227

142 105 95 102

135 92 69 60

994 845 1061 1496

1099 878 730 798

1126 884 665 609

1130 884 656 579

38 42

42 63

42 73

43 76

202 417

209 620

210 210

210 210

829 829

835 835

835 835

835 835

FDRq = 0.01 FDRq = 0.1 FDRq = 0.4

43 40 58

51 35 58

26 19 53

5 13 52

392 280 298

299 175 265

125 113 256

55 102 254

2568 1149 919

1332 744 866

656 651 860

524 644 860

BlockThresh NeighBlock NeighCoeff

46 47 55

72 64 51

72 51 38

31 26 32

444 427 375

635 543 343

600 439 219

293 227 156

1918 1870 1890

1276 1384 1410

1065 1148 1032

983 972 870

Universal soft Universal hard

42 39

63 37

73 18

76 7

417 370

620 340

720 163

746 52

4156 3672

6168 3355

7157 1578

7413 505

Proof. See supplementary material (see Appendix A).

5

7

3

4

5

7



We have the following theorem which is directly derived from Lemma 1.



Theorem 1. The CMLE has the risk RCλ (θ ) ≤ (λ2 + 3) Roracle +

 O



√1



log n

2φ(λ)

λ



. For λ =

 √ α log n, RCλ (θ ) ≤ K log n Roracle +

for some constant K > 0.

Remark. As pointed out by a reviewer, we add two comments on Theorem 1. (i) One is that the result for a deterministic sequence of λ rather than λ (or C ) chosen by SURE in (8). It does not seem to be trivial to obtain the bound of the risk for C from (8) since neither the closed form of C nor some asymptotic property of C is known yet. (ii) Theother one is that Theorem 1  presents the result for one θ . For multiple θ1 , . . . , θn , we obtain

θ¯ = 2

1 n

n

i=1

1 n

n

i=1

E (θi − θi )2 ≤ (λ2 + 3)

2φ(λ)

λ

+ min(θ¯ 2 , 1) where

θ

2 i .

4. Numerical studies In this section, we provide numerical studies comparing our proposed CMLE and some other estimators. The first simulations are done when n = 1000 and the various combinations of the number of nonzero θi s and the value of θ . Table 1 is generated in Johnstone and Silverman (2004) which shows average total squared error of estimation of various methods. As displayed in Table 1, we see that our proposed CMLE method seems to obtain promising performances in most situations. Some methods have unstable performances depending on different situations, for example, Universal hard and soft, FDR and BlockThresh etc. Our second simulation is, instead of a single value for nonzero θi in Table 1, we consider more flexible nonzero values. We use EBayesThres package in R for implementation of the EB approaches. See Johnstone and Silverman (2005) for details in the use of software for the EB approaches. We use the uniform distribution for different proportions, ϵ = 0.05, 0.1, 0.2 and 0.5. Table 2 shows the results of our proposed CMLE and empirical Bayes methods with different types of priors proposed in Johnstone and Silverman (2004). When θi ∼ U (0, 4) postmedian and postmean seem to be getting better than CMLE for large values of k, however when θi ∼ U (4, 7), CMLE achieves the best performance for all k values except empirical Bayes with Laplace a = 0.5 and 0.2 for k = 50. When θi ∼ U (7, 10), the hard shrinkage obtains the best performance and EB with Laplace a = 0.1 and CMLE have the second best performance. Combining all of these, θi ∼ U (0, 10), the CMLE has the best performance and the postmedian is the second best. From Table 2, we see that the CMLE and postmedian have overall competitive performance while EB with Laplace prior seems to be a bit sensitive to the choice of priors. Additional simulation studies are provided in Section 3 in supplementary material (see Appendix A).

J. Park / Statistics and Probability Letters 93 (2014) 1–6

5

Table 2 k µi s out of n µi s are nonzero. n = 1000.

µ i ∼ U (0 , 4 ) k = 50 CMLE Cauchy a=1 a = 0.5 a = 0.2 a = 0.1 Postmean Postmedian Hard (Univ) Soft (SURE) Soft (Univ)

CMLE Cauchy a=1 a = 0.5 a = 0.2 a = 0.1 Postmean Postmedian Hard (Univ) Soft (SURE) Soft (Univ)

167.6 191.3 164.9 176.6 200.7 213.2 251.0 193.3 207.3 930.5 251.1

µi ∼ U (4, 7) 100 286.0 359.3 303.8 331.6 382.0 415.6 301.6 263.9 396.5 936.0 461.5

200 482.1 777.0 614.7 701.3 847.0 945.3 413.2 410.2 854.8 938.9 1019.5

500 836.9 2652.5 2267.0 2539.8 2712.9 2731.3 718.1 820.2 2040.4 960.3 2453.2

k = 50

100

200

500

102.3 101.8 136.3 100.1 96.6 106.8 254.3 182.4 126.5 933.3 717.5

179.6 202.8 273.4 190.2 191.1 224.7 309.9 240.8 276.7 936.5 1427.3

304.9 439.2 611.1 385.3 420.4 544.2 421.7 359.3 549.5 949.8 2852.7

624.5 15259.3 15255.2 15259.3 15259.3 15259.3 753.5 710.5 1369.2 970.4 7135.3

500

µi ∼ U (7, 10) k = 50 100

200

500

µi ∼ U (0, 10) k = 50 100

200

68.6 69.1 126.5 82.0 63.1 57.1 251.1 179.6 53.7 936.0 742.1

217.6 265.2 611.4 305.5 220.2 206.5 409.4 348.8 202.7 950.4 2962.4

517.5 36872.1 36682.3 36866.1 36872.2 36872.2 723.4 685.7 503.6 971.1 7404.2

124.8 121.1 140.0 119.9 123.1 131.5 254.7 189.0 142.4 933.4 566.3

369.4 450.4 525.8 424.4 469.5 542.8 414.6 375.8 554.7 943.0 2207.2

117.8 130.0 258.2 152.5 115.6 106.7 303.0 235.2 102.6 938.7 1481.8

222.7 237.1 260.7 226.4 244.1 273.2 306.1 251.0 302.2 937.4 1032.4

737.9 2309.6 2018.5 1934.3 2845.3 3820.8 731.9 751.7 1354.3 968.8 5308.5

Fig. 2. Left panel: d.f . represents degrees of freedom and solid line represents the squared error of CMLE, dashed line represents that of empirical Bayes with Cauchy prior and dotted line represents that of hard shrinkage. Right panel: Solid line represents the values of λ as a function of C . Dotted line is a diagonal line.

As a reviewer commented, the proposed CMLE approach may not be robust when the distribution of ϵi is not exactly normally distributed. This is because our proposed CMLE fully uses the normality in the sense that the CMLE reflects the form of normality in the form of shrinkage and uses normality in choosing the regularization parameter C from SURE. To t (df ) demonstrate this, we add an additional simulation study as follows. We adopt ϵi ∼ √df /(df −2) where t (df ) is a t-distribution

with degrees of freedom df so that var(ϵi ) = 1. θi ∼ Unif(0, 4) for 1 ≤ i ≤ 50 and θi = 0 for 51 ≤ i ≤ 1000. The left panel in Fig. 2 shows that, as degrees of freedom increases, the squared error (solid line) from CMLE tends to be getting smaller very quickly while the hard shrinkage (dotted line) and the empirical Bayes with Cauchy prior (dashed line) improve the squared error slowly. Although our proposed CMLE has some advantage in performance and avoiding priors, the CMLE seems to be more sensitive to the deviation of normality of errors than other methods. Another comment is about the relationship between C and λ. The right panel in Fig. 2 shows the relationship between C and λ. It can be seen that λ is getting closer to C as C increases.

6

J. Park / Statistics and Probability Letters 93 (2014) 1–6

5. Concluding remark In this paper, we proposed a model-based shrinkage estimator reflecting the normality of observations. We provided the asymptotic optimal property which was also shown for hard and soft shrinkage in Donoho and Johnstone (1994). We present a variety of simulations to compare our proposed CMLE with empirical Bayes with different priors proposed in Johnstone and Silverman (2004). We have demonstrated the use of the conditional MLE in estimation of normal mean vector and provided numerical studies showing that the shrinkage based on CMLE obtains competitive performances or better than some existing approaches while the EB approaches seem to have undesirable performances under some situations. As a reviewer commented, SCAD in Fan and Li (2001) may be considered for this problem, however it is challenging to develop an efficient and adaptive algorithm to determine two parameters in SCAD for the case we have only one observational vector. We leave this as a future work. Acknowledgment We are grateful to an anonymous reviewer for helpful comments and suggestions. Appendix A. Supplementary material Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.spl.2014.06.005. References Donoho, D.L., Johnstone, I.M., 1994. Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455. Donoho, D.L., Johnstone, I.M., 1995. Adapting to unknown smoothness via wavelelt shrinkage. J. Amer. Statist. Assoc. 90, 1200–1224. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348–1360. Johnstone, I.M., Silverman, B.W., 2004. Needles and straw in haystacks: empirical Bayes estimates of possibly sparse sequences. Ann. Statist. 32 (4), 1594–1649. Johnstone, I.M., Silverman, B.W., 2005. EbayesThresh: R programs for empirical Bayes thresholding. J. Stat. Softw. 8 (12).