Journal of Multivariate Analysis 145 (2016) 190–207
Contents lists available at ScienceDirect
Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva
Minimax estimation of a normal covariance matrix with the partial Iwasawa decomposition Hisayuki Tsukuma Faculty of Medicine, Toho University, 5-21-16 Omori-nishi, Ota-ku, Tokyo 143-8540, Japan
article
info
Article history: Received 8 April 2015 Available online 2 January 2016 AMS 2010 subject classifications: primary 62C20 62J07 secondary 62H12 Keywords: Autoregressive model Cholesky decomposition Inadmissibility Incomplete data Minimaxity LDL⊤ decomposition Moving average model Partial Iwasawa decomposition Shrinkage estimator Statistical decision theory
abstract This paper addresses the problem of estimating the normal covariance matrix relative to the Stein loss. The partial Iwasawa decomposition is used to reduce the original estimation problem to simultaneous estimation for variances and means of some normal distributions. The variances and the means are closely related to, respectively, the diagonal and the below-diagonal elements of a lower triangular matrix which is made from the Cholesky decomposition of the covariance matrix. Shrinkage type procedures are proposed for improvements not only on the diagonal elements but also on the below-diagonal elements corresponding to the James and Stein minimax estimator of the covariance matrix. © 2015 Elsevier Inc. All rights reserved.
1. Introduction In this paper, the problem of estimating a normal covariance matrix is discussed from a decision-theoretic standpoint. The estimation problem is precisely specified as follows: Let S ∼ Wp (n, Σ ),
(1.1)
which implies that the p × p random matrix S has the Wishart distribution with n degrees of freedom and mean nΣ . In a standard multivariate normal distribution model, Σ and S are interpreted as, respectively, the covariance matrix and the minimal sufficient statistic for Σ . Suppose that n ≥ p and Σ is an unknown positive definite matrix. Denote by tr A and det A, respectively, the trace and the determinant of a square matrix A and by B−1 the inverse of a nonsingular matrix B. Consider the problem of estimating Σ relative to the Stein loss
, Σ ) = tr Σ −1 Σ − ln det Σ −1 Σ − p, L(Σ
(1.2)
stands for an estimator of Σ . The accuracy of estimators Σ is compared by the risk function R(Σ , Σ ) = E [L(Σ , Σ )], where Σ where the expectation is taken with respect to (1.1).
E-mail address:
[email protected]. http://dx.doi.org/10.1016/j.jmva.2015.12.013 0047-259X/© 2015 Elsevier Inc. All rights reserved.
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
191
Many researchers have made substantial endeavors for decision-theoretic estimation of the covariance matrix Σ . The important key in the decision-theoretic estimation is invariance under scale transformation groups. Such invariance principle inevitably requires matrix decompositions of S. Indeed, James and Stein [5] considered a scale transformation group of lower triangular matrices and derived a minimax estimator built from the Cholesky decomposition of S. Let T be a lower triangular matrix with positive diagonal elements such that S = TT ⊤ , where the superscript ‘‘⊤’’ represents the transposition of a matrix. Then James and Stein’s minimax estimator of Σ is given by
JS = TDJS T ⊤ , Σ
(1.3) JS di
where DJS is the diagonal matrix with the ith diagonal element = (n + p − 2i + 1)−1 . JS are accomplished by using orthogonally invariant estimators which are Moreover effective improvements on Σ based on the eigenvalue decomposition of S. For details of orthogonally invariant estimators, see [19,3]. Other matrix decompositions were found in [11,21] for decision-theoretic estimation of Σ . In order to construct estimators of Σ , this paper considers the following well-known matrix decomposition: Partition S into four blocks as
S11 S21
S=
⊤ S21 S22
,
(1.4)
where S11 is a q × q square matrix with 1 ≤ q ≤ p − 1. Then S can be decomposed as
S=
Iq Y21
Oq×(p−q) Ip−q
S11
O(p−q)×q
Oq×(p−q) S22·1
Iq
O(p−q)×q
⊤ Y21 Ip−q
,
(1.5)
−1 −1 ⊤ where Iq is the identity matrix of order q, Oq×r is the q × r zero matrix, Y21 = S21 S11 and S22·1 = S22 − S21 S11 S21 . The decomposition (1.5) is called the partial Iwasawa decomposition in the literature (see [20]). Hereinafter the partial Iwasawa decomposition is abbreviated to the PID. The PID plays an important role in multivariate statistical theory. A distributional property for (1.5) is given, for instance, in [15, Theorem 3.3.5]. In estimation of Σ with the PID, Konno [6] handled a practical setting with incomplete data and Ma et al. [13] proposed a shrinkage type procedure for S21 of (1.4). It is also noted that Lin and Perlman [11] addressed covariance estimation via shrinkage type procedure for all the off-diagonal elements of S without using the PID. In this paper we employ the PID so as to apply shrinkage type procedures to both the diagonal elements and the belowdiagonal elements of T . In Section 2, we utilize the PID for showing that the original covariance estimation problem is decomposed into the problems of estimating some mean vectors and some variances. Section 3 deals with minimaxity for a class of estimators inspired by the decomposed problem. The key tool to the minimaxity is the IERD (integral expression of risk difference) method, which was proposed by Kubokawa [7–9]. Using the IERD method, we obtain sufficient conditions for the minimaxity of the class. It is also shown that the sufficient conditions yield Stein-like [17] and Brewster and Zidek-like [2] estimators which are shrinking the diagonal elements of T . Section 4 handles an alternative PID and the IERD method for further improvements by means of shrinking the belowdiagonal elements of T . In Section 5 we discuss applications to autoregressive and moving average models with repeated measurements and to an incomplete data model. Section 6 gives some results of simulation studies and numerically investigates the risks of shrinkage minimax estimators obtained in this paper. Section 7 points out remarks on directions for future work. We finally provide some detailed proofs of results of this paper in the Appendix.
2. Decomposition of covariance estimation problem Define the Cholesky decomposition of Σ as Σ = ΘΘ ⊤ , where Θ is a p × p lower triangular matrix with positive diagonal elements. Write Θ = (θi,j ), where θi,j is the (i, j)th element of Θ and, for i < j, θi,j = 0. For i = 1, . . . , p, denote by Θi the ith leading principal submatrix of Θ . Note that Θ1 = θ1,1 and Θp = Θ . Then Θi can be represented by the block matrix
Θi =
Θi−1
0i−1
θ⊤ i
θi,i
,
where θ i = (θi,1 , . . . , θi,i−1 )⊤ and 0i−1 is the (i − 1)-dimensional zero vector. 2 −1 Let σi2 = θi2,i for i = 1, . . . , p and βi = (βi,1 , . . . , βi,i−1 )⊤ = (Θi⊤ −1 ) θ i for i = 2, . . . , p. Let Σ1 = σ1 and denote, for i = 2, . . . , p,
Σi = Θi Θi⊤ . Consider the following type of the PID of Σi :
Σi =
Ii−1
0i−1
β⊤ i
1
Σi−1 0⊤ i−1
0i−1
σi2
Ii−1
βi
0⊤ i −1
1
,
192
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
which implies that Ii−1
−βi
0⊤ i −1
1
Σi
−1
=
Σi−−11
0i−1
0⊤ i−1
σi−2
Ii−1
0i−1
−β⊤ i
1
.
(2.1)
Denote by diag(a1 , . . . , ap ) a diagonal matrix, where a1 , . . . , ap are diagonal elements starting in the upper left corner. Repeatedly using (2.1), we can uniquely rewrite Σ −1 as
Σ −1 = B⊤ diag(σ1−2 , . . . , σp−2 )B,
(2.2)
where 1
0
−β2,1 .. .
1
.. . −βp,2
B=
−βp,1
.
..
. ···
1
For detailed derivation of (2.2), see Appendix A.1. From (2.2), Σ can be decomposed as B−1 diag(σ12 , . . . , σp2 )(B⊤ )−1 , where B−1 is a lower triangular matrix with unit diagonal elements. In the literature, such matrix decomposition of Σ is called the LDL⊤ decomposition. For uniqueness of the LDL⊤ decomposition, see [4, Section 4.1]. Let T = (ti,j ). For i = 1, . . . , p, the ith leading principal submatrix of T is defined as Ti . Denote
Ti =
Ti−1 ti⊤
0i−1 ti,i
,
2 −1 where ti = (ti,1 , . . . , ti,i−1 )⊤ , T1 = t1,1 and Tp = T . Let xi = (Ti⊤ −1 ) ti for i = 2, . . . , p and si = ti,i for i = 1, . . . , p. Then
the PID of Ti Ti⊤ is expressed as
⊤
0i−1 1
Ii − 1 x⊤ i
Ti Ti =
Ti−1 Ti⊤ −1 0⊤ i−1
0i−1 si
Ii − 1 0⊤ i−1
xi 1
.
Then we obtain the following lemma, whose proof is given in Appendix A.2. Lemma 2.1. The si ’s and the xi ’s are distributed as
s1 ∼ σ12 χn2 , si ∼ σi2 χn2−i+1
and
−1 xi |Ti−1 ∼ Ni−1 (βi , σi2 (Ti−1 Ti⊤ −1 ) )
(2.3)
for i = 2, . . . , p.
For every given i (≥ 2), xi does not depend on si .
of Σ in the following manner: Let the We construct an estimator Σ σi2 ’s and the βi ’s be estimators of the σi2 ’s and the 2 1 = i inductively as, for i = 2, . . . , p, βi ’s, respectively. Set Σ σ1 and define Σ
Ii−1
0i−1
i−1 Σ
0i−1
Ii − 1
βi
. ⊤ 0⊤ σi2 0⊤ 1 βi 1 i −1 i−1 p as an estimator of Σ and denote Σ p by Σ S . We think of Σ i = Σ
(2.4)
It is noted from (2.1) and (2.4) that
tr Σi Σi = tr −1
Σi−−11
0i−1
Ii−1
0i−1
i−1 Σ
0i−1
Ii−1
⊤
0⊤ σi2 0⊤ βi − β⊤ 1 i−1 i −1 i −1 Σi−1 ∗ Σi−1 0i−1 = tr i−1 ( ∗ ( β i − β i )⊤ Σ βi − βi ) + σi2 0⊤ σi−2 i −1 i−1 + σi−2 ( i−1 ( = tr Σi−−11 Σ β i − β i )⊤ Σ βi − βi ) + σi2 , 0⊤ i −1
σi−2
βi − βi
1
which implies that
S = tr Σp−−11 Σ p−1 + σp−2 ( p−1 ( tr Σ −1 Σ β p − β p )⊤ Σ βp − βp ) + σp2
p−2 + = tr Σp−−12 Σ
p p 1 σi2 i−1 ( + ( βi − βi )⊤ Σ βi − βi ) 2 2 σ σ i=p−1 i i=p−1 i
= ··· p p σi2 1 i−1 ( = + ( βi − βi )⊤ Σ βi − βi ). 2 2 σ σ i i i=1 i=2
(2.5)
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
193
Moreover using (2.2) yields that p σ2
S = det Σ −1 Σ
i
i =1
σi2
.
(2.6)
S as Combining (2.5) and (2.6), we can rewrite the Stein loss (1.2) of Σ S , Σ ) = L( Σ
p 2 σ
i 2 i
− ln
σ
i=1
p σi2 1 i−1 ( − 1 + ( βi − βi )⊤ Σ βi − βi ). 2 σi2 σ i i=2
(2.7)
Hence the problem of estimating the covariance matrix Σ relative to the Stein loss (1.2) can be characterized as the problem of simultaneously estimating the σi2 ’s and the βi ’s in the decomposed model (2.3) with respect to the decomposed loss (2.7). We here give the risk of James and Stein’s [5] minimax estimator (1.3) with the decomposed loss (2.7). Let us define JS
1JS = iJS = Ti Di Ti⊤ where σi2JS = dJSi si for i = 1, . . . , p and βi = xi for i = 2, . . . , p. Denote Σ σ12JS and, for i = 2, . . . , p, Σ JS JS JS p = Σ JS and, for i = 2, . . . , p, Di = diag(d1 , . . . , di ). It is noted that Dp = DJS , Σ JS JS i−1 0i−1 Ii−1 0i−1 Σ I β JS i − 1 i i = Σ . JS ( β )⊤ 1 0⊤ σ 2JS 0⊤ 1 i
i −1
i
i−1
JS as Using the decomposed loss (2.7), we rewrite the risk of Σ JS , Σ ) = E R(Σ
p i=1
p σi2JS 1 σi2JS JS ⊤ JS JS − ln − 1 + ( β − β ) Σ ( β − β ) . i i i i−1 σi2 σi2 σi2 i i=2
3. A class of estimators and sufficient conditions for minimaxity In this section we provide a class of estimators for the covariance matrix Σ . Kubokawa’s [9] IERD method is employed JS . to establish sufficient conditions that the class improves on James and Stein’s minimax estimator Σ For i = 2, . . . , p, let ⊤ wi = ∥ti ∥2 /si = x⊤ i Ti−1 Ti−1 xi /si .
(3.1)
Consider the following class of estimators: JS
(Φ ) = T Φ T ⊤ , Σ
Φ = diag(d1 , φ2 (w2 ), . . . , φp (wp )),
(3.2)
(DJS ) = Σ JS . where the φi (w)’s are absolutely continuous functions. It is notable that Σ JS JS Let Φ1 = d1 and Φi = diag(d1 , φ2 (w2 ), . . . , φi (wi )) for i = 2, . . . , p. Then we observe that Ti Φi Ti = ⊤
Ii−1
0i−1
x⊤ i
1
Ti−1 Φi−1 Ti⊤ −1
0i−1
0⊤ i −1
φi (wi )si
Ii − 1
xi
0⊤ i−1
1
(3.3)
(Φ ) can be derived from (2.4) with for i = 2, . . . , p, which implies that Σ σ12SH = dJS1 s1 , JS σi2SH = φi (wi )si and βi = xi
for i = 2, . . . , p.
(3.4)
Also the class (3.2) can be written as
( Φ ) = LΦ T 0 L⊤ , Σ where T0 = diag(t12,1 , . . . , tp2,p ) = diag(s1 , . . . , sp ) and L = T · diag(t1−,11 , . . . , tp−,p1 ). Note that S = LT0 L⊤ , which is the LDL⊤
decomposition of S. Tsukuma [21] considered the LDL⊤ decomposition of S in order to define a class of estimators different from the class (3.2). The IERD method yields the following theorem. Theorem 3.1. Suppose that p ≥ 2. Assume that for i = 2, . . . , p JS
(i) φi (w) is nondecreasing in w with limw→∞ φi (w) = di ,
194
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
(ii) φi (w) ≥ φ0,i (w), where
w
g0,i (w)
φ0,i (w) =
n + (p − i)g0,i (w)
,
g0,i (w) = w0 0
x(i−1)/2−1 (1 + x)−n/2 dx
x(i−1)/2−1 (1 + x)−(n+2)/2 dx
.
(3.5)
(Φ ) is a minimax estimator dominating Σ JS relative to the Stein loss (1.2). Then Σ The proof of Theorem 3.1 is provided in Appendix A.3. In estimation of a variance of normal distribution model, Stein [17] proposed a truncated estimator with information on the sample mean drawn from the model. Using Theorem 3.1, we can provide a Stein-like [17] truncated estimator improving JS . The g0,i (w)’s are bounded above as on Σ
w g0,i (w) = w 0
≤
x(i−1)/2−1 (1 + x)−n/2 dx
0
x(i−1)/2−1 (1 + x)−n/2 (1 + x)−1 dx
w 0
(1 +
x(i−1)/2−1 (1 + x)−n/2 dx
w)−1
w 0
x(i−1)/2−1 (1 + x)−n/2 dx
= 1 + w,
which implies that
φ0,i (w) ≤
1+w n + (p − i)(1 + w)
.
It is also noted that the g0,i (w)’s are nondecreasing in w . Hence we can see that
∞ g0,i (w) ≤ ∞0
x(i−1)/2−1 (1 + x)−n/2 dx
x(i−1)/2−1 (1
0
+
x)−(n+2)/2 dx
=
n n−i+1
,
so that
φ0,i (w) ≤
1 n + p − 2i + 1
= dJSi .
JS
Let Φ TR = diag(d1 , φ2TR (w2 ), . . . , φpTR (wp )), where
φ (w) = min TR i
JS di
,
1+w n + (p − i)(1 + w)
. JS
For each i, φiTR (w) is nondecreasing function which tends to di as w → ∞. Hence from Theorem 3.1, we obtain the following proposition.
(Φ TR ) is minimax estimator dominating Σ JS relative to the Stein loss (1.2). Proposition 3.1. Suppose that p ≥ 2. Then Σ The φ0,i (wi )’s given in (3.5) satisfy Assumptions (i) and (ii) of Theorem 3.1. The following proposition is an extension of the Brewster and Zidek [2] type estimator of a normal variance. JS (Φ0 ) is minimax estimator Proposition 3.2. Suppose that p ≥ 2. Let Φ0 = diag(d1 , φ0,2 (w2 ), . . . , φ0,p (wp )). Then Σ JS dominating Σ relative to the Stein loss (1.2).
4. Alternative decomposition and further dominance results
(Φ ) given in (3.2) were derived for dominance results in terms of the In the preceding section, some conditions on Σ (Φ ). One of methods for such improvement is to modify decomposed model (2.3). There is still room for improvement on Σ the xi ’s as estimators of the βi ’s. It is, however, hard to derive appropriate estimators superior to the xi ’s. Therefore we here (Φ ). consider an alternative decomposed model to improve on Σ 4.1. Alternative decomposition of covariance estimation problem Denote Θ(1) = Θ and Θ(p) = θp,p and define, for i = 1, . . . , p − 1,
Θ(i) =
θi,i θ (i)
0⊤ p−i
Θ(i+1)
,
(4.1)
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
195
where
θi+1,i = ... ,
θi+1,i+1 = ...
θ (i)
Θ(i+1)
θp,i
0
..
. ···
θp,i+1
θp,p
.
Here θ (i) is (p − i)-dimensional vector and Θ(i+1) is lower triangular matrix of order p − i. For i = 1, . . . , p − 1, let γ (i) = (γi+1,i , . . . , γp,i )⊤ = θi−,i 1 θ (i) and σi2 = θi2,i . For i = 1, . . . , p, let Σ(i) = Θ(i) Θ(⊤i) , where Σ(1) = Σ . Consider here the following form of the PID for Σ(i) : 2 1 0⊤ σi 0⊤ 1 γ⊤ p−i p−i (i) . (4.2) Σ(i) = γ (i) Ip−i 0p−i Ip−i 0p−i Σ(i+1) Repeatedly using (4.2), we can uniquely rewrite Σ as
Σ = Gdiag(σ12 , . . . , σp2 )G⊤ ,
(4.3)
where 1
0
γ2,1
G= ..
. γp,1
1
.. .
.
..
. ···
γp,2
1
The detailed derivation of (4.3) is given in Appendix A.1. The r.h.s. of (4.3) is the LDL⊤ decomposition of Σ . Comparing (4.3) with (2.2), we can see that G = B−1 . Let T(i) be defined in the same way as in (4.1) and denote the PID of S(i) = T(i) T(⊤i) by
S(i) =
1 x(i)
0⊤ p−i Ip−i
0⊤ p−i S(i+1)
si
0p−i
1
0p−i
x⊤ (i) Ip−i
,
(4.4)
1 −1 ⊤ where si = ti2,i and x(i) = ti− ,i t(i) = ti,i (ti+1,i , . . . , tp,i ) .
Lemma 4.1. For a given i, it follows that
sj ∼ σj2 χn2−j+1 and x(j) |sj ∼ Np−j (γ (j) , Σ(j+1) /sj ) S(i+1) = T(i+1) T(⊤i+1) ∼ Wp−i (n − i, Σ(i+1) ),
for j = 1, . . . , i,
(4.5)
where the (sj , x(j) )’s (j = 1, . . . , i) and S(i+1) are mutually independent. Furthermore, the decomposed model (4.5) implies that the columns of T are mutually independent. For proof of Lemma 4.1, see Appendix A.4. The same arguments as above are applied to the Stein loss (1.2). Let the σi2 ’s and the γ (i) ’s be certain estimators of the
(i) inductively (p) = σi2 ’s and the γ (i) ’s, respectively. Set Σ σp2 . For i = p − 1, . . . , 1, we define (p − i + 1) × (p − i + 1) matrix Σ
as
(i) = Σ
1
γ (i)
0⊤ p−i Ip−i
σi2
0⊤ p−i
(i+1) Σ
0p−i
1
0p−i
γ⊤ (i)
Ip−i
.
(4.6)
A = Σ (1) is regarded as an estimator of Σ . It is noted that, conversely, for any estimator Σ A , the LDL⊤ decomposition Then Σ A can uniquely be obtained from (4.6). of Σ Combining (4.2) and (4.6) yields that (i) = (i+1) , tr Σ(−i)1 Σ σi2 /σi2 + σi2 ( γ (i) − γ (i) )⊤ Σ(−i+11) ( γ (i) − γ (i) ) + tr Σ(−i+11) Σ which is used to obtain
A = tr Σ −1 Σ
p σ2 i =1
i 2 i
σ
+
p−1
σi2 ( γ (i) − γ (i) )⊤ Σ(−i+11) ( γ (i) − γ (i) )
i =1
A derived from (4.6) can alternatively be written as by the same way as for (2.5). Thus the Stein loss (1.2) of Σ A, Σ ) = L(Σ
p 2 σ i=1
σ
i 2 i
− ln
p−1 σi2 − 1 + σi2 ( γ (i) − γ (i) )⊤ Σ(−i+11) ( γ (i) − γ (i) ). 2 σi i=1
(4.7)
196
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207 Table 1 Exact values for (E1 , E2 ) of the James and Stein minimax estimator. n
5 10 20 50 100 200 400 1000
p 5
10
20
50
(3.1574, 1.6063) (0.7943, 0.8607) (0.3176, 0.4586) (0.1106, 0.1926) (0.0527, 0.0981) (0.0257, 0.0495) (0.0127, 0.0249) (0.0050, 0.0100)
(4.6361, 3.9334) (0.9807, 1.9829) (0.2740, 0.8446) (0.1187, 0.4349) (0.0547, 0.2211) (0.0262, 0.1115) (0.0102, 0.0448)
(7.0847, 8.7602) (0.9354, 3.4346) (0.3356, 1.7896) (0.1347, 0.9197) (0.0588, 0.4671) (0.0214, 0.1887)
(13.565, 23.531) (2.435, 10.993) (0.744, 5.717) (0.253, 2.947) (0.071, 1.205)
This suggests that the covariance estimation problem with the Stein loss (1.2) can be considered as the problem of simultaneously estimating the σi2 ’s and the γ (i) ’s under the decomposed loss (4.7) in the decomposed model (4.5).
JS by using the decomposed loss (4.7). The risk of Σ A is given by Let us here derive the risk of Σ A , Σ ) = E1 + E2 , R(Σ
(4.8)
where
E1 = E
σi2 − ln 2 − 1 , σ σi
p 2 σ i=1
i 2 i
E2 = E
p−1
σ ( γ (i) − γ (i) )
⊤
2 i
Σ(−i+11) ( γ (i)
− γ (i) ) .
i =1
JS has a constant risk, which is given by The James and Stein [5] estimator Σ JS , Σ ) = R(Σ
p
ln(n + p − 2i + 1) − E [ln s∗i ] ,
i =1
where s∗i ∼ χn2−i+1 . It is easy to show that E [ln s∗i ] = ln 2 + g ((n − i + 1)/2), where g (·) is the digamma function. Furthermore, JS as we can exactly express E2 of Σ
E2 = E
p−1
JS di si
(x(i) − γ (i) )
⊤
Σ(−i+11) (x(i)
− γ (i) ) =
i=1
p−1 i =1
p−i n + p − 2i + 1
because x(i) |si ∼ Np−i (γ (i) , Σ(i+1) /si ) for i = 1, . . . , p − 1.
JS with typical values of n and p. For a fixed p, the percentage of E2 in risk Table 1 provides exact values for E1 and E2 of Σ increases as n increases. When both n and p are large, the percentage of E2 in risk is very large. These imply that if n is large JS can be expected by successfully improving on E2 of Σ JS . then a substantial reduction in risk of Σ 4.2. Further dominance results
(Φ ) given in (3.2) by applying the IERD method to In this subsection, our goal is to derive further improvement on Σ estimation of the γ (i) ’s in the decomposed model (4.5). σ
(Φ ) is derived from (2.4) with the Assume here that p ≥ 4. Recall that Σ σi2SH ’s which are defined in (3.4), namely
2SH 1
= φ1 (w1 )s1 = dJS1 s1 and σi2SH = φi (wi )si for i = 2, . . . , p. To estimate the γ (i) ’s, we define their estimators as follows: ψi (w(i) ) γ SH x(i) for i = 1, . . . , p − 3, (i) = 1 − w(i) SH γ (i) = x(i) for i = p − 2 and p − 1,
−1 where w(i) = t(⊤i) S(−i+11) t(i) = si x⊤ (i) S(i+1) x(i) and the ψi (w)’s are, respectively, absolutely continuous functions of w . Let ψ = (ψ1 (w(1) ), . . . , ψp−3 (w(p−3) )). Denote the resulting estimator of Σ by
(Φ , ψ), Σ
(4.9)
which is constructed by the inductive relationship (4.6) with the σ and the γ Let T0 = diag(s1 , . . . , sp ) and define G as the lower triangular matrix with unit diagonal elements, where the elements below the ith unit diagonal element of G ⊤ are the same as those of γ SH (i) . Then Σ (Φ , ψ) can explicitly be expressed as GΦ T0 G . For i = 1, . . . , p − 3, the James and Stein [5] type shrinkage estimator of γ (i) is defined by ci p−i−2 γ JS(i) = 1 − x(i) , ci = . w(i) n−p+3 2SH ’s i
SH (i) ’s.
Let ψ JS = (c1 , . . . , cp−3 ). Then we obtain the following proposition, whose proof is given in Appendix A.5.
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
197
(Φ , ψJS ) dominates Σ (Φ ) relative to the Stein loss (1.2). Proposition 4.1. If p ≥ 4, Σ (Φ , ψJS ). Using the IERD method immediately provides sufficient conditions on (4.9) for improving on Σ Theorem 4.1. Suppose that p ≥ 4. Assume that ψ = (ψ1 (w(1) ), . . . , ψp−3 (w(p−3) )) satisfies, for i = 1, . . . , p − 3, (i) ψi (w) is nondecreasing in w and limw→∞ ψi (w) = ci = (p − i − 2)/(n − p + 3), (ii) ψi (w) ≥ ψ0,i (w) for any w > 0, where
w ψ0,i (w) = 0w 0
y(p−i)/2−1 (1 + y)−(n−i+1)/2−1 dy y(p−i)/2−2 (1 + y)−(n−i+1)/2−1 dy
.
(Φ , ψ) is a minimax estimator dominating Σ (Φ , ψJS ) relative to the Stein loss (1.2). Then Σ A short proof of Theorem 4.1 is given in Appendix A.5. (Φ , ψ) dominates Σ (Φ ) if ψ satisfies some conditions. There is, however, no guarantee Theorem 4.1 indicates that Σ (Φ , ψ) dominates Σ (DJS , ψ). This will be examined through Monte Carlo simulations in Section 6. that Σ (Φ , ψ) dominating Σ (Φ , ψJS ). It is obvious to observe that Next, we provide two examples of Σ
w ψ0,i (w) = w0 0
≤
w
y(p−i)/2−1 (1 + y)−(n−i+1)/2−1 dy
y−1 y(p−i)/2−1 (1
w
0 −1
+ y)−(n−i+1)/2−1 dy
y(p−i)/2−1 (1 + y)−(n−i+1)/2−1 dy
w 0
y(p−i)/2−1 (1 + y)−(n−i+1)/2−1 dy
= w.
Also, the ψ0,i (w)’s are nondecreasing in w , so that
∞ ψ0,i (w) ≤ 0∞ 0
y(p−i)/2−1 (1 + y)−(n−i+1)/2−1 dy y(p−i)/2−2 (1
+
y)−(n−i+1)/2−1 dy
=
p−i−2 n−p+3
= ci .
(4.10)
Define ψ PP = (ψ1PP (w(1) ), . . . , ψpPP−3 (w(p−3) )) with
ψiPP (w) = min(ci , w). The resulting estimators of the γ (i) ’s are written by
ci ψiPP (w(i) ) x = max 0 , 1 − x(i) for i = 1, . . . , p − 3. γ PP = 1 − ( i ) (i) w(i) w(i) JS
These are the well-known positive-part type estimators with respect to the γ (i) ’s (see [1]). Since the ψiPP (w)’s satisfy Assumptions (i) and (ii) of Theorem 4.1, it follows that
(Φ , ψPP ) dominates Σ (Φ , ψJS ) and Σ (Φ ) relative to the Stein loss (1.2). Proposition 4.2. If p ≥ 4, then Σ As seen in (4.10), the ψ0,i (w)’s satisfy Assumption (i) of Theorem 4.1. Thus we have
(Φ , ψ0 ) with ψ0 = (ψ0,1 (w(1) ), . . . , ψ0,p−3 (w(p−3) )) dominates Σ (Φ , ψJS ) and Σ (Φ ) Proposition 4.3. If p ≥ 4, then Σ relative to the Stein loss (1.2). (Φ , ψ) as a matrix-valued function of S. Therefore, We conclude this section by giving a remarkable note. Let us look on Σ (Φ , ψ) as δ(S ). Let P be a permutation matrix of order p. Note that the number of such permutation matrices is we write Σ p!. If δ(S ) is minimax relative to the Stein loss (1.2), then a permuted estimator P ⊤ δ(PSP ⊤ )P is also minimax, where P ̸= Ip . , an average of all the permuted estimators Since the Stein loss (1.2) is a strictly convex function of estimator Σ AA (Φ , ψ) = Σ
1 ⊤ P δ(PSP ⊤ )P p!
is also minimax, where denotes the summation over all the permutation matrices. This idea was discussed by Stein [16, Section 4]. AA (Φ , ψ) would not be practical in large dimensional case since the number of the permutation matrices However Σ becomes strikingly large. Therefore we here consider simple estimator 1
δ(S ) + δ UT (S ) with δ UT (S ) = P0⊤ δ(P0 SP0⊤ )P0 , (4.11) 2 where P0 is the permutation matrix such that, for i = 1, . . . , p, the (i, p − i + 1)th elements are ones and the other elements AP (Φ , ψ) is minimax if δ(S ) is minimax. It is also noted that Σ AP (Φ , ψ) does not always dominate are zeros. The estimator Σ UT AP both δ(S ) and δ (S ). See Section 6 for numerical comparison of Σ (Φ , ψ), δ(S ) and δ UT (S ). AP (Φ , ψ) = Σ
198
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
5. Applications 5.1. Autoregressive and moving average models with repeated measurements As seen in Sections 2 and 4, the problem of estimating Σ relative to the Stein loss (1.2) can be reduced to the problem of simultaneously estimating (A) the βi ’s and the σi2 ’s of the model (2.3) relative to the decomposed loss (2.7), or (B) the γ (i) ’s and the σi2 ’s of the model (4.5) relative to the decomposed loss (4.7). We can relate the decomposed estimation problems (A) and (B) to simultaneous estimation of some parameters of autoregressive and moving average models with repeated measurements. In this section the relationship is explained in brief. Let the time points be i = 1, . . . , p and the numbers of repeated measurements j = 1, . . . , n. Denote by zi,j the observation at time i with the jth measurement. Assume that the zi,j ’s have a generalized autoregressive model as follows:
z1,j = ε1,j , zi,j = βi,1 z1,j + · · · + βi,i−1 zi−1,j + εi,j
for i = 2, . . . , p.
Here, the εi,j ’s are mutually independent with εi,j ∼ N (0, σi2 ) and the βi,k ’s and the σi2 ’s are unknown parameters with −∞ < βi,k < ∞ and σi2 > 0. Let εij = (ε1,j , . . . , εi,j )⊤ . For i = 2, . . . , p and j = 1, . . . , n, let Σi be the covariance matrix of zij = (z1,j , . . . , zi,j )⊤ and define the i × i matrices Bi inductively as
Bi =
Bi−1
−β⊤ i
0i−1 1
where βi = (βi,1 , . . . , βi,i−1 )⊤ and B1 = 1. Note that, for each i, the observation vectors zij ’s are mutually independent with respect to j. Then it is seen that ⊤ diag(σ12 , . . . , σi2 ) = Cov[εij ] = Bi Cov[zij ]B⊤ i = Bi Σi Bi ,
which implies that −2 −2 Σi−1 = B⊤ i diag(σ1 , . . . , σi )Bi . −2 −2 It hence follows that Σ −1 = Σp−1 = B⊤ p diag(σ1 , . . . , σp )Bp , which is the same structure as (2.2).
For i = 1, . . . , p, let Yi be the n × i observation matrix such that Yi = (y1 , . . . , yi ), where yi = (zi,1 , . . . , zi,n )⊤ . For −1 ⊤ i = 2, . . . , p, denote by xi = (Yi⊤ −1 Yi−1 ) Yi−1 yi the least squares estimator of βi , which is a solution of min βi
n (zi,j − βi,1 z1,j − · · · − βi,i−1 zi−1,j )2 = min ∥yi − Yi−1 βi ∥2 .
βi
j =1
Let s1 = ∥y1 ∥2 and si = ∥yi − Yi−1 xi ∥2 for i = 2, . . . , p. Then we have
2 ⊤ −1 xi |Yi⊤ −1 Yi−1 ∼ Ni−1 (βi , σi (Yi−1 Yi−1 ) )
for i = 2, . . . , p,
si ∼ σ χ
for i = 1, . . . , p.
2 i
2 n−i+1
Note that, for each i, si and xi are mutually independent and also that S = Yp⊤ Yp ∼ Wp (n, Σ )
1 2 2 ⊤ −1 with Σ = B− . p diag(σ1 , . . . , σp )(Bp )
Define the Cholesky decomposition of Yi⊤ Yi as Ti Ti⊤ . It follows that, for i = 2, . . . , p,
⊤
Ti Ti =
Yi⊤ −1 Yi−1
Yi⊤ −1 y i
yi⊤ Yi−1
∥y i ∥2
=
Ii−1
0i−1
x⊤ i
1
Ti−1 Ti⊤ −1
0i−1
0⊤ i−1
si
Ii−1
xi
0⊤ i −1
1
,
which is the PID of Ti Ti⊤ = Yi⊤ Yi .
Let ω = ( β2 , . . . , βp , σ12 , . . . , σp2 ) be an estimator of ω = (β2 , . . . , βp , σ12 , . . . , σp2 ) based on the xi ’s and the si ’s. Also,
i−1 , where it depends on Ti−1 but not on si and xi . Consider here the for each i = 2, . . . , p, denote an estimator of Σi−1 by Σ estimation problem of ω relative to the loss L( ω, ω) =
p i=1
L0 ( σi2 /σi2 ) +
p i =2
Li ( βi , βi |σi2 , Ti−1 ),
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
199
where L0 (y) = y − ln y − 1 and 1
Li ( βi , βi |σi2 , Ti−1 ) =
σi2
i−1 ( βi − βi )⊤ . ( β i − β i )⊤ Σ
This estimation problem is equivalent to the problem (A). The problem (B) can be interpreted as follows. Let p and n be, respectively, the numbers of time points and repeated measurements. Denote by ui,j the observation at time i with the jth measurement, where 1 ≤ i ≤ p and 1 ≤ j ≤ n. Assume that the ui,j have a generalized moving average model
u1,j = ε1,j , ui,j = γi,1 ε1,j + · · · + γi,i−1 εi−1,j + εi,j
for i = 2, . . . , p,
(5.1)
where the γi,k ’s and the σi2 ’s are unknown parameters with −∞ < γi,k < ∞ and σi2 > 0 and the εi,j ’s are mutually and independently distributed as εi,j ∼ N (0, σi2 ). Denote uj = (u1,j , . . . , up,j )⊤ and εj = (ε1,j , . . . , εp,j )⊤ for j = 1, . . . , n. Let 1
γ2,1
G= ..
. γp,1
0
1
.. .
.
..
γp,2
. ···
1
It is observed that uj = Gεj and Cov[uj ] = GCov[εj ]G⊤ = Gdiag(σ12 , . . . , σp2 )G⊤ , ⊤ which implies that S = ∼ Wp (n, Σ ) with Σ = Gdiag(σ12 , . . . , σp2 )G⊤ . It is noted that Σ is the same as j=1 uj uj (4.3). Therefore, if we use the loss (4.7) as a basis for evaluation of estimators, then the problem (B) is equivalent to the simultaneous estimation problem of G and diag(σ12 , . . . , σp2 ) in the model (5.1). The discussion given in this section is similar to Pourahmadi’s [14, Section 3.5], who introduced generalized autoregressive and moving average models. See also [23,10] for related discussion and applications.
n
5.2. Covariance estimation with incomplete data The improving methods discussed in Sections 3 and 4 can be applied to an incomplete data model considered by Konno [6]. Assume that we observe S ∼ Wp (n, Σ ). Partition S into four blocks as (1.4) and, similarly, denote
Σ=
Σ11 Σ21
⊤ Σ21 , Σ22
−1 ⊤ Σ22·1 = Σ22 − Σ21 Σ11 Σ21 ,
where Σ11 is a q × q squared matrix. Assume additionally that we obtain the extra data z1 , . . . , zm following Np−q (0p−q , Σ22 ) m ⊤ independently. Denote V = i=1 zi zi , which is distributed as Wp−q (m, Σ22 ). Employing the Stein loss (1.2), we consider the problem of estimating Σ based on S and V . We use the same notation as in Section 4. It is important to be able to show that S22·1 = S(q+1) = T(q+1) T(⊤q+1) ,
Σ22·1 = Σ(q+1) .
From Lemma 4.1, it follows that
2 2 sj ∼ σj χn−j+1 and x(j) |sj ∼ Np−j (γ (j) , Σ(j+1) /sj ) S22·1 ∼ Wp−q (n − q, Σ22·1 ), V ∼ Wp−q (m, Σ22 ),
for j = 1, . . . , q,
where the (sj , x(j) )’s (j = 1, . . . , q), S22·1 and V are mutually independent. Using the same arguments as in (4.7), we can rewrite the Stein loss (1.2) as
, Σ ) = L( Σ
q
22·1 , Σ22·1 ), L0j ( σj2 , γ (j) ; σj2 , γ (j) ) + Ls (Σ
j =1
where L0j ( σj2 , γ (j) ; σj2 , γ (j) ) =
σj2 σ
2 j
− ln
σj2 σj2
−1+ σj2 ( γ (j) − γ (j) )⊤ Σ(−j+11) ( γ (j) − γ (j) ),
−1 −1 22·1 , Σ22·1 ) = tr Σ22 Ls (Σ ·1 Σ22·1 − ln det Σ22·1 Σ22·1 − (p − q).
200
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
JS can be decomposed into The James–Stein estimator Σ 2JS σj = dJSj sj ,
γ JS(j) = x(j) for j = 1, . . . , q,
JS ⊤ 22 Σ ·1 = T(q+1) D(q+1) T(q+1) , JS
JS
where D(q+1) = diag(dq+1 , . . . , dp ).
22·1 , Σ22·1 ), where an estimator Σ 22·1 is made Here we consider the problem of estimating Σ22·1 under the loss Ls (Σ from S22·1 and V . An improving method by Loh [12] is adaptable to the estimation problem. Let A and K be, respectively, (p − q) × (p − q) nonsingular and diagonal matrices such that A(S22·1 + V )A⊤ = Ip−q and AS22·1 A⊤ = K . Using Proposition 3 JS
LO −1 ⊤ −1 22 22·1 relative to the loss Ls (Σ 22·1 , Σ22·1 ). of [12], we can see that Σ improves Σ ·1 = A D(q+1) K (A ) LO be an estimator constructed from Let Σ
2LO σj = dJSj sj , γ LO γ SH (j) = (j) for j = 1, . . . , q, LO −1 ⊤ −1 Σ22·1 = A D(q+1) K (A ) , where the γ SH (j) ’s are suitable shrinkage improved estimators given in Section 4. Then we observe that
LO , Σ ) = E R(Σ
q
L0j ( σ
2LO j
LO 22 , γ ; σ , γ (j) ) + Ls (Σ ·1 , Σ22·1 ) LO (j)
2 j
JS (j)
2 j
j =1
≤E
q
L0j ( σ
2LO j
JS Ls Σ22·1
, γ ; σ , γ (j) ) + (
JS , Σ ). , Σ22·1 ) = R(Σ
j=1
is better than Σ JS relative to the Stein loss (1.2). Thus Σ Note that Σ22 − Σ22·1 is positive semi-definite. This restriction would be a valuable information to derive good estimators. See [22], who discussed estimation of several ordered covariance matrices. LO
6. Simulation studies Some results of simulation studies are here reported for examining the performance in risk of shrinkage minimax estimators. Our simulation studies investigate the following minimax estimators
JS , JS: Σ (Φ TR ), TR: Σ (DJS , ψPP ), PP: Σ (Φ TR , ψPP ), LT: Σ (Φ TR , ψPP ), given in (4.11), UT: δ UT (S ) with δ(S ) = Σ PP AP TR AP: Σ (Φ , ψ ), given in (4.11), and ST = RDJS FR⊤ . ST: Σ Here ST indicates Stein’s orthogonally invariant estimator, which has been proposed by Stein [18] and Dey and Srinivasan [3], and R is an orthogonal matrix of order p such that R⊤ SR = F = diag(f1 , . . . , fp ), where f1 ≥ · · · ≥ fp are the ordered eigenvalues of S. For the minimaxity of ST, see Theorem 3.1 of [3]. The following four cases have been assumed for Σ which should be estimated: (i) (ii) (iii) (iv)
Σ Σ Σ Σ
= G1 diag(p, . . . , 1)G⊤ 1 ; = G2 diag(1, 1 − r 2 , . . . , 1 − r 2 )G⊤ 2 with r = 0.5; with r = 1.1; = G2 diag(p2 , . . . , 12 )G⊤ 2 = aa⊤ + σ 2 Ip with a = (1, . . . , p)⊤ and σ 2 = 1,
where
1 r G2 = ...
G1 = Ip ,
0 1
..
r p−1
r p−2
The inverse matrix of G2 is given by 1 −r
0
1 G− 2 =
0
1
..
.
,
..
. −r
1
. ···
.
1
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
201
Table 2 Simulated risks in Case (i) with p = 5. n=5
JS TR PP LT UT AP ST
n = 100
E1
E2
Risk (SE)
E1
E2
Risk (SE)
3.161 3.031 3.161 3.031 3.384 2.943 2.564
1.614 1.516 1.444 1.349 0.985 1.121 1.017
4.775 (0.0225) 4.546 (0.0221) 4.605 (0.0232) 4.380 (0.0229) 4.369 (0.0230) 4.064 (0.0229) 3.582 (0.0234)
0.0523 0.0523 0.0523 0.0523 0.0538 0.0521 0.0504
0.0977 0.0973 0.0605 0.0602 0.0584 0.0555 0.0911
0.1500 (0.00055) 0.1496 (0.00054) 0.1128 (0.00051) 0.1125 (0.00051) 0.1122 (0.00051) 0.1076 (0.00050) 0.1415 (0.00052)
Table 3 Relative risks of minimax estimators over the James and Stein estimator. p=5
p = 10
p = 20
n
5
50
100
10
100
200
20
200
400
(i)
TR PP LT UT AP ST
0.952 0.964 0.917 0.915 0.851 0.750
0.996 0.766 0.763 0.763 0.727 0.896
0.998 0.752 0.750 0.748 0.717 0.943
0.961 0.900 0.865 0.863 0.780 0.722
0.997 0.451 0.451 0.451 0.408 0.895
0.999 0.432 0.432 0.432 0.394 0.941
0.972 0.843 0.822 0.820 0.719 0.711
0.998 0.252 0.254 0.254 0.220 0.897
0.999 0.233 0.234 0.234 0.205 0.942
(ii)
TR PP LT UT AP ST
0.955 0.975 0.930 0.928 0.857 0.758
1.000 0.979 0.979 0.977 0.955 0.916
1.000 0.990 0.990 0.989 0.977 0.956
0.964 0.925 0.890 0.890 0.798 0.719
1.000 0.922 0.922 0.923 0.883 0.901
1.000 0.957 0.957 0.958 0.936 0.946
0.975 0.879 0.856 0.859 0.745 0.698
1.000 0.871 0.871 0.870 0.821 0.895
1.000 0.927 0.927 0.927 0.898 0.941
(iii)
TR PP LT UT AP ST
1.000 0.992 0.991 0.964 0.896 0.867
1.000 0.997 0.997 0.992 0.980 0.987
1.000 0.998 0.998 0.996 0.990 0.994
1.000 0.970 0.970 0.927 0.843 0.847
1.000 0.985 0.985 0.954 0.940 0.980
1.000 0.992 0.992 0.975 0.968 0.991
1.000 0.946 0.946 0.902 0.796 0.842
1.000 0.972 0.972 0.929 0.912 0.977
1.000 0.985 0.985 0.961 0.952 0.989
(iv)
TR PP LT UT AP ST
0.996 0.987 0.984 0.954 0.892 0.778
1.000 0.994 0.994 0.988 0.977 0.855
1.000 0.997 0.997 0.994 0.989 0.892
1.000 0.956 0.956 0.905 0.827 0.705
1.000 0.979 0.979 0.809 0.843 0.784
1.000 0.989 0.989 0.860 0.889 0.838
1.000 0.901 0.901 0.853 0.753 0.663
1.000 0.938 0.938 0.600 0.669 0.745
1.000 0.967 0.967 0.671 0.742 0.808
which corresponds to B defined in (2.2). The covariance structure such as Cases (ii) and (iii) are applied to an autoregressive model and that of Case (iv) to a linear mixed model (see [14]). As an example of Case (iv), we here consider the case of p = 5, and can see that (σ12 , . . . , σ52 ), G and G−1 of Case (iv) are calculated as, respectively, (σ12 , . . . , σ52 ) = (2, 3, 2.5, 2.07, 1.81),
1 1 G4 = 1.5 2 2.5
0 1 1 1.33 1.67
1 0.8 1
1 0.65
, 1
1 −1 1 −0.5 G− 4 = −0.27 −0.16
0 1
−1 −0.53 −0.32
1 −0.8 −0.48
1
−0.65
.
1
Note also that the σi2 ’s of Case (iii) are more scattered than those of Case (iv). The risks of minimax estimators have been computed by summing the simulated values of E1 and E2 , given in (4.8), based on 10,000 independent replications. Table 2 shows how minimax estimators reduce E1 , E2 and the risk of JS in Case (i) for p = 5 with n = 5 and 100, where values in parentheses indicate estimates of standard errors. In Table 2, it follows that Risk = E1 + E2 . Since JS and PP have the same E1 , PP improves only on E2 of JS. LT has the same E1 of TR and improves on E2 of TR. The risk of UT is similar to that of LT, but for n = 5, E1 and E2 of UT are very different from those of LT. When n = 5, ST produces the most reduction in risk by considerably improving on both E1 and E2 of JS. When n = 100, AP substantially reduce E2 of JS and then it has the best performance. Table 3 illustrates the relative risks of minimax estimators over JS in Cases (i)–(iv) for p = 5, 10 and 20 with n = p, 10p has been computed by )/ JS ), where JS ) and ) stand, and 20p. The relative risk of a minimax estimator Σ R(Σ R(Σ R(Σ R(Σ . From Table 3, we can find out several interesting observations as follows. respectively, for the simulated risks of JS and Σ (1) TR produces only a slight improvement over JS in almost the cases and, moreover, the performance of TR worsens with increasing n. The relative risks of TR are smaller than those of PP only when n = p = 5 of Cases (i) and (ii).
202
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
(2) As analytically shown in Section 4, LT improves on TR. Our simulations suggest that the additional improvement of TR by LT is substantial in Case (i). (3) PP is not always dominated by LT (see Case (i) with p = 20). (4) In Case (i), PP, LT, UT and AP decrease their relative risks as n increases. For large p, it would be more effective to shrink the below-diagonal elements of T . However, the performances of the shrinkage estimators deteriorate when the belowdiagonal elements of G are away from zeros such as Cases (iii) and (iv). (5) In almost the cases, AP dominates both LT and UP. If there is quite a difference between performances of LT and UP (see Case (iv) with p = 10 and 20), then AP is superior to either LT or UP. (6) AP has nice performance in large n and p, particularly in Case (i). In Case (iii) except when n = p = 5, AP is better than the other estimators. (7) ST performs well in the very small sample case n = p or in Case (iv), but ST rapidly loses in performance as n is large. 7. Remarks on directions for future work We conclude this paper with pointing out directions for future work. This paper considered the partial Iwasawa decomposition (PID) and it was applied to covariance estimation. The estimation problem with the Stein loss was shown to be changed by simultaneous estimation of some variances and some mean vectors. The PID produced numerically effective methods for minimax estimation of the covariance matrix. In the next future study, the benefit of estimation methods via the PID will be examined by applying to real data. The purpose of this paper is to obtain minimax estimators under the Stein loss via the PID. The Stein loss is invariant under a scale transformation and the invariance plays an important role for the minimaxity proof. Improvements on , Σ ) = tr (Σ −1 Σ − Ip )2 and minimax estimators are very interesting under other invariant loss functions such as LQ (Σ −1 −1 LP (Σ , Σ ) = tr Σ Σ − ln det Σ Σ − p, where the LP loss has been employed for estimation of the precision matrix Σ −1 rather than of the covariance matrix Σ . On the other hand, recently, non-invariant loss functions are often handled in the
, Σ ) = ∥Σ − Σ ∥ = tr (Σ − Σ )2 literature. Typical non-invariant losses are the Frobenius-norm loss LF (Σ
1/2
and the
, Σ ) = ∥Σ − Σ ∥2 . In our future works, the PID will be considered for improvements squared Frobenius-norm loss LF 2 (Σ under these losses. In recent years, there are many studies being performed for estimation of a high-dimensional covariance matrix such that p (dimension of Σ ) is larger than n (degrees of freedom for the Wishart matrix). The p > n problem causes the singularity of the Wishart matrix and it is not easy to handle. However, the PID can be expected to be useful for the p > n problem. Therefore such research area will also be tried at all costs. Kubokawa’s [9] IERD method for variance estimation provides a unified procedure including Brewster–Zidek’s [2] Bayesian procedure which is closely related to a lower limit function for a sufficient condition of the IERD method. Therefore, in covariance estimation, the IERD method may be available to proving the minimaxity of generalized Bayes estimators. (Φ0 ) and Σ (Φ0 , ψ0 ) are characterized as generalized Bayes estimators. Specifically, it is an interesting question whether Σ Acknowledgments The author is grateful to the editor and two reviewers for their helpful comments and suggestions. The work is supported by Grant-in-Aid for Scientific Research (15K00055), Japan. Appendix A.1. LDL⊤ decomposition of Σ For i = 2, . . . , p, let B0,i be a lower triangular matrix of order p such that
B0,i
1
0
0 = −βi,1
..
.
.
1
···
−βi,i−1
1
..
.
0
(A.1)
1
Note that
A1 O(p−i)×i
Oi×(p−i) Ip−i
Ii A2
Oi×(p−i) A3
=
A1 A2
Oi×(p−i) Ip−i
Ii
O(p−i)×i
Oi×(p−i) A3
=
A1 A2
Oi×(p−i) A3
,
(A.2)
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
203
where A1 , A2 and A3 are, respectively, i×i, (p−i)×i and (p−i)×(p−i) matrices. It follows from (A.2) that B0,2 B0,3 · · · B0,p = B, where B is defined in (2.2). Repeatedly using (2.1) and (A.1), we can see that Σ −1 is expressed uniquely as
Σ −1 = Σp−1 = B⊤ 0,p
−1 Σp−1
0p−1
σp−2
0⊤ p−1
Σp−−12
B0,p
0
σp−−21
⊤ = B⊤ 0,p B0,p−1
B0,p−1 B0,p
σp
−2
0
= ··· −2 ⊤ ⊤ ⊤ −2 = B⊤ 0,p B0,p−1 · · · B0,3 B0,2 diag(σ1 , . . . , σp )B0,2 B0,3 · · · B0,p−1 B0,p
= B⊤ diag(σ1−2 , . . . , σp−2 )B. Hence Σ −1 can be decomposed as (2.2). Similarly, let 1
0
G0,i
=
..
.
1
γi+1,i .. . γp,i
0
1
..
.
0
(A.3)
1
for i = 1, . . . , p − 1. Note by (A.2) that G0,1 G0,2 · · · G0,p−1 = G, where G is defined in (4.3). Using (4.2) and (A.3), we can obtain the LDL⊤ decomposition for Σ , which is of the form (4.3). A.2. Proof of Lemma 2.1 The probability density function (p.d.f.) of T is given by p
1
1 exp − tr Σ −1 TT ⊤ tin,i−i , p ( n − 2 )/ 2 n / 2 2 (det Σ ) Γp (n/2) 2 i =1
(A.4)
where
Γp (n/2) = π p(p−1)/4
p n − i + 1 Γ
2
i =1
with the gamma function Γ (·). The p.d.f. (A.4) can be derived from combining Theorem 3.2.1 and Lemmas 3.2.2 and 3.2.3 of [15]. Using the same arguments as in (2.5), we obtain tr Σ −1 TT ⊤ =
p p si 1 + (xi − βi )⊤ Ti−1 Ti⊤−1 (xi − βi ). 2 2 σ σ i i i =1 i=2
(A.5)
1/2 For i = 2, . . . , p, the Jacobian of transformation ti → xi is given by J (ti → xi ) = det Ti−1 = (det Ti−1 Ti⊤ . Also for −1 )
−1/2 i = 1, . . . , p, the Jacobian of transformation si = ti2,i is given by J (ti,i → si ) = 2−1 si . Combining (A.4) and (A.5), we can express the joint density function of the si ’s and the xi ’s as p
1
2p(n−2)/2 (det Σ )n/2 Γp (n/2) i=1
=
p
(n−i)/2 −si /(2σ 2 ) i
si
e
e
(n−i+1)/2 σ n−i+1 Γ ((n − i + 1)/2) i i =1 2
which completes the proof.
p
e
2 −(xi −βi )⊤ Ti−1 Ti⊤ −1 (xi −βi )/(2σi )
i=2
(n−i+1)/2−1 −si /(2σ 2 ) i
si
×
×
p (det Ti−1 Ti⊤−1 )1/2 i =2
(2π σi2 )(i−1)/2
×
p
J (ti,i → si ) ×
i=1
e
2 −(xi −βi )⊤ Ti−1 Ti⊤ −1 (xi −βi )/(2σi )
p i=2
,
J (ti → xi )
204
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
A.3. Proof of Theorem 3.1 The proof of Theorem 3.1 requires the following lemma, which is given in [9, Lemma 2.1]. Lemma A.1. Let h(x) be integrable and nondecreasing in x. Suppose that, for an integrable function K (x), there exists x0 such that K (x) < 0 for x < x0 and K (x) ≥ 0 for x ≥ x0 . Then we have ∞
h(x)K (x)dx ≥ h(x0 )
∞
K (x)dx. 0
0
Proof of Theorem 3.1. The proof is performed along the same lines as in [9, Section 2.4]. JS (Φ ) as Write φ1 (w1 ) = d1 . Using (3.3), we express the decomposed loss (2.7) of Σ
(Φ ), Σ ) = L(Σ
p
L0 (φi (wi )si /σi2 ) +
i=1
p 1 (xi − βi )⊤ Ti−1 Φi−1 Ti⊤−1 (xi − βi ), 2 σ i i=2
where L0 (y) = y − ln y − 1. Denote by Ei−1 the expectation with respect to Ti−1 and by Ei|i−1 the conditional expectation −1 with respect to si and xi given Ti−1 . Recall that xi |Ti−1 ∼ Ni−1 (βi , σi2 (Ti−1 Ti⊤ −1 ) ) for i = 2, . . . , p. Since Φi−1 depends on Ti−1 but not on si and xi , we obtain E
p p ⊤ 2 (xi − βi )⊤ Ti−1 Φi−1 Ti⊤−1 (xi − βi )/σi2 = Ei−1 tr Ti−1 Φi−1 Ti⊤ −1 Ei|i−1 (xi − βi )(xi − βi ) /σi i=2
i=2
=
p
Ei−1 [tr Φi−1 ] = E
p i −1
i=2
p φj (wj ) = E (p − i)φi (wi ) ,
i=2 j=1
i =1
which gives that
(Φ ), Σ ) = E R(Σ
p
L0 (φi (wi )si /σi2 ) + (p − i)φi (wi )
.
i=1
Similarly, the risk of Σ JS is written as
JS , Σ ) = E R(Σ
p
JS L0 di si
(
/σ ) + (p − i) 2 i
JS di
.
i =1
JS and Σ (Φ ) can be represented by Thus the difference in risk of Σ JS , Σ ) − R(Σ (Φ ), Σ ) = R(Σ
p
Ei−1 [∆i (φi )],
i=2
where
JS JS ∆i (φi ) = Ei|i−1 L0 (di si /σi2 ) − L0 (φi (wi )si /σi2 ) + (p − i) di − φi (wi ) . Using the IERD method with Assumption (i) leads to
∆i (φi ) = Ei|i−1
∞
dt
1
∞
d
= Ei|i−1
L0 (φi (t wi )si /σi2 ) + (p − i)
d dt
φi (t wi ) dt
L′0 (φi (t wi )si /σi )φi′ (t wi )wi si /σi2 + (p − i)φi′ (t wi )wi dt ,
1
where L0 (y) = 1 − 1/y. ⊤ ⊤ 2 2 ⊤ 2 2 It is noted that, for i = 2, . . . , p, ui = x⊤ i Ti−1 Ti−1 xi /σi |Ti−1 ∼ χi−1 (λi ) with λi = βi Ti−1 Ti−1 βi /σi , where χi−1 (λi ) is the noncentral chi-square distribution with i − 1 degrees of freedom and the noncentrality parameter λi . We denote by f (u|λi ) the p.d.f. of ui and by p(v) the p.d.f. of si /σi2 , so that ′
∆i (φi ) =
∞
∞
0
0
∞
u u uu v φi′ t u + (p − i)φi′ t f (u|λi )p(v)dtdudv. v v v v
L′0 φi t
1
Making the change of variables w = (t /v)u with dw = (t /v)du yields that
∆i (φi ) =
∞
0
∞
0
∞
L′0 (φi (w)v)φi′ (w)
1
wv 2 t2
+ (p − i)φi′ (w)
wv wv f λi p(v)dtdv dw. 2 t
t
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
205
Next, we make the change of variables x = wv/t with dx = −(wv/t 2 )dt, which gives that
∆i (φi ) =
∞
0 ∞ =
∞
wv
L′0 (φi (w)v)φi′ (w)v + (p − i)φi′ (w) f (x|λi )p(v)dxdv dw
0
0
φi (w) ′
∞
L′0 (φi (w)v)v + p − i F (wv|λi )p(v)dv dw
0 ∞
0 ∞ v− φi (w)
0
0
=
1
′
φi (w)
+ p − i F (wv|λi )p(v)dv dw,
wv
where F (wv|λi ) = 0 f (x|λi )dx. Let hw (v) = F (wv|λi )/F (wv|0) and Kw (v) = (v − v0 )F (wv|0)p(v) with v0 = 1/φi (w) − p + i. Using Assumption (i) gives that φi (w) ≤ (n + p − 2i + 1)−1 < (p − i)−1 , which implies that v0 > 0. It also follows that Kw (v) < 0 for v < v0 and Kw (v) ≥ 0 for v ≥ v0 . It is easy to show that hw (v) is nondecreasing in v since f (x|λi )/f (x|0) is nondecreasing in x. Noting that φi′ (w) ≥ 0 and using Lemma A.1, we can see that
∆i (φi ) =
∞
φi′ (w)
0 ∞ ≥
∞
hw (v)Kw (v)dv dw 0
∞
φi′ (w)hw (v0 )
0
Kw (v)dv dw. 0
(Φ ) dominates Σ JS relative to the Stein loss (1.2) if, for i = 2, . . . , p, Hence Σ ∞
∞
Kw (v)dv = 0
v−
0
1
+ p − i F (wv|0)p(v)dv ≥ 0,
φi (w)
namely,
∞ φi (w) ≥ ∞ 0
F (wv|0)p(v)dv
0
(v + p − i)F (wv|0)p(v)dv
.
It is observed that
∞ 0
∞ 0
∞ wv
v F (wv|0)p(v)dv
0 = ∞0 wv
F (wv|0)p(v)dv
0 w 0 ∞ = w0 ∞0 0
0
y(i−1)/2−1 e−y/2 · v (n−i+1)/2 e−v/2 dy dv
y(i−1)/2−1 e−y/2 · v (n−i+1)/2−1 e−v/2 dy dv x(i−1)/2−1 · v n/2 e−v(1+x)/2 dv dx
x(i−1)/2−1 · v n/2−1 e−v(1+x)/2 dv dx
= n/g0,i (w), where the second equality is verified by the change of variables y = v x with dy = v dx. Thus we obtain
∞ φi (w) ≥ =
∞
F (wv|0)p(v)dv/
0
1 + (p − i)
∞ 0
v F (wv|0)p(v)dv ∞ F (wv|0)p(v)dv/ 0 v F (wv|0)p(v)dv
g0,i (w) n + (p − i)g0,i (w)
which completes the proof.
0
= φ0,i (w),
A.4. Proof of Lemma 4.1 It follows from (4.2) and (4.4) that tr Σ(−j)1 S(j) = tr
0p−j
= tr =
1
sj
σj2
σj
−2
x⊤ (j) Ip−j
Σ(−j)1
1 x(j)
+ (x(j) − γ (j) ) ∗
⊤
0⊤ p−j Ip−j
Σ(−j+11) (x(j)
sj
0p−j
− γ (j) )
0⊤ p−j S(j+1)
∗ Σ(−j+11)
sj
0p−j
+ sj (x(j) − γ (j) )⊤ Σ(−j+11) (x(j) − γ (j) ) + tr Σ(−j+11) S(j+1) .
0⊤ p−j S(j+1)
(A.6)
206
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
Repeatedly using (A.6) from j = 1 to a given i (≤ p − 1), we can rewrite tr Σ −1 S as i i sj + sj (x(j) − γ (j) )⊤ Σ(−j+11) (x(j) − γ (j) ) + tr Σ(−i+11) S(i+1) . σj2 j =1 j =1
tr Σ −1 S =
(A.7)
Combining (A.4) and (A.7), we complete the proof of Lemma 4.1 by the same arguments as in the proof of Lemma 2.1. A.5. Proofs of Proposition 4.1 and Theorem 4.1
(Φ ). To begin with, we state two lemmas that will be needed to establish a method for improvement on Σ In estimation of the mean vector of a multivariate normal distribution, the James and Stein [5] type shrinkage estimator is well-known as an improved estimator on the maximum likelihood estimator under a quadratic loss. This result is described in the following lemma: Lemma A.2. Let z ∼ Np (µ, Ω ) and V ∼ Wp (n, Ω ), where µ and Ω are unknown and z and V are mutually independent. Consider the problem of estimating the mean vector µ relative to quadratic loss LQ ( µ, µ) = ( µ − µ)⊤ Ω −1 ( µ − µ), where µ is an estimator of µ. If n ≥ p ≥ 3, then the James and Stein shrinkage estimator
c0 p−2 µJS = 1 − z with w = z ⊤ V −1 z and c0 = w n−p+3 dominates z relative to the quadratic loss LQ . The proof of Lemma A.2 is given in [5, Section 2]. In the literature, there are many approaches to improving on the James and Stein shrinkage estimator. Kubokawa [7] has proposed a unified approach based on the IERD (integral expression of risk difference) method. Let us use the same notation as Lemma A.2 and consider the class of estimators
ψ(w) µψ = 1 − z, w where ψ(v) is an absolutely continuous function of v . Kubokawa [7] has rewritten the difference in risk of µψ and µJS by means of an integral expression and constructed sufficient conditions on ψ(v) that µψ dominates µJS . This dominance result is summarized as the following lemma. Lemma A.3. Suppose that p ≥ 3. Assume that ψ(v) satisfies (i) ψ(v) is nondecreasing in v and limv→∞ ψ(v) = (p − 2)/(n − p + 3), (ii) ψ(v) ≥ ψ0 (v) for any v > 0, where
v ψ0 (v) = 0v 0
yp/2−1 (1 + y)−(n+1)/2−1 dy yp/2−2 (1 + y)−(n+1)/2−1 dy
.
Then µψ dominates µJS relative to the quadratic loss LQ . Proof of Proposition 4.1. It is important to note that T can be expressed by a block matrix ith column
Ti−1
ith row ti⊤ ∗
0i−1
O(i−1)×(p−i)
ti,i
0⊤ p−i
t(i)
T(i+1)
Ti−1 x⊤ Ti−1 = i ∗
0i−1 1/2
si
1/2
si
x(i)
O(i−1)×(p−i)
0⊤ p−i
T(i+1)
⊤ for a given i (= 2, . . . , p − 1), where the xi ’s are defined in Section 2. Recall that wi = x⊤ i Ti−1 Ti−1 xi /si is given in (3.1) and also that the columns of T are mutually independent, so that, for a given i,
wi ,
x(i)
and
S(i+1)
are conditionally independent given si = ti2,i . Hence, for each i, γ SH σi2SH are conditionally independent given si . (i) and
(Φ ) is equal to that of Σ (Φ , ψJS ). We can write E2 of Σ (Φ ) as Using (4.8) shows that E1 of Σ p−1 i=1
E i φi (wi )E (i+1)|i [si (x(i) − γ (i) )⊤ Σ(−i+11) (x(i) − γ (i) )] ,
H. Tsukuma / Journal of Multivariate Analysis 145 (2016) 190–207
207
where E (i+1)|i denotes the conditional expectation with respect to x(i) and S(i+1) given si and the first (i − 1) columns of T , (Φ , ψJS ) is written as and E i denotes the expectation with respect to si and the first (i − 1) columns of T . Similarly, E2 of Σ p−1
γ (i) − γ (i) )] . γ (i) − γ (i) )⊤ Σ(−i+11) ( E i φi (wi )E (i+1)|i [si ( JS
JS
i =1 JS
It can easily be shown by Lemma A.2 that, for each i, γ (i) dominates x(i) relative to the quadratic loss
γ (i) − γ (i) ), Li ( γ (i) , γ (i) |si ) = si ( γ (i) − γ (i) )⊤ Σ(−i+11) ( (Φ ) is larger than or equal to E2 of Σ (Φ , ψJS ). where the risk is defined by E (i+1)|i [Li ( γ (i) , γ (i) |si )], which implies that E2 of Σ Hence we observe that (Φ ), Σ ) ≥ R(Σ (Φ , ψJS ), Σ ). R(Σ Thus the proof is complete.
Proof of Theorem 4.1. The proof of Theorem 4.1 is achieved by using Lemma A.3 with replacing (n, p) by (n − i, p − i), where i = 1, . . . , p − 3. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
A.J. Baranchik, A family of minimax estimators of the mean of a multivariate normal distribution, Ann. Math. Statist. 41 (1970) 642–645. J.F. Brewster, J.V. Zidek, Improving on equivariant estimators, Ann. Statist. 2 (1974) 21–38. D.K. Dey, C. Srinivasan, Estimation of a covariance matrix under Stein’s loss, Ann. Statist. 13 (1985) 1581–1591. G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., The Johns Hopkins University Press, Baltimore, 1996. W. James, C. Stein, Estimation with quadratic loss, in: Proc. Fourth Berkeley Symp. Math. Statist. Probab., Vol. 1, University of California Press, Berkeley, 1961, pp. 361–379. Y. Konno, Estimation of a normal covariance matrix with incomplete data under Stein’s loss, J. Multivariate Anal. 52 (1995) 308–324. T. Kubokawa, A unified approach to improving equivariant estimators, Ann. Statist. 22 (1994) 290–299. T. Kubokawa, The Stein phenomenon in simultaneous estimation: A review, in: S.E. Ahmed, M. Ahsanullah, B.K. Sinha (Eds.), Applied Statistical Science III, Nova Science Publishers, New York, 1998, pp. 143–173. T. Kubokawa, Shrinkage and modification techniques in estimation of variance and the related problems: A review, Comm. Statist. Theory Methods 28 (1999) 613–650. K. Lee, J.K. Yoo, Bayesian Cholesky factor models in random effects covariance matrix for generalized linear mixed models, Comput. Statist. Data Anal. 80 (2014) 111–116. S.P. Lin, M.D. Perlman, A Monte Carlo comparison of four estimators for a covariance matrix, in: P.R. Krishnaiah (Ed.), Multivariate Analysis VI, North-Holland, Amsterdam, 1985, pp. 411–429. W.L. Loh, Estimating covariance matrices, Ann. Statist. 19 (1991) 283–296. T. Ma, L. Jia, Y. Su, A new estimator of covariance matrix, J. Statist. Plann. Inference 142 (2012) 529–536. M. Pourahmadi, Foundations of Time Series Analysis and Prediction Theory, Wiley, New York, 2001. M.S. Srivastava, C.G. Khatri, An Introduction to Multivariate Statistics, North Holland, New York, 1979. C. Stein, Some problems in multivariate analysis, Part I, technical reports No.6, Department of Statistics, Stanford University, 1956. C. Stein, Inadmissibility of the usual estimator for the variance of a normal distribution with unknown mean, Ann. Inst. Statist. Math. 16 (1964) 155–160. C. Stein, Estimation of a covariance matrix, Rietz Lecture, in: 39th Annual Meeting IMS, Atlanta, GA, 1975. C. Stein, Lectures on the theory of estimation of many parameters, in: I.A. Ibragimov, M.S. Nikulin (Eds.), Studies in the Statistical Theory of Estimation I, in: Proceedings of Scientific Seminars of the Steklov Institute, Leningrad Division, vol. 74, 1977, pp. 4–65. A. Terras, Harmonic Analysis on Symmetric Spaces and Applications II, Springer-Verlag, New York, 1988. H. Tsukuma, Minimax covariance estimation using commutator subgroup of lower triangular matrices, J. Multivariate Anal. 124 (2014) 333–344. H. Tsukuma, T. Kubokawa, Modifying estimators of ordered positive parameters under the Stein loss, J. Multivariate Anal. 102 (2011) 164–181. W. Zhang, C. Leng, A moving average Cholesky factor model in covariance modelling for longitudinal data, Biometrika 99 (2012) 141–150.