Identifying the number of factors using a white noise test

Identifying the number of factors using a white noise test

Statistics and Probability Letters 152 (2019) 92–99 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: w...

305KB Sizes 0 Downloads 44 Views

Statistics and Probability Letters 152 (2019) 92–99

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Identifying the number of factors using a white noise test ∗

Shuangbo Li , Li-Xin Zhang School of Mathematical Sciences, Zhejiang University, China

article

info

Article history: Received 7 March 2018 Received in revised form 22 December 2018 Accepted 20 April 2019 Available online 11 May 2019

a b s t r a c t We propose a new method to estimate the number of factors in a factor model using a new test for vector white noise for high-dimensional cases. We provide the theoretical basis for this method, and the simulation results indicate that the present method outperforms previous methods in some finite sample cases. © 2019 Elsevier B.V. All rights reserved.

MSC: 62M10 62H15 62F05 62F40 Keywords: Autocorrelation Normal approximation White noise test Parametric bootstrap Factor models

1. Introduction The issue of high dimensional time series has received considerable attention in recent years. Factor models play an important role in this field mainly because of their capability in dimension reduction. Recent developments in factor analysis have highlighted the need for the inference when the dimension of time series p is as large or greater than the sample size n. A key aspect of factor models is the determination of the number r of common factors. Several attempt have been made to consistently estimate r. Bai and Ng (2002) proposed a ∑ consistent estimator of r ˆ = k0 Σ ˆ y (k)Σ ˆ y (k)T , where for the static models for the first time. Lam and Yao (2012) paid attention to the matrix M k=1 ∑n−k 1 T ˆ ˆ Σy (k) = n t =1 Yt +k Yt . They found that the ratio of two adjacent eigenvalues of M changes significantly at r, so they defined an estimator rˆ = argmin λˆ i+1 /λˆ i 1≤i≤R

where r < R < p is a constant and usually takes p/2. Xia et al. (2015) proposed two estimators of r. One is a ridgetype ratio estimator, which is a modification of the above estimator by adding a positive value c to the eigenvalues λk , where k = i, i + 1. The other one is a BIC-type estimator. However, these estimators can only estimate the factors of ˆ y (1)Σ ˆ y (1)T and established a ‘‘phase transition the same strength. Li et al. (2017) studied the limit of the eigenvalues of Σ phenomenon’’. Taking advantage of this phenomenon, they proposed an eigenvalue ratio based estimator and argued that in the high-dimensional context, this estimator can consistently estimate r even when factors are of different strength. ∗ Correspondence to: School of Mathematical Sciences, Yuquan Campus, Zhejiang University, Hangzhou, China E-mail address: [email protected] (S. Li). https://doi.org/10.1016/j.spl.2019.04.011 0167-7152/© 2019 Elsevier B.V. All rights reserved.

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

93

ˆ y (1)Σ ˆ y (1)T . It does not work when the dependence is not However, this method only focused on the eigenvalues of Σ predominate at lag 1. Our simulation in Section 3 illustrates this point. ˆ however, there has been little Previous research that has been carried out only paid attention to the eigenvalues of M, ˆ In this paper, we use a white noise test to estimate r, which make use of discussion on the usage of eigenvectors of M. ˆ and it works across different settings. the eigenvectors of M The rest of this paper is organized as follows. In Section 2, the models are introduced and the methods are outlined. We then proposed some theoretical properties in Section 3. In Section 4, detailed simulations are conducted to check the finite sample properties of the proposed estimator and to compare it with the estimators from Lam and Yao (2012) and Li et al. (2017). 2. Models We consider the high-dimensional factor model: the observations Y is a p × n matrix. For each t = 1, 2, . . . , n, let Yt denote the p-dimensional vector observed at time t, then we decompose it as two parts, an r-dimensional common factor time series Xt and a p-dimensional vector white-noise process ϵt : Yt = AXt + ϵt ,

(2.1)

where A is the factor loading matrix of size p × r and our goal is to estimate r. We assume the true value of the rank of Xt is r0 . The model (2.1) is not identifiable: it keeps the same if we replace (A, Xt ) by (AH , H −1 Xt ) for any invertible r0 × r0 matrix H. So we may assume that AT A = Ir0 . However, the factor loading space, which is the r-dimensional linear space spanned by the columns of A, denoted by M(A), is uniquely defined. To estimate M(A) is equivalent to estimating its orthogonal complement M(B), where B = (b1 , . . . , bp−r0 ) is a p × (p − r0 ) matrix. We multiply both sides of (2.1) with A and B separately, and we obtain that AT Yt = Xt + AT ϵt is not a white noise while BT Yt = BT ϵt is. So if we let A and B are formed by orthogonal columns and we denote C = (A, B), which is a p × p orthogonal matrix combined by the columns of A and B. For any given r1 , we let Cr1 be the last p − r1 + 1 columns of C . So we take an idea of sequential test. We make r1 = 1, 2, 3, . . . and for each r1 , we test whether CrT1 Yt is white noise. From the above, we know that CrT1 Yt is white noise is equivalent to r1 ≥ r0 + 1. So for each r1 , we consider the hypothesis testing problem H0 (r1 ) : r ≤ r1 versus H1 (r1 ) : r > r1 We take rˆ as rˆ = {first r1 ≥ ∑ 1 that we cannot reject H0 (r1 )} − 1 ˆ y (k) = 1 nt =−1k Yt +k YtT . To get a consistent estimator for Σy (k), we employ the threshold estimator: Let Σ n

ˆ y (k)) = (σˆ ik,j I{|σˆ ik,j | ≥ u(k)})i,j=1,...,p . Tru (Σ Our threshold method is similar to Bickel and Levina (2008) and we will explicitly explain it in Section 3. Then let ∑ ˆ = k0 Tru (Σ ˆ and we arrange the eigenvalues in ˆ y (k))TruT (Σ ˆ y (k)) = JˆΓˆ JˆT . Here we make an eigen-decomposition of M M k=1 ∑k0 T T descending order. Similarly, we denote Σy (k) = Cov (Yt +k , Yt ) and M = k=1 Σy (k)Σy (k) = J Γ J .

ˆ Bˆ = (γˆr +1 , . . . , γˆp ), and zˆt ,i = γˆ T Yt . For the vector form, let Zˆt = Bˆ T Yt , Let γˆi be the ith eigenvector of M, i 1 ∑n−k 1 T −1/2 ˆ ˆ ˆ ˆ ˆ = ˆ Zˆ (k) = ˆ ˆ Zˆ (0)}−1/2 . Denote Γˆ (k) = {ρˆ ij (k)}pi,−j=r11 and let Tn,k (B) Σ ΣZˆ (k)diag {Σ t =1 Zt +k Zt , Γ (k) = diag {ΣZˆ (0)} n √

ˆ = max1≤k≤K Tn,k (B), ˆ where K ≥ 1 is a prescribed integer. Similarly, let γi be the ith max1≤i,j≤p−r1 n|ρˆ ij (k)| and Tn (B) ∑ ˆ Z (k) = 1 nt =−1k Zt +k ZtT . And we can define Tn (B) in a similar way. We eigenvector of M, B = (γr1 +1 , . . . , γp ), Zt = BT Yt , Σ n

ˆ > c vα , where c vα is the critical value determined by P(Tn (B) ˆ > c vα ) = α under H0 (r1 ), where α is reject H0 (r1 ) if Tn (B) the significance level of the test. To determine c vα , using the idea of Chang et al. (2017b), we first need to show that the Kolmogorov distance ˆ and that of the L∞ -norm of a N(0, Ξn ) random vector converges to 0, where Ξn = between the distribution of Tn (B) √ ˆ (1)}T , . . . , v ec {Σ ˆ (K )}T )T , WZˆ = diag {ΣZˆ (0)−1/2 } ⊗ diag {ΣZˆ (0)−1/2 }, ΣZˆ (0) = (IK ⊗ WZˆ )E(ξn ξnT )(IK ⊗ WZˆ ), ξn = n(v ec {Σ T ˆ n ), where Ξˆ n is an appropriate Cov (Zˆt , Zˆt ). Then we can simply evaluate c vα by drawing the bootstrap sample from N(0, Ξ estimator for Ξn . Theorem 2.1. Let Conditions 1–8 hold and G ∼ N(0, Ξn ). There exists a positive constant δ1 for which logp ≤ Cnδ1 for some constant C > 0. Then it holds under H0 (r1 ) that

ˆ > s) − P(|G|∞ > s)|→ 0, sup|P(Tn (B)

as n → ∞.

s≥0

Proof. Let

ˆ Z ){v ec {Σ ˆ Z (1)}T , . . . , v ec {Σ ˆ Z (K )}T }T , µ ˆ Z = (IK ⊗ W ˆ Z (1)}T , . . . , v ec {Σ ˆ Z (K )}T }T , µZ = (IK ⊗ WZ ){v ec {Σ

94

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

ˆ ˆ ){v ec {Σ ˆ Zˆ (1)}T , . . . , v ec {Σ ˆ Zˆ (K )}T }T , µ ˆ Zˆ = (IK ⊗ W Z ˆ Zˆ (1)}T , . . . , v ec {Σ ˆ Zˆ (K )}T }T . µZˆ = (IK ⊗ WZˆ ){v ec {Σ ˆ Zˆ = And Ω



(Zˆ )

n max1≤l≤(p−r1 )2 K µ ˆ l , ΩZˆ =



(Zˆ )

n max1≤l≤(p−r1 )2 K µl , V = max1≤l≤(p−r1 )2 K Gl . Furthermore, let dn =

ˆ Zˆ > s) − P(V > s)|. Then for any s ≥ 0 and ε > 0 sups≥0 |P(ΩZˆ > s) − P(V > s)|, dˆ n = sups≥0 |P(Ω

ˆ Zˆ > s) − P(V > s) ≤ P(ΩZˆ > s − ε) − P(V > s − ε ) + P(|Ω ˆ Zˆ − ΩZˆ | > ε) + P(s − ε < V ≤ s) P(Ω ˆ Zˆ − ΩZˆ | > ε) + P(s − ε < V ≤ s) ≤ dn + P(|Ω Similarly, we can get the reverse inequality. Hence

ˆ Zˆ − ΩZˆ | > ε) + sup P(|V − s| ≤ ε ) dˆ n ≤ dn + P(|Ω s≥0

By the anti-concentration inequality of Gaussian random variables, sups≥0 P(|V − s| ≤ ε ) ≤ C ε{log(p/ε )}1/2 . So it suffices to bound the other two terms. Note that

ˆˆ −W ˆ ˆ | max ˆ Zˆ − ΩZˆ | ≤|W |Ω Z Z ∞ k

ˆˆ −W ˆ ˆ | max ≤|W Z Z ∞ k

√ √

ˆ Zˆ (k)| n|v ec Σ ∞ ˆ Zˆ (k) − v ec Σ ˆ Z (k)| + |v ec Σ ˆ Z (k) − v ec ΣZ (k)|∞ ) n(|v ec Σ ∞

Since zˆt ,i = zt ,i + oP (1) by Lemma 2.3 for each fixed t , i, we can obtain zˆt ,i is also a β mixing process with the same mixing coefficient decay rate for each i using statement √ (2.4) of Bradley (2005). Following a similar argument of Lemma A1 of

ˆ ˆ − W ˆ |∞ ≤ C Chang et al. (2017b), we get |W Z Z

log p n

with probability at least 1 − Cp−1 . We can also repeat the argument

ˆ Zˆ (k) − v ec Σ ˆ Z (k)|∞ ≤ C of Lemma A2 of Chang et al. (2017b) to get |v ec Σ ˆ Z (k) − v ec ΣZ (k)|∞ ≤ C Zt = BT Yt is white noise, |v ec Σ



log p n



log p n

with probability at least 1 − Cp−1 . Since

ˆ Zˆ − ΩZˆ | ≤ C holds with probability at least 1−Cp−1 . So |Ω

log p √ n

−1

with probability at least 1 − Cp . For dn , the argument of Lemma A4 of Chang et al. (2017b) is followed and we can find that dn = o(1). This completes the proof of Theorem 2.1. □

ˆ n defined in Section 2.1, we can determine the critical value c vα by Using Ξ P(|G|∞ > cˆv α ) = α

ˆ n ). In practice, we can draw G1 , . . . , GB independently from N(0, Ξˆ n ) for a large integer B. The ⌊Bα⌋-th where G ∼ N(0, Ξ largest value among |G1 |∞ , . . . , |GB |∞ is taken as the critical value cˆv α . 2.1. Estimation of Ξn

ˆ n − Ξn |∞ = op (1). So we can estimate Ξˆ n By Lemma 3.1 of Chernozhukov et al. (2013), our estimator is valid when |Ξ by the similar method as Chang et al. (2017b). To be explicit, let ft = {v ec(zˆt +1 zˆtT ), . . . , v ec(zˆt +K zˆtT )}T

(t = 1, . . . , n − K )

and n−K −1

Jˆn =



K(

j=−n+k

j bn

ˆ )H(j)

∑n−K

∑n−K

T T −1 ˆ = (n − K )−1 ˆ where H(j) t =j+1 ft ft −j if j ≥ 0 and H(j) = (n − K ) t =−j+1 ft ft −j otherwise. K(·) is a symmetric kernel function that is continuous at 0 with K(0) = 1, and bn is the bandwidth diverging with n. We adopt the quadratic spectral kernel

KQS (x) =

25 12π

2 x2

{

sin(6π x/5) 6π x/5

− cos(6π x/5)},

which is an optimal kernel by minimizing the asymptotic truncated mean square error of the estimator derived by Andrews (1991). We will use this kernel in the numerical study with a data-driven bandwidth suggested in Andrews (1991). See also Chang et al. (2017b). With these constructions, we have

ˆ )Jˆn (IK ⊗ W ˆ ). Ξˆ n = (IK ⊗ W

(2.2)

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

95

2.2. Conditions Condition 1 ϵt ∼ WN(µϵ , Σ ). If c T Xt is white noise, then for ∀k > 0, c T Cov (Xt +k , ϵt ) = 0. In addition, AT A = Ir . Condition 2 Xt is weakly stationary, and for ∀k ≤ 0, Cov (Xt , ϵt +k ) = 0. Condition 3 Denote A{Var(Xt )}1/2 = {ai,j }. For some constant ι ∈ [0, 1), max 1≤j≤r

max 1≤i≤q

p ∑

|ai,j |ι ≤ s1

i=1 r ∑

|ai,j |ι ≤ s2

j=1

Condition 4 ∃C1 independent of p, s.t. σiz ≥ C1 , ∀i = 1, . . . , p − r0 , where Zt = BT Yt = BT ϵt are white noise. Condition 5 It holds for any z > 0 and ∥ν∥2 = 1 that ∃C2 , C3 > 0, γ1 ∈ (0, 2] independent of p, s.t. sup P(|ν T Yt | > z) ≤ C2 exp(−C3 z γ1 ) t

Condition 6 Zt is β − mixing and ∃C4 > 0, γ2 ∈ (0, 1] independent of p s.t. βk ≤ exp(−C4 kγ2 ) ∑m+q 2+τ Condition 7 ∃C5 > 0, τ > 0 independent of p, s.t. C5−1 < limq→0 infm>0 E | q11/2 ≤ limq→0 supm>0 t =m+1 zi,t +k zj,t | 2+τ

∑m+q

< C5 , where i, j = 1, . . . , p − r0 ; k = 1, . . . , K E | q11/2 t =m+1 zi,t +k zj,t | Condition 8 The components of Yt are uniformly bounded, that is, ∃m1 , m2 , ∀i = 1, . . . , p, m1 ≤ Yit ≤ m2 . Remark 1. Condition 1 is an identifiable condition. Condition 2 is a general stationary condition and it only requires the uncorrelation between future white noise components and the factors up to now. This condition was adopt in Lam and Yao (2012). Conditions 3, 5 and 6 are similar to Conditions 3, 6 and 7 in Chang et al. (2014). Condition 3 shows the sparsity of the factor loading matrix. Conditions 5 and 6 are about the decay of the tail probability and the mixing coefficient. Note that these conditions are stronger than usual conditions because we need to handle the estimator of Zt in the ultra high dimensional cases (see also Chang et al. (2017a)). Conditions 4 and 7 are also a little stronger than Conditions 1 and 4 in Chang et al. (2017a) for the same reason. Condition 8 is a technical condition and is easy to verify in practice. Then we make eigen-decomposition of M, and we get Q T MQ =

(

D1 0

0 D2

)

,

here D1 is the nonzero eigenvalues of M and D2 is in fact a zero matrix. So the dimension of D1 is r × r and D2 is (p − r) × (p − r). Denote

ψ = min |λ − µ|. λ∈σ (D1 ) µ∈σ (D2 )

Following the remark below Lemma 1 in Chang et al. (2014), if we divide Q as (Q1 , Q2 ), where Q1 is the eigenvectors corresponding to the nonzero eigenvalues and Q2 is the eigenvectors of the zero-eigenvalue, then



1

ˆ − M ∥2 ). ψ D(M(Qˆ 2 ), M(Q2 )) := ψ 1 − tr(Qˆ 2 Qˆ 2T Q2 Q2T ) = OP (∥M r

(2.3)

By the definition of the 2-norm,

ˆ − M ∥2 ∥M =∥

K0 ∑

ˆ y (k))TuT (Σ ˆ y (k)) − Σy (k)ΣyT (k)∥2 Tu (Σ

k=1

≤2

K0 ∑

ˆ y (k)) − Σy (k)∥2 ∥Σy (k)∥2 + ∥Tu (Σ

k=1

Lemma 2.1.

1≤j≤r

ˆ y (k)) − Σy (k)∥22 . ∥ T u (Σ

k=1

Under condition 3, we have p

max

K0 ∑

∑ (k) |σij |ι ≤ rs1 (s2 + 1) and i=1

max 1≤i≤p

r ∑ |σij(k) |ι ≤ rs2 (s2 + 1) j=1

(2.4)

96

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

Proof. Write Σx (k) = (σx,i,j )i,j=1,...,p . Then by the inequality (x + y)ι ≤ xι + yι for all the x ≥ 0 and y ≥ 0, it yields that (k)

p ∑

2

|σi(k) ,j |

i=1



p r ∑ ∑

ι

ι |ai,v |ι |σx(k) ,v,l | |aj,l | +

i=1 v,l=1



p r ∑ ∑

p r ∑ ∑

|ai,v |ι |σx(k) ϵ,v,j |

ι

i=1 v=1

|ai,v |ι |aj,l |ι ∥Σx (k)∥ι∞ +

v,l=1 i=1

p r ∑ ∑

|ai,v |ι ∥Σxϵ (k)∥ι∞

i=1 v=1

≤rs1 (s2 + 1) Using similar methods, we can get the second inequality.



Using these inequalities, we can apply the similar methods as lemma 7 in Chang et al. (2014) to show the following result. Lemma 2.2. Under conditions 3, 6 and 7, ∀k ≤ k0 , we have

ˆ y (k)) − Σy (k)∥2 = ( ∥Tu (Σ

log p n

)

1−ι 2

rs(s2 + 1).

So by (2.4)

ˆ − M ∥2 = Op (( ∥M

log p n

)1−ι r 2 s2 (s2 + 1)2 + (

log p n

)1−ι rs(s2 + 1)κ ).

(2.5)

Lemma 2.3. Under conditions 3,6,7 and 8, ∀i = r0 + 1, . . . , p and t = 1, . . . , T , if D(M(γˆr0 +1 , . . . , γˆp ), M(γr0 +1 , . . . , γp )) = oP (1), we have zˆt ,i = zt ,i + oP (1) Proof. Note that under condition 8, we only need to prove that ∀ i = r0 + 1, . . . , p, ∃ orthogonal matrix Qi s.t.

∥γˆi − Qi γi ∥2 = oP (D(M(γˆr0 +1 , . . . , γˆp ), M(γr0 +1 , . . . , γp ))) To prove this claim, we first make an orthogonal decomposition of γˆi :

γˆi = γˆi′ + γˆi′′ where γˆi′ ∈ M(γr0 +1 , . . . , γp ) and γˆi′′ ⊥M(γr0 +1 , . . . , γp ). So we can take an orthogonal transformation in M(γr0 +1 , . . . , γp ) such that γˆi′ and γi coincident. By the definition of metric D, we have

∥γˆi′′ ∥2 = D2 (M(γˆr0 +1 , . . . , γˆp ), M(γr0 +1 , . . . , γp )). So

∥γˆi − Qi γi ∥22 = ∥γˆi′ − Qi γi ∥22 + ∥γˆi′′ ∥22 = OP (∥γˆi′′ ∥22 ) = oP (1) □ Remark 2. From the proof of Lemma 2.3, we know that our method is valid when D(M(γˆr0 +1 , . . . , γˆp ), M(γr0 +1 , . . . , γp )) = ˆ − M ∥2 /ψ ). Note that we gain the rate of ∥M ˆ − M ∥2 oP (1). We also obtain from (2.3) that D(M(Qˆ 2 ), M(Q2 )) = OP (∥M from (2.5), so ψ , which represents the smallest nonzero eigenvalue of M, i.e. the strength of the factors, should be larger log p log p than Op (( n )1−ι r 2 s2 (s2 + 1)2 + ( n )1−ι rs(s2 + 1)κ ). So our method can detect the strong factors. For the definition of the strength of factors, we refer to Li et al. (2017) and Lam and Yao (2012). Theorem 2.2. Let Conditions 1–8 hold, |K(x)| ≍ |x|−τ as |x| → ∞ for some τ > 1, and bn ≍ nρ for some 0 < ρ < min((τ − 1)/3τ , r2 /(2r2 + 1)). Let log p ≤ Cnδ2 for some positive constants δ2 , C , and δ2 depend on the constants in Conditions 1–8 only. Then it holds under H0 (r1 ) that Pr(Tn > c vˆ α ) → α,

n → ∞.

Theorem 2.3. Assume that the conditions of Theorem 2.2 hold. Let ς be the largest element in the main diagonal of Ξn , and λ(p, α ) = [2 log(p2 K )]1/2 + [2 log(1/α )]1/2 . Suppose that max max |ρij (k)| ≥ ς 1/2 (1 + ϵn )n−1/2 λ(p, α )

1≤k≤K 1≤i,j≤p

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

97

for some positive ϵn satisfying ϵn → 0 and ϵn2 log p → ∞. Then it holds under H1 (r1 ) that Pr(Tn > c vˆ α ) → 1 n → ∞. Based on Lemma 2.3, we can proceed the proof of Theorems 2.2 and 2.3 in the same manner as the proof for Theorem 1 and Theorem 2 of Chang et al. (2017b). From Theorems 2.2 and 2.3, we know that for each given r1 , our test is consistent. So if we take all the significance level of these sequential tests as α , we denote the event EVi = {reject the ith null hypothesis} for i ≥ 1. So Pr(rˆ = r0 ) = Pr(we reject the first (r0 − 1)-th null hypothesis and accept the r0 -th null hypothesis) r −1

= Pr((∩i=0 1 EVi ) ∩ EVrc0 ) r −1

= 1 − Pr((∪i=0 1 EVic ) ∪ EVr0 ) r0 −1

≥1−



Pr(EVic ) − Pr(EVr0 ).

i=1

By Theorems 2.2 and 2.3, we obtain Pr(EVr0 ) → α and Pr(EVic ) → 0 when the sample size n tends to infinity, where i = 1, . . . , r0 − 1. Since r0 is usually relatively small respect to n and p, we may treat r0 as a fixed number. If we let n tends to infinity, lim inf Pr(rˆ = r0 ) ≥ 1 − α. n→∞

So the probability that our method can get the right number of factors is at least 1 − α asymptotically. 3. Numerical properties In this section, we illustrate the finite sample properties of our estimators by simulation. To illustrate our method, we use the following settings: yt = Axt + ϵt ,

ϵt ∼ Np (0, Ip )

where (I)

xt = Θ xt −1 + et ,

et ∼ Nr (0, Γ )

(II)

xt = Θ xt −3 + et ,

et ∼ Nr (0, Γ )

W.l.o.g., the variance σ 2 of the white noise ϵt are set to 1. In Lam and Yao (2012), they first independently generated A from uniform distribution on the interval [−1, 1] and then divided it by pδ/2 where δ ∈ [0, 1]. So their factor strengths are of order O(p1−δ ). However, in Li et al. (2017), the coefficient matrix A is fixed and the value of Θ and Γ are adjusted ˆ are to reflect the factor strength. We adopt similar settings to the Li et al. (2017). Considering that the eigenvalues of M invariant under orthogonal transformation (see also Theorem 2.1 in Li et al. (2017)), we let

( A=

)

Ir 0(p−r)×r

.

And we also fix

Θ=

(

0.6 0

)

0 , 0.5

so we only need to adjust the value of Γ to obtain different strength. For each settings, we adopt the following three scenarios:

Γ1 = diag(4 × p Γ2 = diag(4 × p Γ3 = diag(4 × p

1−δ1 2 1−δ1 2 1−δ3 2

,4 × p ,4 × p ,4 × p

1−δ2 2

)

1−δ1 2

)

1−δ1 2

)

where δ1 = 0.5, δ2 = 0.9, δ3 = 0.1. As for the choice of the threshold, Bickel and Levina (2008) propose to select the thresholding parameter by minimizing the Frobenius norm of the difference between the thresholding estimator and the sample covariance matrix computed from an independent data. We have a similar idea. First, for each k = 1, . . . , K0 , we use a rolling window of certain size n∗ and split each window into two sub-samples of sizes n1 and n2 with n − n∗ observations left out in-between them. Then we obtain Σy,1,κ,u(k) (k) from the first sub-sample of the κ -th rolling window

98

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

with thresholding and Σy,2,κ from the second sub-sample without thresholding. Finally, we choose the tunning parameter ∑T 2 u(k) which minimizes κ=1 ∥Σy,1,κ,u(k) (k) − Σy,2,κ ∥F . We will report the results of two thresholding methods. Recall that we take rˆ as rˆ = {first r1 ≥ 1 that we accept H0 (r1 )} − 1 In the simulation, we take k0 = 10 and set p = 50, 70, 90 and n = 100, 140, 180 respectively. We repeat each settings 350 times and report the empirical frequencies of rˆ = r1 , rˆ = r1 ± 1 and |ˆr − r1 | > 1. The results are as follows. rLY is the estimator in Lam and Yao (2012), rLWY is the one in Li et al. (2017) and rˆ are the estimators of ours with subscripts ‘‘hard’’ for hard-thresholding method and ‘‘soft’’ for soft-thresholding method. Scenario 1 lag1 p n rLY < 1 rLY = 1 rLY = 2 rLY = 3 rLY ≥ 4 rLWY < 1 rLWY = 1 rLWY = 2 rLWY = 3 rLWY ≥ 4 rˆhard < 1 rˆhard = 1 rˆhard = 2 rˆhard = 3 rˆhard ≥ 4 rˆsoft < 1 rˆsoft = 1 rˆsoft = 2 rˆsoft = 3 rˆsoft ≥ 4

50 100 0 253 97 0 0 0 2 346 2 0 86 228 33 1 2 90 223 37 0 0

lag3 70 140 0 273 77 0 0 0 0 350 0 0 13 176 150 6 5 11 173 157 7 2

90 180 0 311 39 0 0 0 0 343 7 0 2 67 237 32 12 1 63 241 31 14

50 100 0 247 103 0 0 21 70 140 108 11 65 227 58 0 0 65 221 62 1 1

70 140 0 260 90 0 0 15 37 96 146 56 6 165 171 5 3 8 172 158 8 4

90 180 0 301 49 0 0 0 5 23 134 188 1 69 242 27 11 0 65 239 27 19

70 140 0 35 315 0 0 15 13 16 128 178 2 150 191 4 3 4 136 199 9 2

90 180 0 20 330 0 0 1 5 2 38 304 0 40 283 12 15 0 28 275 30 17

Scenario 2 lag1 p n rLY < 1 rLY = 1 rLY = 2 rLY = 3 rLY ≥ 4 rLWY < 1 rLWY = 1 rLWY = 2 rLWY = 3 rLWY ≥ 4 rˆhard < 1 rˆhard = 1 rˆhard = 2 rˆhard = 3 rˆhard ≥ 4 rˆsoft < 1 rˆsoft = 1 rˆsoft = 2 rˆsoft = 3 rˆsoft ≥ 4

50 100 0 61 289 0 0 0 0 346 4 0 80 216 37 0 0 77 212 60 0 1

lag3 70 140 0 37 313 0 0 0 0 348 2 0 5 147 189 5 4 5 148 186 5 6

90 180 0 27 323 0 0 0 0 339 11 0 0 45 277 20 8 0 43 265 25 17

50 100 0 48 302 0 0 27 27 65 168 63 48 223 77 0 2 48 228 74 0 0

S. Li and L.-X. Zhang / Statistics and Probability Letters 152 (2019) 92–99

99

Scenario 3 lag1 p n rLY < 1 rLY = 1 rLY = 2 rLY = 3 rLY ≥ 4 rLWY < 1 rLWY = 1 rLWY = 2 rLWY = 3 rLWY ≥ 4 rˆhard < 1 rˆhard = 1 rˆhard = 2 rˆhard = 3 rˆhard ≥ 4 rˆsoft < 1 rˆsoft = 1 rˆsoft = 2 rˆsoft = 3 rˆsoft ≥ 4

50 100 0 204 146 0 0 0 0 343 7 0 65 228 56 0 1 70 226 53 0 1

lag3 70 140 0 224 126 0 0 0 0 347 3 0 8 142 183 14 3 9 146 180 11 4

90 180 0 261 89 0 0 0 0 338 12 0 1 47 228 54 20 2 46 226 47 29

50 100 0 184 166 0 0 9 18 36 188 99 49 211 89 0 1 54 205 90 0 1

70 140 0 216 134 0 0 5 6 4 103 232 3 129 206 8 4 4 126 205 8 7

90 180 0 253 97 0 0 1 2 26 319 2 1 31 254 42 22 0 29 251 43 27

Scenarios 1–3 indicate that rLY works well when all factors are of the same strength. However it tends to miss the weaker factors when they are of the different strengths. rLWY is designed for the models when the dependence is predominate at lag 1. It cannot copy, for example, the dependence is only observable at lag 3. The proposed new method takes into account the dependence across different lags, and can handle factors with different strengths. It provides more stable performance across different settings. And it seems that different threshold methods do not affect the estimation accuracy. 4. Conclusion This study set out with the aim of finding a new method to estimate the number of factors in factor models. The results of this study, both in theory and in simulation, indicate that the new estimator is stable across different settings when the strengths of factors are not extremely small. The major limitation of the present study is the computation burdens. More research is needed to estimate the number of factors with a more stable and computation feasible method. Acknowledgments The authors thank Professor Qiwei Yao for his constructive advices on this paper. The authors also thank the referees for their thoughtful comments, which have helped improve this article. And this paper was supported by National Natural Science Foundation of China (Grant No. 11731012), the 973 Program (Grant No. 2015CB352302) and the Fundamental Research Funds for the Central Universities (Grant No. 2018FZA3001). References Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59 (3), 817–858. Bai, J., Ng, S., 2002. Determining the number of factors in approximate factor models. Econometrica 70 (1), 191–221. Bickel, P.J., Levina, E., 2008. Covariance regularization by thresholding. Ann. Statist. 34 (6), 2577–2604. Bradley, R.C., 2005. Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. 2, 107–144. Chang, J., Guo, B., Yao, Q., 2014. Principal component analysis for second-order stationary vector time series. arXiv:1410.2323. Chang, J., Qiu, Y., Yao, Q., 2017a. On the statistical inference for large precision matrices with dependent data. arXiv:1603.06663. Chang, J., Yao, Q., Zhou, W., 2017b. Testing for high-dimensional white noise using maximum cross-correlations. Biometrika 104 (1), 111–127. Chernozhukov, V., Chetverikov, D., Kato, K., 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 (6), 2786–2819. Lam, C., Yao, Q., 2012. Factor modeling for high-dimensional time series: Inference for the number of factors1. Ann. Statist. 40 (2), 694–726. Li, Z., Wang, Q., Yao, J., 2017. Identifying the number of factors from singular values of a large sample auto-covariance matrix. Ann. Statist. 45 (1), 257–288. Xia, Q., Xu, W., Zhu, L., 2015. Consistently determining the number of factors in multicariate volatility modelling. Statist. Sinica 25, 1025–1044.