Squared-norm empirical processes

Squared-norm empirical processes

Statistics and Probability Letters 150 (2019) 108–113 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage:...

287KB Sizes 0 Downloads 74 Views

Statistics and Probability Letters 150 (2019) 108–113

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Squared-norm empirical processes✩ ∗

Vincent Q. Vu a , , Jing Lei b a b

Department of Statistics, The Ohio State University, Columbus, OH, United States Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, United States

article

info

Article history: Received 7 August 2018 Accepted 10 February 2019 Available online 10 March 2019 Keywords: PCA Principal components Empirical process

a b s t r a c t This note extends a result of Mendelson on the supremum of a quadratic process to squared norms of functions taking values in a Banach space. Our method of proof is a reduction by a symmetrization argument and simple observation about the additivity of the generic chaining functional. We demonstrate an application to positive linear functionals of the sample covariance matrix and the apparent variance explained by principal components analysis (PCA). © 2019 Elsevier B.V. All rights reserved.

1. Introduction Let F be a class of real-valued functions on the probability space (X , P) and X , X1 , . . . , Xn be independent, identically distributed random variables. Let ∥ · ∥ψα be the Orlicz ψα -norm, i.e.

∥X ∥ψα := inf{c > 0 | E exp(|X /c |α ) ≤ 2} ,

α ≥ 1,

and let dψα be the associated distance. For a metric space (T , d) and q ≥ 1, let γq (T , d) be Talagrand’s generic chaining functional (see Chapter 2 of Talagrand, 2014 and Section 2 for its definition). Mendelson (2010) has proved the following theorem. Theorem 1.1. Let F be a symmetric class of mean-zero functions (i.e. Ef (X ) = 0 and if f ∈ F , then −f ∈ F ). Then there exist absolute constants c1 , c2 , and c3 such that for any t ≥ c1 , with probability at least 1 − 2 exp(−c2 t 2/5 )

⏐ ⏐ n { } ⏐1 ∑ ⏐ γ2 (F , dψ2 ) γ22 (F , dψ2 ) ⏐ ⏐ 2 2 sup ⏐ f (Xi ) − Ef (X )⏐ ≤ c3 t sup ∥f (X )∥ψ1 + . √ ⏐ n n f ∈F ⏐ n f ∈F i=1

The appeal of this theorem is that it uses information about the class F to bound the empirical process indexed by the squares of the class {f 2 : f ∈ F }. In this note we extend the preceding theorem to classes of functions that take values in a Banach space. Our main result is the following theorem. Theorem 1.2. Let G be a class of functions taking values in a separable Banach space with norm ∥ · ∥. If either 0 ∈ G or G is symmetric, then there exist absolute constants c1 , c2 , and c3 such that for all t ≥ c1 , with probability at least ✩ This research was supported by National Science Foundation, United States grants DMS-0903120 and BCS-0941518. ∗ Corresponding author. E-mail addresses: [email protected] (V.Q. Vu), [email protected] (J. Lei). https://doi.org/10.1016/j.spl.2019.02.008 0167-7152/© 2019 Elsevier B.V. All rights reserved.

V.Q. Vu and J. Lei / Statistics and Probability Letters 150 (2019) 108–113

109

1 − 2 exp(−c2 t 2/5 ),

⏐ ⏐ n } { ⏐ ⏐1 ∑ γ2 (G , d) γ22 (G , d) ⏐ 2 2⏐ + sup ⏐ ∥g(Xi )∥ − E∥g(X )∥ ⏐ ≤ c3 t sup ∥∥g(X )∥∥ψ1 √ , ⏐ n n g ∈G ⏐ n g ∈G

(1)

i=1

where d is the metric on G defined by d(g1 , g2 ) = ∥∥g1 (X ) − g2 (X )∥∥ψ2 . The main application that we have in mind is principal components analysis (PCA). The goal of PCA is to find a lowerdimensional subspace of maximal variance, i.e. it seeks a p × k orthonormal matrix U such that the variance E∥U T X ∥2 is maximized. (We assume for simplicity of presentation that EX = 0.) When the distribution of X is not known, as is usually the case in real applications, PCA instead maximizes the empirical variance n 1∑

n

∥U T Xi ∥22 ,

i=1

where ∥ · ∥2 is the Euclidean norm. The difference between the apparent variance and actual variance is

⏐ n ⏐ ⏐1 ∑ ⏐ ⏐ T 2 T 2⏐ ∥U Xi ∥2 − E∥U X ∥2 ⏐ . ⏐ ⏐n ⏐

(2)

i=1

Maximizing this error over U belonging to a subset of orthonormal matrices U leads immediately to a problem suitable for Theorem 1.2 by considering the function class G = {x ↦ → U T x | U ∈ U }. The typical size of this error will depend on the complexity of G . In the case of ordinary PCA, U is the set of all p × k orthonormal matrices. Sparse PCA corresponds to the case where U is a set of orthonormal matrices with few nonzero rows (or entries). We have applied the results of this note to sparse PCA (Vu and Lei, 2013), and that was the impetus for us to develop the results presented in this note. Our proof of Theorem 1.2, presented in Section 3, consists of two parts. In the first, we construct an auxiliary class F0 of functions obtained by symmetrizing the norm functions of G . This allows us to apply Theorem 1.2. Then the remainder of the proof deals with relating γ2 (F0 , dψ2 ) to γ2 (G , d). This involves bounding γ2 when it is applied to the union of sets. In Section 2 we formally define the generic chaining functional and prove a key technical lemma. After proving the main theorem in Section 3, we move on to applications of Theorem 1.2 to positive linear functionals of the sample covariance matrix and to the above principal components analysis problem (2). Throughout this note c , C , c2 , c3 , . . . refer to absolute positive constants whose values may change from line-to-line. 2. The generic chaining functional Definition 2.1. Let (T , d) be a metric space. An increasing sequence of partitions As , s ≥ 0, of T is admissible if the s cardinalities |A|0 = 1 and |At | ≤ 22 for s ≥ 1. For t ∈ T , As (t) denotes the unique set of the partition containing t, and ∆(As (t)) is its diameter. The generic chaining functional is defined to be

γα (T , d) := inf sup t ∈T



2s/α ∆(As (t)) ,

α ≥ 1,

s≥0

where the infimum is taken over all admissible sequences. Lemma 2.1.

Let (T1 ∪ T2 , d) be a metric space. Then

γα (T1 ∪ T2 , d) ≤ 3 {∆(T1 ∪ T2 ) + γα (T1 , d) + γα (T2 , d)} . Moreover, if T1 ∩ T2 ̸ = ∅, then

∆(T1 ∪ T2 ) ≤ 2 {γα (T1 , d) + γα (T2 , d)} and hence

γα (T1 ∪ T2 , d) ≤ 9 {γα (T1 , d) + γα (T2 , d)} . Proof. Let (As )s≥0 be an admissible sequence for (T1 , d) such that



2s/α ∆(As (t)) ≤ 2γα (T1 , d) for all t ∈ T1 ,

(3)

s≥0

and (Bs )s≥0 an admissible sequence for (T2 , d) such that

∑ s≥0

2s/α ∆(Bs (t)) ≤ 2γα (T2 , d) for all t ∈ T2 .

(4)

110

V.Q. Vu and J. Lei / Statistics and Probability Letters 150 (2019) 108–113

We define partitions Cs of T1 ∪ T2 as follows. Let C0 = {T1 ∪ T2 }, C1 = C0 , and, for s ≥ 2, let Cs be the partition consisting of the sets A ∩ (T1 \ T2 ) for A ∈ As−2 , B ∩ (T2 \ T1 ) for B ∈ Bs−2 , and A ∩ B for A ∈ As−2 and B ∈ Bs−2 . It is straightforward to check that (Cs )s≥0 is an increasing sequence of partitions of T1 ∪ T2 , |C0 | = |C1 | = 1, and for s ≥ 2,

|Cs | ≤ |As−2 | + |Bs−2 | + |As−2 | × |Bs−2 | ≤ 22

s−2

s−2

+ 22

s−1

+ 22

s

≤ 22 .

Thus (Cs )s≥0 is an admissible sequence for T1 ∪ T2 . Let t ∈ T1 ∪ T2 . Since, for s ≥ 2, every element of Cs is a subset of an element of As−2 and/or Bs−2 ,



2s/α ∆(Cs (t)) ≤ 3∆(T1 ∪ T2 ) +



≤ 3∆(T1 ∪ T2 ) +



s≥0

2s/α ∆(Cs (t))

s≥2

2s/α ∆(As (t)) +



s ≥0

2s/α ∆(Bs (t))

s≥0

≤ 3 {∆(T1 ∪ T2 ) + γα (T1 , d) + γα (T2 , d)} , by (3) and (4). Then

γα (T1 ∪ T2 , d) ≤ sup

t ∈T1 ∪T2



2s/α ∆(Cs (t))

s≥0

≤ 3 {∆(T1 ∪ T2 ) + γα (T1 , d) + γα (T2 , d)} . For the ‘‘moreover’’ part, if u ∈ T1 ∩ T2 , then by the triangle inequality

∆(T1 ∪ T2 ) =

sup

d(t1 , t2 )

sup

d(t1 , u) + d(t2 , u)

t1 ,t2 ∈T1 ∪T2



t1 ,t2 ∈T1 ∪T2

≤ 2 {∆(T1 ) + ∆(T2 )} ≤ 2 {γα (T1 , d) + γα (T2 , d)} .



3. Proof of Theorem 1.2 Proof. Let ϵ, ϵ1 , . . . , ϵn be independent, identically distributed Rademacher random variables that are independent of X , X1 , . . . , Xn . Define the class F of real-valued functions on X × {−1, 1} by F = {f | f (x, y) = y∥g(x)∥ for some g ∈ G } ,

and let F0 = F ∪ − F .

Then F0 is a symmetric class of mean-zero functions by construction. So Theorem 1.1 implies that for t ≥ c1 , with probability at least 1 − 2 exp(−c2 t 2/5 ),

⏐ n ⏐ ⏐1 ∑ ⏐ ⏐ 2 2⏐ sup ⏐ ∥g(Xi )∥ − E∥g(X )∥ ⏐ ⏐ g ∈G ⏐ n ⏐ ⏐i=1 n ⏐1 ∑ ⏐ ⏐ ⏐ = sup ⏐ f 2 (Xi , ϵi ) − Ef 2 (X , ϵ )⏐ ⏐ f ∈F0 ⏐ n i=1 { } γ2 (F0 , dψ2 ) γ22 (F0 , dψ2 ) . + ≤ c3 t sup ∥∥g(X )∥∥ψ1 √ g ∈G

n

n

It remains for us to show that there exists a constant C > 0 such that

γ2 (F0 , dψ2 ) ≤ C γ2 (G , d) .

(5)

Toward this end, we first establish a simple fact that will be used for both cases of G . Let f1 , f2 ∈ F . There exists g1 , g2 ∈ G such that fi (x, ϵ ) = ϵ∥gi (x)∥

V.Q. Vu and J. Lei / Statistics and Probability Letters 150 (2019) 108–113

111

for i = 1, 2. Then we have by the triangle inequality that dψ2 (f1 , f2 ) = ∥∥g1 (X )∥ − ∥g2 (X )∥∥ψ2

≤ ∥∥g1 (X ) − g2 (X )∥∥ψ2 = d(g1 , g2 ) , and it follows from Theorem 2.7.5 of Talagrand (2014) that there exists a constant K such that

γ2 (F , dψ2 ) ≤ K γ2 (G , d) .

(6)

Now let us consider the case where G is symmetric. We have from Lemma 2.1 that

γ2 (F0 , dψ2 ) = γ2 (F ∪ −F , dψ2 ) { } ≤ 3 ∆(F ∪ −F ) + γ2 (F , dψ2 ) + γ2 (−F , dψ2 ) { } = 3 ∆(F ∪ −F ) + 2γ2 (F , dψ2 ) ≤ 3 {∆(F ∪ −F ) + 2K γ2 (G , d)} ,

(7)

where (7) follows from (6). For the remaining term in (7), we apply the triangle inequality to get

∆(F ∪ −F ) = max

{

sup ∥∥g1 (X )∥ − ∥g2 (X )∥∥ψ2 , sup ∥∥g1 (X )∥ + ∥g2 (X )∥∥ψ2

≤ max

g1 ,g2 ∈G

g1 ,g2 ∈G

{

sup d(g1 , g2 ) , 2 sup ∥∥g(X )∥∥ψ2

g ,g2 ∈G

}

g ∈G

{1 = max

}

}

sup d(g1 , g2 ) , sup d(g , −g)

g1 ,g2 ∈G

,

g ∈G

and since G is symmetric, it follows that the above is

= sup d(g1 , g2 ) ≤ γ2 (G, d) . g1 ,g2

Combining this with (7) proves (5). For the remaining case, if 0 ∈ G , then Lemma 2.1 gives the bound

} { γ2 (F0 , dψ2 ) ≤ 9 γ2 (F , dψ2 ) + γ2 (−F , dψ2 ) = 18γ2 (F , dψ2 ) . Then (5) follows from this and (6).



4. Positive linear functionals of the sample covariance matrix In this section we consider applications of Theorem 1.2 to positive linear functionals of the sample covariance matrix. Let X , X1 , . . . , Xn be i.i.d. mean 0 random vectors in Rp , and S=

n 1∑

n

Xi XiT

i=1

be the sample covariance matrix. For matrices A and B of conformable dimension, let ⟨A, B⟩ := tr(AT B) be the trace inner product and ∥A∥F := ⟨A, A⟩1/2 be the Frobenius norm. A positive linear functional h of S has the form h(S) = ⟨S , BBT ⟩ for some matrix B. Given a collection H of such functionals, we can choose a representation so that the coefficient matrices B all have the same number of columns, say k, and we shall write B = {B ∈ Rp×k | h(S) = ⟨S , BT B⟩ for some h ∈ H} .

Clearly, if B ∈ B then −B ∈ B, so the function class G = {x ↦ → BT x | B ∈ B}

is symmetric and h(S) − Eh(S) =

n 1∑

n

∥g(Xi )∥22 − E∥g(X )∥22 .

i=1

The following is an application of Theorem 1.2 to this setting.

112

V.Q. Vu and J. Lei / Statistics and Probability Letters 150 (2019) 108–113

Let X , X1 , . . . , Xn ∈ Rp be i.i.d. mean 0 random vectors,

Theorem 4.1.

n 1∑

Sn =

n

Xi XiT ,

i=1

and Σ = EXX . If B is a symmetric set of p × k matrices and T

σ = sup ∥⟨X , u⟩∥ψ2 < ∞ , ∥u∥2 =1

then there exist positive constants c1 , c2 , and c3 such that for all t ≥ c1 , with probability at least 1 − 2 exp(−c2 t 2/5 ),

{ ⏐ ⏐ sup ⏐⟨Sn − Σ , BBT ⟩⏐ ≤ c3 t

σ 2 supB∈B ∥B∥F √

(

n

B∈B

)

E sup⟨Z , B⟩ + B∈B

σ2 n

(

)2 } , E sup⟨Z , B⟩ B∈B

where Z is a p × k matrix with i.i.d. standard Gaussian entries. Furthermore, we have under the same conditions, that

{ ⏐ ⏐ E sup ⏐⟨Sn − Σ , BBT ⟩⏐ ≤ c

σ 2 supB∈B ∥B∥F √

(

n

B∈B

)

E sup⟨Z , B⟩ + B∈B

σ2 n

(

E sup⟨Z , B⟩

)2 }

,

B∈B

with a universal constant c. Proof of Theorem 4.1. We will need the following lemma whose proof we defer to the end. Lemma 4.1.

For all matrices B, ∥∥BT X ∥2 ∥ψ2 ≤ ∥B∥F sup∥u∥2 ≤1 ∥⟨X , u⟩∥ψ2 .

The upper bound in Theorem 1.2 has the form supB∈B ∥∥BT X ∥2 ∥ψ1



1

γ2 (G , dψ2 ) + γ22 (G , dψ2 ) . n

n

(8)

Using Lemma 4.1 and using the fact that ∥ · ∥ψ1 ≤ ∥ · ∥ψ2 (log 2)1/2 (see e.g. Chapter 2.2 of van der Vaart and Wellner, 1996), sup ∥∥BT X ∥2 ∥ψ1 ≤ (log 2)1/2 σ sup ∥B∥F . B∈B

B∈B

We also have from Lemma 4.1, Theorem 2.7.5 of Talagrand (2014), and the Majorizing Measure Theorem (see Theorem 2.4.1 of Talagrand, 2014) that

γ2 (G , dψ2 ) ≤ C σ γ2 (B, d∥·∥F ) ≤ C σ E sup⟨Z , B⟩ B∈B

for some constant C > 0. Substitute the preceding two displays into (8) and invoke Theorem 1.2 to complete the proof of Theorem 4.1. ■ Proof of Lemma 4.1. We will repeatedly use the following fact about the ψ1 - and ψ2 -norms. For all random variables Y ,

∥Y 2 ∥ψ1 = ∥Y ∥2ψ2 Let B ∈ B and denote its columns by b1 , . . . , bk . Then

∥BT X ∥22 =



|⟨bi , X ⟩|2 =



i

∥bi ∥22 · |⟨bi /∥bi ∥2 , X ⟩|2 ,

i

and hence

∥∥BT X ∥2 ∥2ψ2 = ∥∥BT X ∥22 ∥ψ1 ≤

k ∑

∥bi ∥22 ∥ |⟨bi /∥bi ∥2 , X ⟩|2 ∥ψ1

i=1

≤ ∥B∥2F sup ∥ |⟨u, X ⟩|2 ∥ψ1 ∥u∥2 ≤1

= ∥B∥2F sup ∥ |⟨u, X ⟩| ∥2ψ2 .



∥u∥2 ≤1

We conclude the note with an application to principal components analysis. The next result provides an upper bound on the expected worst case difference between the apparent and actual variance in PCA.

V.Q. Vu and J. Lei / Statistics and Probability Letters 150 (2019) 108–113

113

Corollary 4.1 (Apparent Variance Explained by PCA). Let U be the set of p × k orthonormal matrices. Under the assumptions of Theorem 4.1, there exists an absolute constant c > 0 such that

⏐ n ⏐ } {√ ⏐1 ∑ ⏐ pk pk ⏐ T 2 T 2⏐ 2 + E sup ⏐ ∥U Xi ∥2 − E∥U X ∥2 ⏐ ≤ ckσ ⏐ n n U ∈U ⏐ n i=1

Proof. After applying Theorem 4.1, we compute

√ sup ∥U ∥F =

k,

U ∈U

and

E sup⟨Z , U ⟩ ≤ sup ∥U ∥F E∥Z ∥F ≤ U ∈U

U ∈U

√ (

k E∥Z ∥2F

)1/2

=



pk2 .



References Mendelson, Shahar, 2010. Empirical processes with a bounded ψ1 diameter. Geom. Funct. Anal. 20 (4), 988–1027. Talagrand, Michel, 2014. Upper and Lower Bounds for Stochastic Processes. Springer-Verlag, http://dx.doi.org/10.1007/978-3-642-54075-2. van der Vaart, A.W., Wellner, J.A., 1996. Weak Convergence and Empirical Processes. Springer-Verlag. Vu, Vincent Q., Lei, Jing, 2013. Minimax sparse principal subspace estimation in high dimensions. Ann. Stat. 41 (6), 2905–2947. http://dx.doi.org/10. 1214/13-AOS1151.