Linear Algebra and its Applications 594 (2020) 81–94
Contents lists available at ScienceDirect
Linear Algebra and its Applications www.elsevier.com/locate/laa
Concentration inequalities for random matrix products ✩ Amelia Henriksen, Rachel Ward ∗ Oden Institute for Computational Engineering and Sciences, University of Texas, Austin, TX, United States of America
a r t i c l e
i n f o
Article history: Received 19 August 2019 Accepted 30 January 2020 Available online 11 February 2020 Submitted by A. Uschmajew MSC: 65F99 60F99 Keywords: Noncommutativity Random matrix product Stochastic gradient descent Large deviations
a b s t r a c t Suppose {X k }k∈Z is a sequence of bounded independent random matrices with common dimension d × d and common expectation E [X k ] = X. Under these general assumptions, the normalized random matrix product Z n = (I d +
1 1 1 X n )(I d + X n−1 ) · · · (I d + X 1 ) n n n
converges to eX as n → ∞. Normalized random matrix products of this form arise naturally in stochastic iterative algorithms, such as Oja’s algorithm for streaming Principal Component Analysis. Here, we derive nonasymptotic concentration inequalities for such random matrix products. In particular, we show that the spectral √ norm error satisfies Z n − eX = O((log(n))2 log(d/δ)/ n) with probability exceeding 1 − δ. This rate is sharp in n, d, and δ, up to logarithmic factors. The proof relies on two key points of theory: the Matrix Bernstein inequality concerning the concentration of sums of random matrices, and Baranyai’s theorem from combinatorial mathematics. Concentration bounds for general classes of random matrix products are hard to come by in the
✩ This material is based upon work supported in part by AFOSR MURI Award N00014-17-S-F006. AH and RW also thank the anonymous referee for helpful improvements. * Corresponding author. E-mail addresses:
[email protected] (A. Henriksen),
[email protected] (R. Ward).
https://doi.org/10.1016/j.laa.2020.01.040 0024-3795/© 2020 Elsevier Inc. All rights reserved.
82
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
literature, and we hope that our result will inspire further work in this direction. © 2020 Elsevier Inc. All rights reserved.
1. Introduction A classical limit theorem from complex analysis reads: Let (un )n∈N be a uniformly n−1 bounded complex sequence whose mean n1 k=0 un converges towards μ. Then lim
n→∞
n−1
1+
k=0
uk = eμ . n
(1)
This result is easily verified by taking the natural logarithm of each side, and observing n−1
n−1 uk that log 1+ n ≈ n1 k=0 uk → μ. A non-commutative extension of this result k=0
was recently proven by Emme and Hubert in [8]: Proposition 1. Let (An )n∈N be a sequence of d × d complex matrices satisfying 1 lim Ak = A n→∞ n n
k=1
n
and such that ( n1 k=1 Ak )n∈N is bounded for a norm · by a positive constant. Consider the matrix product
1 1 Z n = I d + A1 . . . I d + An . n n Then lim Z n = eA .
n→∞
The proof of Proposition 1 is not a straightforward extension of the scalar result due to noncommutativity of the matrix product. An important special case within the framework of Proposition 1 is when the Ak are uniformly bounded independent random matrices with common expectation E [Ak ] = A. Then Z n is also a random matrix, and has expectation E [Z n ] = (I d + n1 A)n . Within this framework, it is natural to ask about rates of convergence of Z n to eA . As far as we are aware, precise rates of convergence for matrix products of the form Z n have not appeared in the literature before, despite such random matrix products naturally arising in stochastic iterative algorithms such as stochastic gradient descent; in particular, in
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
83
Oja’s algorithm for estimating the top eigenvector of the covariance matrix of a distribution of matrices observed sequentially [11,14,5,13,15,10,3]. Here, as the main content of this paper, we derive a rate of convergence for matrix products of this form. Theorem 1 (Main Theorem). Consider a sequence {X k }k∈N of independent (real or complex-valued) random matrices with common dimension d × d. Assume that E [X k ] = X
and
X k ≤ L
for each index k.
Introduce the sequence of random matrices {Z n }n∈N given by Z n = (I d +
1 1 1 X n )(I d + X n−1 ) · · · (I d + X 1 ). n n n
(2)
Suppose that L > 0, n, d ∈ N, and δ ∈ (0, 1/2] are such that max{3, Le2 } ≤ log(n) + 1 ≤
16n log(d/δ) + log(ne)
1/3 (3)
Then with probability exceeding 1 − 2δ, the following holds: L Z n − eX ≤ 2Le √log(n) n
log(n) 2 log(2d/δ) + (log(n))2 + √ n
+
L2 eL 2n
where · denotes the matrix spectral norm. Theorem 1 immediately implies a bound on the expected value of Z n − eX . Note that Z n ≤ eL and eX ≤ eL , so for any δ > 0 satisfying (3), E Z n − eX ≤ (1 − 2δ) +
2LeL log(n) √ n
log(n) 2 log(2d/δ) + (log(n))2 + √ n
L2 eL + 4δeL 2n
In particular, setting δ = E Z n − eX ≤
L2 8n
gives
2LeL log(n) √ n
log(n) 2 log(2d/δ) + (log(n))2 + √ n
+
L2 eL . n
Note that the O( √1n ) convergence rate is unavoidable under the stated assumptions. It remains open whether the log(n) factor in the rate given by Theorem 1 can be removed. The dependence on L is not optimal; an interesting open question is whether the eL factor in the rate can be improved to eX .
84
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
Remark 1. Limit laws for products of random matrices have been extensively analyzed in the context of ergodic theory or martingales on Markov chains – see for instance the book [7] or the extensive survey articles [9,12]. However, results in the form of quantitative rates of convergence of general random matrix products are quite scarce, apart from specialized cases such as for products of i.i.d. Gaussian random matrices. For the random matrix product Z n we consider (2), the limiting behavior Z n → exp(X) was studied in [6] and [8]. Notation. Throughout, X refers to the spectral norm of the matrix X. For an integer n ≥ 1, we use the notation [n] to refer to the set {1, 2, . . . , n}. We write Prob[E] to refer to the probability of the event E. 2. Preliminaries A crucial ingredient of the proof of Theorem 1 is the matrix Bernstein inequality, a matrix-level extension of the classical scalar Bernstein inequality describing the upper tail of a sum of independent bounded or sub-exponential random variables. The first matrix Bernstein type bound was derived by Ahlswede and Winter [2], and subsequently improved by Tropp [16] by applying Lieb’s theorem in place of the Golden-Thompson inequality. We use the variant of the matrix Bernstein inequality of Tropp stated below. Proposition 2 (Matrix Bernstein Inequality (Theorem 6.1.1 in [17])). Consider a finite sequence {S k } of independent random matrices with common dimension d1 ×d2 . Assume that E [S k ] = 0
and
S k ≤ L
for each index k.
Introduce the random matrix Z=
Sk.
k
Let v(Z) be the matrix variance statistic of the sum: v(Z) = max{E [ZZ ∗ ] , E [(Z ∗ Z)] } = max{ E [S k S ∗k ] , E [S ∗k S k ] }. k
(4) (5)
k
Then, for all t ≥ 0, Prob{Z ≥ t} ≤ (d1 + d2 ) exp
−t2 /2 v(Z) + Lt/3
.
Another key theorem we rely on is Baranyai’s theorem [4], stated below.
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
85
t Proposition 3 (Baranyai, 1973). Let a1 , . . . , at be natural numbers such that j=1 aj = N k . Then the set of k-subsets of [N ] can be partitioned into disjoint families S1 , . . . , St aj ·k aj ·k with |Sj | = aj and each i ∈ [N ] is included in exactly N or N elements of Sj . 2.1. Sketch of the proof of Theorem 1 Suppose that X k , X, and Z n satisfy the assumptions of Theorem 1. Write Zn =
n k 1 k=0
= Id +
n n
X jk X jk−1 · · · X j1
(6)
1≤j1 <···
Z n,k
(7)
k=1
where Z n,k
k 1 = n
X jk X jk−1 · · · X j1 ,
1 ≤ k ≤ n.
(8)
1≤j1 <···
Because the X k are independent, the expected values of Z n,k and Z n are easily calculated: k
1 n E [Z n,k ] = Xk, n k
E [Z n ] = I d +
n
E [Z n,k ] = (I d +
k=1
1 X)n . n
(9)
We then write Z n − eX ≤ Z n − E[Z n ] + E[Z n ] − eX 1 n X = Z n − E[Z n ] + (I d + X) − e n n 1 n X ≤ Z n,k − E[Z n,k ] + (I d + X) − e n
(10) (11)
k=1
The approximation error (I d + n1 X)n − eX is bounded deterministically using standard analysis, and converges to zero at rate O(1/n), as made precise by Lemma 2. The errors Z n,k − E[Z n,k ] decay sufficiently quickly in k that the sum of all but the first n log(n) many of them, k= log(n) Z n,k − E[Z n,k ], is also bounded by O(1/n) deterministically (Lemma 3 below). The leading error term Z n,1 − E[Z n,1 ] is bounded with high probability using the Matrix Bernstein inequality. The most interesting, and most difficult, part of the proof is in bounding the intermediate terms Z n,k − E[Z n,k ], k = 2, . . . , log(n) . To do this, we appeal to Baranyai’s theorem, which implies that each such term can be approximately written as a sum of sums of independent matrix
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
86
products, so that we may apply the matrix Bernstein inequality with properly tuned parameters to each sub-sum to achieve the final bound. 3. Key ingredients The first two lemmas use standard analysis tools; we defer the proofs to appendices. Lemma 2. Let X be a square real or complex-matrix with spectral norm X. The following holds: (I +
1 X2 X X)n − eX ≤ e n 2n
The proof of Lemma 2 is found in Appendix B. Lemma 3. Suppose that Z n is as in Theorem 1, and let Z n,k be as defined in (8). Suppose that log(n) ≥ max{3, Le2 }. Then n
Z n,k − E (Z n,k ) ≤
k= log(n)
2Le2 n(e − 1)
The proof of Lemma 3 is found in Appendix A. Proposition 4 contains the meat of the proof. By carefully combining the Matrix Bernstein inequality and Baranyai’s theorem, we produce high probability bounds for the error terms Z n,k − E (Z n,k ) . Proposition 4. Assume X 1 , X 2 , . . . X n are d × d matrices satisfying the assumptions in Theorem 1, and suppose that n, k, d ∈ Z, and δ > 0 are such that k≤
16n log(d/δ) + log(ne)
1/3 (12)
where, for the k = 1 case, we treat 00 = 1. Then Prob[Z n,k − E(Z n,k ) > γk ] ≤ δ k where γk = 2
eL k−1
k−1
2L √ n
log
2d(ne/(k − 1))k−1 δ
L(k − 1) + n
(13)
Proof. For simplicity of notation, we drop the subscript n in all matrix notation throughout; that is, we let Z = Z n , we let Z n,k = Z k , and so on. Note Let p ∈ {0, 1, . . . , k − 1} be the unique integer such that k divides n − p, and write
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
Z k − E [Z k ] =
=
k 1 n k 1 n
X jk X jk−1 · · · X j1 − X k
87
(14)
1≤j1 <···
X jk X jk−1 · · · X j1 − X k + D k .
(15)
1≤j1 <···
The random matrix D k is a sum of nk − n−p random matrix products, each of which k contains at least one of the p matrices X n−p+1 , · · · X n . Each term is bounded in norm k deterministically by 2 L , so n
k n n−p L − 2 k k n
k n−1 L ≤ 2(k − 1) (by Pascal’s rule) k−1 n
k−1 k (n − 1)(e) L ≤ 2(k − 1) k−1 n
k−1 L(k − 1) eL ≤2 n k−1
D k ≤
We thus have so far that k 1 X jk X jk−1 · · · X j1 − X k Z k − E [Z k ] ≤ n 1≤j1 <···
eL k−1
k−1 .
(n−p k ) Now, as a consequence of Baranyai’s theorem, there exist mk = (n−p)/k = n−p−1 k−1 partitions of [n − p] = {1, 2, . . . , n − p}, denoted by Pr , r = 1, 2, . . . , mk , such that
X jk X jk−1 · · · X j1 − X k
1≤j1 <···
=
mk
X jk · · · X j1 − X k
r=1 1≤j1 <···
Write
(n−p)/k
=1
Y r, =
1≤j1 <···
k 1 X jk · · · X j1 − X k . n
(16)
88
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
Because the X j are independent and because each Pr constitutes a partition of [n − p], (n−p)/k each subset of random matrices {Y r, }=1 forms a mutually independent set of ran mk (n−p)/k dom matrices. We can use this to bound r=1 Y r, with high probability, =1 using the Matrix Bernstein Inequality (Proposition 2). Indeed, we will apply the Ma(n−p)/k trix Bernstein Inequality separately to each sum =1 Y r, of independent random matrices. To do this, we employ the bounds k k 1. E [Y r, ] = n1 E X jk · · · X j1 − X k = n1 Xk − Xk = 0 k k k k 2. Y r, = n1 X jk · · · X j1 − X k ≤ n1 X jk · · · X j1 + X ≤ 2 L n 3. (n−p)/k (n−p)/k ∗ E [Y Y ] E [Y r, Y r, ∗ ] r, r, ≤ =1 =1
(n−p)/k
≤
(n−p)/k ∗
E [Y r, Y r, ] ≤
=1
2 E Y r,
=1
2k n L 2k L 4 ≤4 n k n
(n−p)/k
≤
=1
2k (n−p)/k 4. Similarly, =1 E [Y r, ∗ Y r, ] ≤ 4 nk L n We can now apply the Matrix Bernstein Inequality: for any τ > 0, ⎛ (n−p) ⎞ k 2 −τ /2 ⎠ Prob ⎝ Y r, 2k k ≥ τ ≤ 2d exp =1 4 nk L + 2τ /3 L n n We take the union bound over all mk =
n−p−1 k−1
≤
n k−1
ne k−1 ≤ ( k−1 ) sums to obtain
(n−p) ⎞ k
k−1 2 ne −τ /2 ⎠ Prob ⎝∃r ∈ [mk ] : Y r, exp 2k k ≥ τ ≤ 2d k − 1 =1 4 nk L +τ L n n ⎛
ne −(k−1) Set τ = βk ( k−1 ) (where, in case k = 1, we use 00 = 1). Then
(n−p) (n−p) ⎞ ⎛ ⎞ ⎛ k mk k ⎠ ⎝ ⎠ Prob ⎝ Y r, Y r, ≥ βk ≤ Prob ∃r ∈ [mk ] : ≥ βk /mk r=1 =1 =1 (n−p) ⎛ ⎞ k
−(k−1) ne ⎠ Y r, ≤ Prob ⎝∃r ∈ [mk ] : ≥ βk k − 1 =1
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
89
⎞ ne −2(k−1) −βk2 ( k−1 ) /2 ≤ 2d exp ⎝ 2k ⎠ ne −(k−1) L k 4 nk L + β ( ) k k−1 n n
k−1 2(k−1) −βk2 n( k−1 /2 ne e ) = 2d exp k−1 2k (k−1) k−1 4L /k + βk ( e ) Lk
ne k−1
⎛
k−1
Set 4 e k−1 k βk = √ ( L log(2d) + (k − 1) log(ne/(k − 1)) − log(δ) ) n k−1
(17)
Under the assumption that √ 4 n k≤ , log(d) + (k − 1) log(ne/(k − 1)) − log(δ)
(18)
which is implied by the stated condition (12) on k, it follows that βk (
k − 1 k−1 k L ≤ 4L2k /k, ) e
and so we can continue to bound ⎛ ⎞ mk n/k ⎠ Prob ⎝ Y r, ≥ βk r=1 =1
k−1 2(k−1) −βk2 n( k−1 ) /2 ne e ≤ 2d exp k−1 8L2k /k ≤ 2d
ne k−1
k−1 exp (−k (log(2d) + (k − 1) log(ne/(k − 1)) − log(δ)))
≤ δk Thus, we conclude that for each k = 1, 2, . . . , satisfying assumption (18), it holds that Prob[Z k − E [Z k ] > βk + D k ] ≤ δ k Recalling D k ≤ yields the result.
2L(k − 1) n
eL k−1
k−1
90
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
4. Proof of Theorem 1 We can bound the error Z n − E[Z n ] from Theorem 1 by combining Proposition 4 with Lemma 3. Corollary 3.1. Suppose that L, n, and δ ∈ (0, 1/2] are such that max{3, Le2 } ≤ log(n) + 1 ≤
16n log(d/δ) + log(ne)
1/3 (19)
Then with probability exceeding 1 − 2δ, Z n − E[Z n ] ≤
2LeL log(n) √ n
Proof. First, Z n − E[Z n ] ≤ Proposition 3, n
n k=1
log(n) 2 log(2d(ne)n /δ) + √ n
Z n,k − E [Z n,k ] by the triangle inequality. By
Z n,k − E [Z n,k ] ≤
k= log(n)
2eL n
with probability 1.
Now, given (19), we can apply Proposition 4 to each of k = 1, 2, . . . , log(n), and via the union bound, we obtain that the following holds with probability at least 1 − log(n) k δ δ ≥ 1 − 1−δ ≥ 1 − 2δ: k=1 log(n)
Z n,k − E(Z n,k )
k=1 log(n)
≤
γk
k=1 log(n)
≤2
k=1
eL k−1
k−1
2L √ n
log(2d(ne/(k −
1))k−1 /δ)
L(k − 1) + n
log(n)
eL k−1 2 log(n) n ≤ 2L √ log(2d(ne) /δ) + n k−1 n k=1
2LeL log(n) log(n) √ ≤ 2 log(2d(ne)n /δ) + √ n n
where in the final inequality, we use that have the stated result.
eL x x
is maximized over x > 0 at x∗ = L. We
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
91
Proof of Theorem 1 from Corollary 3.1. Write Z n −eX ≤ Z n −E[Z n ] +E[Z n ] −eX . Bound Z n − E[Z n ] using Corollary 3.1 and bound E[Z n ] − eX = (I + n1 X)n − eX using Lemma 2 to arrive at the statement of Theorem 1. 5. Conclusion and future directions We derived a large deviations bound for the convergence rate of a certain type of product of random matrices toward its limiting distribution. Our results are quite general and nearly sharp with respect to dependence on the matrix size d and number of terms in the product, n. One particularly immediate application of our rates of convergence is in the analysis of random matrix products arising in stochastic iterative algorithms such as Oja’s algorithm for streaming principal component analysis [14]. One area of future work would be to use our results to derive convergence rates for Oja’s method using minimal assumptions – an area of ongoing research (see, for example, [1,10]). This is particularly important because of the fundamental role streaming PCA plays in high-dimensional data analysis. Declaration of competing interest There is no conflict of interest. Appendix A. Proof of Lemma 3 Lemma 3. Suppose that Z n is as in Theorem 1, and let Z n,k be as defined in (8). Suppose that log(n) ≥ max{3, Le2 }. Then n
Z n,k − E (Z n,k ) ≤
k= log(n)
2Le2 n(e − 1)
Proof. We have that 1 k Z n,k − E [Z n,k ] = k X X · · · X − X jk jk−1 j1 n 1≤j1 <···
≤ (Using that X j ≤ L)
≤ ≤
1 nk
1≤j1 <···
2Lk n nk k 2Lk en k nk
k
X j X j · · · X j1 + X k k k−1
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
92
=2
Le k
k
Hence it remains to show that
n k= log(n) }
Le k
k ≤
Le2 n(e − 1)
Let k0 = log(n) in the remainder. First, we observe that
Le k0
k0 ≤
Le n
Le k0
k0
≤
Le n :
⇔
k0 log(Le) − k0 log(k0 ) ≤ log(Le) − log(n)
⇔
(k0 − 1) log(Le) − k0 log(k0 ) ≤ − log(n)
⇔
(k0 − 1) log(Le) − k0 (log(k0 ) − 1) − k0 ≤ − log(n)
Since k0 ≥ log(n), it suffices to show that (k0 − 1) log(Le) − k0 (log(k0 ) − 1) ≤ 0 We consider two cases: 1. Case 1: If L ≤ 1e , then (k0 − 1) log(Le) ≤ (k0 − 1) log(1) ≤ 0. Thus we require −k0 (log(k0 ) − 1) ≤ 0. This clearly holds because k0 ≥ e. 2. Case 2: If L > 1e , then log(Le) > log(1) = 0 ⇒ − log(Le) < 0. Since k0 ≥ Le2 , it follows that (k0 − 1) log(Le) − k0 (log(k0 ) − 1) ≤ (k0 − 1) log(Le) − k0 (log(Le2 ) − 1) = (k0 − 1) log(Le) − k0 (log(Le) + 1 − 1) = − log(Le) < 0. Now, for each k ≥ k0 ,
Le k+1
k+1 ≤
Le k
k+1
Le ≤ k
Le k
k
Le ≤ Le2
Le k
k =e
By induction, it follows that
Le k+1
k+1 ≤e
k0 −k
Le k0
k0
≤ ek0 −k
Le ; n
−1
Le k
k
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
93
Hence,
k n n ∞ Le Le k0 −k Le −k Le2 ≤ e ≤ e = k n n n(e − 1)
k=k0
k=k0
k=0
Appendix B. Proof of Lemma 2 Lemma 2. Let X be a square real or complex-matrix with spectral norm X. The following holds: (I +
1 X2 X X)n − eX ≤ e n 2n
Proof. The proof uses only basic analytic tools and inequalities. Recall the matrix ex∞ k ponential: eX := k=0 Xk! . Let σ = X. Then we have n
∞ k k 1 n X X (I + X)n − eX = − n k nk k! k=0 k=0
n ∞ n! Xk Xk −1 − = k! nk (n − k)! k! k=0 k=n+1
n ∞ n! Xk Xk ≤ −1 + k! nk (n − k)! k! k=0
≤
n k=0
=
n k=0
k=n+1
∞ k X X n! + − 1 k! nk (n − k)! k! k
σk k!
= eσ −
1−
n k=0
= eσ − 1 +
n! − k)!
k=n+1
nk (n
+
∞ k=n+1
σk k!
n σk k nk σ n n
= e − exp (n log(1 + σ/n))
1 σ2 σ ≤ e − exp σ − 2 n
σ2 = eσ 1 − exp (− ) 2n σ
where in the final inequality, we used that log(1 + σ/n) ≥ that e−x ≥ 1 − x for all x > 0,
σ n
−
1 2
σ 2 n
. Thus, using also
94
A. Henriksen, R. Ward / Linear Algebra and its Applications 594 (2020) 81–94
2 (I + 1 X)n − eX ≤ eσ σ n 2n
References [1] Z. Allen-Zhu, Y. Li, First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate, ArXiv e-prints, July 2016. [2] R. Ahlswede, A. Winter, Strong converse for identification via quantum channels, IEEE Trans. Inf. Theory 48 (2003) 569–579. [3] Zeyuan Allen-Zhu, Yuanzhi Li, First efficient convergence for streaming k-pca: a global, gap-free, and near-optimal rate, in: 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2017, pp. 487–492. [4] Zsolt Baranyai, On the factorization of the complete uniform hypergraph, in: Infinite and Finite Sets: to Paul Erdös on His 60th Birthday, in: Colloquia mathematica Societatits János Bolyai, vol. 10, North-Holland Publishing Company, 1975, pp. 91–108. [5] A. Balsubramani, S. Dasgupta, Y. Freund, The fast convergence of incremental PCA, in: Advances in Neural Information Processing Systems (NIPS), 2013, pp. 3174–3182. [6] M. Berger, Central limit theorem for products of random matrices, Trans. Am. Math. Soc. 285 (1984) 777–803. [7] Y. Benoist, Je. Quint, Random Walks on Reductive Groups, Results in Mathematics and Related Areas, vol. 62, Springer, 2016. [8] Jordan Emme, Pascal Hubert, Limit laws for random matrix products, Math. Res. Lett. 25 (2018). [9] A. Furman, Random walks on groups and random transformations, in: Handbook of Dynamical Systems, 1A, 2002, pp. 931–1014. [10] Prateek Jain, Chi Jin, Sham M. Kakade, Praneeth Netrapalli, Aaron Sidford, Matching matrix bernstein with little memory: near-optimal finite sample guarantees for oja’s algorithm, CoRR, arXiv:1602.06929 [abs], 2016. [11] T. Krasulina, Method of stochastic approximation in the determination of the largest eigenvalue of the mathematical expectation of random matrices, Autom. Remote Control (1970) 50–56. [12] F. Ledrappier, Some Asymptotic Properties of Random Walks on Free Groups, CRM Proc. Lecture Notes, vol. 1A, Amer. Math. Soc., 2001. [13] I. Mitliagkas, C. Caramanis, P. Jain, Memory limited, streaming PCA, in: Advances in Neural Information Processing Systems (NIPS), 2013, pp. 2886–2894. [14] E. Oja, Simplified neuron model as a principal component analyzer, J. Math. Biol. 15 (3) (1982) 267–273. [15] C.D. Sa, C. Re, K. Olukotun, Global convergence of stochastic gradient descent for some nonconvex matrix problems, in: Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, pp. 2332–2341. [16] J. Tropp, User-friendly tail bounds for sums of random matrices, Found. Comput. Math. 12 (2010) 389–434. [17] J.A. Tropp, An Introduction to Matrix Concentration Inequalities, 1-2 edition, in: Mike Casey (Ed.), Foundations and Trends in Machine Learning, vol. 8, 2015.