Reminiscences, and some explorations about the bootstrap

Reminiscences, and some explorations about the bootstrap

Available online at www.sciencedirect.com ScienceDirect Stochastic Processes and their Applications ( ) – www.elsevier.com/locate/spa Reminiscence...

321KB Sizes 2 Downloads 67 Views

Available online at www.sciencedirect.com

ScienceDirect Stochastic Processes and their Applications (

)

– www.elsevier.com/locate/spa

Reminiscences, and some explorations about the bootstrap R.M. Dudley Massachusetts Institute of Technology, Room E18-412, MIT, Cambridge, MA, 02139, USA Received 22 February 2015; accepted 7 January 2016

Abstract The paper is a potpourri of short sections. There will be some reminiscences about Evarist (from the early 1970s), then some on infinite-dimensional limit theorems from 1950 through 1990. A section reviews a case of slow convergence in the central limit theorem for empirical processes (Beck, 1985) and another the “fast” convergence of Koml´os–Major–Tusn´ady. The paper does an experimental exploration of bootstrap confidence intervals for the mean (of Pareto distributions) and (as less commonly seen) for the variance, of normal and Pareto distributions. c 2016 Elsevier B.V. All rights reserved. ⃝

Keywords: Empirical measures; Confidence intervals; Bias-correction

1. About Evarist Evarist was my fourth Ph.D. student and third at MIT. Originally from Catalonia, he arrived in the Fall of 1970 from Venezuela, where he had been teaching. Evarist finished his Ph.D. thesis (on testing for uniformity in compact Riemannian manifolds) in 1973 and published it in [19]. After finishing at MIT Evarist went back to Venezuela for a year. He spent the academic year 1974–75 in Berkeley on a short-term faculty appointment. Then he again returned to Venezuela to fulfill the obligation of his 3-year graduate fellowship to MIT. He was at “IVIC”, the Venezuelan Institute of Scientific Research. E-mail address: [email protected]. http://dx.doi.org/10.1016/j.spa.2016.04.016 c 2016 Elsevier B.V. All rights reserved. 0304-4149/⃝

2

R.M. Dudley / Stochastic Processes and their Applications (

)



2. Infinite-dimensional limit theorems The one previous meeting I had been to in the UK was in late July 1974, in Durham, a London Mathematical Society Symposium on Functional Analysis and Stochastic Processes. I was struck by the remarkably cool midsummer weather in Durham, as I was also by remarkably warm February weather in Cambridge (UK) on my one previous visit, as a tourist for a day in 1976. In the 1950s, Robert Fortet and Edith Mourier [18] had considered some infinite-dimensional central limit theorems. Let (S, d) be a complete separable metric space. For the Lipschitz seminorm ∥ f ∥ L := supx̸= y | f (x) − f (y)|/d(x, y) and the bounded Lipschitz norm ∥ f ∥ B L := ∥ f ∥ L + supx | f (x)| on functions for which it is finite, a metric β between probability measures P, Q is defined by      β(P, Q) := sup  f d(P − Q) : ∥ f ∥ B L ≤ 1 . Fortet and Mourier [17] first defined this metric (so I did not!) and showed that for and any Borel probability measure P and its empirical measures Pn , β(Pn , P) → 0 almost surely (even though the norm ∥ · ∥ B L is not separable). They did not show that β metrizes weak convergence of probability measures. (Prohorov in 1956 first proved metrizability, by a different metric.) R. Ranga Rao [29] proved that weak convergence of probability measures on S implies β convergence. (I had the benefit, while in graduate school, of having been given Ranga Rao’s paper to referee.) It seems I did first prove that bounded Lipschitz (β) convergence implies weak convergence in any separable metric space, which moreover need not be complete [11]. Fortet and Mourier [17] proved a CLT (central limit theorem), in a class of separable Banach spaces including L p spaces for 2 ≤ p < ∞, for i.i.d. random variables X j with E X 1 = 0 and E∥X 1 ∥2 < ∞. Strassen and Dudley [33] proved that in C[0, 1] the CLT could fail even for bounded random variables but held under a metric entropy condition. In 1974 as far as I knew, in L p for 1 < p < 2 the question was still open, but I thought it was due to be solved and said we should solve it there in Durham. At the end of my talk Gilles Pisier came up to me and told me it had been solved in the negative. Banach spaces in which the CLT held under the classical conditions were characterized as those of type 2 [25], which L p spaces for 1 ≤ p < 2 are not. In 1976, Evarist began to publish on central limit theorems in Banach spaces. I think since then, he (and others) had known more than I do about limit theorems in separable Banach spaces. Araujo and Gin´e [3] was a book on such central limit theorems. The book is by far Evarist’s most cited work (Google Scholar). I was lucky that my work on empirical processes was synergistic with that of people working at first in separable Banach spaces. In some fields of statistics I entered later I seemed to be more of a contrarian. Related to my other topic are Evarist’s second and third-most cited works, both with Joel Zinn, [21,22]) the latter on bootstrapping empirical measures. When first reading the 1984 paper, I noticed a lot of technical facts on symmetrization, desymmetrization, and randomization such as Poissonization. In the bootstrap paper, I saw how the technical facts are useful. In January of some year, about 1989 or 1990, I visited the statistics departments in Seattle and Berkeley, lecturing on Evarist and Joel’s work on the bootstrap, as I thought that was more interesting than any research I was doing around that time. My handout for those lectures was a mixture of pages from a preprint of [22] and copies of proofs by others it relied on, aiming to

R.M. Dudley / Stochastic Processes and their Applications (

)



3

give self-contained proofs of Evarist and Joel’s theorems. In my book (2014 second ed.) I prove only their “in probability” bootstrap central limit theorem. The “almost sure” theorem would apply if, for an increasing sequence n j of values of n, and a fixed sequence X 1 , X 2 , . . . ., one were applying the bootstrap to each sample (X 1 , . . . , X n j ), but it seems to me that statisticians generally apply the bootstrap for one value of n at a time and not in a sequential way. There is what is called a “sequential bootstrap” [30], in which for a given n, sampling with replacement . is done just until there are at least cn distinct samples, specifically for c = 0.632. Here still one apparently treats one n and sample of size n at a time, although the size of the bootstrap sample is a random variable. (I found the Rao et al. paper by Google search for ‘sequential bootstrap’.) I have taught about the bootstrap from a more applied point of view, a week or two each time, in graduate nonparametric statistics courses since 2007, most recently in the spring of 2015, leaving me hardly an expert. Let S be a sample space. Let P be a probability measure on S and suppose we have  X 1 , X 2 , . . . , S-valued, i.i.d. P. Let Pn := n1 nj=1 δ X j . Suppose we have a functional T (·) defined on probability measures on S. A point estimate of T (P) is T (Pn ). To estimate the amount of variability in T that could occur with different samples, there is the bootstrap, invented by B. B , i = 1, . . . , m, be i.i.d. (P ). The case I am familiar with is to take m = n, Efron [14]. Let X ni n as I think statisticians generally do, although Evarist and others have treated m = o(n) and/or min(m, n) → +∞. Then let PnB :=

n 1 δ B, n i=1 X ni

be the bootstrap empirical measure. Under some conditions, the conditional distribution of PnB given Pn may be such that for some functional(s) T , T (PnB ) − T (Pn ) given Pn may behave as T (Pn′ ) − T (P), where Pn′ is a random empirical measure for P as √ contrasted with the fixed given Pn . More precisely, for some sequence of constants an such as n, (an (T (PnB ) − T (Pn ))|Pn ) ∼ an (T (Pn′ ) − T (P)), with ∼ indicating closeness in distribution, where an (T (Pn′ ) − T (P)) is bounded away from 0 in probability. Let F be a class of real-valued measurable functions on S. It is called pregaussian for P if F ⊂ L2 (P) and the Gaussian process G P indexed by F having the same covariance as P1 or P1 − P can be taken to have sample functions bounded and uniformly continuous with respect to ρ P where ρ P ( f, g) = cov√P ( f − g, f − g)1/2 . F is called a Donsker class for P if in addition, the empirical process n(Pn − P) converges in distribution to G P on F with respect to uniform convergence over F. In 1978 I gave a definition of Donsker class (of sets), then later twice changed the definition, extended to classes of functions. The final definition, by Hoffmann-Jørgensen, defines convergence in distribution of a sequence of possibly nonmeasurable random elements to a measurable random variable limit such as G P . Details are in [12] or in [34]. F is called a bootstrap P-Donsker class in probability if the bounded Lipschitz distance √ between the distribution of n(PnB − Pn ) given Pn and that of G P converges to 0 in outer probability as n → ∞. Gin´e and Zinn [22] proved that P-Donsker implies bootstrap P-Donsker in probability. I include a proof of that in [12]. F is called a uniform Donsker class if the convergence is also

4

R.M. Dudley / Stochastic Processes and their Applications (

)



uniform in all P on the same σ -algebra in S. Gin´e and Zinn [23] characterized uniform Donsker classes as those which are uniformly pregaussian, as defined e.g. in [12, (10.31), (10.32)] or finitely uniformly pregaussian (ibid., (10.33), (10.34)) under a measurability condition. I put a proof of that into the second edition (2014, Theorem 10.22, with the image admissible Suslin condition) of my book. I also put in (as Theorems 10.27 and 10.28) a proof of [7] theorem that the convex hull of a uniformly bounded, image admissible Suslin uniform Donsker class is still uniform Donsker. It seems to me that in bootstrapping, little may be known about P except for the observed Pn , so the uniform Donsker property of F may be desirable, although it is highly restrictive, requiring F to be uniformly bounded up to additive constants. Pollard’s entropy condition provides a useful sufficient condition for uniform Donsker. 3. Slow convergence in the central limit theorem for empirical processes Beck [4] proved that for the uniform distribution P on the unit cube in Rd , for d ≥ 2, for the class B(d) of all balls, which as a nicely measurable VC class of sets is a√uniform Donsker class of functions, we have for the limiting Gaussian process G P , sup B∈B(d) |[ n(Pn − P) − G P ](B)| is no smaller than of order n −1/(2d) , which seems disappointingly slow. In a seminar in the Fall of 2006, which alternated between MIT and University of Connecticut, Storrs, we verified that Beck’s proof was essentially correct although some details needed fixing. I and Richard Nickl, then a postdoc with Evarist, gave most or perhaps all of the talks, but Dmitry Panchenko and a then graduate student, Wen Dong, contributed useful ideas. Richard was familiar with the very relevant series of papers by different authors called “Irregularities of distribution.” There is a book with that title by Beck and Chen [5]. √ 4. “Fast” convergence: O((log n)/ n) √ For the 1-dimensional empirical process n(Fn − F)(x), the rate of convergence in sup norm √ to b F(x) for a Brownian bridge process bt is O((log n)/ n) as stated by Koml´os–Major–Tusn´ady and proved with not too large constants by Bretagnolle and Massart [8]. Just after the first edition of my book UCLT appeared (1999) I gave an exposition of the Bretagnolle–Massart theorem and proof with more details, which is now in the second edition. I also put in a proof of Massart’s form of the Dvoretzky–Kiefer–Wolfowitz inequality with sharp constant, detailed except that in one step, a proof by calculus is replaced by numerical verification on a grid. 5. Quantiles and sample quantiles If F is a continuous distribution function, strictly increasing on F −1 ((0, 1)), and 0 < q < 1, the qth quantile of F is F −1 (q). If one observes X 1 , . . . , X n i.i.d. (F), one can give precise confidence intervals for F −1 (q) with endpoints order statistics of X j , without bootstrapping, for example, [26, p. 498], [28, p. 239]. The bootstrap gives a relatively easy, off-the-shelf method. For an empirical distribution function Fn , eight books, two on the bootstrap and six beginning statistics textbooks, give 9 different definitions of sample qth quantile for 0 < q < 1 as order statistics, e.g. X (⌈nq⌉) , or as convex combinations of adjoining ones. The difference in definitions is not crucial here. If instead of n we have R, the number of bootstrap replications, sometimes called B, two books on the bootstrap give definitions whereby the sample quantiles used are order statistics: [15], for q ≤ 1/2, take X (k) , k = Bq if that is an integer, which can be arranged for usual

R.M. Dudley / Stochastic Processes and their Applications (

)



5

q = α’s and B a round number, otherwise k = ⌊(B + 1)q⌋. Davison and Hinkley [9] give X (k) for k = (R + 1)q if that is an integer, as they arrange for usual q’s for example by setting R = 999 so that R + 1 is a round number. Also, having R odd gives a nice sample median (q = 1/2). 6. Bootstrap confidence intervals 6.1. The percentile interval Let T be a real-valued functional, defined on some set P of probability measures P on a set S, containing all possible empirical measures Pn . Let X 1 , . . . , X n be i.i.d. P for some P ∈ P, and take the empirical measure Pn . To get an approximate confidence interval for T (P), form i.i.d. bootstrap empirical measures PniB for i = 1, . . . , R and let Ti := T (PniB ), i = 1, . . . , R. Form the order statistics T(1) ≤ T(2) ≤ · · · ≤ T(R) . Given 0 < α < 1/2, let TL and TU be the α/2 and 1 − (α/2) sample quantiles of the Ti . Then Efron had defined [TL , TU ] as the 1 − α two-sided percentile confidence interval for T (P), as also in [15]. I have not seen any precise, general statements as to under what conditions and with what accuracy the percentile interval, or for that matter any bootstrap-based interval, works. 6.2. The BCa interval Efron and Tibshirani [15,16] defined, along with the percentile interval, the “Bias-corrected, accelerated” BCa intervals. DiCiccio and Efron [10, (2.3)], give a formula for BCa interval endpoints,   z 0 + z (α) −1 Φ z 0 + ˆθ BCa [α] = G 1 − a(z 0 + z (α) )  is the bootstrap where Φ is the standard normal distribution function, z (α) its α quantile, G empirical distribution function, α = 0.05 and 0.95 respectively for the left and right endpoints of the BCa 90% confidence interval, z 0 is a bias-correction and a an acceleration, which themselves must be estimated. One estimate zˆ 0 is Φ −1 (qn′ ) where qn′ is the fraction of the Ti that are ≤ T (Pn ) ([10, (2.8)]; they give another estimate in their Section 4). An estimate  a of a is the sample skewness of the bootstrap sample [15, (14.15) p. 186)]. 6.3. The “basic” interval If the bootstrap works, i.e. an (T (PnB ) − T (Pn )) given Pn is close to an (T (Pn′ ) − T (P)) in distribution, then since Pr(TL < T (PnB ) < TU |Pn ) has been estimated as 1 − α, we would also approximate Pr(TL − T (Pn ) < T (Pn′ ) − T (P) < TU − T (Pn )) ∼ 1 − α. Davison and Hinkley [9] proposed, it seems to me, to plug Pn in place of Pn′ in the above approximation. That may be plausible, as in some ways, we are treating Pn as a typical value of Pn′ , but it seems to me it is not clearly justified. One might say that Pn is fixed and Pn′ is independent of it? The plug-in yields Pr (2T (Pn ) − TU < T (P) < 2T (Pn ) − TL ) ∼ 1 − α.

6

R.M. Dudley / Stochastic Processes and their Applications (

)



Davison and Hinkley call the interval [2T (Pn )−TU , 2T (Pn )−TL ] the basic bootstrap confidence interval for T (P). It provides an interesting alternative to the percentile and other intervals, to be tried out to see how the intervals compare. The basic and percentile intervals are reflections of one another in T (Pn ). If T (Pn ) is near the midpoint of the percentile interval, then the basic and percentile intervals will be about the same. But if T (Pn )− TL and TU − T (Pn ) are quite different, the basic and percentile intervals will be also, and it has been said that one is “backwards” with respect to the other. In Efron and Tibshirani’s Section 13.4, titled “Is the percentile interval backwards?”, as one book called it, they say that sometimes, it is rather the basic interval that is backwards (skewed in the wrong direction). The percentile, basic, BCa , and a normal interval are all implemented in the R system library “boot” and treated by Venables and Ripley [35]. Venables and Ripley, and others, prefer the basic interval. But Davison and Hinkley [9] (its inventors?) do not find it best at all. They say “The studentized bootstrap and adjusted percentile [i.e. BCa ] methods for calculating confidence limits are inherently more accurate than the basic bootstrap and percentile methods”. They empirically compare 8 different bootstrap-based confidence intervals for the functional   T (P, Q) = x d P(x)/ y d Q(y) for P and Q specified gamma distributions. In this case both the lower and upper endpoints of the basic interval tend to be substantially too small. The errors for the basic interval are the worst of those shown. “Studentized, log scale” intervals perform well. The referee points out that Shao and Tu [31, pp. 108–110] also consider the bootstrap for estimating the ratio of two means by the ratio of sample means. The basic interval can give strange endpoints. Suppose for example that T takes values of any positive number 0 < T < +∞. Then 0 < TL < TU . But 2T (Pn ) − TU can be negative. Also, if T (Pn ) < T (P)/2, then T (P) will never be in the basic interval, for arbitrarily small α. But it often happens that Pr(T (Pn ) < T (P)/2) > 0, showing a fault in the basic interval. Such things cannot occur for the percentile nor BCa intervals, with endpoints order statistics of the bootstrap sample, unless T (P) < T(1) or T(R) < T (P), which would rarely occur for the R ≥ 1000 generally used. In teaching about the bootstrap since 2007 I had recommended the basic interval, following Venables and Ripley. Now, I think it may not even be the second-best of the intervals mentioned. For p > 1 the Pareto( p) distribution has a density ( p − 1)/x p for x ≥ 1 and 0 for x < 1. It has a finite mean if and only if p > 2 and finite variance if and only if p > 3. For sample means of 100 i.i.d. Pareto (7/2) variables I found by simulation that the basic interval has significantly worse coverage properties than the percentile interval. One might say that the basic interval is “backwards” in this case. Efron and Tibshirani [15, Section 13.4], say neither the percentile nor the basic interval works well in all cases. In Section 14.3 they say that the BCa method has “two significant theoretical advantages”. For one, it is transformation respecting, meaning preserved by (monotone) transformations of T [such as taking log(T ) if T > 0], as the percentile intervals also are. The other stated advantage is second-order accuracy, √ meaning that coverage probabilities are approximated within O(1/n) as opposed to O(1/ n) for first-order accuracy with the other

R.M. Dudley / Stochastic Processes and their Applications (

)



7

methods. I do not know hypotheses under which the approximation is within C/n for a given finite C. There are bootstrap confidence intervals assuming approximate normal distribution of the Ti , using normal or t distributions, but for lack of time I will not say much more about these. Venables and Ripley, p. 136, say that if the distribution of the Ti is asymmetric, “intervals based on normality are not adequate”. 7. Bootstrapping of the mean  Here one considers means T (P) = f d P for a fixed function f , so that T (Pn ) are sample means. The bootstrap works in this case by the Gin´e–Zinn theorem if f ∈ L2 (P). Gin´e and Zinn [24] showed that the weaker condition that f be in the domain of attraction of a normal law suffices. [2] had shown that if the distribution of f for P is in the domain of attraction of a stable law of index α with 1 < α < 2, then the bootstrap fails, as the sample means can be centered and normalized to have a limiting α-stable law, but corresponding bootstrapped sample means have a random limit law depending on the sequence X 1 , X 2 , . . . . Using Athreya’s result, Gin´e and Zinn [24] proved that for the bootstrap to work, being in the domain of attraction of a normal is necessary. For statistics, it seems that the main interest will be in f ∈ L2 , or better. In 200 experiments on 90% confidence intervals for the mean for Pareto(9/2), again with n = 100, the number of cases of non-coverage by the normal, basic, percentile, and BCa intervals respectively was 33, 37, 33, and 31 respectively. None of these is consistent with true coverage probability of 0.9. Conditioning on disagreement in coverage between the kinds of intervals, which occurred in just 16 of the 200 experiments, the apparent superiority of the BCa (relatively best, 11 coverages) over the basic (relatively worst, 5 coverages) interval was not significant in itself, even without a correction for multiple testing. 8. Bootstrap of the variance Googling “Bootstrap of the mean” (with the quotes) gave me 47,300 hits (although beyond some point, Google may begin to disrespect the literal quotes), but “bootstrap of the variance” gave just two papers in June 2014, one of which was about time series. Another paper about i.i.d. data, [1], was concerned with estimates and not, as far as I saw, with confidence intervals. Repeating the search in June 2015 gave a larger, perhaps not well defined number. The topic is mentioned in some books, Davison and Hinkley, p. 104, and by Shao and Tu [31, pp. 110–112], who consider the distribution of the ratio of the sample variance to the true variance, based on a paper by Srivastava and Chan [32]). If T is the variance functional T (P) = Var P (X ) for distributions P on the line, 2

T (Pn ) = s X′ :=

n 1 (X j − X )2 . n j=1

It is well known that E T (Pn ) = place of P, that E(T (PnB )|Pn ) =

n−1 n T (P)

n−1 T (Pn ). n

for all P with finite variance. It follows, for Pn in

8

R.M. Dudley / Stochastic Processes and their Applications (

)



The unbiased sample variance is a U-statistic, n  1 1 1  (X j − X )2 =  n  (X i − X k )2 . n − 1 j=1 2 2 1≤i
I have that in Section 11.9 on U-statistics of my book Real Analysis and Probability, 1989; 2002 Cambridge University Press ([13]). I do not recall now where I learned it. My source on history of the bootstrap of U-statistics was Evarist’s St.-Flour Lectures from 1996, [20]. The central limit theorem for nondegenerate U statistics of order 2, as this one is, holds under a hypothesis which in this case amounts to E(X 14 ) < ∞. It follows from results of [36] on what are now called V-statistics. The corresponding bootstrap central limit theorem for order 2 also holds under the same hypothesis as shown by Bickel and Freedman [6]. I did 100 experiments, in each of which I generated 20 i.i.d. N (0, 1) variables, and noted which of the normal, basic, percentile, and BCa 90% intervals for the variance covered the target σ 2 = 1. There were respectively 19, 20, 19, and 11 cases of non-coverage. Seemingly because of the bias(es), the normal, basic, and percentile intervals for the variance are to the left of where they should be and have inferior coverage frequencies. The probability of 81 or fewer successes in 100 independent trials with probability 0.9 of success on each is 0.00458. Multiplying by 3 to correct for multiple tests (Bonferroni) gives 0.0137. So there is evidence that the normal, basic, and percentile intervals with target 90% confidence do not attain the target in this case. The BCa interval does much better. As its name implies, it does bias correction. A confidence interval for its coverage probability is [0.828, 0.932], including 0.9. Davison and Hinkley give a table of coverage probabilities of “confidence intervals for normal variance... for 10 samples each of size two”, in which the BCa interval does very well, much better than the basic or percentile intervals. One should consider non-normal distributions of X j . I experimented with i.i.d. Pareto (11/2) variables, which do have fourth moments, E X 14 = 9. I took samples of n = 200 of them at a time, checking the coverage of the normal, basic, percentile and BCa target 90% confidence intervals for the variance. In 100 experiments, non-coverage of σ 2 occurred respectively 29, 33, 29, and 22 times. There is a strong tendency for intervals to be to the left of where they should be. Of the total 113 non-coverages, in 111 the right endpoint of the interval was less than σ 2 . The other two cases were both for the BCa interval. The target fraction of non-coverages on each side is 0.05, so one could say it is a merit of the BCa interval that on the right it had two and the others none. The BCa interval, intended to compensate for bias and skewness, has only partial success in this case; it is only relatively better than the other intervals. In 5 of the 100 experiments, the sample variance T (Pn ) was less than half the true variance T (P), meaning that the basic 1 − α interval could not have covered T (P) no matter for how small α > 0. Shao and Tu [31, p. 168] write that “the nonparametric bootstrap BCa does not do well in many problems”, partly because the acceleration constant a cannot be precisely evaluated. They give some examples and refer to Loh and Wu [27] for others. They write “the bootstrap BCa is only a slight improvement over the bootstrap percentile or BC” (bias-corrected interval without acceleration), “and all of them perform poorly when the population is asymmetric”, as I found for Pareto distributions.

R.M. Dudley / Stochastic Processes and their Applications (

)



9

References [1] S. Amiri, D. von Rosen, S. Zwanzig, On the Comparison of Parametric and Nonparametric Bootstrap, Report, 2008. [2] K.B. Athreya, Bootstrap of the mean in the infinite variance case, Ann. Statist. 15 (1987) 724–731. [3] Aloisio Araujo, Evarist Gin´e, The Central Limit Theorem for Real and Banach Space Valued Random Variables, Wiley, New York, 1980. [4] J´ozsef Beck, Lower bounds on the approximation of the multivariate empirical process, Z. Wahrscheinlichkeitsth. verw. Gebiete 70 (1985) 289–306. [5] J´ozsef Beck, William W.L. Chen, Irregularities of Distribution, Cambridge University Press, 1987. [6] P.J. Bickel, D.A. Freedman, Some asymptotic theory for the bootstrap, Ann. Statist. 9 (1981) 1196–1217. [7] O. Bousquet, V. Koltchinskii, D. Panchenko, Some local measures of complexity of convex hulls and generalization bounds, in: COLT [Conference on Learning Theory], in: Lecture Notes on Artificial Intelligence, vol. 2375, 2002, pp. 239–256. [8] Jean Bretagnolle, Pascal Massart, Hungarian constructions from the nonasymptotic viewpoint, Ann. Probab. 17 (1989) 239–256. [9] A.C. Davison, D.V. Hinkley, Bootstrap Methods and their Application, Cambridge University Press, 1997. [10] T.J. DiCiccio, B. Efron, Bootstrap confidence intervals, Statist. Sci. 11 (1996) 189–228. [11] R.M. Dudley, Distances of probability measures and random variables, Ann. Math. Statist. 39 (1968) 1563–1572. [12] R.M. Dudley, Uniform Central Limit Theorems, Cambridge University Press, first ed., 1999, second ed., 2014. [13] R.M. Dudley, Real Analysis and Probability, first ed., Wadsworth and Brooks-Cole, 1989, second ed., Cambridge University Press, 2002. [14] B. Efron, Bootstrap methods; another look at the jackknife, Ann. Statist. 7 (1979) 1–26. [15] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall, New York, 1993, (repr. 1994). [16] B. Efron, R. Tibshirani, Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Statist. Sci. 1 (1986) 54–75. [17] R. Fortet, E. Mourier, Convergence de la r´epartition empirique vers la r´epartition th´eorique, Ann. Sci. Ecole Norm. Sup. 70 (1953) 266–285. [18] R. Fortet, E. Mourier, Les fonctions al´eatoires comme e´ l´ement al´eatoires dans les espaces de Banach, Studia Math. 15 (1955) 62–79. [19] E. Gin´e, Invariant tests for uniformity on compact Riemannian manifolds based on Sobolev norms, Ann. Statist. 3 (1975) 1243–1266. [20] E. Gin´e, Lectures on some aspects of the bootstrap, in: P. Bernard (Ed.), Lectures on Probability Theory and Statistics, in: Lecture Notes in Math., vol. 1665, Springer, 1997, pp. 37–151. Ecole d’´et´e de probabilit´es de St.Flour (1996). [21] E. Gin´e, J. Zinn, Some limit theorems for empirical processes, Ann. Probab. 12 (1984) 929–981. [22] E. Gin´e, J. Zinn, Bootstrapping general empirical measures, Ann. Probab. 18 (1990) 851–869. [23] E. Gin´e, J. Zinn, Gaussian characterization of uniform Donsker classes of functions, Ann. Probab. 19 (1991) 758–782. [24] E. Gin´e, J. Zinn, Necessary conditions for the bootstrap of the mean, Ann. Statist. 17 (1989) 684–691. [25] J. Hoffmann-Jørgensen, G. Pisier, The law of large numbers and the central limit theorem in Banach spaces, Ann. Probab. 4 (1976) 587–594. [26] R.V. Hogg, A.T. Craig, Introduction to Mathematical Statistics, Fifth Ed., Prentice Hall, 1995. [27] W.Y. Loh, C.F.J. Wu, Discussion of Better bootstrap confidence intervals by B. Efron, J. Amer. Statist. Assoc. 82 (1987) 188–190. [28] F. Mosteller, R.E.K. Rourke, Sturdy Statistics, Addison-Wesley, 1973. [29] R. Ranga Rao, Relations between weak and uniform convergence of measures with applications, Ann. Math. Statist. 33 (1962) 659–680. [30] C.R. Rao, P.K. Pathak, V.I. Koltchinskii, Bootstrap by sequential resampling, J. Statist. Plann. Inference 64 (1997) 257–281. [31] Jun Shao, Dongsheng Tu, The Jackknife and Bootstrap, Springer, 1995. [32] M.S. Srivastava, Y.M. Chan, A comparison of bootstrap method and Edgeworth expansion in approximating the distribution of sample variance — one sample and two sample cases, Comm. Statist. Simulation Comput. 18 (1989) 339–361. [33] V. Strassen, R.M. Dudley, The central limit theorem and ε-entropy, in: Probability and Information Theory, in: Lecture Notes in Math., vol. 89, Springer, 1969, pp. 224–231. [34] A. van der Vaart, J. Wellner, Weak Convergence and Empirical Processes, Springer, New York, 1996. [35] W.N. Venables, B.D. Ripley, Modern Applied Statistics with S, fourth ed., Springer, NY, 2002. [36] R. von Mises, On the asymptotic behavior of differentiable statistical functionals, Ann. Math. Statist. 18 (1947) 309–348.