Large-sample Statistical Methods J M Singer, Universidade de Sa˜o Paulo, Sa˜o Paulo, Brazil D F Andrade, Universidade Federal de Santa Catarina, Florianopolis, Brazil ã 2010 Elsevier Ltd. All rights reserved.
Introduction In a broad sense, the objective of statistical inference is to draw conclusions about some characteristics of a (possibly conceptual) population of interest based on the information obtained from a sample conveniently selected therefrom. Most commonly, these characteristics correspond to parameters, such as the mean or the variance, but they may also involve more general features of the population, such as the associated (frequency) distribution itself. In general, the strategy for inference involves the selection of an appropriate family of stochastic models, an evaluation of their compatibility with the available data (goodness of fit) and the subsequent estimation of or conduction of tests of hypotheses about some components of the chosen family of models. Such models may have different levels of complexity and depend on assumptions with different degrees of restrictiveness, as illustrated by the following examples. Example 1 Suppose that the proficiency of students in some subject (e.g., mathematics) is measured by the outcome in a given test and that we are interested in comparing the performance of males and females based on data obtained from independent random samples. We may assume that the outcome corresponds to a score Y for which the underlying distribution function F is normal with common variance s2 and mean m1 for males or m2 for females; the data correspond to the scores Yi1 ; . . . ; Yini ; i ¼ 1; 2, obtained from the n1 male and n2 female students in the sample. Alternatively, we may consider the outcome to be the classification of the test result for each student as failed, acceptable, good, or excellent so that the data correspond to the frequencies nik, i ¼ 1, 2, k ¼ 1, 2, 3, 4 of male and female students classified in each category. Here, a product multinomial model with parameters pik denoting the probabilities of classification in the kth response category for students in the ith group, is appropriate. Example 2 In cases where the test is composed of L dichotomous or polytomous questions or items sampled from a large item population, ‘item response theory’ models are often appropriate to measure the proficiency of students (Hambleton et al., 1991). In this context, the probability that a student answers an item correctly is modeled as a function of his/her proficiency (a latent trace) and of some characteristics of the item. For example, the one-parameter logistic model is
232
plj ¼ plj ðyj ; bl Þ ¼ PðUlj ¼ 1jyj ; bl Þ ¼
1 1þ
eðyj bl Þ
where plj is the conditional probability of a correct answer to item l given by student j, Ulj is a Bernoulli random variable with probability of success plj, yj is the proficiency of student j, and bl is the location (or difficulty) parameter for item l. The data consist of the values of Ulj, j ¼ 1, . . . , n, l ¼ 1, . . . , L corresponding to the responses of all sampled students to all sampled items. In Example 1, it is well known that the Student t-statistic 1 Y 2 Þ=ðS t ¼ ðY
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1=n1 þ 1=n2 Þ
Pn i i ¼ n1 whereY i j ¼1 Yij ; i ¼ 1; 2, respectively, denote the sample mean score for males and females and P Pi i Þ2 represents S 2 ¼ ½n1 þ n2 21 2i¼1 nj ¼1 ðYij Y the sample variance, follows an exact t-distribution with n1 þ n2 2 degrees of freedom under the hypothesis that m1 ¼ m2 and thus, may be employed to construct an exact test of such hypothesis. In the second setup, to test for the equality of the response distributions of males and females, we usually employ Pearson’s w2-statistic P P Q ¼ 2i¼1 4k¼1 ðnik eik Þ2=eik , where eik ¼ (ni.n.k)/n,ni: ¼ P2 P4 k¼1 nik and n:k ¼ i¼1 nik . Unfortunately, the exact distribution of Q is difficult to specify, especially when the sample sizes are large, and therefore, we must rely on an approximate w2 distribution to construct the test. In Example 2, assuming that the answers of student j to the L items, conditionally on his/her proficiency, are independent, yj and bl can be estimated by maximum likelihood methods, that is, by maximizing the loglikelihood function lðy; bÞ ¼
n X L X fulj log½plj ðyj ; bl Þþ j ¼1 l¼1
ð1 ulj Þlog½1 plj ðyj ; bl Þg
where y and b are vectors whose elements are the proficiency and item parameters, respectively, and ulj denotes the realization of Ulj. Here, the estimating equations n L X X ½ulj plj ðy^j ; b^l Þ ¼ 0 and ½ulj plj ðy^j ; b^l Þ ¼ 0 j ¼1
l¼1
do not have a closed form solution (Baker and Kim, 2004), and the maximum likelihood estimators ^j and ^bl must be
Large-sample Statistical Methods
obtained by numerical algorithms. Therefore, the quest for approximations to the distributions of such estimators is even more pressing. With the exception of some simple cases, the exact statistical properties of many useful estimators and test statistics are not obtainable for many models of practical interest. Therefore, the methods for approximating the probability distributions (or some appropriate summary measures) of such statistics are of special concern. The main tools for such approximations are based on statistical properties derived by letting the sample size n be indefinitely large and are investigated under the denomination of asymptotic statistical theory. Since, in practice, the asymptotic results generate methods that must be applied to samples of finite (although large) sizes, they are also known as large-sample methods and, therefore, must be considered as approximations. Nevertheless, they are associated with the following advantages: They allow for some flexibility in the selection of
models since large-sample properties of estimators and test statistics are usually less dependent on the particular functional form of the underlying distribution than their exact (small-sample) counterparts, and thus, are relatively robust against departures from some assumed model. They allow for some relaxation of the assumptions of independence and identical distribution of the sample elements, usually required by exact inferential procedures. They generally produce simple and well-studied limiting distributions such as the normal, w2, Weibull, etc. It is quite natural to expect that as the sample size increases, an estimator should be closer to the parameter it is set to estimate; this property is known as consistency and is intimately related to the concept of stochastic convergence of random variables. Similarly, a test statistic should be able to detect false null hypotheses with increasing confidence when the sample size becomes large. A second natural requirement for large-sample procedures is that the corresponding exact sampling distribution can be adequately approximated by a simpler one, such as the normal or w2 distribution, for which tables or computational algorithms are available. This is known as convergence in distribution or weak convergence and constitutes the basis of some important results known as central limit theorems. Third, in a given setup, there are usually many competing statistics satisfying the requirements of consistency and convergence in distribution. In choosing an appropriate one within such a class, we seek for optimality in some sense, such as minimum variance or minimum risk with respect to suitable loss functions in estimation problems, and this motivates the concepts of asymptotic efficiency. Similarly, in testing problems, large sample optimality properties are defined
233
in terms of alternative hypotheses that become closer to the null hypothesis as the sample size increases. In this context, we consider the two basic forms of stochastic approximation useful for large samples; the first concerns the approximation of a parameter or possibly a random variable by a sequence of statistics, and the second is related to the approximation of a sequence of distribution functions by another distribution function with known properties.
Stochastic Convergence We say that a sequence {Tn} of real numbers converges to a limit T as n becomes indefinitely large, if for every positive e, there exists a positive integer n0, possibly depending on e, such that (Apostol, 1974) jTn T j< e for n n0
½1
This may be equivalently expressed as supjTN T j< e for n n0
N n
½2
In dealing with a sequence {Tn} of random elements (statistics), we note that no matter how large an n is chosen, the inequality |Tn T| < e may not hold always when n is sufficiently large but may do so with probability arbitrarily close to 1. Furthermore, in the stochastic context, [1] and [2] are not equivalent and the difference in the probabilities of these two sets of events leads to the concepts of convergence in probability, and almost sure convergence, which essentially address different degrees of stochastic approximation. The latter is stronger and, in general, more stringent conditions are required for its validity. With a similar spirit, we may consider the concepts of convergence in the r th mean and complete convergence (Sen et al., 2010). In practice, these concepts are usually employed to show that certain sequences of statistics, such as maximum likelihood or least-squares estimators, are consistent. A second form of approximation refers to convergence in distribution or weak convergence. Here, we are not concerned with the convergence of the actual sequence of statistics {Tn} to some constant or random variable T, but with the convergence of the corresponding distribution functions {Gn} to some specific distribution function F. Thus, we say that the sequence of statistics {Tn} converges in distribution (or weakly) to T if, for every given e > 0, there exists an integer n0, possibly depending on e, such that at every point of continuity (x) of F, jGn ðxÞ F ðxÞj< e; for n n0
Although this mode of convergence is the weakest among those mentioned so far, it is very important for statistical applications, since the related limiting distribution function F may generally be employed in the construction of
234
Statistics
approximate confidence intervals for and significance tests about the parameters of interest. Classical applications of convergence results are focused on statistics that may be expressed as sums of functions of independent random variables as the sample i in Example 1. The most important results on the meansY convergence in probability or almost sure convergence of such statistics to the population parameters, are known as laws of large numbers (LLNs). As an example, the i Þ Khintchine weak LLN states that the sample meanðY converges in probability to the (finite) population mean (mi), provided that the sample elements are independent and identically distributed. Essentially, this means that the i differs from mi by less than a specified probability thatY (small) constant can be made arbitrarily closer to 1 by increasing the sample size. The Khintchine strong LLN i converge almost surely to mi shows that statistics likeY under the same conditions. Similar results are available even if the underlying random variables are not identically distributed; in general, the price to pay is to require the existence of moments of higher order. This is the case with the Markov weak LLN or with the Kolmogorov strong LLN. By choice of an appropriate norm, we may also extend the results to vector-valued random variables (Sen et al., 2010). The proofs of such results generally rely on some probability inequalities, the most simple being the wellknown Chebyshev inequality: given a sequence {Tn} of random variables with mean m and variance s2n , it follows that for all n 1 and all real numbers t, PðjTn mj< t Þ 1 s2n =t 2
Then, we conclude that whenever s2n converges to 0, Tn converges in probability to m. To obtain stronger results, we need sharper inequalities such as the Bernstein or Hoeffding inequalities, for example. Although these inequalities were originally developed under the assumption that the underlying random variables are independent, they may be extended to more general cases provided some martingale-type structure (Chow and Teicher, 1978) can be imposed on their dependence pattern. Likelihood ratio test statistics, U-statistics, or empirical distribution functions, have such properties (Sen et al., 2010).
Asymptotic Distributions The main result on this topic, known as the central limit theorem (CLT), essentially demonstrates that statistics expressed as sums of the underlying random variables, conveniently standardized are asymptotically normally distributed, that is, converge weakly to the normal distribution. The CLT may be proved under different assumptions on the moments and on the dependence structure of
the underlying random variables. The most simple CLT states that the (sampling) distribution of the sample mean of independent and identically distributed random variables with mean m and finite variance s2 may be approximated by a normal distribution with the same mean m and variance s2/n. Although the limiting distribution is continuous, the underlying distribution may even be discrete. An interesting special case occurs when the underlying variable Y has the Bernoulli distribution, with probability of success p. Here, the expected value and the variance of Y are p and p(1 p), respectively, and the sample mean p is the proportion of sample elements for which Y ¼ 1. It follows that the large-sample distribution of p may be approximated by a N[p, p(1 p)/n] distribution. This result is known as the De Moivre–Laplace CLT. Weak convergence results are also available for independent, but not identically distributed (e.g., with different means and variances) underlying random variables, provided some (relatively mild) assumptions hold for their moments. The Liapounov CLT and the Lindeberg–Feller CLT are useful examples. Further extensions cover cases of dependent random underlying variables; in particular, the Ha´jek–Sidak CLT is extremely useful in regression analysis, where, as the sample size increases, the response variables form a triangular array in which for each row (i.e., for given n), they are independent, but this is not true among rows (i.e., for different values of n). Extensions to cover cases where the underlying random variables have more sophisticated (e.g., martingale-type) dependence structures have been considered by Dvoretzky (1971) among others. The Slustky theorem is a handy tool to prove weak convergence of statistics that may be expressed as the sum, product, or ratio of two terms, the first known to converge weakly to some distribution, and the second known to converge in probability to some constant. As an example, consider independent and identically distributed random variables Y1, . . . , Yn with mean m and variance s2. Since the corresponding sample standard deviation S converges in probability to s and may be approximated by a N(m, the distribution of Y s2/n) distribution, we may apply Slutsky’s theorem to that the large-sample distribution of pffiffiffi show p ffiffiffi =S ¼ ð nY =sÞ ðs=SÞ may be approximated by nY a N(m, 1) distribution. This allows us to construct approximate confidence intervals for and tests of hypotheses about m based on the standard normal distribution. A similar approach may be employed to the Bernoulli example by noting that p is a consistent estimator of p. An important application of Slutky’s theorem relates to statistics that can be decomposed as a sum of terms for which some CLT holds plus a term that converges in probability to 0. Assume, for example, that the variables Yi have a finite fourth central moment g and write the sample variance as
Large-sample Statistical Methods S 2 ¼ ½n=ðn 1Þ n n X X mÞ2 : n1 ½ðYi mÞ2 s2 =n þ s2 ðY i¼1
i¼1
Since the first term within the {} brackets is known to converge weakly to a normal distribution by the CLT and the second term converges in probability to 0, we conclude that the distribution of S2 may be approximated by a N(s2, g/n) distribution. This is the basis of the projection results suggested by Hoeffding (1948) and extensively explored by Jurecˇkova and Sen (1996) to obtain largesample properties of U-statistics as well as of more general classes of estimators. A convenient technique to obtain the asymptotic distributions of many (smooth) functions of asymptotically normal statistics is the Delta-method: if g is a differentiable function of a statistic Tn whose distribution may be approximated (for large samples) by a N(m, t2) distribution, then the distribution of the statistic g(Tn) may be approximated by a N {g(m), [g0 (m)]2t2} distribution, where g0 (m) denotes the first derivative of g computed at m. In the context of Example 1, we may be interested in estimating the odds of a failed versus pass response, that is, pi1/(1 pi1) for the students in the ith group. A straightforward application of the CLT may be used to show that the estimator of pi1, namely, ni1/ni, follows an approximate N[pi1, pi1 (1 pi1)/ni] distribution. Taking g(x) ¼ x/(1 x), we may use the Delta-method to show that the distribution of the sample odds ni1/(ni ni1) may be approximated by a N{pi1/(1 pi1), pi1/[ni (1 pi1)3]} distribution. This type of result has further applications in variance-stabilizing transformations used in cases (as the above example) where the variance of the original statistic depends on the parameter it is set to estimate. These concepts and results may also be extended to the multivariate case; in particular, consideration must be given to the Crame´r–Wold device, which is a useful process of reduction of multivariate problems to their univariate counterparts. In essence, it states that the asymptotic distribution of the former may be obtained by showing that the asymptotic distribution of every linear combination of the multivariate statistic under investigation may be approximated by a normal distribution. For some important cases, such as the Pearson w2-statistic, or more general quadratic forms Q ¼ Q(m) ¼ (Y m)tA (Y m) where Y is a p-dimensional random vector with mean vector m and covariance matrix V and A is a p-dimensional square matrix of full rank, the (multivariate) Delta-method may not be employed because the derivative of Q computed at m is null. If A converges to an inverse of V, a useful result known as the Cochran theorem states that the distribution of Q may be approximated by a w2 instead of a normal distribution. In fact, the theorem holds even if A is not of full rank, but converges
235
to a generalized inverse of V. This is important for applications in categorical data. The CLT also does not hold for extreme order statistics such as the sample minimum or maximum; depending on some regularity conditions on the underlying random variables, the distribution of such statistics, conveniently normalized, may be approximated by one of the three types of distributions, namely, the extreme value distributions of the first, second, or third type, which, in this context, are the only possible limiting distributions as shown by Gnedenko (1943). Given that weak convergence has been established, a question of interest is whether the moments (e.g., mean and variance) of the statistics under investigation converge to the moments of the limiting distribution. Although the answer is negative in general, an important theorem, due to Crame´r, indicates conditions under which the result is true (Sen et al., 2010). For most practical purposes, the specification of an approximate distribution must be complemented with some discussion on the corresponding rates of convergence. In this direction, a bound on the error of approximation of the CLT is provided by the Berry–Esse´en theorem; alternatively, such an evaluation may be carried out via Gram–Charlier or Edgeworth expansions (Crame´r, 1946). Although this second approach might offer a better insight into the problem than that provided by the former, it requires the knowledge of the moments of the parent distribution and, thus, is less useful in practical applications.
Asymptotic Properties of Estimators and Test Statistics The convergence of an estimator (in any of the modes described previously) to the parameter being estimated, is certainly desirable and, in general, holds under rather mild regularity conditions often satisfied in practice. A related, but stronger concept, frequently employed in consistency proofs, is that of asymptotic unbiasedness: a sequence {Tn} of estimators is asymptotically unbiased if the corresponding expected values converge to the parameter being estimated as n increases indefinitely. In practice, however, such properties are of limited value if not coupled with some form of convergence in distribution. Fortunately, it is possible to obtain such asymptotic distributions for a large class of estimators. In particular, the large-sample distribution of maximum likelihood estimators may be approximated by a normal distribution with variance given by the inverse of the Fisher information, provided a rather mild compactness condition on the second derivative of the log-likelihood is satisfied (Sen et al., 2010). This is generally true for estimators derived
236
Statistics
under the exponential family of densities of the underlying random variables. Similar results are also available for other classes of parametric (e.g., robust estimators or more generally, M-estimators) or nonparametric estimators (e.g., those based on U- and V-statistics). Here, however, as shown in Jurecˇkova and Sen (1996), more delicate assumptions are required. Even if we restrict ourselves to the class of asymptotically normal estimators, we still need some further criteria in order to choose the best candidate; among them lies the concept of asymptotic relative efficiency (ARE), which, in the case of asymptotically normal estimators, corresponds to the ratio of the variances of the approximate distributions. In connection with such ideas, the concept of (absolutely) efficient estimators may be developed with a similar spirit to that of the Crame´r–Rao– Fre´chet inequality for finite samples; such estimators are termed best asymptotically normal (BAN), and for practical purposes, estimators having this property may be used interchangeably. Note that in general, the asymptotic relative efficiency of two estimators may not coincide with the Fisher asymptotic efficiency, defined as the ratio between the limit of the variance of the estimator (conveniently standardized) and the Crame´r–Rao– Fre´chet lower bound (Sen et al., 2010). Large-sample properties of test statistics derived by Wald, score, and likelihood ratio methods may also be obtained along the same lines. Comparing the large-sample efficiency of tests, however, is slightly more complicated, since for fixed alternative hypotheses, for example, HA: m ¼ m1 where m1 is a constant, the power of such tests converge to 1 as the sample size increases. Therefore, we must restrict ourselves to localpPitman-type alternatives, for ffiffiffi example, HA : mn ¼ m0 þ D= n, where m0 and D are constants, which, in some sense, get closer to the null hypothesis H0 : m ¼ m0 as the sample size increases; this allows comparisons via the noncentrality parameters of the approximate distributions. Under this approach, we may show that Wald, score, and likelihood ratio methods generate (large-sample) equivalent and locally the most powerful tests.
Extensions to Regression and Categorical Data Models Linear regression and related models pose special problems, since the underlying random variables are not identically distributed, and in many cases, the exact functional form of their distributions is not completely specified. Least-squares methods are attractive under these conditions, since they may be employed in a rather general setup. In this context, the Ha´jek–Sˇidak CLT specifies sufficient conditions on the explanatory variables such that the distributions of the estimators of the regression parameters may be approximated by normal distributions.
As an illustration, consider the simple linear regression model yi ¼ a þ bxi þ ei ; i ¼ 1; . . . ; n
where yi and xi represent observations of the response and explanatory variables, respectively, a and b are the parameters of interest, and the ei correspond to uncorrelated random errors with mean 0 and variance s2. The least-squares estimators of b and a are, respecP P ^¼ tively, ^ ¼ ni¼1 ðxi x Þðyi y Þ= ni¼1 ðxi xÞ2 and ^ y x , where x^ and ^y correspond to the sample means of the explanatory and response variables. Irrespective of the form of the underlying distribution of ei , we may ^ and ^ are unbiased use standard results to show that P P and have variances given by 2 ½ ni¼1 xi2 = ni¼1 ðxi xÞ2 P and 2 ½ ni¼1 ðxi xÞ2 1 , respectively. Furthermore, the ^ is 2 x= Pn ðxi ^ and ; covariance between x Þ2 . i¼1 When the underlying distribution of ei is normal, we may ^ and ^ follow a bivariate use standard results to show that P normal distribution. If max1in ðxi xÞ2 = ni¼1 ðxi xÞ2 converges to 0 (Noether’s condition) and both x and P n1 ni¼1 ðxi x Þ2 converge to finite constants as n increases indefinitely, we may use the Ha´jek–Sˇidak CLT to conclude that the same bivariate normal distribution specified above serves as an approximation of the true ^ whatever the form of the distribu^ and , distribution of tion of ei, provided that n is sufficiently large. The results may also be generalized to cover alternative estimators obtained by means of generalized and weighted least-squares procedures as well as via robust M-estimation procedures. They may also be extended to generalized linear and nonlinear models. Applications to categorical data pose additional problems, since many nonlinear models are attractive for their analysis. Although maximum likelihood estimators have optimal large-sample properties, they often require laborious computation because of the natural restrictions involving the parameters of the underlying multinomial distributions. In such cases, they are usually replaced by competitors such as minimum chi-squared, modified minimum chi-squared, or generalized least-squares estimators. Since all these methods generate BAN estimators, their large-sample properties are equivalent (Paulino and Singer, 2006) and the choice among them may rely on computational considerations.
Extensions to Empirical Distribution Functions and Order Statistics Empirical distribution functions and order statistics have important applications in nonparametric regression models, resampling methods such as the jackknife and bootstrap, sequential testing, as well as in survival and
Large-sample Statistical Methods
reliability analyses. In particular, they serve as the basis for the well-known goodness-of-fit Kolmogorov–Smirnov and Crame´r–von Mises statistics and for L- and Restimators such as trimmed or Winsorized means. Given the sample observations Y1, . . . ,Yn assumed to follow some distribution function F, the empirical distribution function computed at a given real number y is Fn ð yÞ ¼ n1
n X
237
See also: Analysis of Covariance; Categorical Data Analysis; Decision Theory; Educational Data Modeling; Generalized Linear Models; Graphical Models; Multivariate Normal Distribution; Nonparametric Statistical Methods; Order Statistics; Probability Theory; Sequential Testing; Stochastic Processes; Survival Data Analysis; Time Series Analysis.
I ðYi yÞ
i¼1
where I(Yi y) is an indicator function assuming the value 1 if Yi y and 0 otherwise. It is intimately related to the order statistics, Yn:1 Yn:2 . . . Yn:n where Yn:1 is the smallest among Y1, . . . ,Yn, Yn:2 is the second smallest, and so on. For each fixed sample, Fn is a distribution function when considered as a function of y. For every fixed y, when considered as a function of Y1, . . . ,Yn, Fn(y) is a random variable; in this context, since the I(Yi y), i ¼ 1, . . . ,n, are independent and identically distributed zero-one valued random variables, we may apply the CLT to conclude that for each fixed y, the distribution of Fn( y) may be approximated by an N{F( y), F( y)[1 F( y)]/n} distribution, provided that n is sufficiently large. To extend these results to the function Fn computed at all real values y, more sophisticated methods are needed as suggested in Jurecˇkova and Sen (1996) among others.
Conclusion When the sample size is small, properties of estimators and test statistics required for inferential purposes may be either hard to derive or based on assumptions that are difficult to verify in practical applications. Large-sample methods developed under mild assumptions, usually acceptable for most practical problems, provide approximations for such properties and generally lead to standard inferential procedures, such as confidence intervals based on the normal distribution or w2-distributed test statistics. Although the underlying theory lies on asymptotic (i.e., sample sizes tending to infinity) arguments, the resulting approximate procedures may be satisfactorily employed in cases where the sample size is moderate or even small.
Bibliography Apostol, T. M. (1974). Mathematical Analysis. Reading, MA: AddisonWesley. Baker, F. B. and Kim, S. (2004). Item Response Theory: Parameter Estimation Techniques, 2nd ed. New York: Marcel Dekker. Chow, Y. S. and Teicher, H. (1978). Probability Theory: Independence, Interchangeability, Martingales. New York: Springer. Crame´r, H. (1946). Mathematical Methods of Statistics. Princeton, NJ: Princeton University Press. Dvoretzky, A. (1971). Asymptotic normality for sums of dependent random variables. Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2, pp 513–535. Berkeley: University of California Press. Gnedenko, B. V. (1943). Sur la distribution limite du terme maximum d’une se´rie ale´atoire. Annals of Mathematics 44, 423–453. Hambleton, R. K., Swaminathan, H., and Roger, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA: Sage. Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Annals of Mathematical Statistics 19, 293–325. Jurecˇkova´, J. and Sen, P. K. (1996). Robust Statistical Procedures. New York: Wiley. Paulino, C. D. and Singer, J. M. (2006). Ana´lise de dados categorizados, (in Portuguese). Sa˜o Paulo: Blu¨cher. Sen, P. K., Singer, J. M., and Pedroso-de-Lima, A. C. (2010). From Finitc Sample to Asymptotic Methods in Statistics. Cambridge: Cambridge University Press.
Further Reading Ferguson, T. S. (1996). A Course in Large Sample Theory. London: Chapman and Hall. Lehmann, E. L. (2004). Elements of Large Sample Theory. New York: Springer. Reiss, R. D. (1989). Approximate Distributions of Order Statistics. With Applications to Nonparametric Statistics. New York: Springer. Sen, P. K. (1981). Sequential Nonparametrics. New York: Wiley. Serfling, R. J. (1980). Approximation theorems of Mathematical Statistics. New York: Wiley. van der Vaart, A. W. (1998). Asymptotic Statistics. New York: Cambridge University Press.