Journal Pre-proof M-estimation with incomplete and dependent multivariate data Gabriel Frahm, Klaus Nordhausen, Hannu Oja
PII: DOI: Reference:
S0047-259X(19)30007-7 https://doi.org/10.1016/j.jmva.2019.104569 YJMVA 104569
To appear in:
Journal of Multivariate Analysis
Received date : 2 January 2019 Revised date : 15 November 2019 Accepted date : 17 November 2019 Please cite this article as: G. Frahm, K. Nordhausen and H. Oja, M-estimation with incomplete and dependent multivariate data, Journal of Multivariate Analysis (2019), doi: https://doi.org/10.1016/j.jmva.2019.104569. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Elsevier Inc. All rights reserved.
Journal Pre-proof
M-Estimation with Incomplete and Dependent Multivariate Data Gabriel Frahma , Klaus Nordhausenb,∗, Hannu Ojac a Department of Mathematics and Statistics, Helmut Schmidt University, 22008 Hamburg, Germany – Computational Statistics, Institute of Statistics & Mathematical Methods in Economics, Vienna University of Technology, Wiedner Hauptstraße 7, 1040 Vienna, Austria c Department of Mathematics and Statistics, University of Turku, 20014 Turku, Finland
p ro
of
b CSTAT
Abstract
Pr e-
We extend the theory of M-estimation to incomplete and dependent multivariate data. ML-estimation can still be considered a special case of M-estimation in this context. We notice that the unobserved data must be missing completely at random but not only missing at random, which is a typical assumption of ML-estimation, to guarantee the consistency of an M-estimator. Further, we show that the weight functions for scatter must satisfy a critical scaling condition, which is implicitly fulfilled both by the Gaussian and by Tyler’s weight function. We generalize this principal result by introducing the class of power weight functions, which contains the two aforementioned weight functions as limiting cases. A simulation study confirms our theoretical findings. If the data are heavy tailed or contaminated, the M-estimators turn out to be favorable compared to the ML-estimators that are based on the normaldistribution assumption. Keywords: Dependent data, Incomplete data, Location, M-estimation, Missing data, Scatter, Shape, Spatial data, Time-series data. 2010 MSC: 62H12
1. Motivation
Jo
urn
al
In multivariate data analysis, practitioners often deal with high-dimensional, incomplete, serially or spatially dependent, and heavy-tailed data that can even be contaminated by measurement errors. For example, this is a typical situation in modern portfolio optimization [37, 38], where it is a stylized fact that short-term asset returns are not serially independent. However, the aforementioned problems frequently occur in many other scientific disciplines like meteorology, environmental sciences, life sciences, geophysics, signal and image processing, etc. The reason for this might be attributed to modern computer networks, which lead to an enormous flood of information. Nowadays, under the buzzword “big data,” people try to find significant and useful patterns in large data sets. Hence, the development of appropriate statistical procedures is highly relevant both from a practical and a theoretical point of view. The robust-statistics literature regarding complete and independent data is overwhelming. See, for example, the well-known textbooks of Hampel et al. [19] and Maronna et al. [40]. Robust estimation procedures for complete but dependent data are less widespread. Regarding the estimation of parameters of the stationary or marginal distribution see, for example, Gastwirth and Rubin [18], Koul [30], and Portnoy [44], who refer to univariate data, as well as Koul and Mukherjee [31], Wu [55], and Yuanxi [56] in the context of multivariate data. Robust procedures that are typically applied in time-series analysis, such as HAC estimation and the stationary bootstrap [see, e.g., 22, 43], aim at computing asymptotic covariance matrices or standard errors, but this is not the main goal of our work. Indeed, missing-data analysis is a well-established branch of statistics. Some very nice and exciting textbooks on that field are written by Little and Rubin [35] as well as Schafer [45]. However, most estimation procedures that ∗ Corresponding
author URL:
[email protected] (Gabriel Frahm),
[email protected] (Klaus Nordhausen),
[email protected] (Hannu Oja)
Preprint submitted to Journal of Multivariate Analysis
November 15, 2019
Journal Pre-proof
al
Pr e-
p ro
of
are presented in those textbooks are not robust, and the robust alternatives to the standard methods of missing-data analysis that can be found in the literature typically presume that the data are independent. In this work, we focus on multivariate data analysis. More precisely, we discuss the estimation of location and scatter. Of course, this includes the univariate case. Moreover, we discuss also the estimation of shape matrices, i.e., of normalized scatter matrices. Estimating location and scatter is an essential task of multivariate data analysis, but there exist only a few contributions on robust analysis of incomplete multivariate data. For example, [16] generalize Tyler’s M-estimator for scatter [50] to the case of incomplete data. Wang [53] discusses M-estimation for censored data, whereas Danilov et al. [8] refer to S-estimation with incomplete data. Flossmann [12] and [54] propose inverse probability weighting. This requires to specify selection probabilities, which might lead to inconsistent parameter estimates if the propensity model is misspecified. By contrast, M-estimation is consistent if some basic regularity conditions, which are discussed in this work, are satisfied. Other authors focus on regression analysis [2, 20, 47] or principal component analysis [46], but the general scope of regression and principal component analysis seems to be quite different from ours. The traditional estimation approach for location and scatter of incomplete multivariate data is based on multiple imputation [35, 45]. This estimation approach typically presumes that the data are multivariate normally distributed and so the resulting estimators are not robust. For this reason, Branden and Verboven [4], Hron et al. [25], and Templ et al. [48] develop robust imputation algorithms. Little [33] assumes that the data are contaminated or multivariate t-distributed. He estimates location and scatter by maximum likelihood, whereas Little and Smith [36] propose an estimation method based on imputation. Cheng and Victoria-Feser [6] improve the algorithm used by Little and Smith [36] for high-dimensional data by applying high-breakdown estimators (e.g., Rousseeuw’s minimum-volume ellipsoid estimator) and hybrid algorithms. See also Copt and Victoria-Feser [7], who propose a modified algorithm and use the orthogonalized Gnanadesikan-Kettenring estimator as a starting point for an adapted S-estimator. Another promising alternative is presented by Han [21], who combines multiple imputation with inverse probability weighting. All aforementioned contributions presume that the data are independent. By contrast, Palma and del Pino [42] consider incomplete (long-range) dependent data. However, they refer to univariate time series and their methods are not robust. Although the given list of contributions is not exhaustive, our observation is that most authors do not take serial or spatial dependence into account and, at least in some cases, the scope of their contributions seems to be somewhat limited. In our opinion, the main challenge for incomplete and dependent multivariate data is not only to guarantee the consistency of the estimators but also to obtain their asymptotic covariance matrices. Of course, this is an essential task if one is interested in confidence regions or wants to conduct hypothesis tests. To the best of our knowledge, a general theory of M-estimation with incomplete and dependent multivariate data is still missing. This work tries to fill this gap. We demonstrate our method by deriving M-estimators for location and scatter. We also conduct an extensive simulation study in order to confirm our theoretical findings. 2. Notation, Definitions, and Regularity Conditions
Jo
urn
Let X be a sample, i.e., an m × n real-valued random matrix. For example, X may consist of m attributes of n individuals at a specific point in time (cross-sectional data), of m attributes of a specific individual at n points in time (time-series data), or of a specific attribute of m individuals at n points in time (panel data). In the following, we assume without loss of generality that X is a sample of cross-sectional data. Let R be a response indicator, i.e., an m × n matrix of Bernoulli variables. It indicates which part of X is missing (“0”) and which one is observed (“1”). As is usual in the statistics literature, random quantities are denoted by capital letters, whereas their realizations are symbolized by small letters. For example, x and r are realizations of the random matrices X and R, respectively. This means that x is an m × n matrix of real numbers and r is an m × n matrix of zeros and ones. The components of X and R may depend on one another. Let Xi and Ri be the ith column of X and R, respectively. This means that Xi and Ri belong to the ith individual, where Xi = X1i , . . . , Xmi denotes an m-dimensional column vector. Thus, Xi> = X1i · · · Xmi , X1i , . . . , Xmi for m > 1 and X = X1 · · · Xn is an m × n random matrix. The joint distribution of Xi and Ri is given by f (xi , ri ; θ) = f (xi ; θ) f (ri | Xi = xi ; θ) for all i ∈ 1, . . . , n , where θ ∈ Θ ⊆ R p is some unknown parameter and Θ is an open subset of R p . Hence, for notational convenience, we will say that “ f (yi ) is the distribution of Yi ,” where Yi is any random vector, although “yi 7→ f (yi )” would be more precise. We do not use the simple expression “ f ” for the distribution of Yi . This would be highly misleading in our context, since it omits the specific variable yi . Moreover, we drop the phrase “for all i ∈ 1, . . . , n ” in the subsequent analysis whenever it is clear from the context that the given statement refers to each individual. 2
Journal Pre-proof
= f (ri | Xi =xi ;θ)
p ro
= f (xi ;θ)
of
The distribution of Xi can either be discrete or continuous, whereas the distribution of Ri is always discrete by definition. It is assumed that the joint distribution of Xi and Ri is identical for each individual, i.e., f (xi , ri ; θ) = f (x j , r j ; θ) for all i, j ∈ 1, . . . , n . Let xr be the observed and xr¯ the missing data of all individuals. The m × n matrix r denotes the given responses, whereas r¯ symbolizes the non-responses. This means that r¯ = 1 − r, where 1 is an m × n matrix of ones. Further, let ri be the response, xri the observed, and xr¯i the missing data of Individual i. Actually, “xri ” is a shorthand notation for xri i , i.e., it denotes the observed components of the vector xi . Analogously, “xr¯i ” is a shorthand notation for xr¯i i and symbolizes the unobserved components of xi . The joint distribution of XRi and Ri is Z Z f (xri , ri ; θ) = f (xri , xr¯i , ri ; θ) dxr¯i = f (xri , xr¯i ; θ) f (ri | Xri = xri , Xr¯i = xr¯i ; θ) dxr¯i . (1) | {z }| {z } Further, we often encounter the conditional distribution f (ri | Xri = xri ; θ). This can be interpreted as the probability that Individual i provides the response ri given that his observed data are xri . Here, the response of Individual i, i.e., ri , is considered fixed. In the following, we use the typical abbreviations “ML” for maximum likelihood and “M” for maximum-likelihood type. Suppose that we want to estimate the unknown parameter θ by ML. The problem is that we can observe only xr and r. Thus, our (composite) likelihood is n Y
f (XRi , Ri ; θ),
Pr e-
L(θ; XR , R) =
(2)
i=1
urn
al
which represents a random variable. In many practical applications, the random vectors (XR1 , R1 ), . . . , (XRn , Rn ) are Q serially or spatially dependent. Hence, the joint distribution f (xr , r; θ) does not correspond to ni=1 f (xri , ri ; θ). This depends on whether we work with cross-sectional, time-series, or panel data. However, treating the data as independent is a standard approach in econometrics [23] and, in general, the resulting ML-estimator is asymptotically inefficient but not inconsistent. Of course, if the statistician knows the sort of serial or spatial dependence, he or she should take the given dependence structure into consideration when estimating θ in order to obtain an asymptotically efficient ML-estimator. In this work, we do not presume that the dependence structure is known. Hence, our approach can be considered nonparametric regarding the serial or spatial dependence structure of the data. Our primary focus is on consistency and robustness rather than asymptotic efficiency. Typically, it is not possible to use the likelihood function L(· ; XR , R). This is not because we are unable to specify the distribution of Xi , i.e., f (xi ; θ). The problem is that we do not know the distribution of Ri given Xi = xi , and so we cannot calculate f (xri , ri ; θ) (see (1)). In fact, the dependence structure between Xi and Ri can be quite complicated. Of course, the same holds true for the serial or spatial dependence structure. In this work, we focus on the dependence between Xi and Ri . Thus, we ignore the dependence between (Xi , Ri ) and (X j , R j ) for i , j. The distribution of Ri given Xi = xi is said to be the missingness mechanism of the experiment. Estimating θ would be much easier if we could ignore the missingness mechanism and use the observed-data likelihood [45, p. 12] n Y L(θ; XR ) = f (XRi ; θ). (3) i=1
Jo
More precisely, f (xri ; θ) is the distribution of Xri , i.e., of the subvector of Xi that is observed for Individual i. Note that Xri is a random vector, where ri symbolizes the realization of the response vector Ri for Individual i. Hence, f xri ; θ denotes a marginal distribution. More precisely, it is the distribution of the observed part of the random vector Xi , i.e., of Xri . Further, XRi is a random quantity, too, but here both the response indicator, Ri , for Individual i and the corresponding observations, i.e., that part of Xi that is indicated by Ri , are considered stochastic. Finally, xri represents a realization of XRi and thus everything in that quantity is deterministic. Ignoring the missingness mechanism is possible if the so-called ignorability condition f (ri | Xri = xri ; θ) = f (ri | Xri = xri ) 3
(4)
Journal Pre-proof
is satisfied [45, Section 2.3.1], in which case we have that f (xri , ri ; θ) = f (xri ; θ) f (ri | Xri = xri ) ∝ f (xri ; θ) and thus L(θ; XR , R) ∝ L(θ; XR ). The resulting ML-estimator θˆ is the solution of the estimating equation n
X ˆ XRi = 0 ˆ XR = 1 φR θ; ΦR θ; n i=1 i
(5)
p ro
of
ˆ instead of “θˆn ” just to avoid an abundant use of subscripts. Further, with φRi (θ; XRi ) = ∂ log f XRi ; θ /∂θ. We write “θ” “0” denotes a scalar, a vector, or a matrix of zeros, whose dimension shall always be clear from the context. Hence, ˆ XRi is a vector and so the “0” on the right-hand side of (5) represents a vector of zeros. in the multivariate case, φRi θ; ˆ we suppose that the ignorability condition is satisfied. In Section 3.1 we Whenever we refer to the ML-estimator, θ, discuss typical assumptions that guarantee that this important requirement is met. Let F be the joint cumulative distribution function of Xi and ϑ = T (F) ∈ Rq some parameter such that E ψri (ϑ; Xri ) = 0 (6)
for every fixed response ri that is possible for Individual i. Here ψri (· ; Xri ) is a function from Rq to Rq . An M-estimator ϑˆ is the solution of the estimating equation n
(7)
Pr e-
X ˆ XRi = 0 , ˆ XR = 1 ψRi ϑ; ΨR ϑ; n i=1
ˆ XR represents a (composite) score. Hence, the ML-estimating equation (5) is just a special case of the where ΨR ϑ; M-estimating equation (7). In the ML-case we have that ψRi (ϑ; XRi ) = ∂ log f (XRi ; ϑ)/∂ϑ with ϑ ≡ θ. If integration and differentiation are interchangeable, it turns out that ! Z ∂ log f (xri ; ϑ) ∂ log f (Xri ; ϑ) = f (xri ; ϑ) dxri E ψri (ϑ; Xri ) = E ∂ϑ ∂ϑ Z Z ∂ f (xri ; ϑ)/∂ϑ ∂ ∂ f (xri ; ϑ) dxri = f (xri ; ϑ) dxri = 1 = 0. = f (xri ; ϑ) ∂ϑ ∂ϑ
al
Hence, in the context of ML-estimation, the orthogonality condition (6) is always satisfied. The asymptotic results presented later are based on the following regularity conditions: A1: We have that 1 n 2 ΨR ϑ; XR −→ Nq (0, Fϑ ) , n −→ ∞ ,
with
n 1X Cov ψRi (ϑ; XRi ), ψR j (ϑ; XR j ) . n i, j=1
urn
Fϑ = lim
n→∞
A2: The q × q matrix ∂ΨR (ϑ; XR )/∂ϑ> is regular and it holds that n
∂ψRi (ϑ; XRi ) 1 X ∂ψRi (ϑ; XRi ) =E Hϑ = plim > n ∂ϑ ∂ϑ> n→∞ i=1
!
Jo
with det Hϑ , 0. A3: We can apply the Taylor expansion
∂ΨR ϑ; XR ˆ ˆ ΨR ϑ; XR = ΨR ϑ; XR + ϑ − ϑ + Op n−1 . ∂ϑ>
In the ML-case, we denote Fϑ and Hϑ by Fθ and Hθ , respectively. First of all, we would like to explain A1. We could have written, equivalently, 1 n 2 ΨR ϑ; XR − E ΨR ϑ; XR −→ Nq (0, Fϑ ) , n −→ ∞ . 4
(8)
Journal Pre-proof
In fact, we implicitly assume that
E ψRi ϑ; XRi
=0 (9) and thus E ΨR ϑ; XR = 0, since the distribution of ψRi ϑ; XRi is the same for all i ∈ 1, . . . , n . Note that (9) is essentially different from (6), which considers the response of Individual i fixed, whereas in (9) the response of that individual represents a random vector. In general, (6) does not imply (9). The property expressed by (9) is called Fisher consistency because it guarantees that the resulting estimator is consistent. Later on we will see under which circumstances the score function ΨR (· ; XR ) is, indeed, Fisher consistent for ϑ. A1 and A2 can be motivated by ergodic theory. To be more precise, in many practical applications we observe an ergodic stationary process Zt with Zt ∼ Z. This means that for every measurable function h with E |h(Z)| < ∞ we P have that n−1 nt=1 h(Zt ) → E h(Z) as n → ∞. The mode of convergence may depend on the chosen framework, but at least the convergence holds true in probability. We can assume also that the process Zt is strong mixing, which means that the one-sided processes Zt t ≤ 0 and Zt t ≥ l (l ∈ N) are asymptotically independent, i.e., they are independent as 1 P the lag l between Zt t ≤ 0 and Zt t ≥ l grows to infinity. If the convergence rate is high enough, n− 2 nt=1 Zt − E(Zt ) is asymptotically normally distributed as n → ∞. This leads us to the central limit theorem expressed by (8). For more details on that topic see, e.g., the review article by Bradley [3]. The aforementioned properties are typically used in the context of time-series analysis, but they have a meaningful interpretation also if we deal with cross-sectional data. In that case, strong mixing says that the attributes of any individual become independent from the attributes of another individual as the distance between the two individuals grows to infinity. Here, the term “distance” can be understood, e.g., in a social or a regional sense. Hence, the regularity conditions A1 and A2 are satisfied if ψRi (ϑ; XRi ) is strong mixing—with a sufficiently high conver > gence rate—and ∂ψRi (ϑ; XRi )/∂ϑ is ergodic stationary. We hope that these minimal conditions are satisfied in most practical applications. The limit Fϑ is influenced by the linear dependence structure of the elementary scores ψR1 ϑ; XR1 , . . . , ψRn ϑ; XRn , which follows by n n 1 − 1 X 1 X 2 2 ψRi ϑ; XRi = Cov ψRi (ϑ; XRi ), ψR j (ϑ; XR j ) −→ Fϑ Cov n ΨR ϑ; XR = Cov n n i=1 i, j=1
Pr e-
p ro
of
Jo
urn
al
as the sample size n grows to infinity. Hence, the stronger the dependence of the data, the bigger Fϑ . As we will see 1 below, this has an essential impact on the asymptotic covariance matrix of n 2 (ϑˆ − ϑ). By contrast, the limit Hϑ is not influenced by the dependence structure of the elementary scores. This is an immediate consequence of the ergodic theorem, which states that the limit of an arithmetic mean of an ergodic stationary process equals the expected value of each element of that process. A3 is a typical regularity condition of asymptotic theory [52, Chapter 5]. There are many possibilities to guarantee that the remainder of the Taylor expansion expressed by A3 is bounded in probability at the rate n−1 . Sufficient conditions can be found, e.g., in Huber [26]. The same arguments can be applied to the incomplete-data case. In order to keep things as general as possible we avoid any specific requirement on the score function ΨR (· ; XR ). Now, A3 basically states that ∂ΨR ϑ; XR ˆ 0 = ΨR ϑ; XR + ϑ − ϑ + Op n−1 , > ∂ϑ which means that ! ∂ΨR ϑ; XR −1 1 1 1 ˆ 2 n ϑ−ϑ =− n 2 ΨR ϑ; XR + Op n− 2 . ∂ϑ> Finally, from A1 and A2 we conclude that
1 n 2 (ϑˆ − ϑ) −→ −Hϑ−1 Nq (0, Fϑ ) ∼ Nq 0, Hϑ−1 Fϑ Hϑ−1 ,
n −→ ∞ ,
which explains why the asymptotic covariance matrix of n 2 (ϑˆ − ϑ) is determined by the dependence structure of the data through Fϑ but not through Hϑ . 1
5
Journal Pre-proof
3. Theory of M-Estimation for Incomplete and Dependent Data 3.1. Maximum-Likelihood Estimation Now, we discuss typical assumptions for the ignorability condition (4). We have that Z f (ri | Xri = xri ; θ) = f (ri | Xri = xri , Xr¯i = xr¯i ; θ) f (xr¯i | Xri = xri ; θ) dxr¯i . | {z }
of
= f (ri | Xi =xi ;θ)
Pr e-
p ro
First of all, we assume that the missingness mechanism is not determined by the parameter θ. This is the so-called distinctness assumption. DIS: f (ri | Xi = xi ; θ) = f (ri | Xi = xi ). Indeed, the missingness mechanism, f (ri | Xi = xi ), can be parametric, too, but we are interested only in estimating the parameter θ. Thus, we may ignore the parameter in the conditional distribution f (ri | Xi = xi ). Another typical assumption of missing-data analysis is that, conditional on his observed data, the response of Individual i does not depend on his missing data. In this case, we say that xr¯ is missing at random. MAR: f (ri | Xri = xri , Xr¯i = xr¯i ; θ) = f (ri | Xri = xri ; θ). MAR requires only that the response of each individual is conditionally independent of his own missing data. If MAR is violated, we say that the unobserved data are not missing at random (NMAR). From DIS and MAR it follows that f (ri | Xi = xi ; θ) = f (ri | Xi = xi ) = f (ri | Xri = xri ) and thus Z f (ri | Xri = xri ; θ) = f (ri | Xri = xri ) f (xr¯i | Xri = xri ; θ) dxr¯i Z = f (ri | Xri = xri ) f (xr¯i | Xri = xri ; θ) dxr¯i = f (ri | Xri = xri ) .
and
Z Z ∂2 f (ri , xri ; θ) ∂ f (ri , xri ; θ) ∂ dx dr = dxri dri . r i i > > ∂θ ∂θ ∂θ ∂θ The observed-data likelihood L(θ; XR ) leads to a consistent ML-estimator, provided DIS, MAR, and INT are satisfied. This is guaranteed by the following proposition, which establishes the Fisher consistency of ΦR (· ; XR ). Recall that this is an implicit argument for A1 (see the end of Section 2).
urn
Z Z
al
This means that the ignorability condition (4) is satisfied and so we can use (3) instead of (2) in order to estimate θ. The following interchangeability assumption is familiar in ML-theory. INT: Integration and differentiation are twice interchangeable, i.e., Z Z Z Z ∂ f (ri , xri ; θ) ∂ dxri dri = f (ri , xri ; θ) dxri dri ∂θ ∂θ
Proposition 1. Under the assumptions DIS, MAR, and INT the score function ΦR (· ; XR ) is Fisher consistent for θ, which means that E φRi (θ; XRi ) = 0. 1 ˆ We wish to guarantee also that θˆ is asymptotically normally distributed after its usual standardization, i.e., n 2 (θ−θ). Asymptotic normality and efficiency are established by the following theorem.
Jo
Theorem 1. Under the assumptions A1–A3, DIS, MAR, and INT we have that 1 n −→ ∞ . n 2 θˆ − θ −→ N p 0, Hθ−1 Fθ Hθ−1 ,
Moreover, if the elements of φRi (θ; XRi ) are uncorrelated, it holds that 1 n 2 θˆ − θ −→ N p 0, Fθ−1 , n −→ ∞ , and θˆ is an asymptotically efficient estimator for θ.
6
Journal Pre-proof
of
3.2. Maximum-Likelihood-Type Estimation Now, we aim at estimating the parameter ϑ = T (F) by solving (7). Although the condition E ψri (ϑ; Xri ) = 0 is satisfied by definition (see (6)), it can happen that E ψRi (ϑ; XRi ) , 0. In this case ΨR (· ; XR ) is not Fisher consistent for ϑ. Hence, we need an appropriate regularity condition in the context of M-estimation. To be more precise, we have to assume that the unobserved data are missing completely at random. MCAR: f (ri | Xi = xi ; θ) = f (ri ; θ) . This means that the response of Individual i must not depend both on his observed and on his unobserved data. It is required only that the response of each individual is independent of his own data. Hence, the response of some individual may very well depend on the data of another individual.
p ro
Proposition 2. Under the assumption MCAR the score function ΨR (· ; XR ) is Fisher consistent for ϑ, which means that E ψRi (ϑ; XRi ) = 0 .
Pr e-
Proposition 2 does not require DIS. Nonetheless, MCAR guarantees that the M-estimator ϑˆ is consistent. The MLestimator θˆ is an M-estimator, but DIS is no longer necessary for the consistency of θˆ if MCAR is satisfied. Hence, MCAR makes DIS and MAR superfluous. Indeed, MCAR implies MAR. However, MCAR alone does not guarantee ˆ Besides A1–A3, Theorem 1 requires the additional assumptions DIS and INT. the asymptotic efficiency of θ: Further, DIS together with MAR implies that we can ignore the missingness mechanism when we calculate the likelihood (2). However, this is not sufficient for the consistency of an M-estimator, which can be seen as follows: Under the ignorability condition it turns out that Z Z E ψRi (ϑ; XRi ) = f (ri | Xri = xri ) ψri (ϑ; xri ) f (xri ; θ) dxri dri ,
but the term f (ri | Xri = xri ) is still determined by xri unless the unobserved data are MCAR. That is, it cannot be extracted from the inner integral in order to make use of the fact that Z ψri (ϑ; xri ) f (xri ; θ) dxri = E ψri (ϑ; Xri ) = 0 .
al
Thus, it is worth emphasizing that we do not make the assumption MCAR just because it is a sufficient condition for the consistency of M-estimators. In general, M-estimators are inconsistent if the unobserved data are not MCAR. Hence, MCAR is a requirement that must not be ignored in practical applications of missing-data analysis. In many real-life situations the parametric family of Xi is unknown to us and then applying an ML-estimator can be misleading. The following theorem completes our general results on M-estimation with incomplete and dependent data.
urn
Theorem 2. Under the assumptions A1–A3 and MCAR we have that 1 n 2 ϑˆ − ϑ −→ Nq 0, Hϑ−1 Fϑ Hϑ−1 ,
n −→ ∞ .
Jo
Whether or not MCAR might be violated in a real-life situation often follows from practical considerations. For example, if some respondents in a questionnaire refuse to answer a question because they feel that their answer would be embarrassing, we can expect that MCAR is violated. Before using an M-estimator for incomplete data, one can apply some test for the null hypothesis that the missing data are MCAR. A well-known test for MCAR is presented by Little [34]. This requires the data to be multivariate normally distributed and thus it is not robust. Jamshidian and Jalal [27] propose two hypothesis tests for MCAR. One is based on the normal-distribution assumption and the other is distribution-free. Jamshidian et al. [28] provide an R package based on Jamshidian and Jalal [27]. Further, Listing and Schlittgen [32] present a nonparametric test for MCAR that combines several Wilcoxon rank-sum tests. Since MCAR is a relatively simple independence assumption, we can imagine several other parametric and nonparametric testing procedures [see, e.g., 1, p. 3]. Now, one might ask why not to use an ML-estimator right from the start. Indeed, M-estimation requires MCAR, whereas ML-estimation needs only MAR. However, the problem is that Theorem 1 is valid only if we know the parametric family of Xi . Unfortunately, if this is unknown to us, we cannot guarantee that the ML-estimator θˆ is consistent if the missing data are MAR. Hence, it does not help much to conclude that MAR is weaker than MCAR 7
Journal Pre-proof
and thus to propagate ML-estimation if our distributional assumption is wrong. In this case, the given ML-estimator θˆ ˆ provided the prerequisites of Theorem 2 typically turns into an asymptotically inefficient but consistent M-estimator ϑ, are satisfied, which implies that the unobserved data are MCAR. The technical details are elaborated in the subsequent analysis, where we concentrate on the estimation of location and scatter. 4. Estimation of Location and Scatter
of
Let U be a d-dimensional random vector that is uniformly distributed on the unit hypersphere S d−1 = u ∈ Rd : u>u = 1 . A random vector X is said to be elliptically distributed if and only if there exist a vector µ ∈ Rm , a matrix Λ ∈ Rm×d , a non-negative random variable V, and a random vector U ∼ S d−1 being independent of V such that
p ro
X = µ + ΛVU.
The parameter µ is called the location vector, whereas Σ = ΛΛ> is referred to as the scatter matrix of X. Further, V is the generating variate of X. In our context, we have to guarantee only that X − µ is not concentrated on a linear subspace of Rm . Hence, we assume that rank Λ = m ≤ d and thus Σ > 0. The distribution of an elliptically distributed random vector depends on Λ only through Σ = ΛΛ> and so we may choose d = m without loss of generality. Further, we assume that V has no atom at 0, i.e., P(V = 0) = 0. Now, let X1 , . . . , Xn be identically elliptically distributed.
Pr e-
4.1. Maximum-Likelihood Estimation In the context of ML-estimation, we assume that X has an absolutely continuous distribution. The density of X is 1 f (x; µ, Σ) = det Σ− 2 g (x − µ)> Σ−1 (x − µ) ,
where
Γ( m2 )
fV 2 (ξ) ξ−
m 2 −1
, ξ > 0, π is the density generator of X and fV 2 is the density function of V 2 [49]. 1 For example, V = (mFm,ν ) 2 is the generating variate of the multivariate t-distribution with ν > 0 degrees of freedom, where Fm,ν is an F-distributed random variable with m numerator and ν denominator degrees of freedom. Correspondingly, its density generator reads m 2
al
g : ξ 7−→
m+ν Γ( m+ν ξ − 2 1 2 ) 1 + . m Γ( 2ν ) (νπ) 2 ν 1
urn
g : ξ 7−→
2γ Another example is the multivariate power-exponential distribution, whose generating variate is V = Gα,β with pam m rameter γ > 0, where Gα,β is a Gamma-distributed random variable with shape α = 2γ and rate β = mΓ( 2γ )/Γ((m + 2)/(2γ)). Its density generator is ! Γ( m2 ) γ ξγ g : ξ 7−→ exp − . m m Γ( 2γ ) β 2γ π m2 β
Jo
4.1.1. Complete-Data Case First of all, we consider the complete-data case. As is shown in Frahm [13, Section 4.1], we have that φµ (µ, Σ; Xi ) =
and with
∂ log f (Xi ; µ, Σ) = w (Xi − µ)> Σ−1 (Xi − µ) Σ−1 (x − µ) ∂µ
φΣ (µ, Σ; Xi ) =
∂ log f (Xi ; µ, Σ) 1 = Ai − diag Ai ∂Σ 2
Ai = w (Xi − µ)> Σ−1 (Xi − µ) Σ−1 (Xi − µ)(Xi − µ)> Σ−1 − Σ−1 , 8
Journal Pre-proof
where w(ξ) = −2 ∂ log g(ξ)/∂ξ. To obtain the corresponding ML-estimators for µ and Σ we have to solve the equation n X ˆ b Σ; Xi φµ µ, 1 Φ µ, ˆ b Σ; X = = 0. n i=1 φΣ µ, ˆ b Σ; Xi n
0=
of
This leads us to the usual ML-estimating equations
n
1X w(ξi ) (Xi − µ) ˆ n i=1
1X b Σ= w(ξi ) (Xi − µ)(X ˆ i − µ) ˆ > n i=1
and
(10)
p ro
with ξi = (Xi − µ) ˆ >b Σ−1 (Xi − µ), ˆ where w can be considered a weight function.
4.1.2. Incomplete-Data Case In the incomplete-data case, we can observe only xri for Individual i. We denote the number of attributes that are observable for Individual i by mi . Actually, mi represents a realization of a random variable Mi ∈ 1, . . . , m . Since the number of observations, mi , may change with each individual, it is not appropriate to choose the same weight function for i ∈ {1, . . . , n}. Otherwise, the resulting ML-estimators for µ and Σ might be inconsistent. Thus, we have
and
E ∂ log f (XRi ; µ, Σ) D −1 = wi (XRi − µRi )> Σ−1 Ri (XRi − µRi ) ΣRi (XRi − µRi ) ∂µ
Pr e-
φµ,Ri (µ, Σ; XRi ) =
φΣ,Ri (µ, Σ; XRi ) = with
E ∂ log f (XRi ; µ, Σ) D 1 = ARi − diag ARi ∂Σ 2
−1 −1 > −1 ARi = wi (XRi − µRi )> Σ−1 Ri (XRi − µRi ) ΣRi (XRi − µRi )(XRi − µRi ) ΣRi − ΣRi .
Jo
urn
al
Here, f (xri ; µ, Σ) is the density of Xri , i.e., of the observed part of Xi , which we express in terms of the parameters µ and Σ. The symbols µri and Σri denote those parts of µ and Σ that are relevant for calculating f (xri ; µ, Σ), i.e., that are associated with the response of Individual i. For example, if we observe only the first component of Xi , i.e., X1i , we have that µri = µ1 and Σri = Σ11 . Since Xri is a subvector of Xi , the entries in µ and Σ that are not associated with any response of Individual i are redundant for f (xri ; µ, Σ). Because we express the density of Xri in terms of µ and Σ, we have to use the inflation operator h·i. It inflates an array (“·”) by inserting zeros at those parts of the array that are associated with the nonresponse of the corresponding individual. In the previous example, ∂ log f (Xri ; µ, Σ)/∂µ is an m×1 vector that contains ∂ log f (X1i ; µ, Σ)/∂µ1 in the first position and zeros elsewhere (because ∂ log f (X1i ; µ, Σ)/∂µi = 0 for all i ∈ 2, . . . , m ),
which can be written as ∂ log f (X1i ; µ, Σ)/∂µ1 . Similarly, ∂ log f (Xri ; µ, Σ)/∂Σ is an m × m matrix that contains ∂ log f (X1i ; Σ)/∂Σ11 on the upper left and zeros elsewhere (because ∂ log f (X1i ; Σ)/∂Σi j = 0 for all i ∈ 2, . . . , m or
j ∈ 2, . . . , m ), i.e., ∂ log f (X1i ; Σ)/∂Σ11 . The inflation operator h·i guarantees that the M-estimating equations are properly specified and thus it is an essential instrument of our estimation approach. Suppose that Xi = (Xri , Xr¯i ) for the sake of simplicity but without loss of generality. Further, let us assume that Λ is a lower triangular matrix, so that Λ11 0 Λ = Λ21 Λ22
with Λ11 ∈ Rmi ×mi and Σri = Λ11 Λ>11 . According to Cambanis et al. [5], we have the stochastic representation 1 Xri = µri + Λ11 Vβ 2 U, where β ∼ Beta mi /2, (m − mi )/2 and U ∼ S mi −1 . Moreover, V, β, and U are mutually independent. This means that Xri has not the same generating variate as Xi . Hence, the corresponding weight function is wi : ξ 7→ −2 ∂ log gi (ξ)/∂ξ with gi (ξ) =
Γ( m2i ) π
mi 2
fV 2 β (ξ) ξ− 9
mi 2
−1
,
ξ > 0.
Journal Pre-proof
Now, the ML-estimators µˆ and b Σ represent the solution of n X ˆ b Σ; XRi φµ,Ri µ, 1 Φ µ, ˆ b Σ; XR = = 0 , n i=1 φΣ,R µ, ˆ b Σ; XRi i
which leads to the ML-estimating equations
n
D E 1X wi (ξi ) b Σ−1 ˆ Ri ) Ri (XRi − µ n i=1 n
of
n
0=
(11)
p ro
E D 1 X Db−1 E 1 X Σ−1 ΣRi = wi (ξi ) b Σ−1 ˆ Ri )(XRi − µˆ Ri )>b Ri Ri (XRi − µ n i=1 n i=1
Pr e-
with ξi = (XRi − µˆ Ri )>b Σ−1 ˆ Ri ), where µˆ Ri and b ΣRi are those parts of µˆ and b Σ that are associated with the observaRi (XRi − µ tions of Individual i. Moreover, we have that wi (ξ) = −2 ∂ log gi (ξ)/∂ξ. Actually, the weight function for Individual i depends on his number of observations, i.e., mi . However, we write “wi ” instead of “wmi ” just for notational convenience. Note that (11) makes use of the inflation operator h·i. If the data are complete, (11) simplifies to (10) and the inflation operator becomes superfluous. This completes the ML-estimation of location and scatter with incomplete and dependent data. The next section proceeds further with M-estimation. We maintain our assumption that Xi is elliptically distributed and still focus on µ and Σ. 4.2. Maximum-Likelihood-Type Estimation 4.2.1. Complete-Data Case Now, we can drop the assumption that X has an absolutely continuous distribution. If the data are complete, we have that 1 ψµ µ, Σ; Xi = v (Xi − µ)> Σ−1 (Xi − µ) 2 Σ−1 (x − µ)
and ψΣ µ, Σ; Xi = Ai − 12 diag Ai with
al
Ai = w (Xi − µ)> Σ−1 (Xi − µ) Σ−1 (Xi − µ)(Xi − µ)> Σ−1 − Σ−1 .
The corresponding score function Ψ is
urn
n ˆ b Σ; Xi 1 X ψµ µ, b Ψ µ, ˆ Σ; X = n i=1 ψΣ µ, ˆ b Σ; Xi
and thus we obtain the M-estimating equations n
1 1X v ξi2 (Xi − µ) ˆ 0= n i=1
n
1X b Σ= w ξi (Xi − µ)(X ˆ i − µ) ˆ >, n i=1
and
(12)
Jo
where v and w must satisfy a set of regularity conditions [see, e.g., 39]. The following weight functions for Σ can frequently be found in the literature [see, e.g., 29, 50]: The Gaussian weight function w : ξ 7→ 1, Tyler’s weight function w : ξ 7→ m/ξ, Student’s weight function w : ξ 7→ (m + ν)/(ξ + ν) with ν > 0, and Huber’s weight function ξ<λ κ , w : ξ 7−→ κλ/ξ , ξ ≥ λ , where the parameters κ > 0 and λ > 0 are such that E w(χ2m ) χ2m = m. See D¨umbgen et al. [11] for a comprehensive survey on M-estimation of scatter. 10
Journal Pre-proof
In the context of M-estimation, the distribution of the generating variate V is unknown. Since ΛVU = (τΛ) V/τ U for all τ > 0, we have a well-known identification problem regarding the scatter matrix Σ. This problem is typically solved by the choice of the weight function w. Another alternative is to require σ2 (Σ) = 1, where σ2 represents a certain scale function. This point will be detailed below. In fact, the population version of the second part of (12) reads Σ = E w (X − µ)> Σ−1 (X − µ) (X − µ)(X − µ)> .
of
From (X − µ)> Σ−1 (X − µ) = V 2 , (X − µ)(X − µ)> = V 2 ΛUU > Λ> , and E(UU > ) = Im /m it follows that E ϕ(V 2 ) = m, ϕ(ξ) = w(ξ) ξ.
(13)
p ro
Hence, by applying the second M-estimating equation in (12), we implicitly assume that the generating variate V satisfies the scaling condition (13). Condition C of Maronna [39] usually guarantees that there is no positive number τ , 1 such that E ϕ(V 2 /τ2 ) = m. Otherwise, we could substitute the generating variate V with V/τ and Λ with τΛ without changing the distribution of X. Maronna’s Condition C fails to solve the identification problem if there exists a threshold ζ ≥ 0 such that ϕ(ξ) is constant for all ξ > ζ. For example, Tyler’s weight function implies that ϕ(ξ) = m for all ξ > 0. In this case we have to fix the scale of b Σ, i.e., to normalize b Σ, which can be done, e.g., by requiring that tr b Σ = m or det b Σ = 1 [14, 41].
Pr e-
4.2.2. Incomplete-Data Case Things become more complicated in the case of incomplete data, where the M-estimating equations are similar to (11), i.e., n E 1D 1X vi ξi2 b Σ−1 ˆ Ri ) 0= Ri (XRi − µ n i=1 (14) n n D E 1 X Db−1 E 1 X −1 >b−1 b Σ = wi (ξi ) ΣRi (XRi − µˆ Ri )(XRi − µˆ Ri ) ΣRi . n i=1 Ri n i=1
and
al
A keynote of this work is that the choice of the weight functions w1 , . . . , wn is not arbitrary. More precisely, we have to guarantee that the basic condition expressed by (6) is satisfied. Thus, we have to guarantee that 12 −1 E vi (Xri − µri )> Σ−1 Σ (X − µ ) =0 (15) (X − µ ) r r r r ri ri i i i i −1 > −1 −1 E wi (Xri − µri )> Σ−1 =0 ri (Xri − µri ) Σri (Xri − µri )(Xri − µri ) Σri − Σri
(16)
urn
for every fixed response ri that is possible for Individual i. To the best of our knowledge, this issue has not yet been considered in the literature. 2 Now, we have that Xri − µri = Λ11 VU, (Xri − µri )> Σ−1 ri (Xri − µri ) = V β, and
Jo
(Xri − µri )(Xri − µri )> = V 2 βΛ11 UU > Λ>11 with E(U) = 0 and E(UU > ) = Imi /mi , where V, β ∼ Beta mi /2, (m − mi )/2 , and U ∼ S mi −1 are mutually independent. Hence, the condition expressed by (15) is always satisfied, whereas (16) leads us to the critical scaling condition E ϕi (V 2 β) = mi , ϕi (ξ) = wi (ξ) ξ. (17) There exist two well-known weight functions that satisfy this scaling condition implicitly: The Gaussian weight function and Tyler’s weight function. For the Gaussian weight function we have that ϕi (ξ) = ξ. From (13) we know that E(V 2 ) = E ϕ(V 2 ) = m and thus we obtain mi E ϕi (V 2 β) = E V 2 β = E(V 2 ) E(β) = m · = mi . m
Moreover, Tyler’s weight function leads to ϕi (ξ) = mi and thus E(ϕi (V 2 β)) = mi . Hence, in our context, these weight functions can be considered canonical. 11
Journal Pre-proof
p ro
of
If the location vector µ is known, the Gauss- and Tyler-type M-estimators for scatter can always be considered ML-estimators, irrespective of whether the data are complete or incomplete [16]: The Gauss-type M-estimator is an (observed-data) ML-estimator under the assumption that the data are multivariate normally distributed, whereas the Tyler-type M-estimator maximizes the likelihood function after projecting the observed data of each individual onto the unit hypersphere, in which case we obtain an angular central Gaussian distribution [51]. Further, when applying the Tyler-type M-estimator it has to be assumed only that the data are generalized elliptically distributed [13]. The finite-sample distribution of the Tyler-type M-estimator does not depend on the generating variate V at all if the data are generalized elliptically distributed and the location vector is known. For this reason, the Tyler-type M-estimator performs remarkably well compared with the Gauss-type M-estimator if the data are heavy tailed. In this work, we generalize the insights given by Frahm and Jaekel [16] by changing from ML-estimation to M-estimation of location and scatter. 5. The Power Weight Functions
Pr e-
5.1. Theoretical Properties In order to obtain M-estimators for location and scatter that are consistent in the case of incomplete elliptically distributed data, we construct a class of weight functions that satisfy the critical scaling condition (17). The Gaussian and Tyler’s weight function represent two extreme elements of this class. Hence, this work closes a gap left open by Frahm and Jaekel [16]. B(a, b) denotes Euler’s beta function with parameters a, b > 0. We define B(a, 0)/ B(b, 0) = 1 for all a, b > 0. It can easily be seen that B(a, x)/ B(b, x) = 1 as x & 0. Theorem 3. Consider any real number 0 ≤ α ≤ 1 and suppose that !−α ! V2 V 2 = m. E m Then, for every d ∈ 1, . . . , m , we have that !−α B d2 + 1, m−d V 2β 2 2 E V β = d, m B d + 1 − α, m−d 2
al
2
(18)
where V and β ∼ Beta d/2, (m − d)/2 are stochastically independent.
urn
Thus, a natural weight function for scatter is i ξ −α B m2i + 1, m−m 2 wi : ξ 7−→ m , i m B 2i + 1 − α, m−m 2
0 ≤ α ≤ 1.
Jo
In the complete-data case, we simply have that wi (ξ) = (ξ/m)−α . Hence, the scaling condition expressed by (18) is an immediate consequence of (13). The parameter α can be considered a tail index. For α = 0 we obtain the Gaussian weight wi (ξ) = 1. Moreover, since m m − mi mi mi m − mi i B + 1, = B , , 2 2 m 2 2 Tyler’s weight wi (ξ) = mi /ξ can readily be obtained by setting α = 1. Since (15) is always satisfied, the choice of 1 α the weight function for µ is quite arbitrary, but it is tempting to choose vi : ξ 2 7→ ξ − 2 (0 ≤ α ≤ 1). If the data are complete, α = 0 leads to the empirical mean vector, whereas for α = 1 we obtain the M-estimator for location proposed by Hettmansperger and Randles [24]. Similarly, if we choose α = 0, the resulting M-estimator for Σ corresponds to the empirical covariance matrix, whereas for α = 1 we obtain Tyler’s M-estimator for scatter [50]. In the following, vi and wi will be referred to as power weight functions and each M-estimator that is based on a power weight function will be called power M-estimator. 12
Journal Pre-proof
5.2. Asymptotic Distributions 1
1
p ro
of
The following theorems establish the joint asymptotic distribution of n 2 (µˆ − µ) and n 2 (b Σ − Σ) given that the Mestimators µˆ and b Σ are based on the power weight functions with common tail index α. It is straightforward to obtain similar results if the tail indices of vi and wi differ from one another. If the data are complete and independent, we can apply the standard results given by Huber [26] and Maronna [39]. Alternatively, the asymptotic covariance matrices can be derived by Theorem 2, which can be used even if the data are incomplete and dependent. We concentrate on the case of complete and independent data in order to obtain closed-form expressions. The m2 × m2 identity matrix is symbolized by Im2 . Let ei j be the m × m matrix with 1 in the i jth position and zeros P elsewhere. The m2 × m2 commutation matrix is defined as Km2 = m i, j=1 ei j ⊗ e ji , where ⊗ denotes the Kronecker 2 product. For any m × m random matrix M, the m -dimensional vector vec M is obtained by stacking the columns of 1 1 M on top of each other, and n 2 M → Nm×m 0, C means that n 2 vec M is asymptotically normally distributed with 2 2 asymptotic covariance matrix C ∈ Rm ×m .
1 and n 2 b Σ − Σ → Nm×m 0, C as n → ∞ with
Pr e-
Theorem 4. Let µˆ and b Σ be the power M-estimators for location and scatter with tail index 0 ≤ α < 1. Suppose that X1 , . . . , Xn are complete, independent, and identically elliptically distributed. Further, let the assumptions A1–A3 given in Section 2 be satisfied. Then we have that ! E V 2(1−α) 1 m 2 n −→ ∞ , n µˆ − µ −→ Nm 0, Σ , (m − α)2 E2 V −α C = γ1 Im2 + Km2 Σ ⊗ Σ + γ2 (vec Σ)(vec Σ)> .
The numbers γ1 and γ2 are
γ1 =
with
(τ1 − 1) − 2τ1 (τ2 − 1) m + (m + 4) τ2 /(m + 2τ2 )2 γ2 = τ22 τ1 =
m2α E V 4(1−α) m(m + 2)
al
and
(m + 2)2 τ1 (m + 2τ2 )2
and
1−α E V 2(1−α) . 1−α m
urn
Moreover, µˆ and b Σ are asymptotically independent.
τ2 =
Jo
Theorem 4 does not cover the limiting case α = 1 because Tyler’s M-estimator for scatter has to be normalized. Let σ2 be an appropriate scale function. This means that σ2 is a differentiable homogeneous function, i.e., we have that σ2 (τΣ) = τσ2 (Σ) > 0 for all τ > 0 and every positive definite m × m matrix Σ, that is such that σ2 (Im ) = 1. Each positive definite m × m matrix Ω with σ2 (Ω) = 1 is said to be a shape matrix. Now, Tyler’s M-estimator can be b=b normalized by Ω Σ/σ2 (b Σ), which represents an estimator for the shape matrix Ω = Σ/σ2 (Σ) [14, 41]. Apparently, estimating the shape matrix makes sense only if m > 1, i.e., for multivariate data. In the following, Υ = Im2 − (vec Ω)Jσ2 symbolizes an m2 × m2 matrix, where Jσ2 = ∂σ2 (Ω)/∂(vec Ω)> denotes the Jacobian of the chosen scale function σ2 [14]. Theorem 5. Let µˆ and b Σ be the power M-estimators for location and scatter with tail index 0 ≤ α ≤ 1. Further, let b=b Ω Σ/σ2 (b Σ) be the corresponding shape matrix estimator. Suppose that X1 , . . . , Xn are complete, independent, and identically elliptically distributed with m > 1. Further, let the assumptions A1–A3 given in Section 2 be satisfied. Then we have that ! E V 2(1−α) 1 m 2 n µˆ − µ −→ Nm 0, n −→ ∞ , Σ , (m − α)2 E2 V −α 13
Journal Pre-proof
1 b − Ω → Nm×m 0, C as n → ∞ with and n 2 Ω
C=
E V 4(1−α) m+2 Υ Im2 + Km2 Ω ⊗ Ω Υ> . m m1−α + 2(1 − α)E V 2(1−α) /m2
b are asymptotically independent. Moreover, µˆ and Ω
Pr e-
p ro
of
1 1 b − Ω requires the According to Theorem 4 and Theorem 5, the asymptotic normality of n 2 Σˆ − Σ and n 2 Ω generating variate, V, to have a finite moment of order 4(1 − α). For this reason we recommend to choose a sufficiently high tail index α ∈ 0, 1 if the data are heavy tailed. Interestingly, if α is close to 1, the data must not be too heavily concentrated around µ. To be more precise, for the asymptotic normality of the power M-estimator for location, we must guarantee that E V −α < ∞. In the case of α = 1, this phenomenon is already observed by Tyler [50]. Conversely, if the data are light tailed, the tail index should be sufficiently low. How do the power M-estimators behave if m, i.e., the number of dimensions, is high? It is clear that this question cannot be answered in a meaningful way if we do not fix µ and Σ. For example, we could assume that µ = 0 and Σ = Im . Moreover, at least if we analyze the estimator µˆ for the location vector and the estimator b Σ for the scatter matrix, it makes sense to assume that E(V 2 ) = m, which guarantees that Cov(X) = Σ = Im for all m ∈ N. 1 For example, in the case of α = 0 the asymptotic covariance matrix of n 2 (µˆ − µ) is just Im . Hence, the asymptotic variances do not depend on m at all if we use the Gauss-type M-estimator for µ, which is a standard result of multivariate analysis. By contrast, if we use the Tyler-type M-estimator, i.e., α = 1, the asymptotic covariance matrix of 1 n 2 µˆ − µ is 1 m Im . 2 2 (m − 1) E V −1
Jo
urn
al
We conclude that the question of whether the asymptotic variances increase or decrease with m essentially depends on V. To be more precise, in the case that E2 V −1 = o m−1 , the asymptotic variances increase with m. Otherwise 1 it can happen that the asymptotic variances of n 2 µˆ − µ decrease with the number of dimensions. The Tyler-type M-estimator for µ can be considered asymptotically more efficient than the corresponding Gauss-type M-estimator if and only if m E2 V −1 > . (m − 1)2 1 The asymptotic covariance matrix of n 2 b Σ − Σ is provided by Theorem 4. It essentially depends on the numbers γ1 and γ2 , which are influenced by m in a quite complicated way, since not only m but also V matters. We may assume that Ω = Σ and thus Ω = Im . The equality Ω = Σ is satisfied whenever the scale of Σ equals 1. In our case, this holds true for the scale function σ2 : Σ 7→ Σ11 because σ2 (Im ) = 1 for all m ∈ N. This enables us to analyze 1 b − Ω, which is provided by Theorem 5, in a very convenient form. Note the asymptotic covariance matrix of n 2 Ω that Jσ2 corresponds to an m2 -dimensional row vector with 1 in the first position and zeros elsewhere [14]. Now, the asymptotic covariance matrix is essentially driven by the constant E V 4(1−α) m+2 . m m1−α + 2(1 − α)E V 2(1−α) /m2 For example, in the case of α = 0 this constant increases with m if E V 4 increases faster to infinity than m2 . A well-known counterexample is the multivariate normal distribution, in which case we have that V = χ2m and so the constant equals 1. Hence, in that special case, the number of dimensions has no influence at all. By contrast, for the Tyler-type M-estimator, i.e., α = 1, the constant equals (m + 2)/m and so the asymptotic covariance matrix essentially depends on the number of dimensions. To be more precise, the constant decreases and tends to 1 with m → ∞. This explains why the Tyler-type M-estimator for the shape matrix always, i.e., for any arbitrary generating variate V, has almost the same asymptotic efficiency as the Gauss-type M-estimator for Ω in the case of multivariate normally distributed data, i.e., for the generating variate V = χ2m , provided that the number of dimensions, m, is high enough. This phenomenon will be illustrated in the next section. Note that all previous arguments hold true only if the effective sample size n/m is sufficiently high. Otherwise, the asymptotic normality of the M-estimators is questionable. More precisely, if n and m tend to infinity, but the 14
of
Journal Pre-proof
p ro
Fig. 1: Responses (black lines) and non-responses (white space) in a sample with m = 100 dimensions and size n = 1000. There are 36727 missing values, i.e., about 37% of the entire sample cannot be observed.
5.3. Graphical Illustration
Pr e-
ratio n/m is bounded above, it can happen that the asymptotic results obtained so far do no longer hold true. In fact, standard asymptotic theory requires m to be fixed. These phenomena are thoroughly discussed in random matrix theory [see, e.g., 15, 17], which goes far beyond the scope of this work. Note also that we do not consider the superhigh-dimensional case, in which the effective sample size n/m is bounded above by 1. In that case our Mestimators do not exist at all. The power M-estimates can be computed by the fixed-point algorithm developed by Frahm and Jaekel [16]. This algorithm turns out to be very fast and reliable even if the number of dimensions is high. Since every ML-estimator for location and scatter is an M-estimator, of course, the same estimation procedure can be applied in order to compute the ML-estimator. However, in quite simple cases (for example, if the data are multivariate normally distributed), it could be more efficient to apply some standard algorithm, e.g., the EM-algorithm [9]. For example, in the case of α = 0, i.e., when computing the Gauss-type M-estimates, we recommend an algorithm based on the sweep operator [45, Chapter 6.5]. This leads to exact solutions and is even faster than our fixed-point algorithm. D¨umbgen et al. [10] propose an alternative procedure based on a Taylor expansion, which could be useful also in the case of missing data.
urn
al
Now, we want to illustrate the impact of the tail index α. For this purpose, we simulate two samples of n = 1000 multivariate t-distributed data with m = 100 dimensions. In the first sample the number of degrees of freedom of the t-distribution amounts to ν = 2, which means that the data are heavy tailed. By contrast, in the second sample the data are multivariate normally distributed (ν = ∞). The location vector corresponds to µ = 0 and the shape matrix is given by the Toeplitz matrix 0.99 · · · 0.02 0.01 1 1 · · · · · · 0.02 0.99 .. . . . . . . . . Ω = . . . . . . 1 0.99 0.02 · · · · · · 0.01 0.02 · · · 0.99 1
Jo
Fig. 1 indicates which part of both samples is observed (black) and which part is missing (white). We can see that this is a monotone missingness pattern, which typically occurs when analyzing time-series or panel data. The unobserved data are MCAR. The different power M-estimates for the location vector µ are depicted in Fig. 2. If the data are heavy tailed, a higher tail index is preferable. By contrast, if the data are not heavy tailed, a lower tail index leads to an outcome that is slightly better, but the differences are not really substantial. In order to make the power M-estimates for scatter with different choices of α comparable, we need to normalize b =b b Σ. Here, we choose the scale function σ2 : Σ 7→ tr Σ/m and consider the shape matrix estimator Ω Σ/σ2 (b Σ). Due to b = m. Note that the true shape matrix Ω already satisfies the condition tr Ω = m this normalization, we have that tr Ω by construction. The results for the multivariate t-distributed data are depicted in Fig. 3, whereas the results for the multivariate normally distributed data are given by Fig. 4. The Gauss-type M-estimate on the upper right of Fig. 3 15
p ro
of
Journal Pre-proof
Jo
urn
al
Pr e-
Fig. 2: Power M-estimates of µ for multivariate t-distributed data (left) and multivariate normally distributed data (right): The Gauss-type Mestimates (dashes), the power M-estimates with α = 0.5 (dots and dashes), and the Tyler-type M-estimates (dots). The horizontal lines indicate the true location vector.
Fig. 3: Power M-estimates of Ω for multivariate t-distributed data: The Gauss-type M-estimate (upper right), the power M-estimate with α = 0.5 (lower left), and the Tyler-type M-estimate (lower right). The upper left is the true shape matrix. Violet cells indicate small numbers and red cells represent large numbers.
16
p ro
of
Journal Pre-proof
Pr e-
Fig. 4: Power M-estimates of Ω for multivariate normally distributed data: The Gauss-type M-estimate (upper right), the power M-estimate with α = 0.5 (lower left), and the Tyler-type M-estimate (lower right). The upper left is the true shape matrix. Violet cells indicate small numbers and red cells represent large numbers.
6. Simulation Study
urn
al
is heavily distorted. This is due to the fact that ν = 2. That is, the number of degrees of freedom of the multivariate t-distribution is very low. Hence, we can expect to get a better estimate by choosing a higher tail index α for the power weight functions. If we choose the common tail index α = 0.5, the result looks a little bit better than the Gauss-type M-estimate (see the lower left of Fig. 3). However, the Tyler-type M-estimator, i.e., the power M-estimator with α = 1, clearly provides the best result (see the lower right of Fig. 3). By contrast, if the data are not heavy tailed but multivariate normally distributed (see Fig. 4), Theorem 1 implies that the Gaussian weight functions are superior because, in this case, they lead to an ML-estimator. Does this mean that the power M-estimates with α > 0 are much worse? Interestingly, the answer is “No”! As we can see on the right-hand side of Fig. 2, as well as throughout Fig. 4, the power M-estimates with tail index α > 0 are almost indistinguishable from the Gauss-type M-estimates. We already provided an analytical argument for this phenomenon in the previous section. Indeed, the number of dimensions in our illustration is m = 100 and thus it is quite high, which explains why the power M-estimators do not perform essentially different compared to the Gauss-type M-estimator.
The graphical illustration of the power M-estimates does not say much about which M-estimator should be favored in different real-life situations. In order to answer this question we have to conduct an extensive simulation study. 6.1. Design of the Study
Jo
To be able to compare the power M-estimators for scatter with each other, we apply the canonical scale function b =b Σ/(det b Σ)1/m for the shape matrix Ω = Σ/(det Σ)1/m . σ : Σ 7→ (det Σ)1/m [41]. Hence, we focus on the M-estimator Ω Apart from its theoretical advantages—which have been thoroughly investigated by Paindaveine [41]—the canonical b = 1 and thus Ω b can scale function turns out to be convenient from a numerical perspective: It guarantees that det Ω never be singular, even if the number of dimensions, m, is very high. In our simulation study, we always consider µ unknown when estimating Ω. This is in contrast to the simulation study conducted by Frahm and Jaekel [16], where the location vector is considered known. 2
17
Journal Pre-proof
Pr e-
p ro
of
bα , respectively, where α represents the tail index The power M-estimators for µ and Ω are symbolized by µˆ α and Ω of the power weight functions vi and wi . In the simulation study, we take each tail index α ∈ 0, 0.25, 0.50, 0.75, 1 into account. In the limiting cases α = 0 and α = 1 we obtain the Gauss-type and the Tyler-type M-estimators for bG (α = 0) and µˆ T , Ω bT (α = 1), respectively. location and shape. Those power M-estimators are symbolized by µˆ G , Ω bG = b If the data are complete, µˆ G is the empirical mean vector. Moreover, in this case, we have that Ω ΣG /(det b ΣG )1/m , b where ΣG represents the empirical covariance matrix. Analogously, µˆ T is the M-estimator for location proposed by Hettmansperger and Randles [24], whereas b ΣT is Tyler’s (normalized) M-estimator for scatter [50]. Actually, Tyler [50] uses the scale function σ2 : Σ 7→ tr Σ/m for normalization, but this does not alter the overall conclusion of our simulation study. A similar study can be found in Frahm and Jaekel [16], where only the Gauss- and Tyler-type Mestimators for shape are investigated. Hence, the study presented here can be considered a substantial generalization of the results documented by Frahm and Jaekel [16]. For simulating heavy-tailed data we use the multivariate t-distribution with ν > 0 degrees of freedom. The multivariate t-distribution converges to the multivariate normal distribution as ν → ∞, which is indicated by “ν = ∞.” By contrast, for simulating light-tailed data, we apply the multivariate power-exponential distribution with parameter γ > 0. In the case of γ = 1 it coincides with the multivariate normal distribution. By contrast, for γ > 1 its tails are lighter and for 0 < γ < 1 they are heavier than the tails of the multivariate normal distribution. We consider scenarios in which the data are multivariate t-distributed with ν ∈ 1, 2, 3, 4, ∞ degrees of freedom (t1 , t2 , t3 , t4 , t∞ ) and multivariate power-exponentially distributed with γ = 5 (p5 ). Our working hypothesis is that the elements of the process (Xi , Ri ) are independent and identically distributed (i.i.d.), but Xi and Ri still may depend on one another. Hence, we focus on the i.i.d. case in our simulation study. The reason is that, according to Theorem 2, serial or spatial dependence only blows up the asymptotic covariance matrices of the M-estimators, but it has no impact on consistency and asymptotic normality. To be more precise, we have that 1 n −→ ∞ . n 2 ϑˆ − ϑ −→ Nq 0, Hϑ−1 Fϑ Hϑ−1 ,
urn
al
Recall that the dependence structure of the data determines only the matrix Fϑ and thus it has no influence on the matrix Hϑ (see our explanations at the end of Section 2). In particular, since Fϑ is the “meat” of the sandwich expression Hϑ−1 Fϑ Hϑ−1 , the impact of serial or spatial dependence is linear but not quadratic. For this reason, our comparative results should not be surprisingly different for dependent data, e.g., if (Xi , Ri ) would follow an autoregressive process or any other process that has an impact on Fϑ . The number of dimensions amounts to m = 5 and the true parameters of location and scatter are µ = 0 and Σ = I5 , respectively. Besides the six aforementioned scenarios, t1 , t2 , t3 , t4 , t∞ , p5 , we consider three additional scenarios in which the data are contaminated. More precisely, we substitute a number of cn (0 < c < 1) multivariate normally distributed observations with the outlier (10, . . . , 10) ∈ R5 . We consider the contamination rates c = 0.01, 0.05, 0.10. We also distinguish between a small sample (n = 100), a moderate sample (n = 1000), and a large sample (n = 10000). The number of Monte Carlo replications is always 10000, irrespective of the sample size n. As is done in Frahm and Jaekel [16], the estimators are evaluated by their absolute bias (AB), i.e., 1> |E(b µ − µ)| AB µˆ = m
and
> b b = 1 |E(Ω − Ω)| 1 , AB Ω m2
Jo
where |A| is the matrix of absolute values of A and 1 is an appropriate vector of ones. The absolute bias can be relatively large for small or moderate sample sizes, although it is supposed to vanish in large samples. Our second quantity of interest is the mean squared error (MSE). This is the average mean squared error of all components, i.e., b − Ω)(Ω b − Ω)> E tr (Ω E (µˆ − µ)> (µˆ − µ) b = MSE µˆ = and MSE Ω . m m2 Finally, we investigate the relative efficiency (RE) of the Tyler-type M-estimators with respect to the Gauss-type M-estimators by b MSE(µˆ G ) bT = MSE(ΩG ) . RE µˆ T = and RE Ω bT ) MSE(µˆ T ) MSE(Ω 18
Journal Pre-proof
p ro
of
The reader can easily derive the relative efficiency of some power M-estimator with respect to any other power Mestimator by taking the mean squared errors reported in the online supplement into account. The complete-data case is denoted by COM. For investigating the performance of the power M-estimators in the case of incomplete data, we simulate three different missingness mechanisms that satisfy MCAR, MAR (but not MCAR), and NMAR. Let xi be a realization of Xi . We allow only the first component of xi , i.e., x1i , to be missing. More precisely, we have that r1i = 0 if x1i is missing and r1i = 1 if x1i is observed. This means that r1i is the realization of the first component of the response vector Ri . The unobserved data are MCAR if Ri is stochastically independent of Xi . It is worth emphasizing that, in principle, we need not assume that Ri is stochastically independent of X j with j , i, but this assumption is implicitly satisfied in our simulation study. If the distribution of Ri depends only on the observed part of Xi , the missing data are MAR, and if the response is determined by the unobserved part of Xi , the missing data are NMAR. For the MCAR case, we simulate n mutually independent Bernoulli variables R11 , . . . , R1n with probability of success π = 0.5, where R1i is stochastically independent of Xi . In the MAR case, we have that r1i = 0, i.e., x1i is considered missing, if x2i < 0. Finally, in the NMAR case we set r1i = 0 if x1i < 0. This procedure guarantees that approximately 50% of the data in the first row of the sample are missing for each missingness mechanism.
Jo
urn
al
Pr e-
6.2. Numerical Results The results of our simulation study can be found in the online supplement (see Tables S.1–S.8). Tables S.1, S.3, S.5, and S.7 provide the results regarding the location vector µ, whereas Tables S.2, S.4, S.6, and S.8 refer to the shape matrix Ω. It is well-known that the generating variate of a multivariate t-distribution with ν degrees of freedom has finite moments only of orders lower than ν. In Section 5.2 we demonstrated that the joint asymptotic normality of the standardized power M-estimators with common tail index α requires the generating variate to have a finite moment of order 4(1 − α). In fact, our fixed-point algorithm sometimes diverges if α ≤ 1 − ν/4. This could be due to the fact that a solution of the power M-estimating equations may not exist if the data are heavy tailed but the tail index of the power weight functions is too low. Hence, for ν = 1 we do not apply the power M-estimators with tail indices α = 0.25, 0.50, 0.75. Similarly, for ν = 2 we ignore the power M-estimators with α = 0.25, 0.50, etc. Nonetheless, we always compute the Gauss-type M-estimators by using the sweep operator [45, Chapter 6.5], which cannot diverge by construction. The accuracy of the presented results highly depends on the distributional assumption. This shall be illustrated by Table S.1, in which we consider the M-estimators for the location vector in the complete-data case. Let us take a look at the upper left entry of Table S.1, i.e., 0.5261, which quantifies the absolute bias of the Gauss-type M-estimator for µ in the case of multivariate t-distributed data with ν = 1 degree of freedom and a sample size of n = 100. This number is significantly different from all other numbers that refer to uncontaminated data in the same row and its accuracy is poor. Although the Monte Carlo study yields 10000 observations of µˆ G , they have a standard deviation ranging 1 P10000 ˆ Gk is between 0.45 and 0.65. By from 45 to 65. Hence, for each component of µˆ G , the standard error of 10000 k=1 µ 1 P10000 contrast, the standard error of 10000 µ ˆ is only about 0.0013. The same phenomenon can be observed for the Tk k=1 1 P10000 2 other numbers in the first column of Table S.1. For example, the standard error of 10000 k=1 µˆ Gk is between 500 1 P10000 2 and 2000. By contrast, the standard error of 10000 ˆ Tk is only about 0.00026. To sum up, the accuracy of the k=1 µ numbers given in the first column essentially depends on the numbers themselves. The second column represents the case of multivariate t-distributed data with ν = 2 degrees of freedom. Here, 1 P10000 1 P10000 the standard error of 10000 ˆ Gk is about 0.004, the standard error of 10000 ˆ 0.75k is about 0.0013, and the k=1 µ k=1 µ 1 P10000 standard error of 10000 µ ˆ is about 0.0012 given the sample size n = 100. Hence, the accuracy of our results Tk k=1 becomes better the more robust the M-estimator and the less heavy the distribution. For example, in the case of the multivariate power-exponential distribution with sample size n = 100 we have a standard error of about 0.00047 for 1 P10000 1 P10000 ˆ Gk and 0.00055 for 10000 ˆ Tk . In the contaminated-data case we still have a good accuracy. For k=1 µ k=1 µ 10000 1 P10000 example, regarding the last column of Table S.1 we obtain a standard error of only 0.001 both for 10000 ˆ Gk and k=1 µ P 10000 1 for 10000 µ ˆ . Tk k=1 Similar observations can be made throughout Table S.1 with respect to the mean squared error of the M-estimators and these observations do not change, essentially, if we consider the subsequent tables. To sum up, the accuracy of our results becomes better the more robust the M-estimator and the less heavy the data. If the reported absolute bias is large or the mean squared error is big, we can expect a bad accuracy and vice versa. At least for those cases that are 19
Journal Pre-proof
Table 1: Most favorable tail index of the power M-estimators under different scenarios.
Location vector Small sample
Heavy tails
MCAR
MAR
NMAR
COM
MCAR
MAR
NMAR
AB MSE AB MSE AB MSE
0 0 0 0 1 1
0 0 0.50 0 1 1
0 1 0 1 0 1
0 0 1 1 1 1
0 0 0 0 0.75 0.75
0 0 0 0 0.75 0.75
0 0 0 0 1 1
0 0 0 0 0.50 0.75
AB MSE AB MSE AB MSE
1 0.75 1 1 1 1
1 0.75 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 0.75 1 1 1 1
1 0.75 1 1 1 1
1 1 1 1 1 1
1 0.75 1 1 1 1
COM
MCAR
MAR
NMAR
COM
MCAR
MAR
NMAR
AB MSE AB MSE AB MSE
0 0 0.75 0 1 1
0 0 0.25 0 0.75 1
0 0 0 0 0 1
0 0 1 1 1 1
0 0 0.25 0 0.75 0.75
0 0 0 0 0.75 0.75
0 0 0 0 0.75 1
0 0 0 0 0 0.50
AB MSE AB MSE AB MSE
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 0.75 1 1 1 1
MCAR
MAR
NMAR
COM
MCAR
MAR
NMAR
1% 5% 10% Moderate sample Light tails Normal tails Heavy tails
1% 5% 10% Large sample
Normal tails Heavy tails
COM
urn
Light tails
al
Contaminated
Pr e-
Contaminated
of
Normal tails
COM
p ro
Light tails
Shape matrix
AB MSE AB MSE AB MSE
0 0 0 0 1 1
0 0 0 0 1 1
0 0 0 0 0 0.25
0 0 1 1 1 1
0 0 0 0 0.75 0.75
0 0 0 0 0.75 0.75
0 0 0.25 0 0.25 0.75
0 0 0 0 0 0
AB MSE AB MSE AB MSE
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
Contaminated
Jo
1% 5%
10%
20
Journal Pre-proof
Pr e-
p ro
of
relevant from a practical point of view, i.e., ν > 2, the given accuracy seems to be fairly good. In any case, although not all numbers are accurate, we believe that our overall conclusion is reliable. First of all we refer to Table S.1 and Table S.2, which contain the results of the complete-data case. If the data are clean, i.e., uncontaminated, the absolute bias of the power M-estimators with α > 0 decreases with the sample size, n, and eventually vanishes in the large samples. The Gauss-type M-estimators remain biased if ν is too low, i.e., ν < 2 (Table S.1) and ν < 4 (Table S.2), but this result can be attributed to the bad accuracy, which can typically be observed in that case. If the data are contaminated, the absolute bias essentially decreases with α, but it does not vanish with n. Indeed, the power M-estimators cannot be expected to be consistent if the data are contaminated, but in this case the Tyler-type M-estimators turn out to be always preferable in terms of bias. Table S.3 and Table S.4 provide the results under the assumption that the unobserved data are MCAR. The overall findings do not differ essentially from Table S.1 and Table S.2. Our numerical results confirm that the power Mestimators are consistent if the unobserved data are MCAR. This picture changes substantially in Table S.5 and Table S.6, which cover the case in which the unobserved data are MAR. In the previous sections, we have argued that in this case the M-estimators can be inconsistent—even if the data are uncontaminated. This is confirmed by the results of our simulation study. The only exception are the Gauss-type M-estimators in the case of multivariate normally distributed data. Indeed, in this special case the Gauss-type M-estimators are ML-estimators and their consistency under MAR is guaranteed by Theorem 1. Otherwise, we must accept that the power M-estimators are biased even in large samples. Finally, Table S.7 and Table S.8 are based on the assumption that the unobserved data are NMAR. In this case, also the Gauss-type M-estimators are inconsistent, even if the data stem from a clean multivariate normal distribution. Similar arguments hold true for the mean squared error of the power M-estimators. The results of our simulation study are summarized in Table 1. This table indicates which power M-estimator is favorable in different situations. The M-estimators have been evaluated by means of their absolute bias and mean squared error. Each number in Table 1 represents the optimal choice of the tail index α with respect to the efficiency of the power M-estimator relative to the Gauss-type M-estimator. We can see that the Tyler-type M-estimators are nearly always preferable for contaminated data, whereas the Gauss-type M-estimators should be preferred only if the data have normal or light tails. If the data are heavy tailed one should choose a large tail index α. However, regarding the shape matrix it is usually better not to choose α = 1, i.e., to avoid the Tyler-type M-estimator. This holds true for complete and incomplete data, irrespective of whether the unobserved data are MCAR, MAR or NMAR. Acknowledgments
urn
al
Klaus Nordhausen and Hannu Oja would like to thank very much the Academy of Finland for their financial support. Gabriel Frahm would like to thank very much his co-authors for their kind invitations to Finland. Moreover, we would like to thank very much the reviewers and the editor for their splendid contribution. Appendix A. Proofs
Jo
Proof of Proposition 1: We have that ! Z Z ∂ log f (xri ; θ) ∂ log f (XRi ; θ) = f (ri , xri ; θ) dxri dri E ∂θ ∂θ Z Z Z Z ∂ log f (xri ; θ) ∂ f (ri | Xri = xri ) f (xri ; θ) = f (ri | Xri = xri ) f (xri ; θ) dxri dri = dxri dri ∂θ ∂θ Z Z Z Z ∂ f (ri , xri ; θ) ∂ ∂ = dxri dri = f (ri , xri ; θ) dxri dri = 1 = 0. ∂θ ∂θ ∂θ Proof of Theorem 1: The ML-estimator of θ is the solution of n X ∂ log f XRi ; θˆ ˆ XR = 1 ΦR θ; = 0. n i=1 ∂θ 21
Journal Pre-proof
Assumption A3 given in Section 2 guarantees that
1 n 2 θˆ − θ = −
Due to Assumption A1, we conclude that
∂ΦR θ; XR θˆ − θ + Op n−1 . ∂θ>
! ∂ΦR θ; XR −11 1 n 2 ΦR θ; XR + Op n− 2 . ∂θ>
1 n 2 ΦR θ; XR → N p 0, Fθ ,
of
Hence, we have that
ˆ XR = ΦR θ; XR + ΦR θ;
n −→ ∞ .
p ro
Assumption A2 in connection with Slutsky’s theorem implies that
1 n 2 θˆ − θ −→ N p 0, Hθ−1 Fθ Hθ−1 ,
n −→ ∞ .
Note that
! ∂2 log f (xri ; θ) ∂ log f (xri ; θ) ∂ log f (xri ; θ) ∂ ∂ log f (xri ; θ) f (x ; θ) = f (xri ; θ) + f (xri ; θ) . r i > > ∂θ ∂θ ∂θ ∂θ ∂θ ∂θ> f (ri | Xri = xri ) f (xri ; θ) = f (ri , xri ; θ)
and thus Z Z We have that
∂2 log f (xri ; θ) f (ri , xri ; θ) dxri dri + ∂θ ∂θ>
Z Z
al
where
Pr e-
Moreover, due to the assumptions DIS, MAR, and INT, we obtain Z Z Z Z ∂ f (xri ; θ) ∂2 ∂ f (r , x ; θ) dx dr f (ri | Xri = xri ) dxri dri 0= = i r r i i i > > ∂θ ∂θ ∂θ ∂θ | {z } =1 ! Z Z ∂ ∂ log f (xri ; θ) = f (ri | Xri = xri ) > f (xri ; θ) dxri dri , ∂θ ∂θ
∂ log f (xri ; θ) ∂ log f (xri ; θ) f (ri , xri ; θ) dxri dri = 0 . ∂θ ∂θ>
! ∂2 log f (xri ; θ) ∂φRi (θ; XRi ) f (ri , xri ; θ) dxri dri = E = Hθ ∂θ ∂θ> ∂θ> and Proposition 1 guarantees that E φRi (θ; XRi ) = 0. Thus, if the elements of φRi (θ; XRi ) are uncorrelated, the double integral Z Z ∂ log f (xri ; θ) ∂ log f (xri ; θ) f (ri , xri ; θ) dxri dri ∂θ ∂θ>
urn
Jo
equals
Z Z
E φRi (θ; XRi ) φ>Ri (θ; XRi ) = Fθ .
1 This means that Fisher’s information equality Hθ = −Fθ is satisfied and so we obtain n 2 θˆ − θ → N p 0, Fθ−1 as n → ∞. Finally, the Cram´er-Rao Theorem implies that θˆ is asymptotically efficient. Proof of Proposition 2: Assumption MCAR implies that Z Z f (ri , xri ; θ) = f (ri , xri , xr¯i ; θ) dxr¯i = f (ri ; θ) f (xri , xr¯i ; θ) dxr¯i = f (ri ; θ) f (xri ; θ) 22
Journal Pre-proof
and thus
E ψRi (ϑ; XRi ) =
Z
Z
ψri (ϑ; xri ) f (ri , xri ; θ) dxri dri = f (ri ; θ) ψri (ϑ; xri ) f (xri ; θ) dxri dri Z f (ri ; θ) E ψri (ϑ; Xri ) dri = f (ri ; θ) 0 dri = 0 .
Z
of
=
Z Z
Proof of Theorem 2: This proof is skipped, since it follows from the same arguments as those in the first part of the proof of Theorem 1. The second part of that proof, which refers to the asymptotic efficiency of the ML-estimator, is void in the case of M-estimation.
p ro
Proof of Theorem 3: We obtain !−α 2 B d2 + 1, m−d B d2 + 1, m−d V β 2 2 mα E (V 2 β)1−α , V 2 β = E m−d m−d d d m B 2 + 1 − α, 2 B 2 + 1 − α, 2
Moreover, since B
d m−d 2, 2
=
m d
B
d 2
Pr e-
where E (V 2 β)1−α = E V 2(1−α) E(β1−α ). From (18) it follows that E V 2(1−α) = m1−α . Thus, we have that B d2 + 1, m−d B d2 + 1, m−d 2 2 mα E (V 2 β)1−α = mE(β1−α ) . d m−d B d2 + 1 − α, m−d B + 1 − α, 2 2 2 + 1, m−d and thus 2
E(β1−α ) =
B
B
B
d 2
d 2
+ 1, m−d 2
+ 1 − α, m−d 2
al
we conclude that
d m−d + 1 − α, m−d d B 2 + 1 − α, 2 2 , = m B d + 1, m−d B d2 , m−d 2 2 2
d 2
mE(β1−α ) = d. 1
urn
Proof of Theorem 4: The asymptotic covariance matrix of n 2 (µˆ − µ) follows from Theorem 6 of Maronna [39]. We 1 α 1 1 1 1−α 1 1 α have that v ξ 2 = ξ− 2 and thus ϕµ ξ 2 = v ξ 2 ξ 2 = ξ 2 . It follows that ϕ2µ ξ 2 = ξ1−α and ϕ0µ ξ 2 = (1 − α) ξ − 2 , 1 where ϕ0µ denotes the first derivative of ϕµ with respect to ξ 2 . Further, we already know that ξ = V 2 . This means that a = E ϕ2µ (V) /m = E V 2(1−α) /m and b = E v(V) 1 −
Jo
Hence, we obtain
! ! ϕ0µ (V) E (m − 1)V −α + (1 − α)V −α 1 m − α + = = E V −α . m m m m E V 2(1−α) a m = , b2 (m − α)2 E2 V −α 1
which leads us to the given asymptotic covariance matrix of n 2 (µˆ − µ). Now, we turn to the asymptotic covariance 1 matrix of n 2 (b Σ − Σ). The numbers γ1 and γ2 are given by Tyler [49, p. 432]. We have that w(ξ) = (ξ/m)−α and thus ϕΣ (ξ) = w(ξ) ξ = mα ξ1−α . This implies that ϕ2Σ (ξ) = m2α ξ2(1−α) and ϕ0Σ (ξ) = (1 − α) mα ξ−α , where ϕ0Σ is the first derivative of ϕΣ with respect to ξ. For calculating the numbers γ1 and γ2 we need E ϕ2Σ (V 2 ) m2α τ1 = = E V 4(1−α) m(m + 2) m(m + 2) 23
Journal Pre-proof
and
E ϕ0Σ (V 2 ) V 2 1−α = 1−α E V 2(1−α) . τ2 = m m b Finally, according to Theorem 6 of Maronna [39], µˆ and Σ are asymptotically independent. 1
p ro
of
Proof of Theorem 5: For 0 ≤ α < 1, the asymptotic covariance matrix of n 2 (µˆ − µ) has already been established by Theorem 4. The case α = 1 is investigated by Hettmansperger and Randles [24]. It can be verified that the resulting asymptotic covariance matrix is covered by Theorem 4. Furthermore, according to Frahm [14], the asymptotic co1 b − Ω) corresponds to C = γ1 Υ(Im2 + Km2 )(Ω ⊗ Ω)Υ> , where γ1 is already given by Theorem variance matrix of n 2 (Ω 4. The numbers τ1 and τ2 have been derived in the proof of Theorem 4 and lead to E V 4(1−α) m+2 . γ1 = m m1−α + 2(1 − α)E V 2(1−α) /m2
Pr e-
b is a Finally, according to Theorem 4, µˆ and b Σ are asymptotically independent in the case of 0 ≤ α < 1. Since Ω b are asymptotically independent, too. This holds irrespective of the chosen function of b Σ, we conclude that µˆ and Ω b is proved by Hettmansperger and Randles scale function. Moreover, for α = 1 the asymptotic independence of µˆ and Ω [24]. We can switch from one scale function to another by re-scaling the shape matrix, and so their result does not depend on the chosen scale function either. Online supplement
The online supplement contains the detailed results of the simulation study. References
Jo
urn
al
[1] P. Allison, Missing Data, Sage Publications, 2001. [2] G. Boente, W. Gonz´alez-Manteiga, A. P´erez-Gonz´alez, Robust nonparametric estimation with missing data, Journal of Statistical Planning and Inference 139 (2009) 571–592. [3] R. Bradley, Basic properties of strong mixing conditions. A survey and some open questions, Probability Surveys 2 (2005) 107–144. [4] K. Branden, S. Verboven, Robust data imputation, Computational Biology and Chemistry 33 (2009) 7–13. [5] S. Cambanis, S. Huang, G. Simons, On the theory of elliptically contoured distributions, Journal of Multivariate Analysis 11 (1981) 368–385. [6] T.-C. Cheng, M.-P. Victoria-Feser, High-breakdown estimation of multivariate mean and covariance with missing observations, British Journal of Mathematical and Statistical Psychology 55 (2002) 317–335. [7] S. Copt, M.-P. Victoria-Feser, Fast algorithms for computing high breakdown covariance matrices with missing data, in: M. Hubert, G. Pison, A. Struyf, S. van Aelst (Eds.), Theory and Applications of Recent Robust Methods, Birkh¨auser, 2004, pp. 71–82. [8] M. Danilov, V. Yohai, R. Zamar, Robust estimation of multivariate location and scatter in the presence of missing data, Journal of the American Statistical Association 107 (2012) 1178–1186. [9] A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B 39 (1977) 1–38. [10] L. D¨umbgen, K. Nordhausen, H. Schuhmacher, New algorithms for M-estimation of multivariate scatter and location, Journal of Multivariate Analysis 144 (2016) 200–217. [11] L. D¨umbgen, M. Pauly, T. Schweizer, M-functionals of multivariate scatter, Statistics Surveys 9 (2015) 32–105. [12] A. Flossmann, Accounting for missing data in M-estimation: a general matching approach, Empirical Economics 38 (2010) 85–117. [13] G. Frahm, Generalized Elliptical Distributions: Theory and Applications, Ph.D. thesis, University of Cologne, 2004. [14] G. Frahm, Asymptotic distributions of robust shape matrices and scales, Journal of Multivariate Analysis 100 (2009) 1329–1337. [15] G. Frahm, K. Glombek, Semicircle law of Tyler’s M-estimator for scatter, Statistics and Probability Letters 82 (2012) 959–964. [16] G. Frahm, U. Jaekel, A generalization of Tyler’s M-estimators to the case of incomplete data, Computational Statistics and Data Analysis 54 (2010) 374–393. [17] G. Frahm, U. Jaekel, Tyler’s M-estimator in high-dimensional financial-data analysis, in: K. Nordhausen, S. Taskinen (Eds.), Modern Nonparametric, Robust and Multivariate Methods, Springer, 2015, pp. 289–305. [18] J. Gastwirth, H. Rubin, The behavior of robust estimators on dependent data, Annals of Statistics 3 (1975) 1070–1100. [19] F. Hampel, E. Ronchetti, P. Rousseeuw, W. Stahel, Robust Statistics, John Wiley, 1986. [20] P. Han, Multiply robust estimation in regression analysis with missing data, Journal of the American Statistical Association 109 (2014) 1159–1173. [21] P. Han, Combining inverse probability weighting and multiple imputation to improve robustness of estimation, Scandinavian Journal of Statistics 43 (2016) 246–260.
24
Journal Pre-proof
[34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56]
of
p ro
[32] [33]
Pr e-
[29] [30] [31]
al
[28]
urn
[26] [27]
B. Hansen, Consistent covariance matrix estimation for dependent heterogeneous processes, Econometrica 60 (1992) 967–972. L. Hansen, Large sample properties of generalized method of moments estimators, Econometrica 50 (1982) 1029–1054. T. Hettmansperger, R. Randles, A practical affine equivariant multivariate median, Biometrika 89 (2002) 851–860. K. Hron, M. Templ, P. Filzmoser, Imputation of missing values for compositional data using classical and robust methods, Computational Statistics and Data Analysis 54 (2010) 3095–3107. P. Huber, Robust Statistics, John Wiley, 2003. M. Jamshidian, S. Jalal, Tests of homoscedasticity, normality, and missing completely at random for incomplete multivariate data, Psychometrika 75 (2010) 649–674. M. Jamshidian, S. Jalal, C. Jansen, MissMech: An R package for testing homoscedasticity, multivariate normality, and missing completely at random (MCAR), Journal of Statistical Software 56 (2014) 1–31. J. Kent, D. Tyler, Redescending M-estimates of multivariate location and scatter, Annals of Statistics 19 (1991) 2102–2119. H. Koul, Behavior of robust estimators in the regression model with dependent errors, Annals of Statistics 5 (1977) 681–699. H. Koul, K. Mukherjee, Asymptotics of R-, MD- and LAD-estimators in linear regression models with long range dependent errors, Probability Theory and Related Fields 95 (1993) 535–553. J. Listing, R. Schlittgen, A nonparametric test for random dropouts, Biometrical Journal 45 (2003) 113–127. R. Little, Robust estimation of the mean and covariance matrix from data with missing values, Journal of the Royal Statistical Society, Series C 37 (1988) 23–38. R. Little, A test of missing completely at random for multivariate data with missing values, Journal of the American Statistical Association 83 (1988) 1198–1202. R. Little, D. Rubin, Statistical Analysis with Missing Data, John Wiley, 2nd edition, 2002. R. Little, P. Smith, Editing and imputation for quantitative survey data, Journal of the American Statistical Association 82 (1987) 58–68. H. Markowitz, Portfolio Selection: Efficient Diversification of Investments, Wiley, 1959. H. Markowitz, Mean-Variance Analysis in Portfolio Choice and Capital Markets, Basil Blackwell, 1987. R. Maronna, Robust M-estimators of multivariate location and scatter, Annals of Statistics 4 (1976) 51–67. R. Maronna, D. Martin, V. Yohai, Robust Statistics, John Wiley, 2006. D. Paindaveine, A canonical definition of shape, Statistics and Probability Letters 78 (2008) 2240–2247. W. Palma, G. del Pino, Statistical analysis of incomplete long-range dependent data, Biometrika 86 (1999) 965–972. D. Politis, The impact of bootstrap methods on time series analysis, Statistical Science 18 (2003) 219–230. S. Portnoy, Robust estimation in dependent situations, Annals of Statistics 5 (1977) 22–43. J. Schafer, Analysis of Incomplete Multivariate Data, Chapman & Hall, 1997. S. Serneels, T. Verdonck, Principal component analysis for data containing outliers and missing elements, Computational Statistics and Data Analysis 52 (2008) 1712–1727. M. Sued, V. Yohai, Robust location estimation with missing data, Canadian Journal of Statistics 41 (2013) 111–132. M. Templ, A. Kowarik, P. Filzmoser, Iterative stepwise regression imputation using standard and robust methods, Computational Statistics and Data Analysis 55 (2011) 2793–2806. D. Tyler, Radial estimates and the test for sphericity, Biometrika 69 (1982) 429–436. D. Tyler, A distribution-free M-estimator of multivariate scatter, Annals of Statistics 15 (1987) 234–251. D. Tyler, Statistical analysis for the angular central Gaussian distribution on the sphere, Biometrika 74 (1987) 579–589. A. van der Vaart, Asymptotic Statistics, Cambridge University Press, 1998. J.-L. Wang, Asymptotic properties of M-estimators based on estimating equations and censored data, Scandinavian Journal of Statistics 26 (1999) 297–318. J. Wooldridge, Inverse probability weighted estimation for general missing data problems, Journal of Econometrics 141 (2007) 1281–1301. B. Wu, M-estimation of linear models with dependent errors, Annals of Statistics 35 (2007) 495–521. Y. Yuanxi, Robust estimation for dependent observations, Manuscripta Geodaetica 19 (1994) 10–17.
Jo
[22] [23] [24] [25]
25