Estimating heterogeneity variances to select a random effects model

Estimating heterogeneity variances to select a random effects model

Journal of Statistical Planning and Inference 202 (2019) 1–13 Contents lists available at ScienceDirect Journal of Statistical Planning and Inferenc...

415KB Sizes 0 Downloads 64 Views

Journal of Statistical Planning and Inference 202 (2019) 1–13

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Estimating heterogeneity variances to select a random effects model Andrew L. Rukhin Statistical Engineering Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

article

info

Article history: Received 13 April 2018 Received in revised form 7 September 2018 Accepted 10 December 2018 Available online 5 February 2019 Keywords: Bayes estimator Information criterion Maximum likelihood Model selection Research synthesis

a b s t r a c t There are many collaborative studies where the reported within-study uncertainty estimates are unreliable but can be considered as lower bounds to the true uncertainties. This work is motivated by such examples; it provides a method to determine the common mean of heterogeneous observations with unknown variances which however allow for the given lower bounds. In this situation, the classical maximum likelihood estimator and the restricted maximum likelihood estimator are derived. These procedures lead to the choice of the random effects model where the unknown heterogeneity variance can depend on the individual study. The Bayes procedures against the noninformative prior restricted on the appropriate parametric subset are recommended for practical use. Published by Elsevier B.V.

1. Introduction In a meta-analytic setting the estimators of the common mean reported by, say, n studies, each using possibly its own technique, are to be combined to get a consensus estimate of this parameter. The popular random effects model posits these estimators xk , k = 1, . . . , n, to have the form xk = µ + λk + ϵk .

(1)

The parameter of primary interest (focus parameter) is µ, the unknown common mean, the treatment effect in meta-analysis or the reference value in interlaboratory comparisons. The measurement error ϵk of the kth study has zero mean and some positive variance representing the within-study variability; the protocol usually demands that beside their estimates of µ, the participants also report the estimates s2k of this variance (within study uncertainty). In some cases the degrees of freedom associated with these estimators are also required, but it is not assumed here that they are present. The additional noise (between study effect) realized as λk in (1) is commonly assumed to have zero mean and the so-called heterogeneity variance τ 2 , τ 2 ≥ 0. In clinical trials involving the comparison of a drug and placebo, the treatment effect is often measured by the number of independent successful outcomes which automatically provides the variance of this effect estimator. However in many other instances the reported uncertainties tend to underestimate the actual standard errors. Indeed, the multitude of unrealistically small standard errors which do not take into account all possible sources of errors, is widely acknowledged in measurement science. In some applications the variances of systematic, laboratory specific errors cannot be estimated by statistical methods, or the observations are too scarce as they are taken close to the detection limit. The between study effect (termed ‘‘hidden errors’’, ‘‘excess-variance’’, or ‘‘dark uncertainty’’) is sometimes evident; it represents the additional variability which is a fairly common metrological phenomenon. In some studies the degree E-mail address: [email protected]. https://doi.org/10.1016/j.jspi.2018.12.003 0378-3758/Published by Elsevier B.V.

2

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

of heteroskedasticity is quite substantial (e.g. Thompson and Sharp, 1999). Many different procedures to estimate the heterogeneity variance in meta-analysis are reviewed in Higgins et al. (2009), Rukhin (2013) or Veroniki et al. (2016). However, the assumption that all λ’s have the same dispersion seems to be violated in many collaborative studies where the smallest reported values s2k correspond to the cases which are most deviant from the bulk of data. These aberrant data points unduly influence the consensus estimate of the common mean shifting it closer to their laboratory value. One of the important examples is the collaborative effort to determine the values of fundamental physical constants (e.g. Mohr and Taylor, 2005). In the study of Newton’s gravitational constant in 1998 one of institutes presented an outlying measurement of this constant accompanied by a small uncertainty. The present (2017) estimate of Newton’s gravitational constant (https://physics/nist.gov/) obtained after the removal of this data point is about three standard deviations away from the 1998 value! Many more examples of this kind can be found in Cox (2007) or in Thompson and Ellison (2011). These authors analyzed heterogeneous data sets from BIPM International Key Comparisons (http://www.kcdb.bipm.org) where a group of national metrology laboratories reports the measurement results on the same material along with their uncertainty estimates. All these examples make one think that the main reason for a constant variance between-study effect is merely the mathematical convenience of making this additional error quantifiable. There is a body of work aimed at extending the traditional random effects exemplar for meta-analysis needs. For example, Ohlssen et al. (2007), Lee and Thompson (2008) provide more general models. This paper assumes random effects to be normally distributed with unknown different variances which admit the given positive lower bounds. By independence of λk and ϵk , the variance of xk cannot be smaller than the variance of ϵk which is estimated by s2k . In many situations, the estimate s2k is not very reliable, so it is suggested to use it only as mentioned lower bound for the unknown variance of xk . In some cases such bounds can be obtained from physical considerations, archive data or expert judgments. Hoaglin (2016) discusses of possible dangers when s2k is treated as exactly known but the sample size is small. The restricted and classical maximum likelihood estimators of the heterogeneity variances τk2 are derived in Sections 2 and 3. They lead to procedures to select the model where only a given number m of τk2 is positive. The conditions under which some variance estimators exceed their fixed lower bounds is discussed in Section 4. The likelihood ratio tests of this wider paradigm and of the traditional random effects model are discussed in Section 6. Noninformative Bayes estimators are developed in Section 5. They are much smoother than the likelihood estimators, produce good confidence intervals for the common mean and are recommended for practical use. Under the heteroskedasticity scenario described in Section 8, the maximum likelihood estimators are compared to Bayes procedures numerically. The confidence intervals based on these procedures outperform notoriously accurate intervals based on higher order asymptotics obtained from the classical random effects model. The numerical algorithm for the evaluation of the maximum likelihood estimators is provided in the Supplement. The Appendix gives two proofs and some comments about particular cases. 2. Classical maximum likelihood estimator Let the data be represented as a sequence of independent normally distributed random variables xk ∼ N(µ, σk2 ), k = 1, . . . , n, with the common mean µ which is the object of interest. After discussion in Section 1, we assume that the unknown (nuisance) variances σk2 have given lower bounds s2k > 0, say σk2 = τk2 + s2k , τk2 ≥ 0. Put differently, in this setting the random between study effect is allowed to affect each case individually. The first goal is to find the maximum likelihood estimator σˆ k2 = τˆk2 + s2k , which leads to the estimator µ ˜ of the common mean

µ ˆ =

(



xk

k

τˆk2 + s2k



1

k

τˆk2 + s2k

)−1 .

Indeed if the heterogeneity variances τk2 are known, the best unbiased estimator of µ is the weighted means statistic

µ ˆ0 =

(



xk

k

τk2 + s2k

)−1



1

k

τk2 + s2k

,

(2)

so that µ ˆ presents the plug-in version of (2). The negative log-likelihood function (times two) has the form L(µ, τ , . . . , τ 2 1

2 n)

=

∑ [ (xk − µ)2 k

τk2 + s2k

+ log(τ + 2 k

s2k )

]

.

(3)

Thus one has to determine the constrained minimum, min

τ12 ,...,τn2 ≥0,µ,

L(µ, τ12 , . . . , τn2 ) =

min

2 2 z1 ≤s− ,...,zn ≤s− n 1

∑[

(xk − µ ˆ 0 )2 zk − log zk ,

k

]

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

3

with zk = (τk2 + s2k )−1 . The maximum likelihood estimator without any side conditions is not well defined since L is infinite at each data point whose variance is estimated by zero. 2 Assume that the minimum is attained when zj = s− for j ∈ J with J denoting a subset of {1, . . . , n}. Then the number m j

−2 of elements in the set I = J c is the number of active∑ constraints: I. ∑zi < si for i ∈∑ ∑ 2 −2 To simplify we use the following notation: ZI = z , µ ˆ = z x / Z , S = ˆ J = J s− I I J j xj /SJ , so that I i I i i J sj , µ

µ ˆ =

ZI µ ˆ I + SJ µ ˆJ

(4)

ZI + SJ

is the weighted mean of two estimators µ ˆ I and µ ˆ J . Direct differentiation of (3) shows that for any stationary point zi zi =

1 (xi − µ ˆ )2

=

(ZI + SJ )2

[ZI (xi − µ ˆ I ) + SJ (xi − µ ˆ J )]2

, i ∈ I,

(5)

with the last equality proven in the Appendix. Therefore for a given set I, the goal becomes to find ZI and µ ˆ I so that (5) holds. Indeed these two variables determine zi , i ∈ I; they satisfy two simultaneous equations

µ ˆI =



(ZI + SJ )2 xi

I

ZI [ZI (xi − µ ˆ I ) + SJ (xi − µ ˆ J )]2

ZI =



(ZI + SJ )2

I

[ZI (xi − µ ˆ I ) + SJ (xi − µ ˆ J )]2

,

(6)

.

(7)

Thus the original multivariate optimization program is reduced to a series of two-dimensional problems. Provided that a solution exists for a given set I, nonlinear equations (6) and (7) can be easily solved by an iteration method as in the Supplement. To formulate the main result of this section, let for a non-empty set I L(I) =

∑[

min

(xi − µ)2 zi − log zi

2 µ,zi ≤s− , i∈I i

]

I

Because of formulas in the Appendix, L(I) = min µ ˆ I ,ZI

+

(8)

(ZI + SJ )2 (xi − µ ˆ I )2

∑[ I

{ ZS I J (µ ˆI −µ ˆ J )2 ZI + SJ (ZI + SJ )2

(

[ZI (xi − µ ˆ I ) + SJ (xi − µ ˆ J )]2

− log

)] }

[ZI (xi − µ ˆ I ) + SJ (xi − µ ˆ J )]2

.

If for some i ∈ I, zi satisfies (5) and zi s2i ≥ 1, then we put L(I) = ∞. Theorem 2.1. One has min

2 2 z1 ≤s− ,...,zn ≤s− n ,µ 1

{

L(µ, z1 , . . . , zn ) = min L(I) + I

} ∑ 2 [s− ˆ J )2 + log s2j ] , j (xj − µ J

where I denotes a proper subset of {1, . . . , n}, J = I c , L(I) is defined in (8) with µ ˆ I and ZI satisfying simultaneous equations (6), (7). The sufficient condition for the minimum in (8) to be attained is that ZI < SJ .

(9)

If ZI is the minimizer of (8), then ZI ≤ SJ . To find optimizing z’s enumerate all proper subsets I of {1, . . . , n}, and determine for each I all solutions of (6) and (7) 1 1 satisfying (9) for which mini∈I |µ ˆ − xi |s− > 1, maxj∈J |µ ˆ − xj |s− ≤ 1. Then choose the solution which provides the global i j minimum. This procedure also applies to the discussed in Section 5 Bayesian setting where for some prior weights wI one has to minimize the weighted likelihood. The proof of Theorem 2.1 in the Appendix deals with this situation. 3. Restricted maximum likelihood estimator The form of the restricted likelihood function L˜ is well known. With µ ˆ 0 defined by (2),

˜ L(τ12 , . . . , τn2 ) =

∑ [ (xk − µˆ 0 )2 k

τk2 + s2k

] (∑ + log(τk2 + s2k ) + log

1

τk2 + s2k

)

.

(10)

4

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

To obtain the restricted maximum likelihood estimator τ˜k2 one has to determine the minimizers in the following program, min

2 2 z1 ≤s− ,...,zn ≤s− n 1

˜ L(z1 , . . . , zn ) {∑ [

min

=

(xk − µ ˆ 0 )2 zk − log zk + log

2 2 z1 ≤s− ,...,zn ≤s− n 1

]

(∑ )} zk

k

,

k

where, as in Section 2, zk = (τk2 + s2k )−1 . 2 Assume∑ that the minimum of ˜ L(z1 , . . . , zn ) is attained when zj = s− for j ∈ J. We use the notation given in Section 2, but j with µ ˜ I = I zi xi /ZI . Now the point with coordinates zi , i ∈ I is stationary if zi =

ZI + SJ 1 + (ZI + SJ )(xi − µ ˜

)2

=

(ZI + SJ )2

[ZI (xi − µ ˜ I ) + SJ (xi − µ ˆ J )]2 + ZI + SJ

,

(11)

where µ ˜ = (ZI µ ˜ I + SJ µ ˆ J )/(ZI + SJ ). One gets from (11) two equations for µ ˜ I and ZI

µ ˜I =



(ZI + SJ )2 xi

I

ZI {[ZI (xi − µ ˜ I ) + SJ (xi − µ ˜ J )]2 + ZI + SJ }

ZI =



(ZI + SJ )2

I

[ZI (xi − µ ˜ I ) + SJ (xi − µ ˆ J )]2 + ZI + SJ

,

(12)

,

(13)

whose solution can be found by iterations. With zi , i ∈ I, determined by (11) let

{∑ [

LR(I) = min µ ˜ I ,Z I

zi (xi − µ ˜ I )2 − log zi +

]

Z I S J (µ ˜I −µ ˆ J )2 ZI + SJ

I

} + log(ZI + SJ ) .

(14)

If for some i ∈ I, zi s2i ≥ 1, put LR(I) = ∞. The main result of this section is Theorem 3.1. In the notation of Theorem 2.1 one has

{

L(τ12 , . . . , τn2 ) = min LR(I) + min ˜

τ12 ,...,τn2

I

} ∑ 2 [ s− ˆ J )2 + log s2j ] , j (xj − µ J

where I runs through all proper subsets of {1, . . . , n}, µ ˜ I , ZI satisfy simultaneous equations (12), (13), and LR(I) is given by (14). The sufficient condition for the minimum in (14) to be attained is ZI < SJ +



I

zi2

ZI + SJ

+

2[



I

zi2



I

zi2 (xi − µ ˜ )2 − ( (ZI + SJ



I

zi2 (xi − µ ˜ ))2 ]

)2

.

(15)

A non-strict inequality in (15) is necessary for the minimum in (14). 4. Discussion and example An important feature of Theorems 2.1 and 3.1 is that the minimum sought cannot be attained when I = {1, . . . , n}, which means that at least one reported variance retains its value in the maximum∑ likelihood procedure. The restricted likelihood 2 −1 estimator cannot let all s’s stay the same (I ̸ = ∅) if maxi [(xi − µ ˆ J )2 − s2i − ( j̸=i s− ] > 0. j ) Conditions (9) and (15) have the same interpretation: they give upper bounds on the weight ω ˆ = ZI /(ZI + SJ ) of µ ˆ I in (4) and on a similar weight ω ˜ of µ ˜ I in the restricted likelihood version. Because of (9), ωˆ cannot exceed 1/2, i.e. the updated variances are not allowed to dominate in the µ-estimator. Clearly (9) implies (15) but the values of ZI and SJ depend on the chosen method. Still the restricted maximum likelihood estimator τ˜i2 = zi−1 − s2i = (xi − µ ˜ )2 + (ZI + SJ )−1 − s2i as the rule 2 2 2 2 2 exceeds τˆi = (xi − µ ˆ ) − si . Both are strictly positive when (xi − µ ˆ ) is large, but si is relatively small. The large values of SJ indicate validity of (9) or (15) suggesting to increase some s2i , i ∈ / J. If s2i → 0, then both estimators approach xi , but for the maximum likelihood I ̸ = {i}, although typically i ∈ I. For the restricted maximum likelihood estimator I = {i}, provided that xi is not close to other data points. When s2i → ∞ (xi is fixed), one has i ∈ / I. Indeed for the maximum likelihood estimator this happens when 2 si ≥ maxj:j̸=i (xi − xj )2 ∑ > (xi − µ ˆ )2 = zi−1 ; the same is true for the restricted maximum likelihood estimator if 2 −1 2 s2i ≥ maxj:j̸=i [(xi − xj )2 + ( j̸=i s− j )] > zi . This fact suggests how to handle cases when an uncertainty, say si is missing: 2 2 2 / I for all si > si . take the imputed si as the smallest value such that i ∈ To illustrate the dependence of the estimators and of the sets I on a particular variance, Fig. 1 depicts the behavior of the weights ω ˆ and ω˜ as a function of s2 = s23 when n = 3, x1 = 1, s21 = 0.5, x2 = −1, s22 = 0.1, and x3 = 0. Whereas ω˜ is

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

5

Fig. 1. The plots of the weights ω ˆ (line marked by + ), and ω˜ (line marked by ∗) shown as a function of s2 = s23 when n = 3, x1 = 1, s21 = 0.5, x2 = −1, s22 = 0.1, x3 = 0 (left panel). The right panel portrays the relative weights of x3 in estimator µ ˆ GD (solid line), in µ ˆ (line marked by + ), and in µ ˜ (line marked by ∗).

a continuously increasing, ω ˆ is not. Fig. 1 also shows the weights of x3 in the likelihood estimators, both of these decrease monotonically but not in∑ a continuous For small s2 they coincide with the weight of x3 in the traditional weighted ∑ fashion. 2 mean estimator µ ˆ GD = ( xi /s2i )/ s− (the Graybill–Deal estimator). Both weights practically vanish for large s2 . i 2 When s is small, I = {1, 2} for both maximum likelihood estimators which increase s21 and s22 practically up to 1. If s2 is large, for the restricted maximum likelihood estimator I = {2}, for the maximum likelihood estimator I = {1}. In this example the observation x3 is between x1 and x2 so that |xi − µ ˆ J | is not large enough to have I = {3}. If x3 is an outlier, the picture changes: the restricted likelihood estimator modifies the variance s2 when it is small, i.e. I = {3} for all s2 with the maximum likelihood estimator following it for larger s2 . The discontinuous nature of the likelihood estimators (especially that of the classical maximum likelihood estimator) motivates the use of smoother Bayes procedures considered in the next section. 5. Model averaging: Bayesian setting and non-informative priors Theorems 2.1 and 3.1 provide the maximum likelihood solution in the situation where one has to select one of the models I : σi2 = τi2 + s2i > s2i , i ∈ I ; σj2 = s2j , j ∈ J = I c . The frequentist approach to model selection (e.g. Claeskens and Hjort, 2008) usually requires large samples. We assume the Bayes setting with the prior weights wI depending only on the cardinality m of I. The model I corresponds to a proper subset of {1, . . . , n} which plays the role of the data part with underreported uncertainties. If it minimizes

− 2 log wI + L(I) +



2 [s− ˆ J )2 + log s2j ], j (xj − µ

(16)

J

or

− 2 log wI + LR(I) +



2 [s− ˆ J )2 + log s2j ], j (xj − µ

(17)

J

the model I is chosen. Along with the uniform distribution, wI ∝ 1, we consider two different sets of weights: wIA ∝ exp(−m/2) and wIB ∝ exp(−m log(n)/2) to be attached to exp(−L/2) in (3) or to exp(−˜ L/2) in (10). Then the quantities above provide estimates of the Akaike’s information criterion (AIC) or of the Bayesian information criterion (BIC) based on the classical or restricted likelihoods (Burnham and Anderson, 2002). The models for which the minimum is attained at τi2 , i ∈ I with some vanishing components are rejected. ∏ 2 2 2 In a fully Bayesian setting, given the model I the non-informative prior ∏ 2dµ 2 I dτi /(τi + si ) is employed. This prior corresponds to the restriction of the traditional right invariant prior dµ dσi /σi on the relevant parametric region which excludes troublesome small σi2 . See Gelman (2006) or Garcia-Donato and Sun (2007) for a review of various noninformative prior distributions for variance components and further references.

6

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

The posterior probability of model I is P(I |data) = ∑

P(data|I)wI K

P(data|K )wK

,

where K runs through all proper subsets of {1, . . . , n}, and P(data|I)

=





1 (2π )(n−m)/2

1



−∞

I

|xi − µ|

{ (x − µ)2 } ∏ ( (xi − µ)2 ) ∏ j exp − dµ γ 2 2 2si

I

2sj

J

(18)

is the integrated likelihood of I. Here γ (y) = γ (y, 1/2) = π −1/2 0 e−u u−1/2 du denotes an incomplete gamma-function (which can be replaced by the error function Erf(z) = γ (z 2 )). One can compare this likelihood to the commonly used in model selection BIC approximation

∫y

− 2 log P(data|I) ≈ L(I) +

∑ 2 [s− ˆ J )2 + log s2j ] + m log n = BICI j (xj − µ

(19)

J

(Raftery, 1996). However this approximation can be poor for some I. since the integral in (18) can be non-negligible when L(I) is taken to be infinite; reversely when (18) is very small, L(I) could correspond to the maximum likelihood solution. Put differently, under the non-informative prior the models for which the likelihood maximum is attained at τi2 , i ∈ I with some vanishing components are to be rejected, but these cases can have non-negligible posterior probability. If µB (I) = E(µ|data, I) is determined from (18), then the posterior mean and variance of µ are

δD = E(µ|data) =



µB (I)P(I |data),

(20)

I

and Var(µ|data) =



Var(µ|data, I)P(I |data) +



I

[µB (I) − δD ] 2 P(I |data).

(21)

I

To get Var(µ|data, I), we use the estimate which corresponds to the plug-in version of the Fisher information, Var(µ|data, I) = ∑

1 2

τˆ +

I( i

s2i )−1

+



J

2 s− j

.

(22)

The estimator (22) cannot be recommended in the traditional setup with one active random effects model because it underestimates the true variance of the weighted means statistic. However the behavior of (22) is much better when the models are averaged. In the traditional Bayes formulation when no model choice is to be made but lower bound s2k for τk2 is available, assume that the variances τk2 are independent with τk2 given a noninformative prior dτk2 /(τk2 + s2k ), k = 1, . . . , n. If the parameter µ has the independent uniform distribution dµ, the posterior mean takes the form

δB =



−∞

[∫

µ





k



× −∞

|xk − µ| 1



k

|xk − µ|

∏ ( (xk − µ)2 ) γ dµ 2 k

2sk

(23)

]−1 ∏ ( (xk − µ)2 ) γ dµ . 2 k

2sk

In Section 8, the statistics δB , δD , the restricted maximum likelihood estimators µ ˜ and δR (based on (19)) are compared numerically. Fig. 2 illustrates non-smooth behavior of µ ˜ which it shares with µ ˆ , δR and δL in the simple example of Section 4. One of the estimators considered there is the commonly used in the random effects setting µ ˜ RE which is discussed in the next Section. 6. Random effects model and Bayes factor The popular random effects model for meta-analysis has the negative log-likelihood function (times two) LRE (µ, τ ) = 2

∑ [ (xi − µ)2 i

τ 2 + s2i

+ log(τ + 2

s2i )

]

,

with a similar expression for the restricted log-likelihood.

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

7

Fig. 2. The plots as a function of s23 = s2 of estimators: δD (line marked by ⋄), δB (line marked by o), µ ˆ GD (solid line), µ ˜ (line marked by ∗), δR (dash-dotted line), and µ ˜ RE (dotted line) for n = 3, x1 = 1, s21 = 0.5, x2 = −1, s22 = 0.1, and x3 = 0.

An easy recursive method to evaluate µ ˆ and τˆ 2 is to use simple iterations determined by the formulas,



xk

k

τˆ 2 + s2k

[ ∑

]−1

1

, τˆ 2 + s2k [ ]−1 ∑ (xk − µˆ RE )2 − s2 ∑ 1 k 2 τˆ = , (τˆ 2 + s2k )2 (τˆ 2 + s2i )2 k i

µ ˆ RE =

(24)

k

(25)

with truncation at zero if the iteration process for τˆ 2 converges to a negative number. Thus one can get the information criteria numbers for this model, e.g. BICRE = LRE (µ ˆ RE , τˆ 2 ) + 2 log n. A similar method to evaluate µ ˜ and τ˜ 2 uses

µ ˜ RE =

τ˜ = 2



xk

k

τ˜ 2 + s2k

[ ∑ k

]−1

1

τ˜ 2 + s2k

,

[ ∑ (xk − µ˜ RE )2 − s2 ∑

1

k

(τ˜ 2 + s2k )2

k

i

(26)

]−1

(τ˜ 2 + s2i )2

+

[ ∑ k

1

τ˜ 2 + s2k

]−1 .

(27)

In both of these recursive procedures one can take as an initial approximation the DerSimonian–Laird µ-estimator

µ ˆ DL =



xi

i

2 s2i + τDL

[ ∑ i

1 2 s2i + τDL

]−1 ,

along with

] ∑ 2 −2 ˆ (k) −n+1 GD ) si i (xi − µ = max 0, ∑ −2 ∑ −4 (∑ −2 ) . − i si / i si i si [

τˆ

2 DL

Here

µ ˆ (k) GD =



⎡ ⎤−1 ∑ 2⎣ 2⎦ xi s− s− i i

Ik

Ik

denotes the familiar Graybill–Deal estimator. There are general results on maximum likelihood and restricted maximum likelihood estimation in the mixed models and methodologies for their calculation (Pinheiro and Bates, 2000). The default version is the restricted maximum likelihood estimator. To evaluate this estimator, Rukhin (2015) suggested a data transformation with a numerical procedure to get

8

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

Table 1 1998 data on Newtonian gravitational constant for 10 labs in 10−11 m3 kg−1 s−2 units. i

1

2

3

4

5

6

7

8

9

10

xi si

6.6699 0.0007

6.6726 0.0008

6.6729 0.0005

6.6735 0.0029

6.6740 0.0007

6.6742 0.0007

6.6754 0.0015

6.6830 0.0011

6.6873 0.0094

6.7154 0.0005

Table 2 2002 data on Newtonian gravitational constant for 8 labs. i

1

2

3

4

5

6

7

8

xi si

6.6709 0.0005

6.6729 0.0007

6.6739 0.0015

6.6740 0.0008

6.6741 0.0110

6.6742 0.0007

6.6743 0.0094

6.6756 0.0007

Table 3 CCT-K4 data for 12 labs in mK units. i

1

2

3

4

5

6

7

8

9

10

11

12

xi si

−3.96

−0.42

−0.53

−4.30

−0.11

−8.39

−14.3

−5.16

−0.04

−4.01

1.52

1.17

1.29

1.56

0.43

1.93

6.42

1.79

2.45

1.58

0 0.71

−3.43 1.08

accurate confidence intervals for µ. These intervals are based on higher order asymptotic inference; they are compared in Section 8 to the credible intervals based on Bayes rule δD in (20) and δB in (23). An alternative model selection setting based on homogeneity clusters is considered in Rukhin (2019). / I, one can use the approximation (19) for the Bayes To test more general hypotheses HI : σi2 > s2i , i ∈ I , σj2 = s2j , j ∈ factors BIK = 2 [log P(data|I) − log P(data|K )] ≈ BICK − BICI .

(28)

The random effects model (σ = τ + along with models HI by employing BICRE − BICI can be included in the selection process. Notice that it is possible that, say, I ⊂ K , but P(data|I) > P(data|K ). Section 7.2 provides a practical example of such situation. 2 k

2

s2k )

7. Examples 7.1. Determination of Newton’s gravitational constant We look here at an important in physics example of the Newtonian gravitational constant. The experimental data (in 10−11 m3 kg−1 s−2 units) on n = 10 measurements reported by Mohr and Taylor (2000) is reproduced in Table 1. The consensus value of Newton’s constant in this study was given at 6.6731 × 10−11 m3 kg−1 s−2 , which coincided with the previous (1986) estimate. However the former had a larger overall relative uncertainty (6.0 × 10−5 ) since the values x10 and s10 were suspected to be not very reliable. As a matter of fact, this data point was removed later (Mohr and Taylor, 2005). Both for the uniform prior and prior weights wIB the maximum likelihood estimator as well as the corresponding restricted likelihood estimator modify the uncertainty of the questionable lab 10 only (I = {10}) increasing it to 0.0419 (i.e. multiplying by a factor of 84). This changes the consensus mean estimate to µ ˜ = 6.67351. The Bayes procedure δB is practically equal to the maximum likelihood estimators; the estimator δD = 6.67488 gives the interval [6.66226, 6.68751]. The Bayes factor 2 (28) postulating the null hypothesis of the random effect model versus the alternative, τ10 > 0, is of order exp(−60) making the assumption of the null hypothesis untenable. Each of our estimates µ ˆ and µ ˜ is quite different (about 75 standard deviations apart) from the consensus value which was determined in this study (6.6731). However they are much closer to the current (2017) value, 6.67408, which is three standard deviations (of 2002 study) from the consensus value given then. (See Table 2.) Now the maximum likelihood and restricted maximum likelihood estimators increase only lab’s 1 reported standard error (by a modest factor of 6.3). Both of the estimates are practically equal to δB = δL = δR = 6.67413, which in its turn is very close to the 2017 value. The confidence interval for δD ranges from 6.67313 to 6.67524 with a wider set [6.67192, 6.67476] for δB . Now the null hypothesis of the random effect model cannot be rejected with the Bayes factor of 0.19. 7.2. Key comparison CCT-K4 Temperature interlaboratory comparison CCT-K4 was carried out to compare the difference of silver (Ag) fixed-point temperature realized at the participating national institutes relative to a master circulated cell (Nubbemeyer and Fisher, 2002). The results in mK units are given in Table 3. For both maximum likelihood estimators (the uniform prior) I = {5, 11} increasing the uncertainties of these two labs to 2.72 and 2.80 respectively. Then µ ˆ = −2.8933 and µ ˜ = −2.8981.

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

9

Table 4 Confidence intervals for µ in CCT-K4 study. Upper limit Center Lower limit

µ ˆ

µ ˜

δD

δB

δL

δR

−3.8410 −2.8933 −1.9455

−3.8467 −2.8981 −1.9495

−3.5339 −1.5060

−4.5078 −1.7709

−3.1978 −0.4609

−1.7693 −0.5861

0.5219

0.9660

0.3399

0.5971

Fig. 3. Mean squared errors of estimators µ ˜ (line marked by ∗), δD (line marked by circles), δB (line marked by diamonds), δR (dash-dotted line), and µ ˜ RE (dotted line) for n = 7 (left panel), and n = 14 (right panel).

The same highly heterogeneous data set was analyzed by Cox (2007) with the conclusion that four labs 2,∑ 3, 5, and 11 are to be excluded from the largest consistent subset. That set was defined to be formed by the cases for which k (xk − µ ˆ )2 /s2k 2 does not exceed the critical point of χ -distribution for some predetermined significance level (0.05 in this case). This approach, which has been criticized by Toman and Possolo (2009), gives the estimate −4.26. However, neither the maximum of the likelihood function nor that of the restricted likelihood is attained at I = {2, 3, 5, 11}! From the perspective of this work a better definition of the largest consistent subset is J = I c corresponding to (restricted) maximum likelihood. Along with the corresponding confidence intervals, our six estimators of the temperature difference are given in the center line Table 4. They are considerably closer to zero than the mentioned above value −4.26. 8. Simulation results To compare the properties of the considered estimators we performed Monte Carlo simulations involving the mean squared errors of µ-estimators, the coverage probability of the corresponding confidence intervals and their widths. These estimators are (i) µ ˜ which minimizes (17) with wIB ∝ exp(−m log(n)/2); (ii) two exact Bayes rules δD from (20), and δB from (23); (iii) the restricted likelihood Bayes model average δR based on approximations (19) and (22); (iv) the random effects model based restricted maximum likelihood estimator µ ˜ RE . The classical maximum likelihood estimators µ ˆ and δL are not shown as their behavior is not very different from that of restricted likelihood estimators. The weights wIB were chosen as they possess smaller mean squared error than those corresponding to the ∑ uniform weights or to the AIC weights wIA . The width ∑ of 95%-confidence intervals centered at µ˜ was taken as t0.025 (n − 1) ωk (xk − µ ˜ )2 /(n − 1) with normalized weights ωk = 1. See Section 7.3.4 in Hartung et al. (2008). 2 For the model I, ωi ∝ (τˆ 2 + s2i )−1 , i ∈ I , ωj ∝ s− j , j ∈ J, and n is replaced by n − m. The posterior distribution based credible intervals were employed √ for the Bayes rules in (ii), and (22) was used under t(n − 1)-approximation for the approximate pivotal quantity (δL − µ)/ Var(µ|data). The reported numerical results were obtained for n = 7 (25,000 runs) and n = 14 (5000 runs) as a function of p, p = 0 : 0.1 : 1 which represents the proportion of cases affected by additional between-study errors. More exactly, for scaled χ 2 realizations, s2i ∼ χi2 /i, independent, parameter√p, Bernoulli random variables ϵi , independent normal Zi , and independent exponential τi2 were generated to obtain xi ∼

ϵi τi2 + s2i Zi , i = 1, . . . , n. In Figs. 3–5 this noise is given by

10

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

Fig. 4. Coverage probabilities versus p for six confidence intervals for n = 7 (left panel), and n = 14 (right panel).

Fig. 5. The widths of confidence intervals versus p under the same demarcation of lines for n = 7 (left panel), and n = 14 (right panel).

exponentially distributed random variables, but the results for other moderately large perturbations are similar. Thus when p = 0, x’s and s’s correspond to the fixed effects random model (τi2 ≡ 0); when p = 1, all x’s have the variance increased by independent exponential errors. In Fig. 3 the mean squared errors of five estimators are shown. Clearly µ ˜ RE and µ ˜ have the largest such error, the group (ii) has the smallest. The estimators δB and δD behave similarly. Fig. 4 displays the coverage probabilities of the corresponding intervals for the nominal 95 % confidence coefficient. In all simulations the credible interval based on the Bayes model averaging δD maintains this nominal coefficient quite well. The interval which corresponds to δB also seems to be reasonable. The intervals for estimators µ ˜ RE and µ ˜ have inadequate coverage for larger p. Fig. 5 which portrays the widths of confidence intervals supports these statements. The intervals corresponding to µ ˜ may be unrealistically short; these for δD and δB are unsurprisingly wider. The reported and other simulation results confirm that the Bayes rules have noticeably smaller mean squared error and better coverage probability than the maximum likelihood estimators (or the DerSimonian–Laird estimator) derived from the classical random effects setting. For this reason we compare in Fig. 6 the intervals based on δD and δB to the two intervals based on higher order asymptotic inference, namely the Bartlett adjusted interval and the Skovgaard interval (Brazzale et al., 2007). Under our setting these famously accurate intervals obtained from the random effects model have somewhat smaller coverage probability than the credible δD -interval but are considerably wider for large p. The performance of the latter interval, when the random effects model is in fact true is quite good especially when the heterogeneity variance is not very small.

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

11

Fig. 6. The widths (left panel) and the coverage probabilities (right panel) of the confidence intervals based on δD (line marked by circles), on δB (line marked by diamonds), and two procedures based on higher order asymptotic inference, the Bartlett adjusted interval (x-marked line), and the Skovgaard interval (solid line) for n = 7.

9. Conclusions To combine information from heterogeneous data sets whose reported uncertainties are not reliable, the suggestion is to treat these uncertainties only as lower bounds, some of which are to be increased after maximum likelihood. In all considered examples this choice appears to be justified. Namely, a study with a small reported uncertainty si has its standard error enlarged if the summary result (xi ) is quite different from the consensus value. One can contemplate a more realistic setting where the uncertainties are taken to be random realizations of, say, χ 2 -variables. The derived maximum likelihood estimators are useful to select a model or to perform a Bayes meta-analysis. The estimator (20) against the non-informative reference prior in (18) can be employed unlike the traditional Jeffreys prior which does not have proper posterior for small number of studies. The Bayes estimator (23) also can be seriously recommended. According to simulations both of these rules provide confidence intervals with good coverage even for a substantial proportion of data points affected by additional noise. Their numerical evaluation is easy by using one-dimensional numerical integration without employment of MCMC methods. Indeed these procedures show promise for meta-analysis when a few heterogeneous studies are to be combined. When the number of studies is large, the exact calculation of the likelihood procedures becomes impossible but the present methodology can be used if the number m of the cases with allowed increased uncertainty is taken to be fairly small. Acknowledgments Thanks are due to W. Guthrie for his encouragement to consider Bayes model averaging methods, to two referees for penetrating comments and to L. Benz for help with Monte Carlo simulations. The discussion with J. Splett of the procedure mentioned in Appendix A.1 is also acknowledged. This work has been supported by the Information Technology Laboratory Grant BTF ‘‘Taming Model Uncertainty’’ at the National Institute of Standards and Technology, United States. Appendix

A.1. Proof of Theorem 2.1 and comments In the notation of Section 2, one gets Z I (µ ˆI −µ ˆ )2 + SJ (µ ˆJ −µ ˆ )2 =

ZI SJ ZI + SJ

(µ ˆI −µ ˆ J )2 .

Therefore one can rewrite the objective function in the form

∑ ZI SJ (µ ˆI −µ ˆ J )2 ∑ −2 [zi (xi − µ ˆ I )2 − log zi ] + + [sj (xj − µ ˆ J )2 + log s2j ], ZI + SJ I

J

where the last sum does not involve zi , i ∈ I, implying (5).

12

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

2 The solution for µ ˆ I must belong to the interval RI = [minI xi , maxI xi ], and that for ZI cannot exceed I s− i , suggesting ∑ −2 ∑ −2 (0) ∑ −2 (p+1) (p+1) (0) convenient initial approximations: µ ˆ I = I xi si / I si , ZI = 0.5 I si with µ ˆI and ZI obtainable from the (p) right-hand side of (6) and (7) for µ ˆ (p) and Z . I I ˆ = {hˆ iℓ }, i, ℓ = 1, . . . , m, formed by the second derivatives of the likelihood function evaluated at zi has to The matrix H be non-negative definite. Because of the equality,



d dzi

xi − µ ˆ

µ ˆ =

ZI + SJ

,

its elements have the form hˆ iℓ =

δiℓ zi zℓ



2(xi − µ ˆ )(xℓ − µ ˆ) ZI + SJ

,

where δiℓ denotes the Kronecker symbol, δiℓ = 1, i = ℓ; = 0, i ̸ = ℓ. Let a be the m-dimensional vector with coordinates, 21/2 zi (xi − µ ˆ )(ZI + SJ )−1/2 . The Hessian Hˆ is non-negative definite if T and only if the this condition means that ∑matrix I − 2aa is. The form of characteristic polynomial of this matrix∑shows that aT a ≤ 1 or 2 I zi2 (xi − µ ˆ ) ≤ ZI + SJ . Because of (5), this inequality can be written as I (xi − µ ˆ )−2 = ZI ≤ SJ , i.e. a non-strict version of (9) holds. ˆ has to be non-negative definite at the point of minimum, (9) implies that the minimum of L cannot be attained Since H when J = ∅, so that m < n. ■ ∑ −2 When I = {i} is a one point set, SJ = j̸ =i sj , and the simultaneous equations (6), (7) are reduced to one equation for zi , (zi + SJ )2

zi =

SJ2 (xi

−µ ˆ J )2

.

This quadratic in zi equation has a (positive) root if and only if di = SJ (xi − µ ˆ J )2 > 4,

(29)

in which case the solution satisfying (9) is zi−1 = SJ−1

(

di / 2 − 1 +



d2i /4 − di

)

.

2 1 −1 For this solution one must have zi < s− ˆ J | > si + s− i . Under these two conditions, i.e. when |xi − µ i SJ , L({i}) is determined from (8). The inequality L({i}) + w < L(∅) is valid if and only if

di −



d2i − 4di 2



di − 2 +

+ log ⎝





d2i − 4di

⎠+1+w ≤

2

di s2i SJ

+1

( ) + log s2i SJ

2 meaning that di ≥ di (s2i SJ ). Then the maximum likelihood estimator τˆi2 = zi−1 −s2i is positive. If di ≤ 4 or zi ≥ s− i , L({i}) = ∞. Actually the two-dimensional optimization scheme above can be replaced by a one dimensional program. Introduce two ∏ ∑ 2 −2 polynomials PI (x) = i∈I (x − xi )2 , and QJ (x) = ˆ = x in (6) is a solution of the following equation, j∈J (x − xj ) sj . Then µ

0=

n ∑



k=1

i∈I

(x − xk )zk =

1 x − xi

+

∑ x − xj s2j

j∈J

=

1 2

[

PI′ (x) PI (x)

] + QJ′ (x) ,

(30)

2 ′′ 2 for which minI s− ˆ ) + QJ′′ (µ ˆ ) ≥ 0, so that µ ˆ can be taken as a local i (x − xi ) > 1. The inequality (9) means that (log PI ) (µ minimizer of log PI (x) + QJ (x). The polynomial equation obtained from (30) has at most m + 1 real solutions. However the inconvenient disconnected domain in this problem makes the two-dimensional optimization more practical.

A.2. Proof of Theorem 3.1 and comments By differentiation of (10), one obtains (11). Provided that for i ∈ I , zi s2i < 1, (11) gives the minimum of L only when the ˜ is non-negative definite. The elements of this matrix are Hessian H h˜ iℓ =

δiℓ zi zℓ

−2

(xi − µ ˜ )(xℓ − µ ˜) ZI + SJ



1 (ZI + SJ )2

, i, ℓ = 1, . . . , m,

˜ is non-negative definite if the matrix I − aaT − bbT is non-negative definite. Here a is the vector defined in so that H Appendix A.1 (with µ ˜ replacing µ ˆ ) and b is the vector with coordinates zi (ZI + SJ )−1 . The two eigenvalues λ of the latter matrix, which are different from 1, satisfy the equation (1 − λ)2 − (aT a + bT b)(1 − λ) + (aT a)(bT b) − (aT b)2 = 0,

A.L. Rukhin / Journal of Statistical Planning and Inference 202 (2019) 1–13

13

and they are non-negative if and only if aT a + bT b ≤ min[2, 1 + (aT a)(bT b) − (aT b)2 ]. Since (ZI + SJ )aT a = 2



I

zi2 (xi − µ ˜ )2 = 2[ZI −

(ZI − SJ )(ZI + SJ ) < 2



zi2

[

I

3ZI + SJ −

2



I



I

zi2 /(ZI + SJ )], this inequality can be written as zi2

ZI + SJ

] −2

[ ∑

]2 zi2 (xi

−µ ˜)

I

which is equivalent to (15). ∑ 2 When m = n, i.e. J = ∅, one gets SJ = 0, and (15) cannot be satisfied provided that zi (xi − µ ˜ ) ̸= 0. ■ When I = {i}, the simultaneous equations (12), (13) collapse into one, zi =

(zi + SJ )2 SJ2 (xi

−µ ˆ J )2 + zi + SJ

.

This linear in zi equation has a (unique) positive solution if and only if di > 1, where di is defined in (29). Then zi =

SJ di − 1

.

2 If in addition, zi < s− ˆ J )2 > s2i + SJ−1 , (14) implies that LR({i}) + w < LR(∅) if and only if i , meaning that (xi − µ

log (di ) + 1 + w ≤

di s2i SJ + 1

( ) + log 1 + s2i SJ .

For w > 0 this inequality holds whenever di is large enough, in which case the reported uncertainty s2i is to be increased. For w = 0 the inequality is always true, so that if maxi [(xi − µ ˆ J )2 − s2i − SJ−1 ] > 0, the restricted maximum likelihood solution does not allow for I = ∅. When n = 2, the restricted maximum likelihood estimator τ˜i2 = zi−1 − s2i is not defined uniquely when (x1 − x2 )2 > 2 ˜ In particular one can take s1 + s22 . Indeed then any pair τ˜12 , τ˜22 such that τ˜12 + τ˜22 = (x1 − x2 )2 − s21 − s22 , minimizes L. τ˜12 = (x1 − x2 )2 − s22 − s21 , τ˜22 = 0 or τ˜12 = τ˜22 = [(x1 − x2 )2 − s21 − s22 ]/2. If (x1 − x2 )2 ≤ s21 + s22 , these variance estimators vanish, τ˜12 = τ˜22 = 0. Appendix B. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jspi.2018.12.003. References Brazzale, A.R., Davison, A.C., Reid, N., 2007. Applied Asymptotics: Case Studies in Small Sample Statistics. Cambridge University Press. Cambridge, UK. Burnham, K.P., Anderson, D.A., 2002. Model Selection and Multimodel Inference : A Practical Information-Theoretic Approach, second ed. Springer, New York. Claeskens, G., Hjort, N., 2008. Model Selection and Model Averaging. Cambridge University Press, Cambridge, UK. Cox, M.G., 2007. The evaluation of key comparison data: determining the largest consistent subset. Metrologia 44, 187–200. Garcia-Donato, G., Sun, D., 2007. Objective priors for hypothesis testing in one-way random effects models. Canad. J. Statist. 35, 303–320. Gelman, A., 2006. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 1, 515–534. Hartung, J., Knapp, G., Sinha, B.K., 2008. Statistical Meta-Analysis with Applications. Wiley, New York. Higgins, J., Thompson, S.G., Spiegelhalter, D.J., 2009. A re-evaluation of random effects meta-analysis. J. Roy. Statist. Soc. Ser. A 172, 137–159. Hoaglin, D.C., 2016. Misunderstandings about q and ‘cochran’s q test’ in meta-analysis. Stat. Med. 35, 485–495. Lee, K., Thompson, S., 2008. Flexible parametric models for random-effects distributions. Stat. Med. 27, 418–434. Mohr, P.J., Taylor, B.N., 2000. CODATA Recommended values of the fundamental physical constants: 1998. Rev. Modern Phys. 72, 351–495. Mohr, P.J., Taylor, B.N., 2005. CODATA Recommended values of the fundamental physical constants: 2002. Rev. Modern Phys. 77, 1–108. Nubbemeyer, H.G., Fisher, J., 2002. Final report on key comparison CCT-K4 of local realizations of aluminum and silver freezing point temperatures, metrologia, 39. Tech. Supp. 03001. Ohlssen, D.I., Sharpless, L.D., Spiegelhalter, D.J., 2007. Flexible random-effects models using Bayesian semi-parametric models: Applications to institutional comparisons. Stat. Med. 26, 2088–2112. Pinheiro, J., Bates, D., 2000. Mixed Effects Models in S and S-Plus. Springer, New York. Raftery, A.E., 1996. Approximate bayes factors and accounting for model uncertainty in generalized linear models. Biometrika 83, 251–266. Rukhin, A.L., 2013. Estimating heterogeneity variance in meta-analysis studies. J. R. Stat. Soc. Ser. B Stat. Methodol. 75, 451–469. Rukhin, A.L., 2015. Confidence regions and intervals for meta-analysis model parameters. Technometrics 57, 547–558. Rukhin, A.L., 2019. Homogeneous data clusters in interlaboratory studies. Metrologia 56, in press. Thompson, M., Ellison, S.L.R., 2011. Dark uncertainty. Accred. Qual. Assur. 16, 483–487. Thompson, S., Sharp, S., 1999. Explaining heterogeneity in meta-analysis: a comparison of methods. Stat. Med. 18, 2693–2708. Toman, B., Possolo, A., 2009. Laboratory effects model for interlaboratory studies. Accred. Qual. Assur. 14, 553–563. Veroniki, A.A., Jackson, D., Viechtbauer, W., Bender, R., Bowden, J., Knapp, G., Kuss, O., Higgins, J., Langan, D., Salanti, G., 2016. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Res. Synth. Meth. 7, 55–79.