Sample size methods for constructing confidence intervals for the intra-class correlation coefficient

Sample size methods for constructing confidence intervals for the intra-class correlation coefficient

Computational Statistics and Data Analysis 85 (2015) 67–83 Contents lists available at ScienceDirect Computational Statistics and Data Analysis jour...

468KB Sizes 0 Downloads 43 Views

Computational Statistics and Data Analysis 85 (2015) 67–83

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda

Sample size methods for constructing confidence intervals for the intra-class correlation coefficient Kevin K. Dobbin a,∗ , Alexei C. Ionan b a

Department of Epid. and Biostat., University of Georgia, Athens GA, Georgia

b

Department of Statistics, University of Georgia, Athens GA, Georgia

highlights • • • • •

Sample size methods and programs for reliability studies are developed. The two-way balanced analysis of variance model without interaction is the focus. The sample size guarantees a user-specified mean confidence interval width. Modified large sample and generalized confidence interval methods are used. Novel computational algorithms are developed, studied, and implemented.

article

info

Article history: Received 25 February 2014 Received in revised form 13 November 2014 Accepted 14 November 2014 Available online 9 December 2014 Keywords: Sample size Intraclass correlation coefficient Variance components Generalized confidence intervals (GCI) Modified large sample (MLS) Confidence intervals Control variates Rao-Blackwellization

abstract The intraclass correlation coefficient (ICC) in a two-way analysis of variance is a ratio involving three variance components. Two recently developed methods for constructing confidence intervals (CI’s) for the ICC are the Generalized Confidence Interval (GCI) and Modified Large Sample (MLS) methods. The resulting intervals have been shown to maintain nominal coverage. But methods for determining sample size for GCI and MLS intervals are lacking. Sample size methods that guarantee control of the mean width for GCI and MLS intervals are developed. In the process, two variance reduction methods are employed, called dependent conditioning and inverse Rao-Blackwellization. Asymptotic results provide lower bounds for mean CI widths, and show that MLS and GCI widths are asymptotically equivalent. Simulation studies are used to investigate the new methods. A real data example is used and application issues discussed. The new methods are shown to result in adequate sample size estimates, the asymptotic estimates are accurate, and the variance reduction techniques are effective. A sample size program is developed.1 Future extensions of these results are discussed. © 2014 Elsevier B.V. All rights reserved.

1. Introduction The two-way crossed analysis of variance layout is commonly encountered in psychometry (McGraw and Wong, 1996), radiology (Belge et al., 2006; Hing et al., 2007), inter-rater reliability studies (Potempa et al., 1995; Eliasziw et al., 1994), assay reproducibility studies (McShane et al., 2000; Dobbin et al., 2005), and gauge repeatability and reproducibility studies (Burdick et al., 2005). The intra-class correlation coefficient (denoted ICC b below) represents agreement between



Corresponding author. E-mail address: [email protected] (K.K. Dobbin).

1 R program can be downloaded at http://dobbinuga.com. http://dx.doi.org/10.1016/j.csda.2014.11.010 0167-9473/© 2014 Elsevier B.V. All rights reserved.

68

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

laboratories, raters, or instruments in this layout, depending on the context. The ICC b is a function of three variance components. A common goal of inter-rater reliability studies is to construct a confidence interval for the ICC b . But traditional interval methods based on maximum likelihood asymptotics perform poorly (for examples of this poor performance, see supplement (Appendix B) and (Cappelleri and Ting, 2003)). As a result, both Modified Large Sample (MLS) interval (Cappelleri and Ting, 2003) and Generalized Confidence Interval (GCI) (Weerahandi, 1993; Burdick et al., 2005) methods were developed. MLS and GCI provide nominal coverage for the ICC b (Cappelleri and Ting, 2003), (Ionan et al., 2014). But, no methods currently exist for determining the sample size required to construct GCI or MLS intervals in this layout. The GCI and MLS intervals are constructed using exact statistical methods (Weerahandi, 1995). Sample sizes for confidence intervals based on exact methods have been developed for some important contexts (Bonett, 2002; Zou, 2012), but not the one addressed in this paper. Bonett (2002) developed a sample size method for the intraclass correlation coefficient for settings in which the ICC is a function of two variance components. None of the models in Bonett (2002) match the setting addressed in this paper. Also, Bonett’s approach cannot be adopted to this setting because it is based on an asymptotic approximation to the variance of the ICC estimate. But no adequate approximation to this variance is available for the setting studied in this paper. Moreover, the MLS and GCI methods, since they do not use such a variance estimate in their construction (that is, they are based on exact calculations), may have widths unrelated to this variance. Nor can the MLS or GCI widths be expressed algebraically as simple functions of the sample size. Turning to the work of Zou (2012), he addressed the important setting of a one-way layout; the approach used an estimate of the width of the Wald interval for the sample size calculations. Zou also interestingly controlled the probability that the width is below a bound, rather than the expected width. Zou’s Wald interval approximation approach is not well suited to the context addressed in this paper because the ICC b estimator in the two-way layout is often highly skewed, the asymptotic convergence in this context is slow, and Wald intervals do not perform well in this setting. The ICC b in a two-way analysis of variance layout without interaction was discussed in McGraw and Wong (1996) and studied in Saito et al. (2006). McGraw and Wong (1996) did not give sample size formulas; they provided formulas for the lower and upper bounds of a confidence interval (their Table 7, Case 2A); in that context, the ICC b is called the ‘‘degree of absolute agreement among measurements’’, the ‘‘criterion-referenced reliability’’, or the ‘‘Type I ICC’’. Importantly, the formulas from Table 7 in McGraw and Wong for the specific setting ICC(A, 1) and case 2A, which were taken from Shrout and Fleiss (1979) and developed by Haggard (1958) in 1958, only work well when the degrees of freedom for each mean square is large (Montgomery, 2013, p. 600; Cappelleri and Ting, 2003; Ionan et al., 2014). The MLS and GCI methods studied in this paper, on the other hand, work well even if the degrees of freedom for one or more mean squares are small; thus the MLS and GCI are more robust and appropriate to use for sample size estimation. Saito et al. (2006) studied different experimental designs for this model when r0 = 1. To compare the efficiency of different designs, they used an approximation to Var(Ln(ICC b )) based on the delta method (that is, a Taylor series expansion). But since the GCI and MLS methods are based on exact calculations, their widths may not be related to the efficiencies (or standard errors) of the point estimates. At one point in their discussion, Saito et al. (2006) did use exact methods to construct the intervals and compare widths (their Table III), but strictly in a limited context of comparing optimal allocation patterns, not sample size estimation. Since the primary objective of the 2006 paper was to compare designs, they did not present an explicit sample size method as we do here, or a set of tools for determining sample size for specific settings. In our study, we also allow r0 ≥ 1 because this is a reasonable design in some cases and is not uncommon in biomarker studies (MAQC Consortium, 2006; Dobbin et al., 2005; Polley et al., 2013; McShane et al., 2000). The sample size estimates produced by our method ensure that the mean confidence interval width is below a user-specified target width. Ideally, one may wish to control the actual width of an interval to be below a target, rather than the mean width. But the actual width of both the GCI and MLS intervals are functions of the observed mean squares; since the observed mean squares are unknown at the study design phase, it is not feasible to control the actual interval width itself. One approach might be to ‘‘plug in’’ estimates of the values of these variance components that will be observed when the experiment is run. But such an approximation approach seems suboptimal since it leaves open the question of the sensitivity of the approximation to mis-specification of the variance component values that will be observed. Controlling the expected widths as we do in this paper averages over the observed values of mean squares. This averaging requires integrating the unobserved mean squares out of the width formulas. For both the MLS and GCI, this results in complex multiple integrations without clean analytic solutions. A variety of statistical computation methods are employed to make the integrations computationally tractable. Statistical computation methods are reviewed in Robert and Casella (2013). As shown below, classic Monte Carlo integration is inadequate for sample size determination because of the computational demands. For the GCI sample size method, importance sampling (Ripley, 2006) was investigated but produced relatively minor computational gains. The method of Rao-Blackwellization (Robert and Casella, 2013), which involves conditioning on variables so as to reduce the variance, produced substantial variance reduction for estimates of the cumulative distribution function. We modified this method and call the resulting procedure inverse Rao-Blackwellization. Control variates (Rothery, 1982) are another widely used variance reduction technique. They depend on identification of an effective control variate. By investigation we identified a control variate for the ICC b . For the MLS procedure, we find that the ICC b width – which is initially a function of three variance components – can be rewritten as a function of two dependent F distributions. Subsequently, using mathematical manipulations we develop a method which we call ‘‘dependent conditioning’’. This method is closely related to two-stage Gibbs sampling (e.g., Robert and Casella, 2013). Gibbs sampling requires assessment of the convergence

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

69

Table 1 ANOVA table for two-way, balanced, crossed layout, no interaction. Source

DF

MS

EMS

Biology/subject

df2 = b0 − 1

s2b

Lab/rater

df1 = l0 − 1

s2l

Error

df3 = b0 l0 r0 − b0 − l0 + 1

s2e

σe2 +l0 r0 σb2 σe2 + b0 r0 σl2 σe2

of a Markov chain Monte Carlo, which can be problematic; our approach avoids this issue by using a single Monte Carlo instead of an MCMC. Because applications of analysis of variance models for studying the ICC b are so diverse, from manufacturing to psychometry to biomarkers, we add some contextualization for applied researchers in these three areas. In the context of manufacturing, gauge repeatability and reproducibility studies are carried out in order to monitor and improve a manufacturing process (Burdick et al., 2005). The sources of variability in these experiments are typically the operators, parts and replicates. The model studied in this paper assumes no interaction between operators and parts, and that these are random effects. The ICC b in this context is the proportion of the total variation attributable to the operator-to-operator variation. In the context of psychometry, the ICC b is the proportion of total variance attributable to the objects of measurement (a.k.a., targets) (McGraw and Wong, 1996). The two-way model assumes two sources of variance; for example, one source may represent items on a mathematics test, and the other source the different children taking the test; or one source might represent different judges with varying anchor points, and the other source different individuals who are the objects of judgments. In the context of biomarkers or clinical assays, the sources of variability are typically patient-to-patient, laboratory-to-laboratory, and within-laboratory. The ICC b in that context is the ratio of the variance between patients relative to the overall variance due to all sources, and it is assumed that there is no interaction between patients and laboratories. The primary new material in this paper is (1) the presentation of the first methods and programs for generating sample size for the GCI and MLS intervals for the ICC b in this layout, (2) the derivation of the asymptotic distributions of the MLS and GCI width distributions, (3) the development of two statistical variance reduction techniques called inverse Rao-Blackwellization and dependent conditioning; these variance reduction techniques are closely related to existing techniques, but of potential future utility in expanding the results of this paper to other GCI and MLS sample size contexts. This paper is organized as follows: Section 2 presents the basic model, interval methods, and objective functions for sample size determination; Section 3 presents the asymptotic results; Section 4 discusses the interaction setting; Section 5 presents the methods for estimating the confidence interval widths and the sample sizes; Section 6 presents an application; Section 7 presents results of simulation studies; Section 8 presents discussion and conclusions. Proofs and other details appear in the Supplement (see Appendix B). 2. Methods 2.1. The statistical model: two-way layout without interaction The model for the two-way layout without interaction is Yblr = µ + Bb + Ll + eblr

(1)

where Bb ∼ N (0, σb2 ), Ll ∼ Normal(0, σl2 ) and eblr ∼ Normal(0, σe2 ) are all mutually independent. A balanced design has b = 1 . . . , b0 , l = 1, . . . , l0 and r = 1, . . . , r0 . The analysis of variance is given in Table 1. Here, Ll represents the lab-to-lab or rater-to-rater or device-to-device variability; Bb presents the subject-to-subject or sample-to-sample variability. The between-laboratory (or -rater, or -device, etc., depending on context) intraclass correlation coefficient is σb2

ρb = ICC b =

σb2 +σl2 +σe2

l0 r0 s2b



(Y¯b·· − Y¯··· )2

b

=

b0 − 1

 s2e =

. The observed mean squares in Table 1 are

b

l

b0 r 0

,

s2l



(Y¯·l· − Y¯··· )2

l

=

l0 − 1

(Yblr − Y¯b·· − Y¯·l· + Y¯··· )2

r

b0 l0 r0 − b0 − l0 + 1

.

Under the normal model assumptions (e.g., Burdick et al., 2005), df1 ∗ s2l

σe2 + b0 r0 σl2

∼ χ(2df 1) ,

df2 ∗ s2b

σe2 + l0 r0 σb2

∼ χ(2df 2) ,

df3 ∗ s2e

σe2

∼ χ(2df 3)

(2)

where χ(2m) denotes a chi-square distribution with m degrees of freedom, df1 = l0 − 1, df2 = b0 − 1 and df3 = b0 l0 r0 − b0 − l0 + 1.

70

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

2.2. Two methods for constructing an interval The ICC b = ρb is typically estimated using restricted maximum likelihood (REML). Two interval estimation procedures that have demonstrated good finite sample coverage performance for the ICC b are the Modified Large Sample (Polley et al., 2013) and the Generalized Confidence Interval (Weerahandi, 1993) methods. 2.3. The MLS method for ICC b Following notation in Burdick et al. (2005), let Fa:b,c be the a(100)th percentage point of an F distribution with b numerator and c denominator degrees of freedom. Use the conventions that Fa:b,∞ is the percentage point of a chi-square distribution with b degrees of freedom divided by its degrees of freedom (that is χ(2b) /b), and Fa:∞,c is the percentage point of c /χ(2c ) . Let t = (s2b , s2l , s2e ) be the vector of observed mean squares. The MLS method is based on the following four equations: b0 (1 − G2 )s4b − b0 s2b s2e + b0 [F5 − (1 − G2 )F52 ]s4e

L∗ (t ) =

l0 (b0 r0 − 1)s2b s2e + l0 (1 − G2 )s2b s2l /F4

U ∗ (t ) = L(t ) = U (t ) =

b0 (1 + H2 )s4b − b0 s2b s2e + b0 [F6 − (1 + H2 )F62 ]s4e l0 (b0 r0 − 1)s2b s2e + l0 (1 + H2 )s2b s2l /F3 max{0, L∗ (t )}

1 + max{0, L∗ (t )} max{0, U ∗ (t )} 1 + max{0, U ∗ (t )}

.

Here G2 = 1 − Fα/2:∞,b0 −1 , F5 = F1−α/2:b0 −1,b0 l0 r0 −b0 −l0 +1 , F4 = Fα/2:l0 −1,b0 −1 , H2 = F1−α/2:∞,b0 −1 − 1, F6 = Fα/2:b0 −1,b0 l0 r0 −b0 −l0 +1 , F3 = F1−α/2:l0 −1,b0 −1 . The equal-tailed (1 − α)100% MLS confidence interval for the ICC b is (L(t ), U (t )). The volume of the MLS interval, conditional on the observed mean squares t = (s2b , s2l , s2e ), is

vˆ MLS (t ) = U (t ) − L(t ). 2.3.1. The GCI method for ICC b Let t be a sample statistic and G(t ) a generalized pivotal quantity (GPQ) for parameter ρb as defined in Appendix A, with distribution function FG(t ) (g ) and quantile function FG−(1t ) (p) = infg {g : FG(t ) (g ) ≥ p}. Weerahandi (1993) showed that, under a.s.

a.s.

certain conditions, P (FG−(1t ) (α/2) < ρb < FG−(1t ) (1 − α/2)) −→ 1 − α , where → is almost sure convergence. This convergence assumes repeated independent experiments. For the model of Eq. (1), with t = (s2b , s2l , s2e ), the generalized confidence interval (GCI) is built from the GPQ for ρb (Burdick et al., 2005):



G(t ) =

max 0, df1 s2l b 0 r 0 W1

where W1 ∼ χ

+

df2 s2b l0 r0 W2 df2 s2b

l 0 r 0 W2



+

df3 s2e l0 r0 W3



df3 (df3 −1)s2e b0 l0 r0 W3

(3)

, W2 ∼ χ(2df2 ) , and W3 ∼ χ(2df3 ) , are mutually independent. An equal-tailed 100(1 − α)% GCI for ρb is − α/2)). Let

2 (df1 )

(FG−(1t ) (α/2), FG−(1t ) (1

vGCI (t ) = FG−(1t ) (1 − α/2) − FG−(1t ) (α/2). Note that vGCI (t ) is the volume of the GCI. We here do not put a ‘‘hat’’ on the v because the estimate of the GCI uses an estimated quantile function Fˆ −1 , rather than F −1 itself, since F −1 is unknown for this setting. The quantile function can be estimated using the type 7 estimator in Robert and Casella (2013), which is the default used in the R language (Ripley, 2006) (version 3.0.2). If the GCI is formed properly the difference between these Fˆ −1 and F −1 will be trivial. 2.4. Defining the objective functions Both the MLS and the GCI confidence interval widths, denoted vGCI (t ) and vˆ MLS (t ) above, are deterministic functions of the observed mean squares vector t. The observed mean squares is a random vector, and these width functions map this vector from t ∈ ℜ3 to the unit interval, [0, 1]. The sample size is chosen that controls the expected interval width for a given set of parameters (σb2 , σl2 , σe2 ). The experimental design is fully determined by the three design values (b0 , l0 , r0 ). Let q = (σb2 , σl2 , σe2 , b0 , l0 , r0 ). If t has

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

71

density f with respect to a measure µq (t ) on ℜ3 , then the objective functions are:

µGCI (q) = E [vGCI (t )|q] =



µMLS (q) = E [ˆvMLS (t )|q] =

vGCI (t )f (t |q)dµq (t )



(4)

vˆ MLS (t )f (t |q)dµq (t ).

(5)

Since there are three design values that must be chosen in Table 1, b0 , l0 and r0 , the question becomes how to create a useful program for determining one or all of these. The total number of observations is of course b0 × l0 × r0 . One could fix this total number of observations, as done in Saito et al. (2006). But this approach, while leading to interesting insights on design allocation in the 2006 paper, does not seem well suited to sample size determination because fixing this product is artificial, and the designs for a fixed value of the product may be limited. For example, if the product value is 55, then the possible designs are (b0 , l0 , r0 ) ∈ {(11, 5, 1), (5, 11, 1), (1, 5, 11), (1, 11, 5), (11, 1, 5), (5, 1, 11)}. This set of designs does not seem intuitively reasonable. Another option would be to specify a cost function, and determine the sample size for a fixed cost; but this method is limiting in that cost functions may vary from one application to another, and users may have difficulty specifying an explicit set of costs. Instead, to maximize the potential usefulness of our approach, we focus on settings where two of the design parameters are fixed, and the third design parameter is free. More motivation for this approach is presented in the Discussion section. We develop for a GCI-based design sample sizes that will satisfy bˆ 0 = min{b|µGCI (σb2 , σl2 , σe2 , b, l0 , r0 ) < µtarget }

ˆl0 = min{l|µGCI (σb2 , σl2 , σe2 , b0 , l, r0 ) < µtarget } rˆ0 = min{r |µGCI (σb2 , σl2 , σe2 , b0 , l0 , r ) < µtarget }. Similarly, for the MLS-based method, replace µGCI above with µMLS . 3. Asymptotic results In this section we derive the asymptotic distributions of the confidence interval widths as one design parameter goes to infinity and the other two remain fixed. These results are important because in some cases it is not possible to achieve the targeted mean width of Eqs. (4) and (5) by increasing only one design parameter (e.g., b0 → ∞). Therefore, searching for a sample size would be futile. These theorems can be used to identify these settings. Theorems are given here and proofs appear in the Supplement (see Appendix B). 3.1. The GCI method asymptotics First, note that when s2b ̸= 0, s2l ̸= 0 and s2e ̸= 0, the conditional distribution of the GPQ G(t ) given t is a mixture of a point mass at zero and an absolutely continuous distribution with positive density on (0, 1). Theorem 1. Define l2lim = limb0 →∞ b0 → ∞, d



G(t ) → max 0,

1 df1

df1

l =1

(Y ·l· − Y ··· )2 . Under the assumptions of the model of Eq. (1), if σb2 > 0, then as

σb2 2 2 σb + σe + (l2lim )/[W1 /df 1]



= G(l2lim ; ∞, l0 , r0 ) d

where W1 is chi-squared with df1 = l0 − 1 degrees of freedom. The notation → means convergence in distribution. l2lim .



l2lim

The limiting distribution in the theorem is a function of The value will have a distribution proportional to a chi-square. Define F −12 as the corresponding quantile function. From this theorem it follows that G(llim ;∞,l0 ,r0 )

Corollary 1. Under the assumptions of Theorem 1, lim v(t ) = F −12

b0 →∞

G(llim ;∞,l0 ,r0 )

lim µGCI (q) =

b0 →∞



(1 − α/2) − FG−(1l2

lim

F −12

G(llim ;∞,l0 ,r0 )

σl2

where fl2 is the density of ( df )χ(2df ) . lim

1

1

;∞,l0 ,r0 )

(1 − α/2) − FG−(1l2

lim

(α/2) and

;∞,l0 ,r0 )

(α/2)fl2 (l2lim )dl2lim lim



The utility of this result can be seen in the following observation: if the targeted confidence interval width is µtarget > limb0 →∞ µGCI (q), then the targeted width cannot be achieved for any b0 .

72

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

Theorem 2. Define b2lim = liml0 →∞ l0 → ∞,

1 df2

b0

b =1

(Y b·· − Y ··· )2 . Under the assumptions of the model of Eq. (1), if σb2 > 0, then as

(b2lim )/(W2 /df 2) G(t ) → max 0, 2 (blim )/(W2 /df 2) + σl2 + σe2 

d



where W2 is chi-squared with df2 = b0 − 1 degrees of freedom. Corollary 2. liml0 →∞ (vGCI (t )) = F −12

G(blim ;b0 ,∞,r0 )

(1 − α/2) − FG−(1b2

;b ,∞,r0 ) lim 0

σb2

vGCI (t )fb2 (b2lim )db2lim where fb2 is the density of ( df2 )χ(2df2 ) .

(α/2) and liml0 →∞ µGCI (b0 , l0 , r0 ) =

b

0 2 2 Theorem 3. Define b2lim,r = limr0 →∞ df1 b=1 (Y b·· − Y ··· ) and llim,r = limr0 →∞ 2 2 of the model of Eq. (1), if σb > 0, then as r0 → ∞,

 G(t ) → max 0,

liml0 →∞

1 df1



(Y ·l· − Y ··· )2 . Under the assumptions



b2lim,r /(W2 /df2 ) − b0 σe2

d





lim

lim



l2lim,r /(W1 /df1 ) + b2lim,r /(W2 /df2 ) + σe2

where W1 is chi-squared with df1 degrees of freedom and W2 is chi-squared with df2 degrees of freedom. Corollary 3. limr0 →∞ (vGCI (t )) = F −12

G(blim,r ,l2lim,r :b0 ,l0 ,∞)

lim µGCI (b0 , l0 , r0 ) =



r0 →∞

where fb2

lim,r

σb2

is the density of

df2

(1 − α/2) − FG−(1b2

lim (vGCI (t ))fb2

lim,r

r0 →∞

χ(2df 2) and fl2

lim,r

,l2 :b ,l ,∞) lim,r lim,r 0 0

(l2lim,r )db2lim,r dl2lim,r

(b2lim,r )fl2

lim,r

is the density of

(α/2), and

σl2 df1

χ(2df1 ) .

3.2. The MLS method asymptotics We develop similar results for the MLS intervals. Theorem 4. Define l2b0 = s2l /(b0 r0 ), and l2lim = limb0 →∞ l2b0 = limb0 →∞ r0 fixed, d

σb2



L → max 0,



σb2 + σe2 + l2lim /Fα/2:df1 ,∞

2 l (Y ·l· −Y ··· )



l 0 −1

. If σb2 > 0, then as b0 → ∞, with l0 and

. 

Theorem 5. If σb2 > 0, then as b0 → ∞, with l0 and r0 fixed, d



U → max 0,

 σb2 .  σb2 + σe2 + l2lim /F1−α/2:df1 ,∞

Corollary 4. lim µMLS (t ) =

 

b0 →∞

σb2 max 0, 2 σb + σe2 + l2lim /F1−α/2:df1 ,∞

where fl2 is the density for ( lim



r0 σl2 df1

)χ(2df1 ) .



Theorem 6. As l0 → ∞, with b0 and r0 fixed, then



Fα/2:∞,df2 b2lim,3

d

L → max 0,

Fα/2:∞,df2 b2lim + σl2 + σe2

where b2lim = liml0 →∞

2 b (Y b·· −Y ··· )



b 0 −1

.









− max 0,

σb2 σb2 + σe2 + l2lim /Fα/2:df1 ,∞



fl2 (l2lim )dl2lim lim

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

73

Theorem 7. As l0 → ∞, with b0 and r0 fixed, then F1−α/2:∞,b0 −1 b2lim



d

U → max 0,



F1−α/2:∞,b0 −1 b2lim + σe2 + σl2

. 

Corollary 5. lim µMLS (t ) =

l0 →∞

 

F1−α/2:∞,b0 −1 b2lim



max 0,



− max 0,

F1−α/2:∞,b0 −1 b2lim + σe2 + σl2 Fα/2:∞,df2 b2lim



Fα/2:∞,df2 b2lim + σl2 + σe2

where fl2 is the density for the distribution lim



r0 σb2 df2

χdf2 2 .

fl2 (l2lim )dl2lim lim



Theorem 8. As r0 → ∞, with b0 and l0 fixed, then b2lim,r Fα/2:∞,b0 −1

d

L→

b2lim,r Fα/2:∞,b0 −1 + σe2 + l2lim,r Fα/2:∞,b0 −1 /Fα/2:df1 ,df2

.

Theorem 9. As r0 → ∞, with b0 and l0 fixed, then b2lim,r F1−α/2:∞,b0 −1

d

U →

b2lim,r F1−α/2:∞,b0 −1 + σe2 + l2lim,r /F1−α/2:df1 ,df2

.

Corollary 6. lim µMLS =

r0 →∞

where fb2

lim,r



[max {0, L(Theorem 9)} − max {0, L(Theorem 8)}] fb2

lim,r

(b2lim,r )fl2

lim,r

(l2lim,r )db2lim,r dl2lim,r

is the density of σb2 × χ(2df ) /df2 and fl2 is the density of σl2 × χ(2df ) /df1 . 2 1 lim,r

3.3. Relating the asymptotics From Theorem 1, we can note that the limiting distribution of the GCI width is a monotone increasing function of W1 . Therefore, if we have a very large Monte Carlo sample, we will have the approximate equality Fˆ −12 (1 − α/2) ≈ max(0,

σb2 σb2 +σe2 +(l2lim )/F1−α/2;df 1,∞

G(llim ;∞,l0 ,r0 )

) which is the corresponding term from Theorem 4 for the MLS. Similarly, the results of

Corollaries 1 and 4 lead to similar estimates, as do the results of Corollaries 2 and 5. In fact, the MLS and GCI methods are asymptotically equivalent in respect to these interval widths. To our knowledge, we are the first to make this observation. 4. Estimation A formula for the sample size could be derived if Eqs. (4) and (5) could be written explicitly as functions of b0 , l0 and r0 . But this is not the case. On the other hand, if the intervals could be calculated quickly, then an iterative search could be used to determine the sample size. This is our approach. Table 2 shows why this approach is needed. Brute force computation methods can take over 6 h. Our computation methods produce the same answer in less than one minute. 4.1. Estimating µMLS Our approach to estimating the mean µMLS is based on recasting the MLS formula as a function of two dependent F distributions. By explicitly deriving the conditional distribution of one F given the other, this permits us to integrate out one of the F variables, avoiding one Monte Carlo integration. Moreover, the correlation between the F ’s results in further variance reduction and the MLS mean can be calculated from a relatively small simulation from a single F distribution. We call this method ‘‘dependent conditioning’’. Here is an outline of the dependent conditioning approach: 1. Rewrite MLS equations for L(t ), U (t ), and U (t ) − L(t ) as functions of 2 F distributions. 2. Calculate the bivariate joint density, and the conditional density function for one F distribution given the other F distribution.

74

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83 Table 2 Computing time comparison: Each method uses the sample size algorithm described in the paper with k = 50 and bmin = 5. Times are system times from computer clock. SR is savings ratio, defined as pure MC time over our method’s time. Pure MC computes widths using a pure Monte Carlo with 1000 replications. ICC b = 0.9 and ICC w = 0.5.

µtarget

l0

0.20 0.20 0.20 0.20 0.15 0.15

r0

20 20 30 30 10 10

2 2 30 30 5 5

Interval type

System times

MLS GCI MLS GCI MLS GCI

Pure MC (min)

Our method (min)

1.09 404.08 18.21 372.60 8.26 417.46

0.29 0.40 0.26 0.36 0.46 0.38

SR

bˆ 0

4 1010 70 1035 18 1099

12 11 9 9 83 40

3. Using the conditional density, integrate out one variable so that the MLS limits, U (t ) and L(t ), are each functions of a single variable. 4. Perform a single Monte Carlo using an F distribution. First, note that under the model of Eq. (1),



s2e σe2



=

s2l s2e

s2b σe2 +l0 r0 σb2

/



s2e

σe2



=

s2b

σe2

s2e (σe2 +l0 r0 σb2 )

= W has an F distribution, and

s2l

σe2 +b0 r0 σl2

/

σe2

= Y also has an F distribution. Also  k1 W + k2 + k3 /W L = max 0, k1 W + k2 + k3 /W + k4 + k5 Y (σe2 +b0 r0 σl2 )



where k1 = b0 (1 − G2 )

(σe2 + l0 r0 σb2 ) σe2

k2 = −b0 k3 = b0 [F5 − (1 − G2 )F52 ]



σe2 2 (σe + l0 r0 σb2 )



k4 = l0 (b0 r0 − 1) k5 = (l0 (1 − G2 )/F4 )



 (σe2 + b0 r0 σl2 ) . σe2

After the calculations in Appendix A, the conditional distribution is: f (y|w) =

df2 /2+df3 /2

df3 + df2 w



df3 + df2 w + df1 y

(df3 + df2 w + df1 y)−df1 /2 df1 /2−1 df1 /2 y df1 . Beta(df1 /2, df2 /2 + df3 /2)

The expectation conditional on W = w is m1 (w)

 E [L|w] =

m2 (w) + m3 y

f (y|w)dy

where m1 (w) = k1 w + k2 + k3 /w, m2 (w) = m1 (w) + k4 and m3 = k5 . An analytic solution to the integral that uses the hypergeometric function is presented in Appendix A. Alternatively, the integration can be performed numerically. Similarly, for the upper limit: k6 W + k7 + k8 /W



U = max 0,



k6 W + k7 + k8 /W + k9 + k10 Y

where k6 = b0 (1 + H2 )

(σe2 + l0 r0 σb2 ) σe2

k7 = −b0 k8 = b0 [F6 − (1 +

σe2 ) ] (σe2 + l0 r0 σb2 )

H2 F62



k9 = l0 (b0 r0 − 1) k10 = (l0 (1 + H2 )/F3 )



 (σe2 + b0 r0 σl2 ) . σe2



K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

Then

m4 (w)

 E [U |w] =

m5 (w) + m6 y

75

f (y|w)dy

where m4 (w) = k6 w + k7 + k8 /w, m5 (w) = m5 (w) + k9 and m6 = k10 . The analytic solution is easily derived from the formula for E [L|w] presented in Appendix A; just replace (m1 , m2 , m3 ) with (m4 , m5 , m6 ). Alternatively, one can use numerical integration to solve. The final width estimator is

vˆ MLS =

m0 1 

m0

E [U |wm ] − E [L|wm ]

m=1

where m0 is the number of Monte Carlo, and the wm are independent and identically distributed from F(df2 ,df3 ) . 4.2. Estimating µGCI We want to estimate µGCI =



vGCI (t )f (t |q)dµq (t ). Our approach can be outlined as follows.

1. The quantity vGCI (t ) inside the integral is typically estimated with simple Monte Carlo. We propose a new method based on a Rao-Blackwellization (Robert and Casella, 2013) of the cumulative distribution function, and taking the function inverse. We term this method ‘‘inverse Rao-Blackwellization’’.  2. The outer integral in vGCI (t )f (t |q)dµq (t ) displays excessive variance. But vGCI (t ) is correlated with s2l /s2b , and this correlation is very high when the ICC b is large. Therefore, utilizing this quantity as a control variate further reduces the variance and computation time. We describe the inverse Rao-Blackwellization. Let (W1m , W2m , W3m ) for m = 1, . . . , m0 be the Monte Carlo observations. Consider the two estimators of the quantile function FG−(1t ) based on Monte Carlo samples: 1. Monte Carlo estimator: For a Monte Carlo sample, m = 1, . . . , m0 , the estimator is FˆG−(1t ) (α), the quantile function such as (Hyndman and Fan, 1996) estimator 7. 2. Inverse Rao-Blackwellization estimator: For a Monte Carlo sample, m = 1, . . . , m0 , the estimator is:

 FˆG−(1t ):RB (α)

= inf g : α ≤

m0 1 

m0

 P [G(t ) ≤ g |W2m , W3m ] .

(6)

m=1

Estimator (1) is the traditional simple Monte Carlo estimator. Estimator (2) is our proposed inverse Rao-Blackwellization estimator. This estimator conditions on the W2 and W3 in Eq. (3). It turns out that the conditional probability in Eq. (6) can be calculated exactly from percentage points of an inverted gamma distribution. Specifically, as shown in Appendix A. Lemma 1. Under the model of Eq. (1): P (G(t ) ≤ g0 |W2 , W3 ) = 1 − Finv χ 2 (df 1) where M2 = 1/W2 , M3 = 1/W3 , a =

df2 s2b



max {0, M2 a − bM3 } − ag0 M2 − dg0 M3



cg0

/(l0 r0 ), b =

df3 s2e

/(l0 r0 ), c = df1 s2l /(b0 r0 ) and d = b(df3 + 1)/b0 .

Lemma 1 provides a way to estimate the quantile function, and subsequently vGCI (t ), with a much smaller number of replications than the traditional 100,000 or more needed for simple Monte Carlo in this context (Burdick et al., 2005). Now we turn to estimating the outer integral v(t )f (t )dt, the average of these v(t ) over the distribution of the observed mean squares. There is an approximately linear relationship between v(t ) = FG−(1t ) (1 − α/2) − FG−(1t ) (α/2) and s2l /s2b , when

ICC b is over 0.7. Let Lˆ GCI = FˆG−(1t ):RB (α/2) be the estimated lower endpoint of the confidence interval based on the inverse Rao-Blackwellization described above. Then consider Lˆ ctrlvar = Lˆ GCI − GCI

cov(Lˆ GCI , s2l /s2b ) Var(s2l /s2b )

× (s2l /s2b − E [s2l /s2b ]).

For the upper endpoint, Uˆ GCI = FˆG−(1t ):RB (1 − α/2). Then, ctrlvar Uˆ GCI = Uˆ GCI −

cov(Uˆ GCI , s2l /s2b ) Var(s2l /s2b )

× (s2l /s2b − E [s2l /s2b ]).

The mean and variance of s2l /s2b are E [s2l /s2b ] =

σe2 +b0 r0 σl2 σe2 +l0 r0 σb2

·

b0 −1 , b0 −3

and Var(s2l /s2b ) =



σe2 +b0 r0 σl2 σe2 +l0 r0 σb2

2

2(b0 −1)2 (b0 +l0 −4) . (l0 −1)(b0 −3)2 (b0 −5)

covariance terms are estimated with the sample covariances in the Monte Carlo output. The final estimator is ctrlvar ctrlvar vˆ GCI = Uˆ GCI − Lˆ ctrlvar . GCI

The

76

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

4.3. Sample size estimation The sample size is estimated from the asymptotic results and the width functions developed above. Here is an example outline for determining b0 to ensure that, for a user-specified width µtarget , the interval will satisfy µtarget > µMLS .

Sample size algorithm 1. Use the theorems in Section 3 to evaluate whether the targeted width is achievable. If it is not, then stop and return a message indicating the smallest possible width. 2. If the width is achievable, let bmin be a minimum acceptable value for b0 (e.g., b0 = 5 replicates). Define g (b) = vˆ MLS (σb2 , σl2 , σe2 , b, l0 , r0 ) − µtarget . If g (bmin ) ≤ 0, return bmin and exit. Otherwise, let a = bmin and b = bmin + k, with k a positive integer (e.g., k=100). 3. Update g (b) = vˆ MLS (σb2 , σl2 , σe2 , b, l0 , r0 ) − µtarget . (a) If g (b) ≤ 0, then perform a search for the zero on (a, b), e.g., by bisection, and find b∗ such that g (b∗ ) ≈ 0. (b) If g (a) > 0 and g (b) > 0 then let a = b, b = b + k (e.g., k = 100), and return to (3). 4. Return bˆ 0 = b∗ + 1.

5. The effect of an interaction A natural extension of Eq. (1) is to add an interaction term between the factors, resulting in the model equation: yblr = µ + Bb + Ll + BLbl + eblr

(7)

with BLbl ∼ Normal(0, σ ) an interaction random effect. This corresponds to Case 2 of McGraw and Wong (1996). For 2 bl

this model, in many applications interest may be in ICC ∗b =

σb2 σb2 +σl2 +σbl2 +σe2

which corresponds to the between-lab intraclass

correlation coefficient. McGraw and Wong (1996) call this the degree of absolute agreement, and Burdick et al. (2005) the ratio of individual variance to total variability. A potential issue that may arise is that during data analysis an interaction may be detected by testing, and this may lead to constructing a GCI or MLS interval for ICC ∗b instead of for ICC b . We investigated the question of how robust our sample size methods will be if that occurs. In particular, will the mean width of the interval for ICC ∗b be smaller than the targeted mean width if a detectable interaction is present? Formulae for both the MLS and GCI interval construction for ICC ∗b are presented in Burdick et al. (2005). 6. Results of simulation studies For all simulations, the total variance was set to 10 = σb2 + σl2 + σe2 . We have found that the MLS and GCI intervals do not depend on the total variance (Ionan et al., 2014). For all simulations, we used 0.5 = ICC w =

σb2

σb2 +σe2

. Computations were

carried out in R version 3.0.2 and Mathematica version 7. The asymptotic results of Section 3 can be used to estimate the minimum possible width when design parameter b0 → ∞ or l0 → ∞. A single Monte Carlo integration can be implemented using Corollaries 1–6. The resulting minimum widths as b0 → ∞ for both the MLS and GCI are presented in Table 3. Also shown are the widths for the MLS interval when b0 is very large, specifically b0 = 500, for comparison. As can be seen from the table, the asymptotic widths for the MLS and GCI are very similar to each other. This makes sense since the two approaches are asymptotically equivalent, as noted in Section 3.3. Since the width is a monotone decreasing function of b0 , the widths when b0 = 500 should all be larger than the estimates when b0 = ∞. This is the case for all lines except the first, where the tiny discrepancy can be attributed to simulation error. Therefore, both Corollaries 1 and 4 are supported by the simulation and produce nearly identical asymptotic estimates. Next, we evaluated the MLS method, both in terms of accuracy of the CI width estimate and variance reduction (computational savings). The total variance reductions shown in Table 4 due to both the integration step and the reduced variance from the conditioning are shown in the rightmost column. The reductions depend on the study size, and are two orders of magnitude for larger studies, and an order of magnitude or less for smaller studies (b0 = 30, l0 = 5, r0 = 1). Since larger studies require more computational overhead, this means that most width calculations can be performed in a relatively short timeframe. The accuracy of the mean interval width estimates are evaluated by comparison of the MLS width estimation method with a pure Monte Carlo estimation of the mean width; this comparison shows that the MLS estimates are very close to the Monte Carlo means; the new method produces very accurate mean widths over these simulations. The two are equal to two decimal places for larger sample size. The MLS method is slightly conservative for some of the smaller sample sizes, probably due to the discrete nature of sample sizes and associated widths in that context. Our GCI method is based on inverse Rao-Blackwellization and a control variate. We sought to assess both the accuracy of our approach and the computational savings. As can be seen in Table 5, the total variance reduction from the two methods

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

77

Table 3 Minimum width asymptotics. Widths as b0 → ∞ based on Corollaries 1 and 4, and using Monte Carlo integration. For all simulations, r0 = 1 and ICC b = 0.50. Rightmost column is a pure Monte Carlo evaluation of the mean width using the MLS interval construction method. l0

As b0 → ∞

ICC b

MC MLS width b0 = 500

MLS min. width

GCI min. width

5 10 20

0.90 0.90 0.90

0.244 (1e−3) 0.114 (4e−4) 0.067 (2e−4)

0.243 (4e−3) 0.114 (1e−3) 0.067 (6e−4)

0.242 (1e−2) 0.122 (4e−3) 0.076 (2e−3)

5 10 20

0.70 0.70 0.70

0.387 (1e−3) 0.227 (2e−4) 0.145 (3e−4)

0.387 (4e−3) 0.228 (2e−3) 0.146 (1e−3)

0.420 (1e−2) 0.254 (6e−3) 0.167 (2e−3)

5 10 20

0.50 0.50 0.50

0.374 (8e−4) 0.244 (4e−4) 0.164 (2e−4)

0.373 (2e−3) 0.242 (2e−3) 0.164 (1e−3)

0.412 (6e−3) 0.277 (4e−3) 0.196 (2e−3)

5 10 20

0.30 0.30 0.30

0.275 (4e−4) 0.191 (2e−4) 0.133 (1e−4)

0.275 (1e−3) 0.189 (1e−3) 0.133 (4e−4)

0.311 (3e−3) 0.223 (2e−3) 0.165 (8e−4)

5 10 20

0.10 0.10 0.10

0.109 (1e−4) 0.078 (6e−5) 0.056 (2e−5)

0.110 (4e−4) 0.079 (2e−4) 0.056 (7e−5)

0.139 (2e−3) 0.100 (1e−3) 0.074 (4e−4)

Table 4 MLS method interval widths: Evaluation of the dependent conditioning (DC) method for estimating the MLS CI width. Table based on 1000 MC runs. Confidence level is 95%. Total variance is 10 and lab variance equals error variance. VR is the ratio of the variances of the estimators, pure Monte Carlo divided by DC method. TR is the computing time ratio of the methods, pure Monte Carlo divided by DC method. Total is the total computation savings ratio.

ρb

b0

l0

r0

MC mean

Est. mean

VR

TR

Total VR

0.90 0.70 0.30 0.10 0.90 0.70 0.30 0.10 0.90 0.70 0.30 0.00

100 100 100 100 30 30 30 30 30 30 30 30

10 10 10 10 5 5 5 5 30 30 30 30

5 5 5 5 1 1 1 1 30 30 30 30

0.143 0.290 0.258 0.115 0.299 0.497 0.433 0.259 0.120 0.260 0.259 0.119

0.147 0.299 0.266 0.117 0.324 0.529 0.447 0.253 0.120 0.260 0.260 0.120

9.9 19.5 11.4 1.3 5.1 15.6 1.1 1.3 1.4 1.6 1.2 1.2

50.9 44.2 41.4 45.4 1.7 1.8 2.1 2.4 152.8 168.0 211.9 239.4

504 860 474 59 9 28 2 3 216 266 257 287

Table 5 GCI Method interval widths: MC is pure Monte Carlo. RB is the Rao-Blackwellization method. CV is the control variate method. Confidence level is 95%. Using 1000 Monte Carlo for estimation, summary statistics from 100 replications.

ρb

b0

l0

r0

RB mean

MC mean

RB VR

CV after RB VR

Total VR

0.9 0.7 0.5 0.3 0.1 0.9 0.7 0.5 0.3 0.1 0.9 0.7 0.5 0.3 0.1

30 30 30 30 30 100 100 100 100 100 30 30 30 30 30

5 5 5 5 5 10 10 10 10 10 30 30 30 30 30

1 1 1 1 1 5 5 5 5 5 30 30 30 30 30

0.303 0.510 0.513 0.412 0.222 0.132 0.272 0.298 0.238 0.101 0.107 0.242 0.287 0.278 0.112

0.299 0.508 0.508 0.409 0.221 0.131 0.272 0.298 0.237 0.101 0.106 0.240 0.285 0.247 0.112

158.8 71.8 21.3 5.7 2.5 106.4 69.0 51.9 26.4 14.0 4.0 2.7 2.1 1.7 2.4

9.4 2.1 1.2 1.0 2.2 51.0 8.0 2.2 1.1 1.6 9.9 3.2 1.0 2.2 2.7

1493 151 26 6 6 5426 552 114 29 22 40 9 2 4 6

78

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83 Table 6 Sample size estimates and evaluations. Evaluation of the dependent conditioning-based (DC-based) sample size estimates for the Modified Large Sample method. Here the b0 is the free parameter and l0 = 10, r0 = 5 are fixed. For the DC-based method, the estimation process is repeated 20 times, each with a different random seed. For the pure Monte Carlo evaluations, a total of 1000 Monte Carlo replications to estimate the mean. In each case, σb2 + σl2 + σe2 = 10 and σl2 = σe2 . Sample size estimations

Monte Carlo evaluations

ICC b

Target

bˆ 0 mean (SD)

b0

Mean (SEM) width

0.9 0.7 0.5 0.3 0.1

0.50 0.50 0.50 0.50 0.50

4 (0.0) 7.9 (0.2) 11.0 (0.0) 9.0 (0.0) 5 (0.0)

4 8 11 9 5

0.38 (6e−4) 0.49 (2e−4) 0.50 (1e−4) 0.48 (2e−4) 0.43 (5e−4)

0.9 0.7 0.5 0.3 0.1

0.40 0.40 0.40 0.40 0.40

4 (0.0) 17 (0.0) 27 (0.0) 15 (0.0) 6 (0.0)

4 17 27 15 6

0.38 (6e−4) 0.39 (2e−4) 0.39 (9e−5) 0.40 (2e−4) 0.37 (4e−4)

0.9 0.7 0.5 0.3 0.1

0.30 0.30 0.30 0.30 0.30

5.8 (0.4) 95.6 (1.3) 95 (0) 47.1 (0.4) 8.3 (1.0)

6 96 95 47 8

0.31 (4e−4) 0.29 (2e−4) 0.32 (1e−4) 0.29 (8e−5) 0.30 (3e−4)

combined ranges from three orders of magnitude to less than one order of magnitude. Unlike the MLS method, the amount of variance reduction is strongly dependent on the true value of the ICC b . The closer the ICC b is to 1, the greater the variance reduction. This is not too surprising since our control variate has a linear relation with ICC b only for large values of the parameter (not shown); but one can also see here that the inverse Rao-Blackwellization is more efficient when the ICC b is close to one as well. In terms of design settings, the variance reduction is smallest for the design with b0 = l0 = r0 = 30. Comparing the means from our method to the pure Monte Carlo means also presented in the table, our method is accurate in most cases to two decimal points. Our method appears to provide potentially large variance reduction and very accurate estimates of the interval widths over all simulations examined. The objective of our program is to estimate the sample size requirements. We therefore used simulation to assess the adequacy of the sample size estimates. Table 6 shows the MLS sample sizes generated by our method along with a Monte Carlo evaluation of the adequacy of the sample sizes. First, it is clear that the sample size estimates achieve the targeted mean widths because the Monte Carlo evaluations produced mean MLS CI widths almost identical to the targeted widths (or slightly conservative for smaller sample sizes). To make our method reproducible, the R program uses a fixed random seed (that the user can vary manually) so that if it is run twice in a row with the same input, the same sample size estimate will result. The table shows the variation in sample size estimates (see SD) as these seeds are varied. The variation is 0 or close to 0 in all cases across all the twenty replications. It is apparent from the table that our method produces sample size estimates with very small standard deviations, often with no observed variation over Monte Carlo replications. In sum, the program provided high quality sample size estimates across all simulations considered. Table 7 shows the GCI sample sizes generated by our method with a Monte Carlo evaluation of the adequacy of the sample sizes. The results indicate that the method produces sample size estimates that achieve the targeted mean width. Comparing the results of Tables 6 and 7, we see that there are some differences in the sample size requirements between the MLS and GCI methods. The GCI sample size estimates are equal to or smaller than the MLS sample size estimates. The Monte Carlo indicates that both sample size estimates produce adequate mean widths however. This might seem to suggest that the GCI method is preferable to the MLS method overall, but we have found that the two methods in practice are very similar and such a broad conclusion seems unwarranted from this limited study. Comparing these methods is outside the scope of this paper. Also note that others have suggested that the MLS is superior to the GCI (Cappelleri and Ting, 2003). Next, we addressed the setting of an interaction effect in the two-way layout. If the interaction exists, and the methods of Burdick et al. (2005) for interactions are used to construct a GCI interval instead of the no-interaction model, how will the resulting widths of the two intervals compare? Results are presented in Fig. 1. One can see that for small values of b0 , the mean interval widths for the interaction-based intervals can be quite a bit larger than the mean widths for the no-interaction intervals (negative values on the y-axis). As b0 increases, the bias decreases to below 0.01. In sum, if one is concerned about an interaction then a rule of thumb may be to decrease the targeted width by, say, 0.01, and insist on b0 ≥ 40. 7. Application To explore how our programs may be used in practice, we revisit the dataset of Barzman et al. (2012). As background, this study looked at the reproducibility of different emergency room staffers who rated a common set of children (actually, videos of the children), on the Brief Rating of Aggression by Children and Adolescents (BRACHA) instrument. The two sources

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

79

Table 7 Sample size estimates and evaluations. Evaluation of the Rao-Blackwellization plus control variate-based (RB + CV-based) sample size estimates for the Generalized Confidence Interval method. Here the b0 is the free parameter and l0 = 10, r0 = 5 are fixed. For the RB + CV-based method, the estimation process is repeated 20 times, each with a different random seed. For the pure Monte Carlo evaluations, a total of 1000 Monte Carlo replications to estimate the mean. In each case, σb2 + σl2 + σe2 = 10 and σl2 = σe2 . Sample size estimations

Monte Carlo evaluations

ICC b

Target

bˆ 0 mean (SD)

0.9 0.7 0.5 0.3 0.1

0.50 0.50 0.50 0.50 0.50

4(0) 6(0) 9(0) 8(0) 5(0)

4 6 9 8 5

0.37 (5e−3) 0.52 (3e−3) 0.50 (2e−3) 0.48 (3e−3) 0.41 (5e−3)

0.9 0.7 0.5 0.3 0.1

0.40 0.40 0.40 0.40 0.40

4(0) 13(0) 19(0) 13(0) 5(0)

4 13 19 13 5

0.37 (5e−3) 0.39 (2e−3) 0.39 (1e−3) 0.39 (2e−3) 0.41 (5e−3)

0.9 0.7 0.5 0.3 0.1

0.30 0.30 0.30 0.30 0.30

5(0) 38(0) 65(0) 28(0) 7(0)

5 38 65 28 7

0.32 (5e−3) 0.30 (2e−3) 0.30 (1e−3) 0.30 (1e−3) 0.32 (4e−3)

b0

Mean (SEM) width

Fig. 1. Robustness to interaction evaluation. Bias of the width estimates based on no-interaction model when an interaction model is fitted. Negative values mean the estimated mean width is smaller than the true mean width. Here L = 5 and R = 2. ‘‘Ratio’’ is the ratio of interaction σbl2 /σe2 . Also, σl2 = σe2 and 10 = σb2 + σl2 + σbl2 + σe2 . The value of the bias is above (−0.01) when B ≥ 40. Mean estimates based on 100 Monte Carlo simulations.

of variation are the children and the raters. For this example, we assume that we are designing a new reproducibility study of the BRACHA instrument which will use a new set of children and emergency room staffers, with each set drawn from the same population used in the 2012 paper. The analysis of variance table for this study is shown in the supplement (see Appendix B), and the mean squares for videos, raters, and residuals were s2b = 8.6610, s2l = 0.3965 and s2e = 0.0722 respectively, with degrees of freedom 23, 9 and 207, respectively. Point estimates for the variance components are:

σˆ b2 = σˆ l2 =

s2b − s2e LR s2l − s2e BR

= 0.85888 = 0.0135125

σˆ e2 = s2e = 0.0722.

80

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

We assume that logistics limits us to L = 10 emergency room staffers. Then our goal, at the design stage, is to determine how many children each emergency room staffer should evaluate to achieve a sufficient level of precision in terms of the interval width on the ICC. To achieve a mean interval widths of 0.30, 0.25, 0.20, 0.15, 0.10, 0.05, plugging the estimates into the MLS and GCI sample size functions produced sample size estimates of B = 6, 7, 11, 18, 47, 142 and B = 5, 7, 9, 13, 31, 236, respectively. Thus, if we find an interval width of 0.10 sufficient, we could use 31 children, 10 emergency room staffers, and GCI method for the interval calculation. However, if we want a higher level of precision with an expected interval width of 0.05, we will need 142 children, 10 emergency room staffers, and MLS method for the interval calculation. The estimated asymptotic width for the MLS using Corollary 4 is 0.0359, and for the GCI using Corollary 1 is 0.0358. The methods display instability in the sample size estimates as the targeted width gets close to the optimal width, in these cases within 0.02. For comparison, the Barzman study used n = 24 and had a Bayesian credible interval width of 0.11. If such a study is to be repeated again, we would advise to take a larger sample to ensure that the nominal coverage and the expected interval width are achieved. 8. Discussion In this paper were presented the first methods and accompanying computer programs to calculate sample size to obtain a desired mean confidence interval width for the intraclass correlation coefficient in a two-way crossed analysis of variance without interaction. The methods were shown in simulations to work well and to produce appropriate sample sizes. Code in R to implement the methods is being made available with this publication. In addition, asymptotic results were presented that can be used to estimate the minimum achievable width as one design parameter goes to infinity, and simulations also showed that these methods perform well. These asymptotic results also showed the asymptotic equivalence of GCI and MLS method widths. In a two way balanced layout there are three free design quantities, which we have denoted (b0 , l0 , r0 ), the first two representing the levels of each main effect and the third representing the number of replicates per cell. In this paper, we developed methods for determining the number of replicates for one parameter when the other two are held fixed. We think this approach produces the most useful methods and software for experimental designers for a number of reasons. First, often in practice only one design parameter can be feasibly increased; in that setting, our method can be used to determine how large to make that parameter. Second, often the basic design will be driven by the experimental objective, logistics, or cost. For example, in laboratory assay comparison studies, it may simply be impractical to have more than l0 = 6 labs participate. In that context, a program that sought over every possible combination of (b0 , l0 , r0 ) would likely produce a design solution that is impractical. Our program could be used in this context, however; to see this, note that one can run our program with l0 ∈ {3, 4, 5, 6} and get 4 different b0 values. (l0 < 3 probably would not make sense, and we assume r0 = 1); then, among the 4 resulting designs, use the one that is best (e.g., cheapest). Alternative approaches that search for an optimal design over the design space, or that optimize a cost function, are future directions that may also provide practical design solutions. In this paper, two new methods were developed for variance reduction called inverse Rao-Blackwellization and dependent conditioning. Inverse Rao-Blackwellization is motivated by contexts in which estimation of a function f (X ) of a random variable X is of interest. In our motivating example, f was a quantile function. The method may be useful when estimation of f (X ) results in high variance estimators, but f −1 (X ) can be estimated precisely using Rao-Blackwellization; in other words, using a random variable Y such that Var(f −1 (X )|Y ) < Var(f −1 (X )). In the context of quantile estimation by Monte Carlo, this method could be useful whenever a Y can be found with Var(Fˆ (X )|Y ) < Var(Fˆ (X )), as was done in this paper. This provides a potentially simple and widely applicable method for increasing the efficiency of Monte Carlo simulations that estimate quantiles. On the other hand, dependent conditioning is appropriate for multidimensional contexts in which the goal is to estimate a function f of a set of independent random variables (X1 , X2 , . . . , Xk ). If the random variables can be mapped to a lower dimensional space by some continuous mapping g : ℜk → ℜh with h < k, say (Y1 , . . . , Yh ) = g (X1 , . . . , Xk ). If, further, the distribution of Y2 , . . . , Yk |Y1 can be derived, then f (g −1 (Y2 , . . . , Yk |Y1 )) may be estimated with a single Monte Carlo on Y1 using this conditional density. Dependence among the Y ’s may increase the variance reduction. We focused on the mean interval width as the criterion for sample size determination. In fact, for a given set of design parameters (b0 , l0 , r0 ), and variance parameters (σb2 , σl2 , σe2 ) the MLS and GCI interval widths will be random variables; the widths will depend on the (as yet unobserved) sample mean squares. Average interval width is a common criterion used for sample size determination, but it is possible to use an alternative criterion, such as the 90th percentile of the interval widths; a similar suggestion was made by Zou (2012). Such an approach should be more conservative, and would guarantee with 90% probability that the width will be no larger than the target. Our approach could be extended to this context with some additional work. In particular, the integrals in Eqs. (4) and (5) would be replaced with quantile functions; subsequently, some of the variance reduction methods used in this paper specifically designed for integration may need to be modified to quantile estimation. These results can potentially be expanded from confidence intervals to hypothesis testing by inverting the intervals. Calculating the power to detect a user-specified alternative, or the sample size required to achieve that power, would be a very direct extension of the results of this paper. But an issue that would need to be addressed in such an extension is

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

81

how to form the alternative hypothesis. For example, the null hypothesis may be that H0 : ICC b = ρ0 and an alternative H1 : ICC b ̸= ρ0 . But the power will depend on the parameter ICC w = σb2 /(σb2 + σe2 ). One could address this by setting this parameter to a specific value, or by using minimax procedures. Careful consideration would be needed to produce a useful solution to this problem and is an area of future research. This paper has focused on the balanced, two-way layout without interaction and the intraclass correlation coefficient. But the methods developed here could be extended to ratios of other variance components based on MLS or GCI procedures, such as the many presented in Burdick et al. (2005). They could also be extended to unbalanced designs using the unweighted sum of squares approach (Burdick et al., 2005). In some cases, simple sample size formulas may result; but in many other cases, there may be need to address statistical computational challenges like those described above. The GCI and MLS methods used in this paper are only moderately robust to departures from normality (Ionan et al., 2014). If data have heavy tails and/or are highly skewed, then the MLS and GCI methods should not be used. This is a limitation of the sample size methods presented here. Acknowledgment Dobbin and Ionan’s work was partially funded by the Georgia Research Alliance through the GCC Distinguished Cancer Scholars program (Grant number 045395-03). Appendix A A.1. Definition of GPQ Let G = g (T ; t , θ , δ) be a function of T , t , θ and δ , where T is a statistic, t is an observed value of T , θ is a parameter of interest, and δ is a nuisance parameter. If G has the following two properties, then it is said to be a generalized pivotal quantity: (1) G has a probability distribution free of unknown parameters; (2) The observed pivotal, gobs = g (x; x, θ , δ) does not depend on the nuisance parameter δ . A.2. Proof of Lemma 1 Proof. P (G(t ) ≤ g0 |W2 , W3 ) = P



max 0,

 =P



a W2





b W3

≤ g0 c /W1 + g0 a/W2 + g0 d/W3 |W2 , W3

max {0, M2 a − bM3 } − ag0 M2 − dg0 M3 cg0

 = 1 − Finv χ 2 (df 1)

≤ M1 |W2 , W3

max {0, M2 a − bM3 } − ag0 M2 − dg0 M3 cg0







.

A.3. MLS conditional density calculation We want the joint distribution of (W , Y ). Note that we can write:

 W Y Z

 

χ(2df2 ) /df2



 χ 2 /df   (df3 ) 3    d  2  =  χ(df1 ) /df1     χ 2 /df3   (df3 )  χ(2df3 ) /df3

where Z = s2e /σe2 . Then χ(2df ) = WZdf2 and χ(2df ) = YZdf1 and χ(2df ) = Zdf3 . Which leads to the Jacobian, 2 1 3



Zdf2

J = 0



0

0 Zdf1 0

Wdf2

Ydf1 



df3

and det(J ) = Z (df1 df2 df3 ). 2



82

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

The joint distribution is f (w, y, z ) = z 2 (df1 df2 df3 )fχ 2 (w zdf2 )fχ 2 (yzdf1 )fχ 2 (zdf3 ) df1

df 2

df3

 (w zdf2 )df2 /2−1 Exp[−wzdf2 /2] = z 2 (df1 df2 df3 ) Γ (df2 /2)2df2 /2     df1 /2−1 (zdf3 )df3 /2−1 Exp[−zdf3 /2] (yzdf1 ) Exp[−yzdf1 /2] · · Γ (df1 /2)2df1 /2 Γ (df3 /2)2df3 /2 df1 /2+df2 /2+df3 /2−1 z Exp[−z (ydf1 + w df2 + df3 )/2] = df1 /2+df2 /2+df3 /2  Γ (df1 /2 + df2 /2 + df3 /2) ydf +w2df +df 1 2 3  df1 /2+df2 /2+df3 /2 2 · Γ (df1 /2 + df2 /2 + df3 /2) ydf1 + w df2 + df3 

df3 /2−1

·

df1 df2 df3 (w df2 )df2 /2−1 (ydf1 )df1 /2−1 df3

Γ (df1 /2)Γ (df2 /2)Γ (df3 /2)2df1 /2+df2 /2+df3 /2

.

Recognizing the gamma distribution function density, we see that, f (w, y) =



f (w, y, z )dz df /2

df /2

df /2

Γ (df1 /2 + df2 /2 + df3 /2) w df2 /2−1 ydf1 /2−1 df2 2 df1 1 df3 3 . Γ (df1 /2)Γ (df2 /2)Γ (df3 /2) (ydf1 + w df2 + df3 )df1 /2+df2 /2+df3 /2 Now we know since w ∼ F(df2 ,df3 ) that    df 2/2 Γ df2 +2 df3 df2 w(df2 −2)/2     f (w) = df3 (1 + (df2 /df3 )w)(df2 /2+df3 /2) Γ df22 Γ df23

=

so that, f (y|w) =

f (w, y) f (w) y

Γ 

df1 /2−1



df1 +df2 +df3 2



   · df1df1 /2 df3df3 /2+df 2/2 · (1 + (df2 /df3 )w)df2 /2+df3 /2 (ydf1 + w df2 + df3 )df1 /2+df2 /2+df3 /2 Γ df2 +df3 Γ df1 2 2  df2 /2+df3 /2 −df1 /2 df3 + df2 w (df3 + df2 w + df1 y) df /2 f (y|w) = ydf1 /2−1 df1 1 df3 + df2 w + df1 y Beta(df1 /2, df2 /2 + df3 /2) =

where the last line is a more numerically stable version of the second-to-last line, and Beta is the beta function. A.4. Analytic formula for E [L|w]



m1 m2 + m3 y

f (y|w)dy = (df3 + df2 w)

×

 

1−



df1 +df2 +df3 2

 1−

m3 (df3 + df2 w)



m3 (df3 + df2 w) df1 m2

df1 +df2 +df3 2

df1 m2



− df1 +df22 +df3

Γ (df1 /2)Γ (df2 /2 + df3 /2)

× Hypergeom [1, df1 /2, (2 − df2 − df3 )/2, m3 (df3 + df2 w)/(df1 m2 )]  df2 /2+df3 /2 m3 (df3 + df2 w) −π Γ ((df1 + df2 + df3 )/2) df1 m2      df1 /2   π (df2 + df3 − 1) df1 df1 + df2 + df3 · Sec m2 Γ 2 df3 + df2 w 2 where Hypergeom is the hypergeometric function is defined as: Hypergeom(a, b, c , d) =

Γ (c ) Γ (b)Γ (c − b)

1



t b−1 (1 − t )c −b−1 (1 − tz )−a dt . 0

K.K. Dobbin, A.C. Ionan / Computational Statistics and Data Analysis 85 (2015) 67–83

83

Appendix B. Supplementary data Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.csda.2014.11.010. References Barzman, D., Mossman, D., Sonnier, L., Sorter, M., 2012. Brief rating of aggression by children and adolescents (BRACHA): a reliability study. J. Amer. Acad. Psychiatry Law 40, 375–382. PMID: 22960920. Belge, B., Coche, E., Pasquet, A., Vanoverschelde, J.J., Gerber, B.L., 2006. Accurate estimation of global and regional cardiac function by retrospectively gated multidetector row computed tomography. Eur. Radiol. 16, 1424–1433. http://dx.doi.org/10.1007/s00330-006-0169-6. Bonett, D.G., 2002. Sample size requirements for estimating intraclass correlations with desired precision. Stat. Med. 21, 1331–1335. http://dx.doi.org/10.1002/sim.1108. Burdick, R.K., Borror, C.M., Montgomery, D.C., 2005. Design and Analysis of Gauge R & R Studies. SIAM and ASA, Philadelphia, Alexandria, http://dx.doi.org/10.1137/1.9780898718379. Cappelleri, J.C., Ting, N., 2003. A modified large-sample approach to approximate interval estimation for a particular intraclass correlation coefficient. Stat. Med. 22, 1861–1877. Dobbin, K.K., Beer, D.G., Meyerson, M., Yeatman, T.J., Gerald, W.L., Jacobson, J.W., Conley, B., Buetow, K.H., Heiskanen, M., Simon, R.M., Minna, J.D., Girard, L., Misek, D.E., Taylor, J.M.G., Hanash, S., Naoki, K., Hayes, D.N., Ladd-Acosta, C., Enkemann, S.A., Viale, A., Giordano, T.J., 2005. Interlaboratory comparability study of cancer gene expression analysis using oligonucleotide microarrays. Clin. Cancer Res. 11, 565–572. Eliasziw, M., Young, S.L., Woodbury, M.G., Fryday-Field, K., 1994. Statistical methodology for the concurrent assessment of interrater and intrarater reliability: using goniometric measurements as an example. Phys. Ther. 74, 777–788. Haggard, E., 1958. Intraclass Correlation and the Analysis of Variance. Dryden Press, New York. Hing, C.B., Young, D.A., Dalziel, R.E., Bailey, M., Back, D.L., Shimmin, A.J., 2007. Narrowing of the neck in resurfacing arthroplasty of the hip. J. Bone Joint Surg. 89-B (8), 1019–1024. http://dx.doi.org/10.1302/0301-620X.89B8.18830. Hyndman, R.J., Fan, Y., 1996. Sample quantiles in statistical packages. Amer. Statist. 50, 361–365. http://dx.doi.org/10.2307/2684934. Ionan, A.C., Polley, M.Y., McShane, L.M., Dobbin, K.K., 2014. Comparison of interval methods for the intra-class correlation coefficient (ICC). BMC Med. Res. Methodol. 14 (1), 121. http://dx.doi.org/1186/1471-2288-14-121. MAQC Consortium, 2006. The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnol. 24 (9), 1151–1161. McGraw, K.O., Wong, S.P., 1996. Forming inferences about some intraclass correlation coefficients. Psychol. Methods 1, 30–46. http://dx.doi.org/1082989X/96. McShane, L.M., Aamodt, R., Cordon-Cardo, C., Cote, R., Faraggii, D., Fradet, Y., Grossman, H.B., Peng, A., Taube, S.E., Waldman, F.M., 2000. Reproducibility of p53 immunohistochemistry in bladder tumors. Clin. Cancer Res. 6, 1854–1864. Montgomery, D.C., 2013. Design and Analysis of Experiments, eighth ed. John Wiley and Sons, New York. Polley, M.C., Leung, S.C.Y., McShane, L.M., Gao, D., Hugh, J.C., Mastropasqua, M.G., Viale, G., Zabaglo, L.A., Penault-Llorca, F., Bartlett, J.M.S., Gown, A.M., Symmans, W.F., Piper, T., Mehl, E., Enos, R.A., Hayes, D.F., Dowsett, M., Nielsen, T.O., 2013. On behalf of the International Ki67 in Breast Cancer Working Group. An international Ki67 reproducibility study. J. Natl. Cancer Inst. USA 105, 1897–1906. Potempa, K., Lopez, M., Braun, L.T., Szidon, J.P., Fogg, L., Tincknell, T., 1995. Physiological outcomes of aerobic exercise training in hemiparetic stroke patients. Stroke 26, 101–105. http://dx.doi.org/10.1161/01.STR.26.1.101. Ripley, B., 2006. Stochastic Simulation. Wiley, New York, http://dx.doi.org/10.1002/9780470316726. Robert, C.P., Casella, G., 2013. Monte Carlo Statistical Methods. Springer, New York, http://dx.doi.org/10.1007/978-1-4757-3071-5. Rothery, P., 1982. The use of control variates in Monte Carlo estimation of power. Appl. Stat. 31, 125–129. http://dx.doi.org/10.2307/2347974. Saito, Y., Sozu, T., Hamada, C., Yoshimura, I., 2006. Effective number of subjects and number of raters for inter-rater reliability studies. Stat. Med. 25, 1547–1560. http://dx.doi.org/10.1002/sim.2294. Shrout, P.E., Fleiss, J.L., 1979. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86 (2), 420–428. Weerahandi, S., 1993. Generalized confidence intervals. J. Amer. Statist. Assoc. 88, 899–905. http://dx.doi.org/10.1080/2290779. Weerahandi, S., 1995. Exact Statistical Methods for Data Analysis. Springer-Verlag, New York, http://dx.doi.org/10.1007/978-1-4612-0825-9. Zou, G.Y., 2012. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat. Med. 31, 3972–3981. http://dx.doi.org/10.1002/sim.5466.

Further reading Gelfand, A.E., Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 410, 398–409. http://dx.doi.org/10.2307/2289776. Graybill, F.A., Wang, C.M., 1980. Confidence intervals for nonnegative linear combinations of variances. J. Amer. Statist. Assoc. 75, 869–873. http://dx.doi.org/10.2307/2287174. R Core Team, 2013. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: http://www.Rproject.org/.