Statistical power in two-level hierarchical linear models with arbitrary number of factor levels

Statistical power in two-level hierarchical linear models with arbitrary number of factor levels

Journal of Statistical Planning and Inference ( ) – Contents lists available at ScienceDirect Journal of Statistical Planning and Inference journa...

2MB Sizes 4 Downloads 39 Views

Journal of Statistical Planning and Inference (

)



Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Statistical power in two-level hierarchical linear models with arbitrary number of factor levels Yongyun Shin a, *, Jennifer Elston Lafata b , Yu Cao c a

Department of Biostatistics, Virginia Commonwealth University, P.O. Box 980032, 830 East Main Street, Richmond, VA 23298-0032, United States b University of North Carolina at Chapel Hill, United States c Virginia Commonwealth University, United States

article

info

Article history: Received 2 January 2017 Received in revised form 5 September 2017 Accepted 5 September 2017 Available online xxxx Keywords: Cluster randomized trial Multisite randomized trial Split-plot randomized trial Interaction effects Omnibus test Sample sizes

a b s t r a c t As the US health care system undergoes unprecedented changes, the need for adequately powered studies to understand the multiple levels of main and interaction factors that influence patient and other care outcomes in hierarchical settings has taken center stage. We consider two-level models where n lower-level units are nested within each of J higher-level clusters (e.g. patients within practices and practices within networks) and where two factors may have arbitrary a and b factor levels, respectively. Both factors may represent a × b treatment combinations, or one of them may be a pretreatment covariate. Consideration of both factors at the same higher or lower hierarchical level, or one factor per hierarchical level yields a cluster (C ), multisite (M) or split-plot randomized design (S). We express statistical power to detect main, interaction, or any treatment effects as a function of sample sizes (n, J), a and b factor levels, intraclass correlation ρ and effect sizes δ given each design d ∈ {C , M , S }. The power function given a, b, ρ , δ and d determines adequate sample sizes to achieve a minimum power requirement. Next, we compare the impact of the designs on power to facilitate selection of optimal design and sample sizes in a way that minimizes the total cost given budget and logistic constraints. Our approach enables accurate and conservative power computation with a priori knowledge of only three effect size differences regardless of how large a × b is, simplifying previously available computation methods for health services and other researches. © 2017 Elsevier B.V. All rights reserved.

1. Introduction The Affordable Care Act (ACA) and other pressures within the American health care market have created unprecedented incentives for physician practices and other health care providers to devise new ways of delivering care to patients. With these care innovations and the desire to understand their impact has come an increasing awareness that barriers to effectiveness can arise at multiple hierarchical levels including the patient, provider-team, organization, and local market or policy levels (Ferlie and Shortell, 2001; Taplin et al., 2012; Nutting et al., 2011; Miller et al., 2010). The inherently hierarchical structure of the health care system demands multilevel research methods that enable an understanding of how the multiple hierarchical contexts affect the impact of new care innovations as well as an understanding of the interplay between characteristics at the different hierarchical levels. The latter is particularly important to reducing known disparities in care outcomes, as it is the latter that enables an understanding of how different care innovations or organizational contexts

*

Corresponding author. E-mail address: [email protected] (Y. Shin).

https://doi.org/10.1016/j.jspi.2017.09.004 0378-3758/© 2017 Elsevier B.V. All rights reserved.

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

2

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



may impact outcomes for vulnerable or otherwise important population subgroups (e.g., those of lower socio-economic status, from a minority race, or with multiple chronic conditions). Furthermore, the impact of care innovations is highly dependent on the ability to successfully implement them across diverse settings and patient populations (Damschroder et al., 2009; Glasgow et al., 2001). As such, it is often important to consider not just whether or not a population was exposed to a given care innovation (i.e., treatment), but the extent to which they were exposed (i.e., treatment intensities or factor levels) (McHugh et al., 2015). Therefore, it is important for statistical methods, including power, to consider the possibility of not just the presence or absence of a given care innovation but the multiple relevant intensities of that innovation. Despite the advances in statistical methods for analyzing hierarchical or multilevel data (Raudenbush and Bryk, 2002; Goldstein, 2003; Snijders and Bosker, 2012), recent multilevel power analyses have focused mainly on a single binary treatment factor in either multisite or cluster randomized designs (Donner and Klar, 1996; Raudenbush, 1997; Raudenbush and Liu, 2000, 2001; Raudenbush et al., 2007; Hedeker et al., 1999; Moerbeek, 2008; Schochet, 2008; Heo and Leon, 2008; Konstantopoulos, 2009; Hedges, 2011; Usami, 2014a, b; Cunningham and Johnson, 2016), thereby limiting their applicability in the context of studying complex health care innovations. It is not surprising, therefore, that recent evaluations of new care models and other system redesign strategies have often been hindered by inadequate statistical power (Jackson et al., 2013; Peikes et al., 2011). Usami (2011) considered a multisite design where two factors may represent the arbitrary number of factor levels and where power to detect the main or interaction effects is based on multiparameter contrasts using Wald statistics (Raudenbush and Bryk, 2002). Although useful for large samples, Usami’s formula may yield inaccurate power for small samples, and depends on sample sizes, effect sizes and variance components. As we will show, the exact power will, in addition, depend on the number of factor levels and vary widely across designs. We derive power to detect main, interaction, or any (i.e. at least one) treatment effects of factors A and B in a two-level hierarchical linear model (HLM), also known as a multilevel or mixed effects model, where the factors represent arbitrary a and b factor levels or treatment intensities, respectively (Raudenbush and Bryk, 2002; Goldstein, 2003; Snijders and Bosker, 2012). One of the factors may be a pretreatment covariate such as the size of a practice, availability of electronic health records, and the patient’s socio-economic status or race/ethnicity. Randomizing both factors at the same higher or lower hierarchical level, or one factor per hierarchical level yields a cluster (C ), multisite (M) or split-plot randomized design (S) (Montgomery, 2005). Consequently, statistical power depends on the type of design d ∈ {C , M , S }. We express power as a function of sample sizes, a and b factor levels, intraclass correlation ρ , effect sizes δ , and design d, and compare the impact of the determinants on power. Given the determinants without constraints, a multisite design produces higher power than other designs. It is important, however, to learn the impact of designs on power as sample sizes are typically constrained due to logistic and/or cost constraints, in particular, when a × b treatment combinations are large. For example, solo or small physician practices can provide a limited number of patients for any given study, and access to physician practices or other higher level clusters may be even more limited. Such practical limitations make comparison of power across the designs relevant and useful. Given a, b, ρ , δ and constraints, the power function determines optimal design and sample sizes in a way that achieve a minimum power requirement and, also, in a way that minimize the research costs. In particular, we derive a single uniform closed-form formula for power to detect main, interaction or any treatment effects given each design. The resulting formula facilitates not only computation of power to detect any of the effects given any of the designs, but also comparison of power across the designs given logistic and cost constraints. Throughout this paper, we assume perfect compliance to the treatment assignment (Angrist et al., 1996). Illustrative of the methodological challenges faced was recent request by the Agency for Health Care Research and Quality (AHRQ) for applications to evaluate programs designed to accelerate the dissemination and implementation of results from patient-centered, evidence-based cardio-vascular care outcomes research findings into primary care (http://grants.nih.gov/ grants/guide/rfa-files/RFA-HS-14-009.html). Under this Initiative, eight ‘‘Health Cooperatives (Coops)’’ or ‘‘networks’’ each bringing with them some 250 primary care practices, are used to test how externally provided quality improvement (QI) support (i.e., treatment) can accelerate the dissemination and implementation of patient-centered, evidence-based care. By design (http://grants.nih.gov/grants/guide/rfa-files/RFA-HS-14-008.html), the Coops differ in their QI support approaches. That diversity and vast number of practices created a unique opportunity for an external, cross-cutting evaluation to provide new insights into how different QI support approaches, and primary care practices’ scope and intensity of engagement with those supports, could impact outcomes among diverse practices and patients. Central to a comprehensive evaluation is an understanding not only of the Initiative’s overall effectiveness, but also the comparative effectiveness of different QI support strategies for different practices, recognizing that practices will differ in their intensity of engagement with the Initiative. Thus, in addition to using a simple binary variable reflective of exposure to the Initiative for a given practice in a given time period (i.e., overall treatment vs. comparison group contrasts), it would be advantageous to consider both types of QI support approaches available (i.e., ‘‘treatment’’) and each practice’s level of exposure to them (i.e., ‘‘treatment intensity’’ such as how frequently they participated in each activity and their level of commitment to the activity). Additionally, when planning such an evaluation, it is important to consider the cost implications of the number of Coops, practices and patients. We revisit design of this evaluation in Section 6. Because our power computation is not approximate, but exact, it is useful for both large and small sample sizes (Raudenbush and Liu, 2001). Explicit dependence of the power function on the arbitrary a and b treatment intensities and designs Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



3

distinguishes our approach from the existing ones. In addition, we show that setting b = 1 in the power formula yields power for a single factor A representing arbitrary a factor levels, and derive power to detect treatment effects when there are no interaction effects as a corollary to the power formula. Furthermore, we illustrate how to determine an adequately powered minimum cost design via power of the omnibus test to detect any (i.e. at least one) treatment effects. What is more, the power formula given arbitrary a and b factor levels enables a screening or pilot experiment that identifies active factor levels for a later substantive experiment (Brennan, 2001; Campbell et al., 2007; Budin et al., 2008; Shin and Raudenbush, 2012; Garrett et al., 2013). In the next two sections, we introduce our model and derive the hypothesis tests of interest and power. Section 4 presents comparatively easy specification of effect sizes that depends on three or less effect size differences regardless of the number of a × b treatment combinations and that produces conservative power. Section 5 illustrates the impact of the determinants on power. In Section 6, we revisit the AHRQ proposed evaluation to illustrate how to select optimal design and sample sizes that yield an adequately powered minimum-cost design. The final section discusses limitations and future extensions of the approach. 2. Model For concreteness, we consider patients at lower level 1 nested within practices at cluster level 2 as our hierarchical setting in this and next sections although our power formula is generally applicable to many other two-level hierarchical settings such as practices within Coops, children attending schools, and occasions within persons. Treatment combination (A = k, B = l) is randomly assigned to patient i nested within practice j for k = 1, 2, . . . , a, l = 1, 2, . . . , b, i = 1, 2, . . . , n and j = 1, 2, . . . , J such that a two-level HLM of interest is Yijkl = µ + αk + βl + (αβ )kl + ujkl + ϵijkl

(1)

where Yijkl is a desired outcome, µ is the mean outcome, αk is the factor A main effect, βl is the factor B main effect, (αβ )kl is the interaction effect, and practice-specific random effect ujkl ∼ N(0, τ ) is independent of patient-specific random error ϵijkl ∼ N(0, σ 2 ). One of the factors, say B, may be a pretreatment covariate such as b practice sizes and b patient race/ethnicities. In that case, investigators are interested in adequate power to detect the factor A main and interaction effects controlling for the main effects of B. The interaction effects help distinguish subpopulations of practices or patients that benefit comparatively more or less from the care innovations. Such B may improve power to detect treatment effects (Raudenbush et al., 2007). In this paper, we consider level-2 or wholeplot factor A, and level-1 or subplot factor B for a split-plot design (Montgomery, 2005). 3. Hypothesis tests and power Based on the model (1), we derive hypothesis tests to find main, interaction and any treatment effects of factors A and B given each design, and power of each test. Following the literature (Raudenbush et al., 2007), we express power as a function τ of the intraclass correlation ρ = τ +σ 2 and (standardized) effect sizes in units of standard deviation (αβ )kl δαβ kl = √ . τ +σ τ +σ τ + σ2 To derive power of each test uniformly, let a generic vector δ of effect sizes refer to one of

δα k = √

αk

2

,

δβ l = √

βl

2

,

(2)

δα = (δα1 , . . . , δαa ), δβ = (δβ 1 , . . . , δβ b ), δαβ = (δαβ 11 , . . . , δαβ ab ) and (δα , δβ , δαβ ), and the average squared effect size δ¯ 2 equal the corresponding one of

δ¯α2 =

a ∑ δ2

αk

k=1

a

, δ¯β2 =

b ∑ δβ2 l l=1

b

2 , δ¯αβ =

a b 2 ∑ ∑ δαβ kl k=1 l=1

ab

2 and δ¯ α2 + δ¯ β2 + δ¯ αβ .

The generic δ enables a uniform expression of each hypothesis test H0 : δ = 0 versus H1 : not H0

(3)

for the null hypothesis referring to one of δα = 0, δβ = 0, δαβ = 0 and δα = δβ = δαβ = 0. To derive the test statistic and power of the null hypothesis against an alternative, we denote χp2 (λ) and Fp,q (λ) as the noncentral χ 2 distribution with p degrees of freedom (df) and F distribution with p numerator and q denominator df, respectively, for the noncentrality parameter λ. Let χp2 = χp2 (0) and Fp,q = Fp,q (0). The test statistic and uniform power function of each test are based on the sums of squares SSA =

a ∑ nJ αˆ 2 k

k=1

a

, SSB =

b ∑ nJ βˆ l2 l=1

b

, SSAB =

a b ∑ ∑ ˆ )2kl nJ(αβ k=1 l=1

ab

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

4

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



where the maximum likelihood estimators explained in Appendix A are

αˆ k = Y¯..k. − Y¯.... , βˆ l = Y¯...l − Y¯.... , (αˆ β )kl = Y¯..kl − Y¯..k. − Y¯...l + Y¯....

(4)

for the overall mean outcome Y¯.... and treatment mean outcomes Y¯..kl , Y¯..k. and Y¯...l . 3.1. Cluster randomized design In this /(ab) practices are randomly assigned to treatment combination (A, B) = (k, l). Distributions (17) where ∑b J∑ ∑adesign, J /(ab) ¯ ¯ ˆ2 ˆ SSU = j=1 nujkl for ujkl = Y.jkl − Y..kl imply l=1 k=1 SSA nτ + σ

∼ χa2−1 (λCα ),

2

SSB nτ + σ

∼ χb2−1 (λCβ ),

2

SSAB nτ + σ

2

∼ χ(a2 −1)(b−1) (λCαβ ),

SSU nτ + σ 2

∼ χJ2−ab

(5)

independent where E [SSA/(nτ + σ 2 )] = (a − 1) + λCα , E [SSB/(nτ + σ 2 )] = (b − 1) + λCβ and E [SSAB/(nτ + σ 2 )] = (a − 1)(b − 1) + λCαβ for the noncentrality parameters

λCα =

αk2 , a(nτ + σ 2 ) nJ



λCβ =

k

βl2 , b(nτ + σ 2 ) nJ



λCαβ =

l

αβ )2kl . ab(nτ + σ 2 )

nJ

∑ ∑ k

l(

The sums of squares (5) imply MSA

FA = FB = FAB =

FAll =

MSU MSB MSU MSAB

MSU SSA + SSB + SSAB (ab − 1)MSU

∼ Fa−1, J −ab (λCα ),

(6)

∼ Fb−1, J −ab (λCβ ), ∼ F(a−1)(b−1), J −ab (λCαβ ), ∼ Fab−1, J −ab (λCα + λCβ + λCαβ )

for mean squares MSA = SSA/(a − 1), MSB = SSB/(b − 1), MASB = SSAB/[(a − 1)(b − 1)] and MSU = SSU /(J − ab). 3.2. Multisite randomized design Now, n/(ab) patients are randomly assigned to treatment combination , B) ∑ = (k, l) within each practice. Because ∑b ∑a (A∑ n/(ab) 2 J ujkl = uj for all (k, l) in model (1), SSA, SSB and SSAB, and SSE = ˆijkl in Eqs. (18) are functions of j=1 i=1 ϵ l=1 k=1 2 level-1 random errors ϵijkl ∼ N(0, σ ), not uj ∼ N(0, τ ), to imply independent SSA

σ2

∼ χa2−1 (λM α ),

SSB

σ2

∼ χb2−1 (λM β ),

SSAB

σ2

∼ χ(a2 −1)(b−1) (λM αβ ),

SSE

σ2

∼ χnJ2 −J −ab+1

(7)

M M 2 2 where E(SSA/σ 2 ) = (a − 1) + λM α , E(SSB/σ ) = (b − 1) + λβ and E(SSAB/σ ) = (a − 1)(b − 1) + λαβ for the noncentrality parameters

λα = M

nJ

∑ aσ

k 2

αk2

,

λβ = M

nJ



β

2 l l 2



,

λαβ = M

nJ

∑∑ l

k( 2

abσ

αβ )2kl

.

The sums of squares (7) imply FA = FB = FAB =

FAll =

MSA MSE MSB

MSE MSAB

MSE SSA + SSB + SSAB (ab − 1)MSE

∼ Fa−1, nJ −J −ab+1 (λM α ),

(8)

∼ Fb−1, nJ −J −ab+1 (λM β ), ∼ F(a−1)(b−1), nJ −J −ab+1 (λM αβ ), M M ∼ F(ab−1), nJ −J −ab+1 (λM α + λβ + λαβ )

for the MSA, MSB and MSAB in Eqs. (6) and MSE = SSE /(nJ − J − ab + 1). 3.3. Split-plot randomized design In this design, J /a practices and n/b patients within each practice are to treatment B = l, ∑arandomly ∑J /a assigned ∑b ∑aA = ∑kJ /and a ∑n/b 2 ˆ2 respectively. The model (1) has ujkl = ujk for all l. Eqs. (19) for SSU = ˆijkl j=1 nujk and SSE = j=1 i=1 ϵ k=1 l=1 k=1 Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



5

yield independent SSA nτ + σ 2 SSU nτ + σ 2

∼ χa2−1 (λCα ),

SSB

∼ χJ2−a ,

SSE

σ2 σ2

SSAB

∼ χb2−1 (λM β ),

σ2

∼ χ(a2 −1)(b−1) (λM αβ ),

∼ χnJ2 −J −ab+a

to imply MSA

FA = FB = FAB =

FAll =

MSU MSB

MSE MSAB

MSE SSA/(nτ + σ 2 ) + (SSB + SSAB)/σ 2 nJ − ab SSU /(nτ + σ 2 ) + SSE /σ 2

for MSA, MSB and MASB in Eqs. (6), MSU =

ab − 1 SSU J −a

∼ Fa−1, J −a (λCα ),

(9)

∼ Fb−1, nJ −J −ab+a (λM β ), ∼ F(a−1)(b−1), nJ −J −ab+a (λM αβ ), M ∼ Fab−1, nJ −ab (λCα + λM β + λαβ )

and MSE =

SSE . nJ −J −ab+a

3.4. Test statistics Each statistic in Eqs. (6), (8) and (9) under H0 : δ = 0 for δ = δα , δβ , δαβ or (δα , δβ , δαβ ) follows a central F distribution to yield the test statistic FA , FB , FAB or FAll . Without interaction effects, H0 : δα = 0, H0 : δβ = 0 and H0 : δα = δβ = 0 imply respective test statistics MSA MSB SSA+SSB ′ ′ FA′ = MSU ′ ∼ Fa−1, J −a−b+1 , FB = MSU ′ ∼ Fb−1, J −a−b+1 and FAll = (a+b−2)MSU ′ ∼ Fa+b−2,J −a−b+1 in Eqs. (6) for design d = C and SSAB+SSU ; J −a−b+1

MSB SSA+SSB ′ ∼ Fa−1, nJ −J −a−b−2 , FB′ = MSE ′ ∼ Fb−1, nJ −J −a−b+2 and FAll = (a+b−2)MSE ′ ∼ Fa+b−2,nJ −J −a−b+2 SSAB+SSE ′ in Eqs. (8) for design d = M and MSE = nJ −J −a−b+2 . In Eqs. (9) for d = S, H0 : δβ = 0 and H0 : δα = δβ = 0 imply respective

MSU ′ =

MSB MSE ′

and FA′ =

MSA MSE ′

SSA/(nτ +σ 2 )+SSB/σ 2 nJ −a−b+1 SSU /(nτ +σ 2 )+(SSAB+SSE)/σ 2 a+b−2

+SSE ∼ Fa+b−2, nJ −a−b+1 for MSE ′ = nJSSAB . −J −b+1 MSA For a single factor A which amounts to setting b = 1 in model (1), H0 : δα = 0 implies test statistic FA′′ = MSU ′′ ∼ Fa−1, J −a +SSU MSA SSB+SSAB+SSE ′′ ′′ in Eqs. (6) for d = C and MSU ′′ = SSB+SSAB , and F = ∼ F in Eqs. (8) for d = M and MSE = . a−1, nJ −J −a+1 A J −a MSE ′′ nJ −J −a+1 ′ In Eqs. (9) for d = S, model (1) having a single factor A or B yields FA ∼ Fa−1, J −a or FB ∼ Fb−1, nJ −J −b+1 under H0 : δα = 0 or H0 : δβ = 0, respectively.

FB′ =

′ = ∼ Fb−1, nJ −J −b+1 and FAll

3.5. Power function Each statistic in Eqs. (6), (8) and (9) implies power of the corresponding H0 : δ = 0 against H1 : δ = δ1 ̸ = 0. To derive the power, we divide the numerator and denominator of the noncentrality parameter by τ + σ 2 to express

λC (δ¯ 2 ) =

nJ δ¯ 2 (n − 1)ρ + 1

,

λM (δ¯ 2 ) =

nJ δ¯ 2 1−ρ

,

⎧ C 2 δ = δα ⎨ λ (δ¯ ), λS = λM (δ¯ 2 ), δ = δβ or δαβ ⎩ S λAll , δ = (δα , δβ , δαβ )

(10)

2 for λSAll = λC (δ¯ α2 ) + λM (δ¯ β2 ) + λM (δ¯ αβ ). The λd is uniform across all hypothesis tests (3) for d = C or M, but depends on the d effects tested for d = S. Each λ for d ∈ {C , M , S } increases in sample sizes and effect sizes although the impact of n appears comparatively weak on λC . The λC decreases, but λM increases with a rise in ρ although the impact of ρ appears greater on

λC than on λM . For the omnibus test of the split-plot design, we solve for ρ in minimized at the intraclass correlation

ρ = ∗

−(R + 1) ±



(R + 1)2 + Rn − R − 1

(Rn−R−1)(n−R−1) n−1

∂λSAll ∂ρ

= 0, and check

∂ 2 λSAll ∂ρ 2

> 0 to find λSAll

(11)

2 where R = (δ¯ β2 + δ¯ αβ )/δ¯ α2 . Thus, λSAll is a decreasing function of ρ for ρ < ρ ∗ dominated by the negative association between 2 C ¯2 λ (δα ) and ρ , but an increasing function of ρ for ρ > ρ ∗ dominated by the positive association between λM (δ¯β2 ) + λM (δ¯αβ ) ∗ C M S and ρ . We explore the implication of ρ on sample size determination in Section 5. For ρ = 0, λ = λ = λ . Let fp,q,1−s be the 1 − s quantile of Fp,q . Given the noncentrality parameter λd of design d, we present the uniform power formula for each test (3):

Theorem 3.1. Suppose that we want to perform hypothesis test (3) at a significance level s given design d ∈ {C , M , S }. Under the H1 , the power to detect δ = δ1 ̸ = 0 is P Fp,q (λd ) > fp,q,1−s

[

]

(12)

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

6

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



for λd in Eqs. (10),

⎧ a − 1, ⎪ ⎨ b − 1, p= ⎪ ⎩(a − 1)(b − 1), ab − 1,

δ δ δ δ

⎧ J − ab, ⎪ ⎪ = δα ⎪ ⎨(n − 1)J − ab + 1, = δβ q = J − a, = δαβ ⎪ ⎪ ⎪(n − 1)J − ab + a, ⎩ = (δα , δβ , δαβ ), nJ − ab,

d d d d d

=C =M = S , δ = δα = S , δ = δβ or δαβ = S , δ = (δα , δβ , δαβ ).

Because the distribution Fp,q (λd ) is stochastically increasing in the noncentrality parameter λd , power (12) is an increasing function of λd (Casella, 2008). Consequently, power (12) increases in n, positively associated with reliability τ +στ 2 /n , ceteris paribus (Raudenbush and Liu, 2001). The higher the impact of practices on the patient outcome Yijkl , the higher the ρ . Factor B highly correlated with Yijkl reduces the random practice differences and, thus, ρ to increase power to detect other factor effects for d = C (Raudenbush et al., 2007). The ρ ∗ that minimizes λSAll will require larger sample sizes to achieve adequate power than any other value of ρ , ceteris paribus. The p numerator df, uniform across all designs, is a function of the number of main, interaction or all treatment effects to test while the q denominator df is a function of the number of random effects to which the treatment effects are compared. Theorem 3.1 facilitates not only computation of power given each design but comparison of power across the designs. For example, power to detect factor A main effects for d ∈ {C , S }, or factor B main or AB interaction effects for d ∈ {M , S } 2 depends on the same λd and p, but different q. The λd of the omnibus test, an increasing function of δ¯ 2 = δ¯ α2 + δ¯ β2 + δ¯ αβ , yields power at least as large as that of main or interaction effects, in general. With no interaction effects, i.e., δαβ = 0, Theorem 3.1 implies a corollary: Corollary 3.2. Suppose that we want to perform the hypothesis test in Theorem 3.1 where δαβ = 0. Under the H1 , the power to detect δ = δ1 ̸ = 0 is power (12) for λd in Eqs. (10),

⎧ J − a − b + 1, ⎪ ⎪ ⎪ ⎨(n − 1)J − a − b + 2, a − 1, δ = δα δ = δβ q = J − a, p = b − 1, ⎪ ⎪ a + b − 2, δ = (δα , δβ ), ⎪(n − 1)J − b + 1, ⎩ nJ − a − b + 1, {

d d d d d

=C =M = S , δ = δα = S , δ = δβ = S , δ = (δα , δβ ).

4. Specifying effect sizes Computation of power (12) requires specification of effect sizes for λd in Eqs. (10). In planning an experiment, investigators may not have enough information to specify all effect sizes at each treatment combination, in particular, when the number of a × b treatment combinations is large. In this section, we suggest a comparatively easy specification of effect sizes that yields conservative power. Our effect sizes are based on a contrast coding that yields (Kutner et al., 2005, ch.16) a ∑

αk =

k=1

b ∑

βl =

l=1

a ∑

b ∑

k=1

l=1

(αβ )kl =

(αβ )kl = 0.

(13)

To see the implication on effect sizes, we consider a simple case of a = 2 and βl = (αβ )kl = 0 for all k and l where E(Yijkl ) = µ + αk and α1 = −α2 to yield the factor A treatment effect E(Yij2l − Yij1l ) = 2α2 . Thus, main effect α2 is half of the treatment effect, and the effect size δα 2 is half of that of Raudenbush et al. (2007). In general, Eqs. (13) imply

− α1 =

a ∑

αk , −β1 =

k=2

b ∑

βl ,

−(αβ )k′ 1 =

l=2

b ∑

(αβ )k′ l ,

l=2

−(αβ )1l′ =

a ∑

(αβ )kl′

k=2

for k′ = 1, . . . , a and l′ = 1, . . . , b to require specification of (a − 1) + (b − 1) main and (a − 1)(b − 1) interaction effects. For a large number of a × b treatment combinations, it is error-prone to specify all main and interaction effects, which may produce inaccurate power. Instead, following Kutner et al. (2005) for specifying main effects, we suggest that investigators supply the minimum pairwise differences

α ∗ = min|αk − αk′ |, k̸ =k′

β ∗ = min|βl − βl′ | l̸ =l′

that they wish to detect with high power. The α ∗ and β ∗ amount to the minimum factor A and B treatment effects, respectively. The next step is to set

− α1 = α2 =

α∗ 2

, −β1 = β2 =

β∗ 2

,

α3 = · · · = αa = β3 = · · · = βb = 0.

(14)

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



7

Among all specifications of main effects that satisfy Eqs. (13) and that include α ∗ and β ∗ , this specification results in the lowest average squared effect sizes δ¯ α2 and δ¯ β2 in Eqs. (16), thereby, yielding the lowest power to detect main effects given others fixed (Kutner et al., 2005). We extend this approach to specification of the interaction effects. We suggest that investigators supply the desired minimum pairwise difference (αβ )∗ = mink̸=k′ , l̸=l′ |(αβ )kl − (αβ )k′ l′ | to detect with high power, and set

− (αβ )11 = (αβ )12 =

(αβ )∗ 2

,

(αβ )k1 = −(αβ )k2 =

(αβ )∗ 2(a − 1)

,

(αβ )k′ l = 0

(15)

for k = 2, . . . , a, k′ = 1, . . . , a, l = 3, . . . , b and a ≥ b. Among all specifications that satisfy Eqs. (13) and that include the 2 (αβ )∗ , this specification yields the minimum δ¯ αβ in Eqs. (16) to result in the lowest power to detect interaction effects, ceteris paribus. See Appendix A for proof. The result implies that given two factors, in general, we assign one with more factor levels to factor A. A single exception is the splitplot design when the subplot factor has more factor levels than the wholeplot factor because we described model (1) with the wholeplot factor A and subplot factor B in Section 3. In that case, we assign the subplot factor as factor A and the wholeplot factor as factor B. Consequently, the results including test statistics and their distributions about factors A and B in Section 3 become those of B and A in the splitplot design. This approach requires that investigators supply the three minimum pairwise differences α ∗ , β ∗ and (αβ )∗ regardless. Subsequent setting of Eqs. (14) and (15) yields

δ¯α2 =

β ∗2 (αβ )∗2 α ∗2 2 , δ¯β2 = , δ¯αβ = . 2 2 2a(τ + σ ) 2b(τ + σ ) 2(a − 1)b(τ + σ 2 )

(16)

Investigators may further simplify the specification by supplying a single minimum pairwise difference α ∗ = β ∗ = (αβ )∗ to detect with high power, or specify more detailed effect sizes to compute comparatively less conservative power given information to do so. 5. Determinants of power This section illustrates the impact of sample sizes, design d ∈ {C , M , S }, a and b treatment intensities, intraclass 2 correlation ρ , and effect sizes δ on power √ (12). Without loss of generality, we assume τ + σ = 1 to imply ρ = τ as in analysis of a standardized outcome Yijkl / τ + σ 2 , and set the minimum effect size difference δ ∗ = α ∗ = β ∗ = (αβ )∗ for a minimum power requirement of 0.8 at a significance level s = 0.05. To show associations between power and determinants clearly as well as make power and sample sizes legible in graphs, we plot power against J up to 120 clusters on the X axis in Fig. 1 to 3. For the detailed number of J clusters that meets the minimum power requirement, see Table 1. We compute power by statistical software package R and provide the R code in the online supplementary material. Omnibus Test. Omnibus test (3) for δ = (δα , δβ , δαβ ) tests if any, main or interaction, effects will be found. Fig. 1 shows the impact of the determinants on power. Each panel graphs power against the number of J clusters for cluster sizes n changing from 4 to 8 to 16 as the legend in panel (1) shows. The horizontal dotted line draws the minimum power requirement. Across the columns, design d changes from C to M to S while across the rows, a and b factor levels, ρ and δ ∗ change. Detailed sample size requirements are tabulated under H0 : δα = δβ = δαβ = 0 in Table 1. Panels (1) to (3) draw power given a = b = 2, ρ = 0.05, and δ ∗ = 0.2 equivalent to a small effect size (Raudenbush and Liu, 2000; Cohen, 1988). Given others fixed, the number of factor levels rises to a = b = 3 in panels (4) to (6). We see that the a and b factor levels are negatively associated with power in each design. For the solid line (n = 4) of a cluster randomized design from panel (1) to (4), for example, the number of J clusters to achieve the minimum power requirement has to more than double from 109 to 267 clusters as shown in Table 1. Although the multisite design in the next panel (5) yields higher power than does the cluster randomized design in panel (4), it is not feasible for n < ab = 9 so that panel (5) draws power only for n = 16. On the other hand, the split-plot design in panel (6) not only yields comparable power to that of the multisite design, but is feasible for all three cluster sizes. Sample size constraints to study a × b = 9 treatment combinations are J > 9 for d = C , n ≥ 9 for d = M, and comparatively low J > 3 and n ≥ 3 for d = S. Given others fixed as in panels (4) to (6), the intraclass correlation ρ rises to 0.2 in panels (7) to (9), which decreases power for the cluster randomized design, but increases power for the multisite design. We see that the impact of ρ is greater for the 2 cluster randomized design than for other designs. For the split-plot design, R = (δ¯ β2 + δ¯ αβ )/δ¯ α2 = 2 such that ρ ∗ = 0.053 at ∗ n = 4 in Eq. (11). Consequently, ρ = 0.05 near the power-minimizing ρ given n = 4 and others fixed in panel (6) requires large J = 233 clusters for the adequate power of 0.8. Because power is higher for ρ less than or greater than the ρ ∗ , ceteris paribus, ρ = 0.2 in panel (9) requires comparatively low J = 228 clusters for the adequate power. Given n = 8 and others fixed from Table 1, ρ ∗ = 0.10 requires J = 126 clusters for adequate power while both ρ = 0.05 and ρ = 0.2 shown in Panels (6) and (9) requires comparatively low 123 and 124 clusters to meet the minimum power requirement, respectively. From panels (4) to (6), the minimum effect size difference δ ∗ increases to 0.3 in panels (10) to (12), ceteris paribus, which raises power rapidly to halve the number of J clusters for the adequate power of 0.8 in each design. Main Effects. Fig. 2 displays power to detect main effects in the same way as did Fig. 1 for the omnibus test. It displays power to detect the factor A main effects for d ∈ {C , M } in the first two columns, and the sub-plot factor B main effects for d = S in the last column. Because of the orthogonal design, the a = b factor levels yield identical power for the factor A and Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

8

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



Table 1 J clusters to achieve power of 0.8 at a significance level 0.05. Hypothesis test

H0 : δα = 0

(a, b)a

(2,2)

na 4

(4,4)

228 258 316 375 103 116 142 168 135 169 238 307 61 77 107 138 88 125 199 272 41 57 90 122

1035 1169 1437 1706 463 522 642 761 610 766 1079 1393 274 344 483 622 397 565 900 1236 179 254 403 552

2709 3061 3765 4469 1209 1365 1678 1991 1594 2004 2826 3648 713 896 1261 1626 1036 1476 2356 3237 466 661 1052 1444

109 123 150 177 51 57 69 81 66 82 114 145 32 39 53 67 44 61 95 129 22 30 45 60

267 301 369 436 124 139 169 199 161 200 279 358 76 94 129 164 107 149 234 318 53 71 109 146

477 537 658 779 219 246 300 354 285 356 497 638 134 166 228 291 190 265 416 567 92 125 192 260

Multisite-randomized design 188 NAb NA 178 NA NA 158 NA NA 138 NA NA 84 NA NA 80 NA NA 71 NA NA 62 NA NA 94 NA NA 89 NA NA 79 NA NA 69 NA NA 42 NA NA 40 NA NA 36 NA NA 31 NA NA 47 87 130 45 82 123 40 73 110 35 64 96 21 39 58 20 37 55 18 33 49 16 29 43

188 178 158 138 84 80 71 62 94 89 79 69 42 40 36 31 47 45 40 35 21 20 18 16

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 213 202 180 157 95 90 80 70

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 559 529 471 412 249 236 210 184

88 84 74 65 40 38 34 30 44 42 37 33 20 19 17 15 22 21 19 17 10 10 9 8

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 55 52 46 40 25 24 21 19

NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 97 92 82 72 44 42 37 33

0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30

Split-plot randomized design 228 419 631 258 473 713 316 582 877 375 690 1040 103 188 283 116 212 319 142 260 392 168 309 465 135 247 372 169 311 468 238 437 659 307 564 849

188 178 158 138 84 80 71 62 94 89 79 69

852 808 718 629 380 360 320 281 426 404 359 314

2233 2116 1881 1647 994 942 838 734 1117 1058 941 823

93 93 89 82 42 42 40 37 49 49 48 44

233 234 228 213 105 105 103 96 123 126 124 116

417 422 414 391 188 189 186 176 221 229 228 213

0.20

0.20

0.20

0.20

0.20

0.30

8

(3,3)

0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30

0.20

0.30

4

(2,2)

0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30

0.30

16

(4,4)

0.20

0.30

8

(3,3)

Cluster randomized design 228 419 631 258 473 713 316 582 877 375 690 1040 103 188 283 116 212 320 142 261 392 168 309 465 135 247 373 169 311 468 238 437 659 307 564 849 61 112 168 77 140 211 107 196 295 138 252 380 88 162 243 125 229 345 199 365 550 272 500 754 41 74 111 57 104 156 90 164 247 122 224 338

0.30

4

H0 : δα = δβ = δαβ = 0

(2,2)

ρa

0.30

16

(4,4)

δ∗ a

0.30

8

H0 : δαβ = 0 (3,3)

0.20

(continued on next page)

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



9

Table 1 (continued) Hypothesis test

H0 : δα = 0

(a, b)a

(2,2)

(3,3)

(4,4)

(2,2)

(3,3)

(4,4)

(2,2)

(3,3)

(4,4)

61 77 107 138 88 125 199 272 41 57 90 122

112 140 196 252 162 229 365 500 74 104 164 224

168 210 295 380 243 345 550 754 110 156 247 338

42 40 36 31 47 45 40 35 21 20 18 16

190 180 160 140 213 202 180 157 95 90 80 70

497 471 419 367 559 529 471 412 249 236 210 184

22 23 22 20 26 27 26 23 12 12 12 11

55 57 56 52 66 69 67 62 30 31 30 28

100 103 102 96 120 126 124 114 54 57 56 51

0.30

16

0.20

0.30

0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30 0.05 0.10 0.20 0.30

H0 : δαβ = 0

a

n: cluster size; δ ∗ : minimum effect size ; (a, b): factor levels; ρ : intra-cluster correlation.

b

NA: an infeasible multisite design having n < a × b.

H0 : δα = δβ = δαβ = 0

B main effects of d = C and of d = M, the factor A main effects of d ∈ {C , S }, and the factor B main effects of d ∈ {M , S }, respectively. Table 1 displays detailed sample size requirements under columns H0 : δα = 0 and H0 : δβ = 0. Panels (1) to (3) draw power given a = b = 2, ρ = 0.05 and δ ∗ = 0.2. Given others fixed, the number of factor levels increases to a = b = 3 in Panels (4) to (6). We see that the a and b factor levels are negatively associated with power in each design. Panel (5) draws power only for n = 16 in a multisite design for the sample size restrictions explained above. On the other hand, the split-plot design in panel (6) yields comparable power to that of the multisite design, and is feasible for all three cluster sizes. Given others fixed at the levels of panels (4) to (6), the intraclass correlation ρ rises to 0.2 in panels (7) to (9). We see that ρ is negatively associated with power of the main effects for the cluster randomized design, but positively so with power of the main effects for the multisite design and the subplot factor B main effects for the split-plot design. The impact of change in ρ appears greatest on power for the cluster randomized design. Given others fixed at the levels of panels (4) to (6) again, the minimum effect size difference δ ∗ increases to 0.3 in panels (10) to (12) to raise power rapidly in each design. Interaction Effects. Fig. 3 draws power to detect interaction effects for a = b = 3 in the same way as did panels (4) to (6) in Fig. 2. Power is overall lower than that of the main effect counterparts in Fig. 2. Therefore, it appears harder to detect the interaction than main effects for a, b > 2, holding others constant. We do not show power plots for a = b = 2 because they are identical to the corresponding main-effect plots in Fig. 2. Detailed sample size requirements appear under H0 : δαβ = 0 in Table 1. Overall, power is sensitive to changes in the number of a and b factor levels, desired minimum effect sizes δ ∗ and intraclass correlation ρ in each design. Power also varies much across designs. In particular, power is more sensitive to changes in ρ for the cluster randomized design than for other designs. Therefore, researchers planning a cluster randomized experiment should take great care to base the appropriate level of ρ on the past research for accurate power computation. 6. Finding an optimal design We revisit planning of an evaluation for the large-scale AHRQ-sponsored Initiative. ‘‘Health Cooperatives (Coops)’’ or networks provide quality improvement support to small and medium sized primary care practices to accelerate the dissemination and implementation of patient-centered, evidence-based cardiovascular care. Our goal is to determine the optimal design and sample sizes that will achieve a minimum power requirement of 0.8 at a significance level s = 0.05 to detect treatment effects on each of two outcome variables and that will minimize the evaluation costs. One outcome is at the practice level where practices are nested within Coops, and the other at the patient level where patients are nested within practices. Our strategy is first to determine optimal design and sample sizes for the former, which will impose a logistic constraint on the number of practices in finding an optimal design for the latter. Because main and interaction effects on each outcome are considered equally important, it is desirable to have high power to detect the minimum pairwise effect difference α ∗ = β ∗ = (αβ )∗ , which equals the minimum effect size difference δ ∗ under our assumption τ + σ 2 = 1. We show how to find the optimal numbers of Coops, practices and patients that will minimize the evaluation costs via power (12) of the omnibus test. Following Raudenbush (1997), we define the total evaluation cost as $J(C2 + C1 n) where C2 and C1 are the costs of adding extra level-2 and -1 units to the sample, respectively. Dividing by C1 , we simplify the total cost as T = J(C2∗ + n) in terms of the number of level-1 units where C2∗ = C2 /C1 is the cost of adding an extra level-2 cluster in terms of the number of level-1 units. Consequently, sampling an extra level-2 cluster and n level-1 units within the cluster incurs an additional cost equivalent to C2∗ + n level-1 units. Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

10

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



Fig. 1. Each panel graphs power (12) of the omnibus test to detect any effects against J for n changing from 4 to 8 to 16 as the legend in panel (1) shows. The horizontal dotted line indicates a power of 0.8. Across the rows, treatment intensities change from a = b = 2 to a = b = 3; ρ is either 0.05 or 0.2; and the minimum effect size δ ∗ either 0.2 or 0.3. Across the columns, the design changes from cluster (d = C ), to multisite (d = M) and to split-plot randomized designs (d = S).

6.1. Practices nested within cooperatives Reflecting the fact that it costs much higher to add a Coop than a practice for the practice-level outcome, we let C2∗ = 50. Based on the current AHRQ design involving 8 Coops (J = 8) and 250 practices per Coop (n = 250), we compute the total cost constraint as J(C2∗ + n) = 8(50 + 250) = 2400 units (T ≤ 2400). With a logistic constraint that AHRQ has resources to evaluate 8 to 12 Coops, the outcome variable is an overall practice-level quality measure where two QI support strategies each with three treatment intensities (a = 3, b = 3) may be randomized to Coops or practices. We set δ ∗ = 0.2, equal to a small effect size (Cohen, 1988), for conservative power computation and anticipate that 10% of the outcome variance will Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



11

Fig. 2. Each panel graphs power (12) to detect the main effects of A or B against J in the same way as does each panel of Fig. 1 except that the first two columns graph power to detect the main effects of A for the cluster and multisite designs while the last column draws power to detect the sub-plot factor B main effects of the split-plot design.

be at the cluster level (ρ = 0.1) based on the past research (Peikes et al., 2011). Our aim is to find the optimal design and sample sizes subject to constraints T ≤ 2400 and 8 ≤ J ≤ 12. For overall effectiveness of the Initiative, we compute power of the omnibus test to detect any, main or interaction, effects. Panels (1) to (3) in Fig. 4 display power of the omnibus test for each design. Each panel graphs power against n given J changing from 8 to 10 to 12 as the legend⌊in Panel (1) ⌋ shows. A vertical line indicates the upper-bound cluster size imposed by the budget constraint given J, i.e. n ≤

2400 J

− 50 , where ⌊A⌋ denotes the greatest integer ≤ A. Because J = 8 does not

allow investigation of the 9 treatment combinations for d = C , panel (1) draws power only for J > 8 and reveals that a cluster randomized design cannot achieve the minimum power requirement for J ≤ 12. On the other hand, other designs give adequate power for all J ≥ 8. With an extra Coop more expensive than an extra practice per Coop for J ≤ 12, the minimum costs of the multisite and split-plot designs are T = 1224 and T = 1720 units at sample sizes (n, J) = (103, 8) and (165, 8), respectively. Therefore, the optimal design and sample sizes are d = M and (n, J) = (103, 8) at the lowest evaluation cost T = 1224. The current AHRQ design involving (n, J) = (250, 8) is comparatively overpowered and too expensive for the omnibus test. Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

12

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



Fig. 3. Each panel graphs power (12) to detect the AB interaction effects against J in the same way as does that of Fig. 2 for the main effects.

Fig. 4. Each panel graphs power (12) to detect treatment effects against n for J changing from 8 to 10 to 12 as the legend in the first panel shows, given a = b = 3, ρ = 0.1 and δ ∗ = 0.2. Panels (1) to (3) draw power of the omnibus test in each design; panels (4) and (5) of the main effects for d ∈ {M , S }; and panel (6) of the interaction effects for d = M.

Power of the omnibus test indicates that the cluster randomized design does not have adequate power to detect main or interaction effects. Panels (4) to (6) draw power to detect the main and interaction effects for d ∈ {M , S }. Panel (4) shows adequate power to detect the factor A main effects of design d = M for J = 8, 10, 12. The power also equals that of the factor B main effects for d ∈ {M , S } due to the orthogonal design and a = b. Panel (5) reveals low power to detect whole-plot factor A main effects for d = S, which is also the corresponding power for d = C due to the orthogonal design. An important difference is that the split-plot design enables investigation of all treatment combinations for J > a while the cluster randomized design is limited to J > ab as power formula (12) reveals. Panel (6) graphs inadequate power, although conservative, to detect the interaction effects for d = M, which is practically identical to the corresponding power for d = S. By comparing panels (4) and (6), we see that it is harder to detect interaction than main effects for a, b > 2. Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



13

Fig. 5. Each panel graphs power (12) of the omnibus test of each design against J for n changing from 6 to 8 to 10 given a = b = 3, ρ = 0.05 and δ ∗ = 0.25 in the same way as did the counterparts in Fig. 4.

6.2. Patients nested within practices An important part of the AHQR evaluation is to identify whether there are subgroups of the population who benefit relatively more or less from the specific QI support strategies chosen. In this section, we consider a QI support strategy with three treatment intensities (a = 3) that may be randomly assigned to practices or patients; and low, medium and high levels of patient socioeconomic status (SES, b = 3). Patients at each SES level may be randomly selected within practices or SES may be randomly assigned to practices from the populations of which such patients are sampled. To facilitate sampling from ∗ small practices, each practice may sample 6 to 10 patients (6 ≤ n ≤ 10). The cost of⌊sampling ⌋ a practice is C2 = 10 with the 4000 total cost constraint T = J(10 + n) ≤ 4000 units yielding the upper bound for J ≤ 10 practices. A logistic constraint is +n J ≥ 103 practices from the selected design in the previous section. We set ρ = 0.05 based on the previous research (Peikes et al., 2011; Thompson et al., 2012) and expect to detect the minimum effect size difference δ ∗ = 0.25 with high power. In each panel of Fig. 5, we draw power of the omnibus test against J practices for n changing from 6 to 8 to 10 patients given design d ∈ {C , M , S }. A vertical line indicates the J upper bound given each n. The cluster randomized design produces adequate power for each cluster size within the J bound in panel (1). The lowest cost of the cluster randomized designs is T = 2064 at (n, J) = (6, 129). Given a × b = 9 treatment combinations, a multisite design is feasible for a single cluster size n = 10 displayed in the next panel. Although power is adequate for J much lower than 103, the logistic constraint J ≥ 103 identifies the minimum-cost multisite design having T = 2060 at (n, J) = (10, 103). The split-plot design in the last panel achieves adequate power for all cluster sizes considered within the J bounds. Because of the logistic constraint J ≥ 103, the minimum-cost split-plot design incurs the total cost T = 1648 at (n, J) = (6, 103). Therefore, the optimal design and sample sizes that meet the minimum power requirement and that also minimize the evaluation costs are d = S and (n, J) = (6, 103). 7. Discussion In this paper, we confronted a problem in planning evaluation of a care delivery innovation that requires understanding how two factors, representing arbitrary a and b factor levels, respectively, and their interaction influence patient and other care outcomes in two-level hierarchical settings. Consideration of the two factors at the same higher or lower hierarchical level, or one per hierarchical level gave rise to three designs to investigate: cluster (C), multisite (M) and split-plot randomized designs (S). The challenge was to find the optimal design and sample sizes given cost and logistic constraints in a way that satisfies a minimum power standard and also in a way that minimizes the associated research costs. We derived statistical power (12) to detect main, interaction, or any treatment effects as a function of sample sizes, the arbitrary a and b factor levels, intraclass correlation ρ , effect sizes δ , and design d ∈ {C , M , S } to determine sample sizes that meet a minimum power requirement in each design. Next, we compared the impact of the designs on power of the omnibus test to detect any treatment effects in a way that minimizes the research costs given the constraints. Our approach has limitations that identify important future research areas. First, power formula (12) is based on a balanced hierarchical model (1). A valuable future research topic is extension of the formula to unbalanced designs, possibly, due to missing data. In addition, our power computation assumes that treatment effects are fixed. With main or interaction effects randomly varying across clusters, we have to take the variance as an additional determinant of the power (12). This will be another valuable future research area. Finally, we computed power separately for the outcome variable of practices nested within networks and another outcome of patients nested within practices to determine the optimal designs and numbers of networks, practices and patients in a way that minimized the related research costs. If one outcome is an aggregate of the other, it is desirable to have a three-level method for power computation. For simultaneous tests of two outcome variables, the Type I error based on Theorem 3.1 for a single hypothesis test will be higher than desired. A power formula for the simultaneous tests is also desirable as future research. Importantly, our findings illustrate the dependence of power on the number of a and b factor levels or treatment intensities. Despite increasing awareness of the importance of considering multiple doses or intensities of treatments (McHugh Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

14

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



et al., 2015), we showed the negative association between the number of treatment intensities and power (12). The explicit negative association enables determination of whether arbitrary a and b treatment intensities meet a minimum power requirement, ceteris paribus. In addition, we illustrated in Figs. 2–4 that it is harder to detect interaction than main effects when the factors represent more than two treatment intensities. Thus, calls to consider differing treatment intensities and effectiveness within given contexts or for specific population subgroups are not without associated research costs. The fact that prior studies, which have not considered these complexities, have themselves been underpowered (Jackson et al., 2013; Peikes et al., 2011), likely illustrates both the need for practical tools to assist with power estimations and the challenges of identifying adequate funding to conduct such studies. With our approach, accurate power computation can be greatly simplified with a priori knowledge of only three or less effect size differences regardless of how large a × b is. The challenge of adequate funding to enable well powered studies to be conducted remains and is well beyond the scope of this study. Acknowledgments This research was supported by William T. Grant Foundation (183631) through the project ‘‘Learning from Variation in Program Effects: Methods, Tools, and Insights from Recent Multisite Trial’’, and the National Institutes of Health through the project ‘‘A Post-Visit Patient Portal Tool to Promote Colorectal Cancer Screening’’ and the grant U01HL101064. Appendix A A.1. Estimation in a cluster randomized design

∑J /(ab)

∑n

∑b

∑a

¯ ¯ = ¯..kl /b, Y¯...l = ¯ a and Y¯.... = ¯ ¯ Let Y¯.jkl = j=1 abY.jkl /J, Y..k. ∑ l=1 Y∑ l Y...l /b. The maximum i=1 Yijkl /n, Y..kl = ∑ k=1 Y..kl /∑ likelihood estimators (MLE) (4) under assumptions α = β = ( αβ ) = ( αβ ) = 0 are of the same form, k l kl kl k l k l but has some the averages different across the three designs, as indicated below. The residual sums of squares ∑of∑ ∑sample ∑ ∑ ∑ ∑ 2 ˆ2 are SSU = ˆijkl for uˆ jkl = Y¯.jkl − Y¯..kl and ϵˆijkl = Yijkl − Y¯.jkl . Equating the mean k l j nujkl and SSE = k l j iϵ SSE SSU squares MSU = J −ab and MSE = J(n−1) to respective expectations nτ + σ 2 and σ 2 yields variance estimators σˆ 2 = MSE and



τˆ = (MSU − MSE)/n that are of the same form across the three designs. Because αˆ k , βˆ j , (αˆ β )kl and uˆ jkl , normally distributed with zero pairwise covariances, are independent, so are their respective functions SSAc

∼ χa2−1 ,

a(nτ + σ 2 )

SSBc b(nτ + σ 2 )

∼ χb2−1 ,

SSABc ab(nτ + σ 2 )

∼ χ(a2 −1)(b−1) ,

SSU nτ + σ 2

∼ χJ2−ab

(17)

for the sums of mean-centered squares SSAc =



nJ(αˆ k − αk )2 , SSBc =



k

nJ(βˆ l − βl )2 , SSABc =

l

∑∑ k

ˆ nJ [(α β )kl − (αβ )kl ]2 .

l

A.2. Estimation in a multisite randomized design

∑n/(ab)

∑J

∑ ∑b ¯ /J and Y¯.j.. = a Y¯ /(ab). Then, Y¯..k. , Y¯...l and Y¯.... are of ∑J k=1 2 l=1 .jkl ∑b ∑a ∑J ∑n/(ab) 2 ˆ the same form in the MLE (4). The residual sums of squares are SSU = ˆijkl for j=1 nuj and SSE = j=1 i=1 ϵ l=1 k=1 SSE uˆ j = Y¯.j.. − Y¯.... and ϵˆijkl = Yijkl − Y¯..kl − Y¯.j.. + Y¯.... . The mean squared errors are MSU = JSSU and MSE = . The MLE −1 nJ −J −ab+1 In this design, Y¯.jkl =

i=1

abYijkl /n, Y¯..kl =

j=1 Y.jkl

(4) imply independent SSAc aσ

2

∼ χa2−1 ,

SSBc bσ

2

∼ χb2−1 ,

SSABc abσ

2

∼ χ(a2 −1)(b−1) ,

SSE

σ2

∼ χnJ2 −J −ab+1 .

(18)

A.3. Estimation in a split-plot randomized design

∑ ¯ / and Y¯ = J /a aY¯.jkl /J. Other sample averages stay the j=1 ∑ ∑J /a ..kl2 ∑b ∑a ∑J /a ∑n/b 2 ˆ same in the MLE (4). The residual sums of squares SSU = n u and SSE = ˆijkl yield the jk j=1 j=1 i=1 ϵ l=1 k=1 SSE ¯.jk. − Y¯..k. and ϵˆijkl = Yijkl − Y¯..kl − Y¯.jk. + Y¯..k. . The MLE (4) ˆ and MSE = for u = Y mean squared errors MSU = JSSU jk −a nJ −J −ab+a In the split-plot design, Y¯.jkl =

∑n/b

i=1 bYijkl

/n, Y¯.jk. =

∑b

l=1 Y.jkl b a k=1

imply independent SSAc a(nτ + σ SSU nτ + σ 2

2)

∼ χa2−1 ,

∼ χJ2−a ,

SSBc

bσ SSE

σ2

2

∼ χb2−1 ,

SSABc abσ 2

∼ χnJ2 −J −ab+a .

∼ χ(a2 −1)(b−1) , (19)

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



15

2 A.4. Proof of interaction effects (15) minimizing δ¯ αβ

Among all specifications of interaction effects that satisfy Eqs. (13) and that include the (αβ )∗ , the specification (15) yields 2 the minimum δ¯ αβ in Eqs. (16) to result in the lowest power to detect interaction effects, ceteris paribus. To show this, we

∑b

αβ )21l according to Eqs. (15). Next, set −(αβ )11 = (αβ )12 = (αβ )∗ /2 and (αβ )13 = · · · = (αβ )1b = 0 to minimize l=1 (∑ ∑a ∑a ∑a a (αβ )∗ (αβ )∗ 2 2 we minimize k=2 (αβ )k1 subject to k=2 (αβ )k1 = − 2 , and k=2 (αβ )k2 subject to k=2 (αβ )k2 = 2 ; and, then, let ∑ ∑ a b (αβ )k′ l = 0 for k′ = 2, . . . , a and l = 3, . . . , b. That will minimize l=1 (αβ )2kl at each k and k=1 (αβ )2kl at each l to yield the 2 ∗ ¯ specification (15) that minimizes δαβ and that includes (αβ ) , ceteris paribus. 2 For a ̸ = b, it remains to determine which one of a and b to assign a larger value for the minimum δ¯ αβ =

(αβ )∗2 . 2(a−1)b(τ +σ 2 )

2 Minimizing δ¯ αβ implies maximizing (a − 1)b = ab − b given others fixed. The ab − b is larger when a > b than when a < b.

Appendix B. Supplementary data Supplementary material related to this article can be found online at https://doi.org/10.1016/j.jspi.2017.09.004. References Angrist, J.D., Imbens, G.W., Rubin, D.B., 1996. Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91, 444–455. Brennan, R.L., 2001. Generalizability Theory. Springer-Verlag, New York. Budin, W.C., Hoskins, C.N., Haber, J., Sherman, D.W., Maislin, G., Cater, J.R., Cartwright-Alcarese, F., Kowalski, M.O., McSherry, C.B., Fuerbach, R., Shukla, W., 2008. Breast cancer education, counseling, and adjustment among patients and partners: A randomized clinical trial. Nurs. Res. 57, 199–213. Campbell, L.C., Keefe, F.J., Scipio, C., McKee, D.C., Edwards, C.L., Herman, S.H., Johnson, L.E., Colvin, O.M., McBride, C.M., Donatucci, C., 2007. Facilitating research participation and improving quality of life for African American prostate cancer survivors and their intimate partners: A pilot study of telephone-based coping skills training. Cancer 109, 414–424. Casella, G., 2008. Statistical Design. Springer Science & Business Media, New York. Cohen, J., 1988. Statistical Power Analysis for the Behavioral Sciences, second ed. Cunningham, T.D., Johnson, R.E., 2016. Design effects for sample size computation in three-level designs. Stat. Methods Med. Res. 25 (2), 505–519. Damschroder, L.J., Aron, D.C., Keith, R.E., Kirsh, S.R., Alexander, J.A., Lowery, J.C., 2009. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implementation Sci. 4 (1), 50. Donner, A., Klar, N., 1996. Statistical considerations in the design and analysis of community intervention trials. J. Clin Epidemiol. 49, 435–439. Ferlie, E.B., Shortell, S.M., 2001. Improving the quality of health care in the United Kingdom and the United States: A framework for change. Milbank Q 79(2), 281–315. Garrett, K., Okuyama, S., Jones, W., Barnes, D., Tran, Z., Spencer, L., Lewis, K., Maroni, P., Chesney, M., Marcus, A., 2013. Bridging the transition from cancer patient to survivor: Pilot study results of the Cancer Survivor Telephone Education and Personal Support (C-STEPS) program. Patient Educ. Couns. 92, 266–272. Glasgow, R.E., McKay, H.G., Piette, J.D., Reynolds, K.D., 2001. The re-aim framework for evaluating interventions: what can it tell us about approaches to chronic illness management?. Patient Educ. Couns. 44 (2), 119–127. Goldstein, H., 2003. Multilevel Statistical Models. Edward Arnold, London. Hedeker, D., Gibbons, R.D., Waternaux, C., 1999. Sample size estimation for longitudinal designs with attrition: Comparing time-related contrasts between two groups. J. Educ. Behav. Stat. 24, 70–93. Hedges, L.V., 2011. Effect sizes in three-level cluster-randomized experiments. J. Educ. Behav. Stat. 36 (3), 346–380. Heo, M. and Leon, A.C., 2008. Statistical power and sample size requirements for three level hierarchical cluster randomized trials. Biometrics 64, 1256–1262. Jackson, G.L., Powers, B.J., Chatterjee, R., Bettger, J.P., Kemper, A.R., Hasselblad, V., et al., 2013. Improving patient care. The patient centered medical home: A systematic review. Ann. Internal Med. 158 (3), 169–178. Konstantopoulos, S., 2009. Incorporating cost in power analysis for three-level cluster-randomized designs. Eval. Rev. 33, 335–357. Kutner, M., Nachtsheim, C., Neter, J., Li, W., 2005. Applied Linear Statistical Models, fifth ed. McHugh, M., Harvey, J.B., Kang, R., Shi, Y., Scanlon, D.P., 2015. Measuring the dose of quality improvement initiatives. Med. Care Res. Rev. 10775587 15603567. Miller, W.L., Crabtree, B.F., Nutting, P.A., Stange, K.C., Jaén, C.R., 2010. Primary care practice development: a relationship-centered approach. Ann. Fam. Med. 8 (Suppl. 1), S68–S79. Moerbeek, M., 2008. Powerful and cost-efficient designs for longitudinal intervention studies with two treatment groups. J. Educ. Behav. Stat. 33, 41–61. Montgomery, D.C., 2005. Design and Analysis of Experiments, sixth ed.. Nutting, P.A., Crabtree, B.F., Miller, W.L., Stange, K.C., Stewart, E., Jaén, C., 2011. Transforming physician practices to patient-centered medical homes: lessons from the national demonstration project. Health Aff (Millwood) 30 (3), 439–445. Peikes, D., Dale, S., Lundquist, E., Genevro, J., Meyers, D., 2011. Building the Evidence Base for the Medical Home: What Sample and Sample Size Do Studies Need? Technical Report, Mathematica Policy Research. Raudenbush, S.W., 1997. Statistical analysis and optimal design for cluster randomized trials. Psychol. Methods 2, 173–185. Raudenbush, S.W., Bryk, A.S., 2002. Hierarchical Linear Models. Sage, Newbury Park, CA. Raudenbush, S.W., Liu, X., 2000. Statistical power and optimal design for multisite randomized trials. Psychol. Methods 5, 199–213. Raudenbush, S.W., Liu, X., 2001. Effects of study duration, frequency of observation, and sample size on power in studies of group differencess in polynomial change. Psychol. Methods 6, 387–401. Raudenbush, S.W., Martinez, A., Spybrook, J., 2007. Strategies for improving precision in group-randomized experiments. Educ. Eval. Policy Anal. 29, 5–29. Schochet, P.A., 2008. Statistical power for random assignment evaluations of education programs. J. Educ. Behav. Stat. 33, 62–87. Shin, Y., Raudenbush, S.W., 2012. Confidence bounds and power for the reliability of observational measures on the quality of a social setting. Psychometrika 77 (3), 543–560. Snijders, T., Bosker, R., 2012. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling, second ed. SAGE, Los Angeles; London. Taplin, S.H., Yabroff, K.R., Zapka, J., 2012. A multilevel research perspective on cancer care delivery: the example of follow-up to an abnormal mammogram. Cancer Epidemiol. Biomarkers Prevent. 21(10), 1709–1715.

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.

16

Y. Shin et al. / Journal of Statistical Planning and Inference (

)



Thompson, D.M., Fernald, D.H., Mold, J.W., 2012. Intraclass correlation coefficients typical of cluster-randomized studies: Estimates from the robert wood johnson prescription for health projects. Ann. Family Med. 10, 235–240. Usami, S., 2011. Statistical power of experimental research with hierarchical data. Behaviormetrika 38, 63–68. Usami, S., 2014a. A convenient method and numerical tables for sample size determination in longitudinal-experimental research using multilevel models. Behav. Res. Methods 46 (4), 1207–1219. Usami, S., 2014b. Generalized sample size determination formulas for experimental research with hierarchical data. Behav. Res. Methods 46, 346–356.

Please cite this article in press as: Shin Y., et al., Statistical power in two-level hierarchical linear models with arbitrary number of factor levels. J. Statist. Plann. Inference (2017), https://doi.org/10.1016/j.jspi.2017.09.004.