Journal of Statistical Planning and Inference 142 (2012) 2953–2964
Contents lists available at SciVerse ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Design and relative efficiency in two-phase studies Yang Zhao a,n, Jerald F. Lawless b, Donald L. McLeish b a b
Department of Mathematics and Statistics, University of Regina, College West 307.14, Regina, SK, Canada S4S 0A2 Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada N2L 3G1
a r t i c l e i n f o
abstract
Article history: Received 28 October 2010 Received in revised form 14 March 2012 Accepted 26 April 2012 Available online 6 May 2012
Many two-phase sampling designs have been applied in practice to obtain efficient estimates of regression parameters while minimizing the cost of data collection. This research investigates two-phase sampling designs for so-called expensive variable problems, and compares them with one-phase designs. Closed form expressions for the asymptotic relative efficiency of maximum likelihood estimators from the two designs are derived for parametric normal models, providing insight into the available information for regression coefficients under the two designs. We further discuss when we should apply the two-phase design and how to choose the sample sizes for two-phase samples. Our numerical study indicates that the results can be applied to more general settings. & 2012 Elsevier B.V. All rights reserved.
Keywords: Asymptotic relative efficiency Expensive variable Maximum likelihood Normal model One-phase design Surrogate variable Two-phase design
1. Introduction Parametric regression models are commonly used to study the relationship between a response variable and a vector of covariates. In health and medical research, when certain variables are difficult or expensive to measure two-phase sampling designs are often used to collect information from a target population, where cheap variables are measured for a large phase I sample and expensive variables are measured only for a small subsample, called the phase II sample. The expensive variables can be response variables or covariates, while the cheap variables may include other variables in the regression model and/or surrogate measurements of the expensive variables. In the context of missing data problems, individuals included in the phase II sample have complete observations, while individuals included only in the phase I sample have incomplete observations. A recent literature review in Morara et al. (2007) indicated that the focus of most of the literature in two-phase or multi-phase designs has been on innovative analysis methods and only a few research papers discussed how to use two-phase design to achieve high estimation efficiency while minimizing the cost of data collection. For example, Chen et al. (2008) investigated estimation efficiency in settings where both the response and certain covariates are missing completely at random (Rubin, 1976). They compared the complete case (CC) analysis, the complete response (CR) analysis and the all case (AC) analysis based on the normal linear regression models, where expected information matrices were derived and the efficiency gains of the AC analysis over the CR analysis and the CR analysis over the CC analysis were addressed through expected information matrices and the asymptotic variances of the estimates. However, they did not consider design strategies which could efficiently reduce the cost of data collection and at the same time increase estimation efficiency for two-phase or multi-phase studies, and they did not consider surrogate measurements in the study. Cheaper surrogate measurements can be used in two-phase or multi-phase designs to improve estimation efficiency while reducing the cost of data collection.
n
Corresponding author. Tel.: þ1 306 5854348; fax: þ 1 306 5854020. E-mail address:
[email protected] (Y. Zhao).
0378-3758/$ - see front matter & 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2012.04.013
2954
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
For regression models with expensive covariates and cheaper surrogate variables, Morara et al. (2007) proposed optimal multi-phase design, using a numerical approach to minimize the asymptotic variance of the maximum likelihood estimator (MLE) of the regression parameter of interest (e.g. the coefficient of an expensive covariate) under an appropriate constraint (e.g. the total budget available). They considered multi-phase sequential designs where a response variable was observed in phase I, surrogate variables were observed in sequential subsamples and the expensive covariates were only measured in the final subsample. A general likelihood function was provided under the assumption that the response variable and the surrogate variables were conditionally independent given the expensive covariates. A logistic regression model was used to model the selection probability at each sampling phase, which may depend on the response and surrogate variables observed in previous stages. Strategies for computing Hessian matrices and developing computational software were also discussed in their paper. The numerical procedures are general and very important especially when analytic results are not available or too complex. When surrogate variables are fully observed as is common in many applications, however, one possible drawback of their method (see the likelihood function in their Eq. (5)) is that their likelihood function conditions on the expensive covariates with missing values instead of the fully observed surrogate variables. For example, let S be a response variable, D be an expensive covariate with missing values and X be a fully observed covariate. Chatterjee et al. (2003) and Zhao et al. (2009) considered estimating the joint conditional distribution PrðS,D9XÞ ¼ PrðS9D,XÞPrðD9XÞ, given the fully observed covariate X, where estimation of the marginal distribution of X was not required. This is more efficient than estimating the joint distribution PrðS,D,XÞ ¼ PrðS9D,XÞPrðX9DÞPrðDÞ as in Morara et al. (2007) in general. An analytic study was conducted by Thomas (2007) in investigating two-phase design for estimation of a latent variable effect in regression models, where surrogate measurements of the latent variable were measured on a phase II subsample. Thomas considered normal and binary response data with normal covariates. He used closed form expressions for the asymptotic relative cost efficiency based on the simple random sampling to compare the efficiency in two-phase and one-phase designs, and some nice analytic results for the two-phase design were obtained. For missing data problems, Reilly and Pepe (1995) did an analytic comparison of two-phase and one-phase designs where the MLE based on data from a one-phase design was compared with a mean score estimator from a two-phase design. However, their study did not address all the information available in the twophase samples, since the mean score estimating function method is not fully efficient (e.g. Lawless et al., 1999). Zhao et al. (2009) described a semiparametric (MLE) method for regression models with expensive variables missing in twophase designs. They considered the expensive response and the expensive covariate problems. Following Zhao et al. (2009) in this research, we discuss design strategies for both the expensive response model and the expensive covariate model, which are different from those in Thomas (2007). We consider both analytic and numerical approaches. We assume that the phase II sample is a simple random subsample from a phase I sample with some fixed sampling fraction. In the context of missing data settings, we say that the expensive variables have values missing completely at random. We consider normal regression models and we express the asymptotic relative efficiency (ARE) of the MLE’s from the two-phase and one-phase designs explicitly as a function of the phase II sampling fraction, the correlations between the response and the covariates in the regression models and the correlations between the expensive variable and the surrogate measurements given the other variables in the regression model. This provides insight into the available information for regression coefficients based on the two-phase and one-phase designs and indicates factors related to efficiency gains from the two-phase design. We further discuss optimal two-phase design for fixed budget and fixed precision problems using a cost effective ARE which takes the costs of data collection into account. Some analytic results for the two-phase design are presented. For most settings analytic results cannot be obtained, for example, where regression models are non-normal and where the selection of phase II sample depends on the phase I sample. We use numerical methods to investigate whether the analytic results can be extended to other practical settings. We consider non-normal models and general linear models with correlated covariates. We further illustrate a numerical optimization procedure for stratified two-phase sampling design where the selection of phase II sample depends on the observed variables from the phase I sample. The rest of the article is organized as follows. Section 2 introduces notation and the normal models. It derives the ARE of the MLE’s of the two designs and examines the factors related to the ARE for the expensive response and the expensive covariate problems. Section 3 introduces the cost of data collection and defines the cost effective ARE and discusses the optimal two-phase designs for both fixed budget and fixed precision problems. Section 4 considers extending the results to other models and illustrates numerical approaches. Section 5 concludes with a brief discussion and some recommendations. 2. Asymptotic relative efficiency for normal models Let S, D and X be the random variables, where D is an expensive to measure variable and S and X are cheap-to-measure variables. In two-phase designs, the variables S and X are observed in the phase I sample and the expensive variable D is only observed in the phase II sample. Therefore, a complete observation on an individual selected for the phase II sample consists of ðS,D,XÞ and an incomplete observation for an individual not in the phase II sample consists of (S,X). We consider the following normal models: ðD9X; bÞ Nðb0 þ b1 X, s21 Þ, ðS9D,X; aÞ Nða0 þ a1 D, s22 Þ,
ð1Þ ð2Þ
where ðD9X; bÞ denotes the conditional distribution of D given X, and ðS9D,X; aÞ denotes the conditional distribution of S given ðD,XÞ. We assume that variable X has mean m and variance s2 . Here, b and a are the vectors of regression parameters.
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
2955
The model for ðS,D9XÞ given by (1) and (2) can represent several practical settings, as we indicate below. The normal model is chosen for its analytic tractability but as we discuss in Section 4, we believe that it provides insights for other models as well. We have for convenience let S, D and X be the scalars; we comment on this in Section 4 as well. Finally, in model (2) we assume that ðS9D,X; aÞ ¼ ðS9D; aÞ, that is, S and X are independent given D (see Morara et al., 2007 for examples). If we treat model (1) as the regression model of interest, then we get an expensive response problem, where D is the expensive response, X is a covariate and S is a cheap-to-measure surrogate for D. Model (2) gives the relationship between the response D and the surrogate S, and we assume that S is quite highly correlated with D. We are interested in estimating b ¼ ðb0 , b1 Þ0 , but in the two-phase design we need to estimate b, a ¼ ða0 , a1 Þ0 and ðs1 , s2 Þ jointly. If model (2) is instead the regression model of interest, we have an expensive covariate problem, where S is the cheap response variable, D is the expensive covariate and we are interested in estimating a. In this case, X is a surrogate for D and model (1) gives the relationship between D and X. In practical settings, we want X to be highly correlated with D. Similarly, a, b and ðs1 , s2 Þ will be estimated jointly in the two-phase design. We assume that the phase II sample is a simple random subsample from a phase I sample with a fixed sampling fraction p. We will compare the MLE’s from the two designs based on the same number of observations, that is, n independent complete observations ðS,D,XÞ from the one-phase sampling design compared with np complete plus nð1pÞ incomplete independent observations ðS,XÞ from the two-phase design. We will later introduce costs of measuring variables, and compare designs of equal cost. From models (1) and (2), we have that ðS9X; a, bÞ Nða0 þ a1 ðb0 þ b1 XÞ, a21 s21 þ s22 Þ:
ð3Þ
In the discussion that follows, X is always an observed covariate, so we condition on its observed values. The contribution to the log likelihood function from a complete observation ðS,D,XÞ can then be written as a constant plus lc ðhÞ ¼ logðD,S9X; b, aÞ ¼ logðD9X; bÞ þ logðS9D,X; aÞ ¼ logðs1 Þ
ðDb0 b1 XÞ2 ðSa0 a1 DÞ2 logðs2 Þ , 2 2s1 2s22
where h ¼ ðb0 , b1 , s1 , a0 , a1 , s2 Þ0 . The observed information matrix for the complete observation can be derived as Ic ðhÞ ¼ @2 lc ðhÞ=@h@h0 and its expectation is I c ðhÞ ¼ EfIc ðhÞg. For an incomplete observation, the contribution to the log likelihood function is, from (3), a constant plus 1 fSa0 a1 ðb0 þ b1 XÞg2 : lc ðhÞ ¼ logff ðS9X; b, aÞg ¼ logða21 s21 þ s22 Þ 2 2ða21 s21 þ s22 Þ The observed information matrix for the incomplete observation is Ic ðhÞ ¼ @2 lc ðhÞ=@h@h0 and let I c ðhÞ ¼ EfIc ðhÞg. In the one-phase design with n independent complete observations, the observed information matrix is I ð1Þ ðyÞ ¼ nI c ðhÞ. The asymptotic variance–covariance matrix for the maximum likelihood estimator h^ ð1Þ based on the one-phase sample is Asvarðy^ ð1Þ Þ ¼ I 1 ð1Þ ðyÞ ¼
1 1 I ðyÞ: n c
ð4Þ
In the two-phase design with np complete and nð1pÞ incomplete independent observations, the expected information matrix is I ð2Þ ðhÞ ¼ npI c ðhÞ þ nð1pÞI c ðhÞ. The asymptotic variance–covariance matrix for the maximum likelihood estimator h^ ð2Þ from the two-phase sampling design can be derived as Asvarðy^ ð2Þ Þ ¼ I 1 ð2Þ ðhÞ ¼
1 ½pI c ðhÞ þ ð1pÞI c ðhÞ1 : n
ð5Þ
Explicit expressions for the information matrices and the asymptotic variance–covariance matrices of the MLE’s are given in the Appendix. To compare the above two designs, we define the ARE of the MLE’s for a given scalar parameter yi for the two-phase design relative to the one-phase design as AREðy^ i Þ ¼
Asvarðy^ ið1Þ Þ , Asvarðy^ Þ
ð6Þ
ið2Þ
where the asymptotic variance of y^ ið1Þ and y^ ið2Þ , Asvarðy^ ið1Þ Þ and Asvarðy^ ið2Þ Þ, is given in (4) and (5), respectively. Next, we will examine how the ARE’s depend on the phase II sampling fraction p, the correlation between response variable and covariate, and the correlation between the expensive variable and the surrogate variable, given other variables in the regression model. We consider the expensive response problem and the expensive covariate problem separately. 2.1. Expensive response model In the expensive response model (1), the estimator of main interest is b^ 1 . We let s ¼ a21 s21 =s22
and
2
t ¼ b1 s2 =s21 :
2956
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
Using expressions for (4) and (5) in the Appendix with some algebra, we find the ARE of b^ 1 as AREðb^ 1 Þ ¼
pðs þ 1Þftðs2 þ 2psþ 1Þ þ ðs þpÞðs þ 1Þg
: ð7Þ tfps3 þð2p2 þ1Þs2 þ3ps þ 1g þ pðs þ 1Þ3 One may expect that the ARE of the two-phase and one-phase designs depends on the correlation between the expensive variable (the response) D and its surrogate S, given X, as well as p and the strength of the relationship between the response D and the covariate X. It can be shown that s¼
r2ðD,S9XÞ , 1r2ðD,S9XÞ
t¼
r2ðD,XÞ , 1r2ðD,XÞ
ð8Þ
where rðD,XÞ denotes the correlation between the response D and the covariate X, and rðD,S9XÞ denotes the correlation between the expensive variable D and its surrogate S, given X. Substituting (8) into (7) we can write AREðb^ 1 Þ as a function of the correlations and the phase II sampling fraction, ðrðD,S9XÞ , rðD,XÞ ,pÞ. In Fig. 1(a) and (b), we plot AREðb^1 Þ as a function of rðD,S9XÞ and p, respectively, at selected values of the other factors. We see that (i) AREðb^1 Þ 4p when rðD,S9XÞ 4 0, which indicates that the incomplete observations are informative for the regression parameter when the expensive response and the surrogate variable are correlated given the other covariates in the model; (ii) AREðb^ 1 Þ increases with rðD,S9XÞ significantly especially when both rðD,XÞ and p are small; and (iii) highly efficient two-phase design can be achieved at small p when rðD,S9XÞ is high and rðD,XÞ is low. For example, at ðrðD,XÞ , rðD,S9XÞ Þ ¼ ð0:2,0:9Þ, we can achieve 75% efficiency at p ¼0.05 in the two-phase study. The plot of the ARE as a function of p can be used to choose the optimal p in the two-phase sampling design which will be discussed later. 2.2. Expensive covariate model For the expensive covariate model (2), the estimator of interest is a^ 1 . Using the information in matrices (4) and (5), the ARE of a^ 1 can be computed as pðs þ 1Þfðs þ 1Þðs þ pÞ þ tð2psþ s2 þ 1Þg : ðp þ sÞðt þ 1Þð2ps þ s2 þ 1Þ
1.0
0.8
0.8 ARE for beta_1
1.0
0.6 0.4 (cor (D, X), p) = (0.2, 0.2) (cor (D, X), p) = (0.9, 0.2) (cor (D, X), p) = (0.2, 0.6) (cor (D, X), p) = (0.9, 0.6)
0.2 0.0 0.0
ARE for alpha_1
ð9Þ
0.2
0.4 0.6 cor (D, S|X)
0.8
0.6 0.4
0.0
1.0
0.0
1.0
1.0
0.8
0.8
0.6 0.4 (cor (S, D), p) = (0.2, 0.2) (cor (S, D), p) = (0.9, 0.2) (cor (S, D), p) = (0.2, 0.6) (cor (S, D), p) = (0.9, 0.6)
0.2 0.0 0.0
0.2
0.4 0.6 cor (D, X|S)
0.8
1.0
(cor (D, X), cor (D, S|X)) = (0.2, 0.2) (cor (D, X), cor (D, S|X)) = (0.9, 0.2) (cor (D, X), cor (D, S|X)) = (0.2, 0.9) (cor (D, X), cor (D, S|X)) = (0.9, 0.9)
0.2
ARE for alpha_1
ARE for beta_1
AREða^ 1 Þ ¼
0.4 0.6 0.2 0.8 Phase II sampling fraction p
1.0
0.6 0.4 (cor (S, D), cor (D, X|S)) = (0.2, 0.2) (cor (S, D), cor (D, X|S)) = (0.9, 0.2) (cor (S, D), cor (D, X|S)) = (0.2, 0.9) (cor (S, D), cor (D, X|S)) = (0.9, 0.9)
0.2 0.0 0.0
0.4 0.6 0.2 0.8 Phase II sampling fraction p
Fig. 1. Plots of ARE’s: (a and b) expensive response model and (c and d) expensive covariate model.
1.0
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
To examine factors related to the AREða^ 1 Þ, when X has a normal distribution we can show that 8 r2ðS,DÞ ð1r2ðD,X9SÞ Þ > > > >s¼ , > > 1r2ðS,DÞ ð1r2ðD,X9SÞ Þ < > r2ðD,X9SÞ > > > , t ¼ > > : ð1r2ðD,X9SÞ Þð1r2ðS,DÞ Þ
2957
ð10Þ
where rðS,DÞ denotes the correlation between the response S and the covariate D, and rðD,X9SÞ is the correlation between the expensive variable D and its surrogate X, given S. Some details are given in the Appendix. Substituting (10) into (9) we can write the AREða^ 1 Þ as a function of ðrðD,X9SÞ , rðS,DÞ ,pÞ. In Fig. 1(c) and (d), we plot the AREða^ 1 Þ as a function of rðD,X9SÞ and p, respectively, at different values of the other factors. Similar to the expensive response model, we see that AREða^ 1 Þ increases when the correlation rðD,X9SÞ increases. However, at rðD,X9SÞ ¼ 0, we observe that the AREða^ 1 Þ 4 p which is different from what we see in the expensive response model. To better see the association between AREða^ 1 Þ and rðS,DÞ at rðD,X9SÞ ¼ 0, using (9) and (10) we get a simplified AREða^ 1 Þ at rðD,X9SÞ ¼ 0 as AREða^ 1 Þ ¼
pðs þ1Þ2 ðs þ 1Þ2 2sð1pÞ
¼
p 12ð1pÞr2ðS,DÞ ð1r2ðS,DÞ Þ
pffiffiffi and we can show that AREða^ 1 Þ reaches its maximum 2p=ð1þ pÞ at rðS,DÞ ¼ 7 1= 2 7 0:7, and it reaches its minimum p at rðS,DÞ ¼ 0 or 71. The result indicates that for the expensive covariate model the incomplete observations are informative even when there is no valuable surrogate for the expensive covariate; the response can provide some information about the unobserved expensive covariates as long as the response and expensive covariate are correlated and the correlation does not equal to 71. So, the incomplete observations should be used in the regression analysis even when there is no surrogate variable. This result provides a precise justification for the comments in Little (1992) (see Section 1.1 before Example 1). 3. Two-phase designs for normal model The results of our study about the ARE’s indicate that in certain settings we can achieve quite high efficiency using the two-phase design, which can be much cheaper to conduct than the one-phase design with the same sample size. In this section, we further discuss when we should use the two-phase sampling design and how to choose sample sizes for twophase samples. We consider optimal two-phase design issues for fixed budget and fixed precision problems separately. 3.1. Fixed budget designs Let Ct be the fixed total budget, Cc and C c denote the cost for observing a complete observation ðS,D,XÞ and an incomplete observation (S,X ), respectively, and let R ¼ C c =C c . In general, C c b C c so 0 oR o1. For a given budget Ct, we can either observe n independent complete observations in the one-phase design or observe n1 p complete plus n1 ð1pÞ incomplete independent observations in the two-phase design. For the two designs to have equal cost, we require C t ¼ nC c ¼ n1 pC c þ n1 ð1pÞC c , which implies that the phase I sample size in the two-phase design is n1 ¼
n Ct ¼ p þ ð1pÞR C c ðp þ ð1pÞRÞ
ð11Þ
and n1 4n. Recall that the asymptotic relative efficiency AREðy^ i Þ that we discussed in Section 2 is based on the same sample size for the two designs. Using the sample size in (11), we can therefore define the cost effective ARE for the twophase design relative to the one-phase design based on the same cost as Asvarðy^ ið1Þ Þ=n AREðy^ i Þ : AREc ðy^ i Þ ¼ ¼ ^ fp þð1pÞRg Asvarðy ið2Þ Þ=n1
ð12Þ
The expression for AREc ðy^ i Þ in (12) indicates that the relative cost R is a significant factor for the cost effective ARE, and when R decreases the AREc ðy^ i Þ increases. In Fig. 2(a), we plot the cost effective ARE of b^ 1 in the expensive response model, AREc ðb^ 1 Þ, for different values of ðR, rðD,S9XÞ Þ at fixed rðD,XÞ ¼ 0:2. We see that the curve of AREc ðb^ 1 Þ is above 1 when p is bigger than some threshold value, and it reaches its maximum at some optimal p. The results indicate that for the fixed budget problem: (i) the two-phase design can be much more efficient than the one-phase design especially when rðD,S9XÞ is high and R is small and (ii) there exists an optimal p which maximizes AREc ðb^ 1 Þ, and when R decreases and/or rðD,S9XÞ increases the optimal p decreases and the maximum AREc ðb^ 1 Þ increases. For example, the maximum AREc ðb^ 1 Þ increases from 1.07 at ðR, rðD,S9XÞ Þ ¼ ð0:2,0:5Þ to 3.14 at ðR, rðD,S9XÞ Þ ¼ ð0:2,0:9Þ, and the corresponding optimal p decreases from 0.2 to 0.04. The cost effective ARE of a^ 1 in the expensive covariate problem, AREc ða^ 1 Þ, is plotted in Fig. 2(b) for different values of ðR, rðD,X9SÞ Þ at fixed rðS,DÞ ¼ 0:2. We observe very similar results as for the expensive response model.
2958
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
3.0 ARE for alpha_1
ARE for beta_1
3.0
2.0
1.0 (R, cor (D, S|X)) = (0.1, 0.5) (R, cor (D, S|X)) = (0.2, 0.5) (R, cor (D, S|X)) = (0.2, 0.9)
0.0 0.0
0.2 0.4 0.6 0.8 Phase II sampling fraction p
2.0
1.0 (R, cor (D, X|S)) = (0.1, 0.5) (R, cor (D, X|S)) = (0.2, 0.5) (R, cor (D, X|S)) = (0.2, 0.9)
0.0
1.0
0.0
0.2 0.4 0.6 0.8 Phase II sampling fraction p
1.0
3.0 ARE for beta_1
ARE for alpha_1
1.5
1.0
0.5 (R, cor (S, D)) = (0.1, 0.2) (R, cor (S, D)) = (0.1, 0.7) (R, cor (S, D)) = (0.01, 0.7)
0.0 0.0
0.2 0.4 0.6 0.8 Phase II sampling fraction p
1.0
2.0
1.0 gamma = 1 gamma = 0.318
0.0 0.0
0.2 0.4 0.6 0.8 Phase II sampling fraction p
1.0
Fig. 2. Plots of cost effective ARE’s: (a and d) expensive response model and (b and c) expensive covariate model.
To further examine the expensive covariate problem in settings where there is no valuable surrogate, that is rðD,X9SÞ ¼ 0, we obtain the simplified AREc ða^ 1 Þ as AREc ða^ 1 Þ ¼
p : f12ð1pÞr2ðS,DÞ ð1r2ðS,DÞ Þgfp þ ð1pÞRg
ð13Þ
In Fig. 2(c), we plot the simplified AREc ða^ 1 Þ as a function of p for several values of ðR, rðS,DÞ Þ. We see that when S and D are moderately correlated and R is small the two-phase design can be more efficient than the one-phase design even when there is no surrogate variable. Especially for given ðR, rðS,DÞ Þ and the optimal p can be obtained as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 11=f2r2ðS,DÞ ð1r2ðS,DÞ Þg : ð14Þ p¼ 11=R 3.2. Fixed precision designs Section 3.1 shows that for a fixed total budget Ct and a given relative cost R, we can achieve the maximum efficiency in the two-phase study using the optimal phase II sampling fraction p based on the cost effective AREc ðy^ i Þ. We notice that each point on the same curve of the cost effective ARE plot has the same cost, and the maximum of AREc ðy^ i Þ is proportional to the total budget Ct. Therefore, for the fixed precision problem we can change the total budget Ct to a multiple of Ct, say gC t with 1 4 g 4 0, such that the maximum of gAREc ðy^ i Þ reaches the required precision, then the corresponding total cost gC t is the minimum cost required in the two-phase design such that the fixed precision can be achieved. For example, for the expensive response model we want to find the minimum cost for the two-phase design such that b1 can be estimated as efficiently as in the one-phase design using n complete observations with a total cost Ct. Assume that we have a ,r ,RÞ ¼ ð0:9,0:2,0:2Þ. In Fig. 2(d), we plot gAREc ðb^ Þ as a function of p, and see that (i) when g ¼ 1 the model with ðr ðD,X9SÞ
1
ðS,DÞ
maximum of gAREc ðb^ 1 Þ is 3.14, which means that the two-phase sampling design can be 3.14 times as efficient as the one-phase design based on the same total cost C , and (ii) when g ¼ 1=3:14 ¼ 0:318 the maximum of gAREc ðb^ Þ reaches 1 at p¼0.04, which t
1
indicates that in the two-phase design with a cost of gC t ¼ 0:318C t we can achieve the same efficiency as the one-phase design
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
2959
with a total cost of Ct. We calculate the phase I and phase II sample size as n1 ¼
gn 0:318 ¼ n ¼ 1:37n p þ ð1pÞR 0:04þ 0:2ð10:04Þ
and
n1 p ¼ 0:04ð1:37nÞ ¼ 0:055n
respectively. In practice, we will need to get initial estimates of s and t, or the correlations, say ðrðD,S9XÞ , rðD,XÞ Þ for the expensive response model, and ðrðD,X9SÞ , rðS,DÞ Þ for the expensive covariate model with a normal X. Then for given R, we can obtain the optimal p for the two-phase design using the cost effective ARE in (12) or (13). 4. Other models and some extensions Analytic results as in Sections 2 and 3 are limited to special models in certain settings. There is, for example, no analytic results for logistic regression models. However, as mentioned by the referee, in certain settings often one can treat the binary response variable as being generated from some underlying model for a continuous outcome: the response is 1 if the outcome is larger than a certain number and 0 otherwise. In addition, if the underlying model is normal, then the results in Sections 2 and 3 can be applied directly. In this section, we use numerical methods to study models where analytic results are not available or too complex. We demonstrate that the analytic results in Sections 2 and 3 have more general application. We further illustrate numerical procedures which can be applied to achieve more efficient two-phase designs: the stratified two-phase sampling designs. 4.1. Simulation study for other models In practice, it is common that (i) there are other covariates in the regression model except for X and D and (ii) the conditional distributions, ðD9XÞ and ðS9D,XÞ, may not be exactly normal. In this section, we use simulation study to exam if the analytic results in Sections 2 and 3 can be extended to deal with these two cases. In particular, for case (i) we assume that there is an other covariate Z in model (1) and (2), and EðD9X,ZÞ ¼ b0 þ b1 X þ b2 Z and EðS9D,X,ZÞ ¼ a0 þ a1 D þ a2 Z. We assume that X and Z have a joint normal distribution and each has mean 0 and variance 1. We let c denote the correlation between X and Z. For case (ii), we assume that D ¼ b0 þ b1 X þ E1 and S ¼ a0 þ a1 D þ E2 , and we consider the following three settings: ðaÞ E1 tð4Þ
and
ðbÞ E1 Nð0, s21 Þ
E2 Nð0, s22 Þ, and
E2 tð4Þ,
and ðcÞ E1 tð3Þ
and
E2 tð3Þ,
where tð4Þ and tð3Þ denotes the Student-t distribution with degree of freedom 4 and 3, respectively. We let b ¼ a ¼ ð0:1,1; 1Þ0 . In case (i), for the expensive response model we use s21 ¼ 25 and s22 ¼ 6:25, and for the expensive covariate model we use s21 ¼ 0:25 and s22 ¼ 25. The selected values for the variances in each case represent the settings where the expensive variable and its surrogate variable are highly correlated given the other variables in the model. We consider different correlation between X and Z, say c¼0.5 and 0.3. In case (ii), we let s21 ¼ s22 ¼ 1. For each case, we generate a large phase I sample with size n¼100,000. Then, phase II sample is selected under the simple random sampling design with selection probability starting from p¼ 0.05. For each model, we estimate asymptotic relative efficiency of the MLE’s, AREðb^ 1 Þ or AREða^ 1 Þ, based on the simulated data. The simulated
ARE for beta_1
0.8
0.4 Theoretical ARE Simulated ARE (c = −0.3)
Expensive covariate model
ARE for alpha_1
Expensive response model
0.8
0.4 Theoretical ARE Simulated ARE (c = −0.3)
Simulated ARE (c = 0.5)
0.0
Simulated ARE (c = 0.5)
0.0 0.2 0.4 0.6 0.8 Phase II sampling fraction p
0.2 0.4 0.6 0.8 Phase II sampling fraction p
Fig. 3. Plots of ARE’s for models in case (i).
2960
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
ARE’s and the corresponding theoretical ARE’s computed from (7) and (9) are plotted in Figs. 3 and 4 for cases (i) and (ii), respectively. We see that the simulated ARE’s, for models in cases (i) and (ii), are very close to the theoretical ARE’s, which including the plots not shown in here for low or moderate correlation between X and Z, say 9c9 o0:5 in case (i). The only case where the simulated ARE diverges slightly from the theoretical ARE is for the expensive covariate model when the correlation between X and Z is high, here c¼0.5. Our simulation indicates that extending the analytic results to more general models are practically acceptable. More investigations in this area are valuable. 4.2. Stratified two-phase design In settings where fine surrogate variables are observed in phase I sample, one often use stratified two-phase sampling design, which stratifies on the fine surrogate variables for the selection of phase II sample to maximize the information in Expensive covariate model (a) 1.0
0.8
0.8
0.6
0.4 Theoretical ARE Simulated ARE
ARE for alpha_1
ARE for beta_1
Expensive response model (a) 1.0
0.6
0.4 Theoretical ARE Simulated ARE
0.2
0.2
0.0
0.0 0.2
0.4 0.6 Phase II sampling fraction p
0.8
0.2
0.8
0.8
0.6
0.4 Theoretical ARE Simulated ARE
ARE for alpha_1
ARE for beta_1
1.0
0.4 Theoretical ARE Simulated ARE 0.2
0.0
0.0 0.6
0.2
0.8
0.8
0.8
0.6
0.4 Theoretical ARE Simulated ARE
ARE for alpha_1
ARE for beta_1
1.0
0.8
0.6
0.4 Theoretical ARE Simulated ARE
0.2
0.2
0.0
0.0 0.6
0.6
Expensive covariate model (c)
Expensive response model (c) 1.0
0.4
0.4
Phase II sampling fraction p
Phase II sampling fraction p
0.2
0.8
0.6
0.2
0.4
0.6
Expensive covariate model (b)
Expensive response model (b) 1.0
0.2
0.4
Phase II sampling fraction p
0.8
0.2
Phase II sampling fraction p
Fig. 4. Plots of ARE’s for models in case (ii).
0.4 0.6 Phase II sampling fraction p
0.8
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
2961
the observed data. For example, in the ‘‘balanced design’’ (Breslow and Cain, 1988) the phase II selection probabilities are proportional to the inverses of stratum sizes, where the strata are defined in terms of the phase I observations. The numerical study of Breslow and Chatterjee (1999) indicated that the ‘‘balanced design’’ is more efficient than the simple random sampling design. In this section, we illustrate a numerical procedure which can be used to achieve the optimal stratified two-phase design. We consider the models in (1) and (2). We assume that X is binary and PrðX ¼ 1Þ ¼ m and s2 ¼ mð1mÞ. We let the phase II selection probability depend on X, say the phase II selection probability is p1 if X ¼1 and p0 otherwise. Let I c0 ðhÞ ¼ EfIc ðhÞ9X ¼ 0g, I c1 ðhÞ ¼ EfIc ðhÞ9X ¼ 1g, I c 0 ðhÞ ¼ EfIc ðhÞ9X ¼ 0g, and I c 1 ðhÞ ¼ EfIc ðhÞ9X ¼ 1g. Explicit expressions for these expected information matrices are given in the Appendix. In the stratified two-phase design with nfmp1 þ ð1mÞp0 g complete and nfmð1p1 Þ þ ð1mÞð1p0 Þg incomplete independent observations, the expected information matrix can be written as I Sð2Þ ðhÞ ¼ n½mfp1 I c1 ðhÞ þð1p1 ÞI c 1 ðhÞg þ ð1mÞfp0 I c0 ðhÞ þ ð1p0 ÞI c 0 ðhÞg: The asymptotic variance of the MLE from the stratified two-phase design is Asvarðh^ Sð2Þ Þ ¼ I Sð2Þ ðhÞ1 : The corresponding ARE’s of the MLE’s comparing the stratified two-phase design with one-phase design is ARES ðy^ i Þ ¼ Asvarðy^ ið1Þ Þ=Asvarðy^ iSð2Þ Þ. For both the expensive response and the expensive covariate models, the closed form expressions of the ARE’s are very complex. However, with known or estimated values of h and m the ARE’s are just functions of ðp1 ,p0 Þ, which can be optimized numerically. As an example, we consider the fixed budget problem in Section 3.1. The cost effective ARE under the stratified design is AREcS ðy^ i Þ ¼
ARES ðy^ i Þ : ½fmp1 þð1mÞp0 g þ fmð1p1 Þ þ ð1mÞð1p0 ÞgR
For the expensive covariate model, let say that the known (or estimated) values for the parameters are b ¼ a ¼ ð0:1,1; 1Þ0 , m ¼ 0:1, s21 ¼ 0:0225, s22 ¼ 2:25, and R ¼0.2. We can obtain the optimal stratified two-phase design by maximizing AREcS ða^ 1 Þ numerically with respect to ðp0 ,p1 Þ using the Matlab function ‘‘fmincon’’. The maximum of AREcS ða^ 1 Þ is 3.249 at ðp0,p1Þ ¼ ð0:017,0:156Þ. To compare with the simple random sampling design, that is p0 ¼ p1 ¼ p, we get the maximum AREcS ða^ 1 Þ ¼ 2:849 at p¼ 0.052. We see that the ARE is increased 14% in the stratified design compared to the simple random sampling design. We note that when p0 ð1mÞ ¼ p1 m we get the ‘‘balanced design’’. In this particular example, the optimal stratified two-phase design is just the ‘‘balanced design’’. However, this is not always the case. The stratified two-phase design can be more efficient than the ‘‘balanced design’’ especially when the selection probability depends on both the response and the covariate and there are other highly significant covariates in the model (Breslow and Cain, 1988; Zhao, 2005). In practice, often the closed form expressions for the conditional expected information matrices are not available. To derive the optimal stratified two-phase design, we may need to compute the AREcS from a pilot study. 5. Discussion This article discusses both analytic and numerical approaches to compare two-phase designs with one-phase designs for the expensive response and the expensive covariate problems. The results of investigations indicate that the relative cost of measurement and the correlation between an expensive variable and its surrogate, given the other variables in the model, are related to the cost effective ARE’s for the two designs. The two-phase sampling design is in general more efficient than the one complete sampling design especially when the relative cost R is low and the correlation between the expensive variable and its surrogate, given other variables in the model, is high. In cases where analytic results cannot be obtained or are too complex, the numerical approaches are valuable. The numerical investigation in Section 4.1 indicates that the analytic results in Sections 2 and 3 may be extended to models in certain settings as cases (i) and (ii). Following the same procedure as in Section 4.1, we can examine other parametric regression models by comparing the simulated ARE’s with the theoretical ARE’s in (7) and (9). If the theoretical ARE’s fit the simulated ARE’s well, then we can use the theoretical ARE’s to derive the optimal two-phase design as in Section 3. We have some investigations on stratified two-phase design for regression models with only binary variables, say D, S and X are all binary variables. We let the selection probability at phase II depend on ðS,XÞ. The closed form expression for the ARE of the MLE’s from the two designs is complex. Our numerical study in this setting indicates that the stratified design is more efficient than the ‘‘balanced design’’ and the ‘‘balanced design’’ is more efficient than the simple random sampling design. With modern fast computing facility the numerical approaches as illustrated in Section 4.2 should be applied to obtain the optimal stratified two-phase sampling design in practice especially when fine surrogates are observed in phase I sample. Developing computational software for the stratified two-phase sampling design is valuable.
Acknowledgments This research was partially supported by an Ontario Graduate Scholarship (Y.Z.) and by grants from the Natural Sciences and Engineering Research Council of Canada (J.F.L., D.L.M., Y.Z.). The authors thank the anonymous referee and the Associate Editor and the Editor for their constructive comments.
2962
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
Appendix The expected information matrix for a complete observation ðS,D,XÞ is 01 1 m 0 0 0 0 s21 s21 B C B m m2 þ s2 C B s2 0 0 0 0C s21 B 1 C B C 2 B0 C 0 0 0 0 B C s21 B C I c ðhÞ ¼ B C: b þ b m 1 0 1 0 0 0 B0 C 2 2 s s B C 2 2 B C 2 2 2 B C b s þ s21 þ ðb0 þ b1 mÞ B0 C 0 0 0 b0 þs2b1 m 1 2 s B C 2 2 @ A 2 0 0 0 0 0 s2 2
Here, the expectation is also taken over X. However, the ‘‘fixed X’’ case corresponds to putting m1 ¼ X , s2 ¼ ð1=nÞ The expected information matrix for an incomplete observation (S,X) can be derived as ! I b I ba I c ðyÞ ¼ , I ab I a
P
i ðX i X Þ
where 0
a21
2 1 2 2 þ 1ð
a m a m s2 Þ
1B I b ¼ @ a21 m A 0
a1
2 1 =A
2a s
C A,
ðb0 þ b1 mÞa1 fb0 m þ b1 ðs2 þ m2 Þga1
1B I ba ¼ I 0ab ¼ @ a1 m A 0 0 1 1B B b þ I a ¼ @ 0 b1 m A 0
0 4 1
0 0
1
0
3 1
3 1 =A
2a s
1
0 0 2 1
2a s1 s2 =A
b0 þ b1 m
C A,
1
0
2 Þ2 þ b1
ðb0 þ b1 m s2 þ 2a21 s41 =A 2a1 s2 s21 =A
C 2a1 s2 s21 =A C A, 2s22 =A
with A ¼ a21 s21 þ s22 . The asymptotic variance–covariance matrix for the MLE h^ ð1Þ from the 0 ðm2 þ s2 Þs2 ms2 1 s21 0 0 0 s2 B B ms2 s21 B s21 0 0 0 s2 B B s21 B 0 0 0 0 2 1B B I 1 ðb21 s2 þ s21 þ ðb0 þ b1 mÞ2 Þs22 ðb0 þ b1 mÞs22 B ð1Þ ðyÞ ¼ 2 2 2 0 0 0 nB b21 s2 þ s21 b1 s þ s1 B B B ðb0 þ b1 mÞs22 s22 B 0 0 0 2 2 2 b1 s þ s1 b21 s2 þ s21 B @ 0 0 0 0 0
one-phase sample is 1 0 C C 0C C C 0C C C C: 0C C C C 0C C A 2 s2 2
The asymptotic variance–covariance matrix for the MLE h^ ð2Þ from the two-phase sampling design can be derived as ! 1 I b I ba 1 I ð2Þ ðhÞ ¼ , n I ab I a where 0 Ib ¼
1
B B B
2A H B 1 3 @
s
s41 ðps þ 1ÞA3 H pðs þ 1Þ
2pm2 A2
2pmA2
2pmA2
2pA2
2pð1pÞ2 s2 s2 b1 m
2 2 2 2pð1pÞs1s s b1
s1
2pð1pÞ2 s2 s2 b1 m
s1
1
C C 2 2 2 2pð1pÞs1s s b1 C C A 2 ps A1
2
.
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
2963
and 0 Ia ¼
1 B @ pðs þ 1ÞB2
ðp þ sÞs22 fB2 þ m2D B1 g
s22 mD B1
s22 mD B1
ðp þ sÞs22 B1
2 1
ð1pÞa1 s s2 mD
2 1
ðp þsÞð1pÞs s2 a1
ð1pÞa1 s21 s2 mD
1
ðp þ sÞð1pÞs21 s2 a1 C A,
s21 s22 B3 =2
with A1 ¼ tðs þ 1Þðps2 þ 2ps þ1Þ þðp þ sÞðps2 þ 2s þ 1Þ, A2 ¼ tðpsþ 1Þð2psþ s2 þ 1Þ þ pðs þ 1Þ3 , A3 ¼ tðpsþ 1Þðps2 þ 2ps þ 1Þ þ pðs þ 1Þðps2 þ 2s þ 1Þ, H ¼ 2p2 s2 f2ð1pÞ4 s4 tA1 A2 g=ðs41 A23 Þ,
mD ¼ b0 þ b1 m, B1 ¼ 2psþ s2 þ 1, B2 ¼ s21 fðs þ 1Þðs þ pÞ þ tB1 g, and B3 ¼ tðs þ 1Þð2psþ s2 þ pÞ þðp þ sÞðp þ s2 þ 2sÞ: When X has normal distribution with mean m and variance s2 , we can derive the following probability distributions: 2
ðDÞ Nðb0 þ b1 m, b1 s2 þ s21 Þ, 2
ðSÞ Nða0 þ a1 ðb0 þ b1 mÞ, s22 þ a21 ðs21 þ b1 s2 ÞÞ, !
ðD9SÞ N
a1 ðSa0 Þðb21 s2 þ s21 Þ þ s22 ðb0 þ b1 mÞ s22 ðs21 þ b21 s2 Þ , , s22 þ a21 ðs21 þ b21 s2 Þ s22 þ a21 ðs21 þ b21 s2 Þ
ðX9SÞ N
a1 b1 s2 ðSa0 a1 b0 Þ þ mða21 s21 þ s22 Þ s2 ða21 s21 þ s22 Þ , : s22 þ a21 ðs21 þ b21 s2 Þ s22 þ a21 ðs21 þ b21 s2 Þ
and !
Then for the expensive covariate model the correlation between the expensive variable D and its surrogate variable X given S, rðD,X9SÞ , and the correlation between the response S and the covariate D, rðS,DÞ, can be obtained as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ss2 b1 t rðD,X9SÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , ð15Þ ¼ ðs þ 1Þðt þ 1Þ 2 2 2 2 2 2 ðb1 s þ s1 Þða1 s1 þ s2 Þ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi sðt þ 1Þ : rðS,DÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ sðt þ 1Þ þ 1 2 2 2 2 2 2 2 2 ðb1 s þ s1 Þða1 ðb1 s þ s1 Þ þ s2 Þ
a1 ðs21 þ b21 s2 Þ
The conditional expected information matrices for a complete observation ðS,D,XÞ, given X, are 0 2 1 s1 0 0 0 0 0 B C B 0 0 0 0 0 0 C B C 2 B 0 0 0 0 C 0 2s1 B C B C b0 2 C I c0 ðhÞ ¼ B 0 0 0 s 0 B C 2 s22 B C 2 B C 2 s1 þ b0 b0 B 0 C 0 0 0 2 2 B C s2 s2 @ A 0 0 0 0 0 2s2 2
ð16Þ
2964
Y. Zhao et al. / Journal of Statistical Planning and Inference 142 (2012) 2953–2964
and
0
s2 s2 1 1 B 2 s2 B s1 1
B B 0 B B I c1 ðhÞ ¼ B B 0 B B B 0 B @ 0
0 0
0
2s
0
0
2 1
0 0
0 0
0 0
0
0
0
s2 2
b0 þ b1
0
s22
0
0
b0 þ b1
s22
s21 þ ðb0 þ b1 Þ s22
0
0
0
0
1
2
0 2s2 2
C C C C C C C: C C C C C A
The conditional expected information matrices for an incomplete observation (S,X), given X, are 0 2 1 a1 a1 b0 a1 0 0 0 s2 s2 B s2 C B 0 C 0 0 0 0 0 B C 2 4 3 3 2 B s1 a 1 a1 s1 s1 s2 a1 C B 0 0 2 s4 2 s4 C 0 2 s4 B C C b0 I c 0 ðhÞ ¼ B 2 B sa12 0 C 0 s 0 2 s B C Ba b 2 3 3 4 2 2 s1 a1 s a s C b0 b0 B 1 0 0 2 a 1 s1 2 1 s14 2 C B s2 C s4 s2 s2 þ2 s4 @ A s1 s2 a21 s21 a1 s2 s22 0 0 2 s4 0 2 s4 2 s4 and
0
a21 s2 a21 s2
B B B B B B 0 B I c 1 ðhÞ ¼ B B a1 B s2 B B a1 ðb0 þ b1 Þ B B s2 @ 0
a21 s2 a21 s2
0 a1 s2 a1 ðb0 þ b1 Þ s2
0
0 0 s2 a4
2 s1 4 1 0 a3 s3 2 s1 4 1 s s a2 2 1 s42 1
a1 s2 a1 s2
0
a1 ðb0 þ b1 Þ s2 a1 ðb0 þ b1 Þ s2 a31 s31 2 s4 b0 þ b1
s2 b0 þ b1
s2
0
s2
ðb0 þ b1 Þ s2
2
s4 a2
þ 2 s1 4 1
s2 a s2
2 1 s14
0
1
C C C C C s1 s2 a21 C 2 s4 C C: 0 C C C s21 a1 s2 C 2 s4 C C A s2 2 s24 0
References Breslow, N.E., Cain, K.C., 1988. Logistic regression for two-stage case–control data. Biometrika 75 (1), 11–20. Breslow, N.E., Chatterjee, N., 1999. Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Applied Statistics 48 (4), 457–468. Chatterjee, N., Chen, Y., Breslow, N.E., 2003. A pseudo-score estimator for regression problems with two-phase sampling. Journal of the American Statistical Association 98, 158–168. Chen, Q., Ibrahim, J.G., Chen, M.H., Senchaudhuri, P., 2008. Theory and inference for regression models with missing responses and covariates. Journal of Multivariate Analysis 99, 1302–1331. Lawless, J.F., Kalbfleisch, J.D., Wild, C.J., 1999. Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society B 61 (2), 413–438. Little, R.J.A., 1992. Regression with missing x’s: a review. Journal of the American Statistical Association 87, 1227–1237. Morara, M., Ryan, L., Houseman, A., Strauss, W., 2007. Optimal design for epidemiological studies subject to designed missingness. Lifetime Data Analysis 13, 583–605. Reilly, M., Pepe, M., 1995. A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82 (2), 299–314. Rubin, D., 1976. Inference and missing data. Biometrika 63 (3), 581–592. Thomas, D.C., 2007. Multistage sampling for latent variable models. Lifetime Data Analysis 13, 565–581. Zhao, Y., 2005. Design and efficient estimation in regression analysis with missing data in two-phase studies. Unpublished Ph.D. Thesis, University of Waterloo. Zhao, Y., Lawless, J.F., McLeish, D.L., 2009. Likelihood methods for regression models with expensive variables missing by design. Biometrical Journal 51, 123–136.