Bias analysis for misclassification in a multicategorical exposure in a logistic regression model

Bias analysis for misclassification in a multicategorical exposure in a logistic regression model

Statistics and Probability Letters 83 (2013) 2621–2626 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage...

348KB Sizes 0 Downloads 26 Views

Statistics and Probability Letters 83 (2013) 2621–2626

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

Bias analysis for misclassification in a multicategorical exposure in a logistic regression model Yaqing Liu a , Juxin Liu b,∗ , Fuxi Zhang c a b c

Sasktel and TRlabs, Canada Department of Mathematics and Statistics, University of Saskatchewan, Canada School of Mathematical Sciences, Peking University, China

article

info

Article history: Received 18 January 2013 Received in revised form 26 June 2013 Accepted 30 August 2013 Available online 5 September 2013

abstract We derive the expressions of asymptotic biases when ignoring the misclassification in a multicategory exposure. For a model with a misclassified exposure variable only, we provide a general conclusion on the direction of the biases under nondifferential misclassification assumption. To better understand the bias formulas, we use a numerical example. Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved.

Keywords: Misclassification Nondifferential misclassification Bias Multicategorical exposure Logistic regression

1. Introduction In many epidemiological studies, the central interest is to find the association between a binary health outcome and a categorical exposure variable. One common difficulty encountered in practice is that exposures may not always be assessed correctly. For example, self-reported tobacco use has been widely used to measure smoking prevalence in epidemiological studies. However, some people may not honestly report their smoking habit in the interview. Due to this possible error, the actual relationship between smoking and some certain disease (e.g. lung cancer) may not be detected. Then this misleading result may influence the decision makers on allocating resources for the prevention of smoking. The theoretical investigation on the consequences of ignoring the errors in measurement of variables have been studied in the past over 50 years. For a textbook treatment of this topic, please refer to Carroll et al. (2006) and references therein. It is well known that the techniques to adjust for misclassification in variables are different for different types of variables. For discrete or categorical variables, there have been extensive work focusing on the misclassification of a binary exposure variable. To name a few, Greenland (1988) proposed a method for variance estimation of the estimates under misclassifications; Davidov et al. (2003) studied the asymptotic bias caused by a misclassified binary exposure in logistic regression context; Prescott and Garthwaite (2005) developed a Bayesian method to account for the misclassification in a binary risk factor under a prospective logistic regression model. However, there has been very limited work on the misclassifications of a multicategory covariate (Reade-Christopher and Kupper, 1991; Viana, 1994; Ruiz et al., 2008). To our best knowledge, there is no work that has been done regarding large sample theory on the model parameters when misclassification in a multicategory exposure is ignored. Our work here concerns the asymptotic bias. In this paper, we



Corresponding author. E-mail address: [email protected] (J. Liu).

0167-7152/$ – see front matter Crown Copyright © 2013 Published by Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.spl.2013.08.014

2622

Y. Liu et al. / Statistics and Probability Letters 83 (2013) 2621–2626

extend the formulas derived in Davidov et al. (2003) to a more general setting where the error-prone exposure has at least two different categories. The bias formulas can provide some insightful ideas about how severe the consequence would be when simply ignoring the misclassification of the exposure variable. The remainder of this paper is organized as follows. Section 2 gives the derivation of the asymptotic biases due to the misclassifications of a multicategory covariate under two models with and without an error-free covariate. Section 3 discusses the sign of the bias under the latter model. Section 4 presents a numerical example demonstrating how the biases change along with the change in the misclassification rates. Section 5 concludes the findings and also discusses some possibilities of the future work. 2. Derivation of the asymptotic bias 2.1. Asymptotic bias for a model with a misclassified exposure variable only In this subsection, we consider a logistic regression model with a binary outcome, denoted by Y , and a categorical exposure which is subject to misclassifications. Let E denote the true value of the categorical exposure and X denote the observed value of E and it is subject to errors. Note that in practice, often not E but X is observable. Suppose that there are m(≥ 2) categories of E, labeled by 1, 2, . . . , m, respectively. Please note those categories may or may not have meaningful orderings, that is, E can be either nominal or ordinal. We define the following indicator variables for E by treating the category 1 as the reference category. For i = 2, . . . , m, let Ii−1 = 1 if E = i and zero otherwise. The logistic regression model that can describe the association between Y and E is logitP (Y = 1|E ) = β0 + β1 I1 + β2 I2 + · · · + βm−1 Im−1 . Similarly, we can define the indicator variables for X . For i = 2, . . . , m, let regression model for (Y , X ) is

(1) Ii∗−1

∗ logitP (Y = 1|X ) = γ0 + γ1 I1∗ + γ2 I2∗ + · · · + γm−1 Im −1 .

= 1 if X = i and zero otherwise. The logistic (2)

Note the parameters γ in (2) stand for the large sample limit of the regression coefficient estimates based on the errorprone X . The objective of our study is to find the difference between the regression parameters β and γ . That is, what is the bias if one simply treats the error-prone variable as the true one. For k = 0, 1, . . . , m − 1, let ∆k = βk − γk denote the asymptotic bias in each regression coefficient. The magnitude of ∆k can indicate how harmful the consequence of misclassifications would be. Prior to deriving the formulas for ∆k , we need to introduce some other notations as follows:

πdi = P (Y = d, E = i), d = 0, 1; i = 1, 2, . . . , m, pdi = P (Y = d, X = i), θd,ij = P (X = i|E = j, Y = d); j = 1, 2, . . . , m, πda , a, b = 1, 2, . . . , m. ξd,ab = πdb Here θd,ij (i ̸= j) stand for the misclassification rates. The diagonal m elements θd,ii are correct classification rates, which are actually sensitivity and specificity when m = 2. Obviously i=1 θd,ij = 1 for any given d, j. If X and Y are independent conditional on E, then the resulting misclassification mechanism is defined as nondifferential misclassification, otherwise a differential misclassification. In the nondifferential case, the subscript d in θd,ij can be suppressed because P (X = i|E = j, Y = d) = P (X = i|E = j). Thanks to the definition of conditional probability, we have pdi = P (Y = d, X = i) =



P (X = i|E = j, Y = d)P (Y = d, E = j)

j

=



θd,ij πdj .

(3)

j

Note that the notation ξd,ab is introduced to simplify the expressions of asymptotic biases derived later. The meaning of ξd,ab can be better understood by rewriting it as P (E = a|Y = d)/P (E = b|Y = d). That is, the ratio of the exposure risk of categories a and b given Y = d. When the number of categories m = 2, ξd,ab is simply the odds in favor of the category a. Moreover, the ratio of ξ1,ab /ξ0,ab can be linked to the regression coefficients. According to the model (1), for any l ≥ 1, k ≥ 1, we have eβk =

e

βk −βl





odds Y = 1|E = 1



odds Y = 1|E = k + 1

=

 =

ξ1,(k+1)1 ξ0,(k+1)1







 =

odds Y = 1|E = k + 1 odds Y = 1|E = l + 1

ξ1,(k+1)(l+1) . ξ0,(k+1)(l+1)

(4)

(5)

Y. Liu et al. / Statistics and Probability Letters 83 (2013) 2621–2626

2623

Taking the reciprocal of both sides of (4), we have e−βk =

ξ0,(k+1)1 . ξ1,(k+1)1

(6)

These equations will be used in Section 3 when discussing the sign of the bias. Let πd|c = P (Y = d|E = c ) and pd|c = P (Y = d|X = c ), d = 0, 1, c = 1, 2, . . . , m. According to the models (1) and (2), we have

∆0 = log



π1|1 π0|1



 − log

p1|1



p0|1

 = log

π11 π01



 − log

p11 p01



.

Thanks to the formula (3) that expresses pdi in terms of πdi and θd,ij , we have m 

 π  11 ∆0 = log   π01

 m   θ0,1j π0j  θ ξ  j=1 0,1j 0,j1  j =1    = log  m . m     θ1,1j π1j θ1,1j ξ1,j1 j =1

j =1

Similarly, we can find the expressions of ∆c = βc − γc , c = 1, . . . , m − 1. The models (1) and (2) imply that logit(P (Y = 1|E = c + 1)) = β0 + βc and logit(P (Y = 1|X = c + 1)) = γ0 + γc . Therefore, we have

∆c = logit(π1|c +1 ) − logit(p1|c +1 ) − (β0 − γ0 ) m    j=1 θ0,(c +1)j ξ0,j(c +1)    = log  m  − ∆0 .   θ1,(c +1)j ξ1,j(c +1) j =1

2.2. Asymptotic bias under a model with a misclassified exposure and an error-free covariate In line with the models considered in Davidov et al. (2003) from binary exposure to multicategory exposure, we now consider the model with an error-free covariate Z included. That is, m −1

logitP (Y = 1|E , Z ) = β0 +



m−1

βi Ii + βm Z +

i =1



βi+m Ii Z .

(7)

i=1

Analogously, a logistic regression model is built for (X , Z , Y ): m−1

logitP (Y = 1|X , Z ) = γ0 +



m−1

γi Ii∗ + γm Z +

i=1



γi+m Ii∗ Z .

(8)

i=1

Similar to the notations introduced in Section 2.1, we define the following for k = 0, 1:

πd|jk = P (Y = d|E = j, Z = k), pd|ik = P (Y = d|X = i, Z = k), P (E = a, Y = d|Z = k) P (E = a|Y = d, Z = k) πda|k = = . ξd,ab|k = πdb|k P (E = b, Y = d|Z = k) P (E = b|Y = d, Z = k) Assume that X and Z are independent conditional on E and Y , i.e., P (X = i|E = j, Y = d, Z = k) = P (X = i|E = j, Y = d) = θd,ij . This assumption indicates that misclassification in X does not depend on Z . Similar to the derivation of asymptotic bias in the model without Z in Section 2.2, we have

 m−1   θ ξ 0 , 0j 0 , j0 | 0  j =0    ∆0 = log  , −1   m θ1,0j ξ1,j0|0 j =0

∆ t −1

 m−1   θ ξ  j=0 0,tj 0,jt |0    = log  m−1  − ∆0 ,   θ1,tj ξ1,jt |0 j =0

2624

Y. Liu et al. / Statistics and Probability Letters 83 (2013) 2621–2626

 m−1   θ ξ  j=0 0,0j 0,j0|1    ∆m = log   − ∆0 , −1  m  θ1,0j ξ1,j0|1 j =0

∆m+t −1

 m−1    j=0 θ0,tj ξ0,jt |1    = log  m−1  − ∆ 0 − ∆ t −1 − ∆ m .   θ1,tj ξ1,jt |1 j=0

Similar to the discussion in Davidov et al. (2003), we provide the approximation formulas for the biases in Supplementary Material. There is no intuitive way to determine the direction of the above biases (especially ∆m+t −1 ). Therefore, in the following section, we will focus on the model with one misclassified exposure only as considered in Section 2.1. 3. The direction of bias under nondifferential misclassification assumption As indicated later in Section 4, the differential misclassification is more complicated than nondifferential. Therefore, we focus on nondifferential misclassification in this section. In the following we suppress the first subscript in θd,ij , i.e., θij . Now the asymptotic bias ∆0 can be simplified as follows: m 

m   ξ0,j1 θ ξ 1j 0 , j1  j =1  j=1 θ1j ξ1,j1 · ξ1,j1     ∆0 = log  m  = log  m     θ1j ξ1,j1 θ1j ξ1,j1 

j =1

   . 

j =1

Regard θ1j ξ1,j1 ’s in the numerator as coefficients, and use the denominator to normalize them as aj := θ1j ξ1,j1 /

m

i=1

 θ1i ξ1,i1 .

ξ ξ ξ ξ aj · ξ0,j1 , a convex combination of ξ0,j1 ’s. Recall (6), ξ0,j1 = e−βj−1 , j ≥ 2. And ξ0,11 = 1 = e0 . By Jensen’s 1,j1 1,j1 1,j1 1,11 inequality and the convexity of ex , we have

Then, e∆0 =

m

j =1

e∆0 = a1 e0 + a2 e−β1 + · · · + am e−βm−1 ≥ ea1 ·0+a2 (−β1 )+···+am (−βm−1 ) , which gives ∆ 0 ≥ −(a2 β1 + · · · + am βm−1 ). To ensure ∆0 ≥ 0, it is sufficient that the RHS above is nonnegative.  m Equivalently, j=2θ1j ξ1,j1 βj−1 ≤ 0 since aj ∝ θ1j ξ1,j1 . To derive the sufficient condition ensuring ∆0 ≤ 0, we use the



m

θ ξ



j=1 1j 1,j1 . Analogously, we normalize θ1j ξ0,j1 as bj := θ1j ξ0,j1 / fact −∆0 = log m θ ξ j=1 1j 0,j1

m

i=1

 θ1i ξ0,i1 . By the same reasoning

−∆ 0 as ≥ eb1 ·0+b2 β1 +···+bm βm−1 . Thus, to ensure ∆0 ≤ 0 (i.e. −∆0 ≥ 0), it is sufficient that mthe above, we conclude that e θ ξ β ≥ 0. 1j 0 , j1 j − 1 j =2 Now let us study the sign of ∆c , c = 1, . . . , m. Notice that



m 

 θ ξ ( c + 1 ) j 0 , j ( c + 1 )  j=1    ∆c = log  m  − ∆0   θ(c +1)j ξ1,j(c +1) j =1

 m   ξ θ ξ θ  j=1 (c +1)j 0,j(c +1)   j=1 1j 1,j1      = log  m  + log  m .     θ(c +1)j ξ1,j(c +1) θ1j ξ0,j1 

m 

j =1

j =1

Normalize θ(c +1)j ξ1,j(c +1) as a˜ j := θ(c +1)j ξ1,j(c +1) / and e

βc

(9)

m

i=1

θ(c +1)i ξ1,i(c +1) . Recall (4) and (5), we have

ξ0,j(c +1) ξ1,j(c +1)

= eβc −βj−1 , for j ≥ 2

for j = 1. Similarly, Jensen’s inequality gives

e∆c ≥ exp{˜a1 βc + a˜ 2 (βc − β1 ) + · · · + a˜ m (βc − βm−1 ) + b1 · 0 + b2 β1 + · · · + bm βm−1 }. To ensure ∆c ≥ 0, it is sufficient that

βc + b2 β1 + · · · + bm βm−1 ≥ a˜ 2 β1 + · · · + a˜ m βm−1 .

(10)

Y. Liu et al. / Statistics and Probability Letters 83 (2013) 2621–2626

2625

Analogously, to derive the sufficient condition for ∆c ≤ 0, we rewrite (9) as

 m   ξ θ ξ θ   j=1 (c +1)j 1,j(c +1)   j=1 1j 0,j1       ∆c = − log  m  + log  m  .      θ(c +1)j ξ0,j(c +1) θ1j ξ1,j1 



m 

j =1

j =1

Clearly if βc + a2 β1 + · · · + am βm−1 ≤ b˜ 2 β1 + · · · + b˜ m βm−1 , where b˜ j := θ(c +1)j ξ0,j(c +1) / In a special case when 0 ≤ βj ≤ βc , j = 1, 2, . . . , m − 1, we have

m

i=1

θ(c +1)i ξ0,i(c +1) , then ∆c ≤ 0.

a˜ 2 β1 + · · · + a˜ m βm−1 ≤ (˜a2 + · · · + a˜ m )βc

≤ βc − a˜ 1 βc ≤ βc + b2 β1 + · · · + bm βm−1 . m

Note that the second inequality is due to the fact that i=1 a˜ i = 1 and the third inequality is based on the fact βj ≥ 0 and bj ≥ 0 for all j ≥ 1. According to (10), we have ∆c ≥ 0. Otherwise if βc ≤ βj ≤ 0, j = 1, 2, . . . , m − 1, it can be shown

∆c ≤ 0 by analogous reasoning on b˜ 2 β1 +· · · b˜ m βm−1 . In other words, if the category c (>1) is the most (least) risky category associated with the disease (Y = 1) while the reference category is the least (most) risky category, the estimate of its risk is deflated (inflated). 4. A numerical example The asymptotic bias formulas derived in Section 2 are very complex and thus no obvious insights can be perceived directly. In this section, we use a numerical example to obtain more insights from the bias formulas. We consider a model with both a misclassified exposure and an error-free covariate. The collection of data (Y , X , E , Z ) is generated based on the following decomposition: the conditionals X |Y , E , Z and Y |E , Z and marginal of (E , Z ). Note that E has three categories and Z is binary. Therefore, P (X = l|E = m, Y = d, Z ) = P (X = l|E = m, Y = d) = θd,lm , logitP (Y = 1) = β0 + β1 I1 + β2 I2 + β3 Z + β4 I1 Z+ β5 I2 Z , P (E = t , Z = s) = πE ,Z (t , s). Here the values of the parameters we used to generate the data are set as follows:

β0 = 0.1, β1 = 0.5, β2 = 0.25, β3 = 0.5, πE ,Z (t , s) = 0.1 + 0.1|t − 2|, t = 1, 2, 3, s = 1, 2.

β4 = 0.2,

β5 = 0.2,

For an easy operation on the misclassification rates, we assume that misclassifications only occur between two adjacent categories and the misclassification rates are equal, that is, P (X = j|E = i, Y = d) = (1 − θd,ii )/2, j = i ± 1. Note that θd,ii are the diagonal elements which stand for the correct classification rates. In other words, we may use the values of θd,ii as an indicator of different amount of misclassification error. For the nondifferential case, we consider the following three cases. Case 1: θii = 0.95; Case 2: θii = 0.80; Case 3: θii = 0.65. For differential misclassification, we consider the following three cases. Case 4: θ1,ii = 0.95, θ2,ii = 0.80; Case 5: θ1,ii = 0.80, θ2,ii = 0.70; Case 6: θ1,ii = 0.70, θ2,ii = 0.60. For the three nondifferential misclassification cases, we find that the magnitude of bias increases when misclassification rates increase. However, there is no such simple trend for differential misclassification cases. Therefore, the change of bias would be more complicated in the context of differential misclassification models. For details of the number results, please see the Supplementary Material. 5. Discussion We have derived the formulas of asymptotic biases in the regression coefficients when the misclassification in a multicategory exposure is simply ignored in a logistic regression model. Different from the binary exposure considered in Davidov et al. (2003), our work focuses on a multicategory exposure which is error-prone. In the model with a misclassified exposure only, the expressions of asymptotic biases can be simplified and this can lead to some interesting discussion of the sign of the biases. The conclusion on the direction of biases (either positive or negative) can tell when the inference based on the error-prone variable tends to overstate (understate) the actual risk of a certain exposure variable associated with a disease. In practice, often it is impossible to have a large sample size of validation data. Nonetheless, when some empirical knowledge about misclassification rates is available, the derived bias formulas may provide guide information about how severe the consequence of ignoring the misclassification would be. There are a couple of issues that need to be further studied in the future. First, more investigation of the direction of asymptotic biases in the context of differential misclassification is required. As indicated in Section 3, we have discovered the direction of the biases under nondifferential misclassification assumption for a model with an exposure variable only.

2626

Y. Liu et al. / Statistics and Probability Letters 83 (2013) 2621–2626

Second, the evaluation of bias on regression parameter estimates for finite samples may be of interest. When performing the estimation procedure, one cautionary note is the underlying constraints of the misclassification parameters. For a binary exposure, it is well known that a constraint on misclassification parameters is necessary to ensure the model identifiability, i.e., sensitivity >1-specificity (Gustafson, 2003). Now for a multicategory covariate, the constraints on θd,ij should be certainly more complex. The constraints may complicate the procedure of finding the maximum likelihood estimate. If there is a priori knowledge about the misclassification parameters, one may consider a Bayesian approach. Acknowledgments We would like to thank the two anonymous reviewers for their valuable comments that help greatly improve the manuscript. The second author would like to thank NSERC Canada Discovery Grant and New Faculty Graduate Student Support Program of the University of Saskatchewan. The third author would like to thank the center of statistical science and LMEQF of Peking University. Appendix A. Supplementary material Supplementary material related to this article can be found online at http://dx.doi.org/10.1016/j.spl.2013.08.014. References Carroll, R.J., Ruppert, D., Stefanski, L.A., Crainiceanu, C.M., 2006. Measurement Error in Nonlinear Models: A Modern Perspective, second ed. Chapman&Hall/CRC. Davidov, O., Faraggi, D., Reiser, B., 2003. Misclassification in logistic regression with discrete covariates. Biometrical Journal 45, 541–553. Greenland, S., 1988. Variance estimation for epidemiologic effect estimates under misclassification. Statistics in Medicine 7 (7), 745–757. Gustafson, P., 2003. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman&Hall/CRC. Prescott, G.J., Garthwaite, P.H., 2005. A simple Bayesian analysis of misclassified binary dta with a validation substudy. Biometrics 58 (2), 454–458. Reade-Christopher, S.J., Kupper, L.L., 1991. Effects of exposure misclassification on regression analyses of epidemiologic follow-up study data. Biometrics 47 (2), 535–548. Ruiz, M., Girón, F.J., Pérez, C.J., Martin, J., Rojano, C., 2008. A Bayesian model for multinomial sampling with misclassified data. Journal of Applied Statistics 35 (4), 369–382. Viana, M.A.G., 1994. Bayesian small-sample estimation of misclassified multinomial data. Biometrics 50 (1), 237–243.