Residual and influence diagnostics

Residual and influence diagnostics

CHAPTER Residual and influence diagnostics 6 CHAPTER OUTLINE 6.1 Residual Diagnostics...

3MB Sizes 0 Downloads 61 Views

CHAPTER

Residual and influence diagnostics

6

CHAPTER OUTLINE 6.1 Residual Diagnostics.......................................................................................... 174 6.1.1 Types of Residuals in Linear Regression Models................................. 174 6.1.2 Semivariogram in Random Intercept Linear Models............................ 176 6.1.3 Semivariogram in the Linear Random Coefficient Model..................... 178 6.2 Influence Diagnostics......................................................................................... 181 6.2.1 Cook’s D and Related Influence Diagnostics...................................... 181 6.2.2 Leverage........................................................................................ 183 6.2.3 DFFITS, MDFFITS, COVTRACE, and COVRATIO Statistics................... 184 6.2.4 Likelihood Displacement Statistic Approximation............................... 187 6.2.5 LMAX Statistic for Identification of Influential Observations................ 189 6.3 Empirical Illustrations on Influence Diagnostics................................................... 190 6.3.1 Influence Checks on Linear Mixed Model Concerning Effectiveness of Acupuncture Treatment on PCL Score........................................... 190 6.3.2 Influence Diagnostics on Linear Mixed Model Concerning Marital Status and Disability Severity Among Older Americans....................... 198 6.4 Summary........................................................................................................... 203

One of the primary objectives in creating a linear mixed model is to describe the time trend of a response measurement on the continuous scale, either by experimental units (such as treatments) or by population groups (such as currently married vs. currently not married). Given within-subject dependence in longitudinal data, specification of the random effects or of a residual covariance structure is required to account for intraindividual correlation, even though those parameters are usually not of direct interest. In the estimation of the fixed and the random effects, some bizarre observations can have unduly influences on the fitness of linear mixed models, thereby affecting the quality of parameter estimates. Therefore, once a linear mixed model is fitted with longitudinal data, regression diagnostics are necessary for verifying whether the statistical model fits the data appropriately or meets various assumptions on model specifications. In general linear models, the assessments of model adequacy are regularly focused on checking linearity, normality, homogeneity of variance, and independence of errors. With respect to longitudinal data analysis, regression diagnostics are more complex because an individual’s response is measured repeatedly over time, so that analytic results of regression diagnostics may vary over different Methods and Applications of Longitudinal Data Analysis. http://dx.doi.org/10.1016/B978-0-12-801342-7.00006-X Copyright © 2016 Higher Education Press. Published by Elsevier Inc. All rights reserved.

173

174

CHAPTER 6  Residual and influence diagnostics

time points. The general principles in statistical modeling, however, are universal for all regression models, and linear mixed modeling is no exception. Standard diagnostic techniques, such as deviation of the expected value from the observed, the overall model fit, and the identification of influential observations, also need to be performed in longitudinal data analysis. In this chapter, statistical methods of residual diagnostics are described first, starting with an introduction on different types of residuals applied in linear mixed models. Next, the statistical methods on influence diagnostics are presented. I then provide an empirical illustration to display the application of various regression diagnostics in linear mixed models based on the same longitudinal data previously used. Lastly, I summarize the chapter with comments on the regression diagnostics applied in linear mixed models.

6.1  RESIDUAL DIAGNOSTICS In this section, a variety of residual types applied in linear mixed models are introduced. Next, I delineate a popular residual diagnostic technique in longitudinal data analysis, referred to as semivariogram, which is used to check whether serial correlation among repeated measurements for the same subject is present given the specification of the fixed effects and the covariance parameters. This unique method can be applied in both the random intercept and the random coefficient models.

6.1.1  TYPES OF RESIDUALS IN LINEAR REGRESSION MODELS In regression modeling, residuals are closely related to random errors; the two concepts, however, differ distinctively. While a random error, or disturbance, of an observed value is defined as the deviation of the observed value from the true function value (unobservable), residual of an observed value is the difference between the observed value and the estimated function. A regression model, based on various assumptions and with certain properties on random errors, yields parameter estimates and model-based predictions from a statistical procedure, which, in turn, yield residuals, not random errors. Let the predicted mean response for subject i at time point j be µˆ ij = X ij′ βˆ . An ni-dimensional vector of residuals for each subject can then be derived from a linear mixed model, given by

rmi = Yi − X i′βˆ ,

(6.1)

where rmi , briefly discussed in Chapter 4, is referred to as the vector of the marginal residuals because X ij′ βˆ is a marginal mean vector in linear mixed models. If the random-effects component is considered, residuals for each subject become conditional on the random effects, written as

rci = Yi − X i′βˆ − Z i bˆi = rmi − Z i bˆi ,

where rci represents the conditional residuals.

(6.2)

6.1 Residual diagnostics

Given the specifications of linear mixed models, the variance–covariance matrix of marginal residuals is

var( rˆmi ) = Vˆi − X i ( X i′ Vˆi −1 X i )−1 X ′,

(6.3)

where Vˆi is the estimated total variance of Yi, from either maximum likelihood estimate or the restricted maximum likelihood (REML) estimator. The variance–covariance matrix of conditional residuals can be specified according to Gregoire et al. (1995) as follows:

(

)

(

)

ˆ ′Vˆ −1 var ( rˆ ) I − Z GZ ˆ ′Vˆ −1 ′ . var ( rˆci ) = I ni − Z iGZ i i mi ni i i i

(6.4)

In linear regression models, the distributions of residuals at different data points may not necessarily follow an expected pattern, even if random errors from the true function value are identically distributed. For example, in the conditional residuals, the random effect estimate bˆi , obtained from the empirical Bayes approximation, heavily depends on the normality assumption and is also influenced by the assumed covariance structure Vˆi . Therefore, the elements in bˆi are the empirical best linear unbiased predictors (BLUPs), thereby being shrunk toward the population fixed effects b. As a result, the distribution of the BLUPs does not accurately represent the distribution of the true random effects. In the application of linear mixed models, residuals often need to be adjusted by the expected variability for comparative purposes at different time points. A popular approach for such an adjustment is to standardize the raw residuals by using the estimated residual variance, referred to as studentizing in statistical analysis, given by rmstudent = i rcstudent = i

rmi

var ( rmi ) rcmi

var ( rci )

,

(6.5a)

,

(6.5b)

where rmstudent is the vector of studentized marginal residuals for subject i and rcstudent i i is the corresponding vector of studentized conditional residuals. Standardization of residuals is particularly important in the longitudinal data with distinctive outliers. Some scientists suggest external studentization of residuals in which a common residual variance estimate is used for all subjects. Raw residuals can also be scaled by the estimate of the variance for the observed response y, referred to as Pearson-type residuals, given by rmpearson = i rcpearson = i

rmi

var ( yi ) rcmi

var ( yi )

,

(6.6a)

.

(6.6b)

175

176

CHAPTER 6  Residual and influence diagnostics

A more refined type of standardized residuals is the scaled residuals. Construction of this type of residuals is based on the argument that given the specification of the random effects or the inclusion of a selected residual covariance structure in linear mixed models, residuals should behave as pure measurement error, and therefore, they should reflect variability that is not explained by the specified random effects or the covariance parameters (Verbeke et al., 1998). Thus, it is necessary to eliminate all sources of correlation by appropriate scaling to check residuals. The classical approach in this regard is to apply the Cholesky decomposition for the generation of transformed residuals that have constant variance and zero correlation. The application of the Cholesky decomposition on the variance–covariance matrix  starts with construction of a lower triangular matrix for each subject, denoted by C i , which satisfies the condition   Vˆi = C i C i′,   where C i represents the Cholesky root of Vˆi , with C i′−1Yi having constant variance and zero correlation.  Given the attached properties, the C i matrix can be used to transform the correlated residuals to correlation-free transformed residuals. For the marginal distribution of longitudinal data, the scaled residuals, denoted byRmi , are defined as

(

)

  Rmi = C i−1 rmi = C i−1 Yi − X i′βˆ ,

(6.7)

which have unit variance and zero correlation. For practical purposes, the scaled residuals can be plotted against the transformed predictions of Y, denoted by Yˆi * and defined as  Yˆi * = C i−1 X i′βˆ . If a linear mixed model is correctly specified, the plot of the transformed residuals against Yˆi * should be scattered randomly around zero with a constant range of variation. In contrast, if this scatter-plot displays a systematic trend, residuals remain correlated even with the specified random effects and/or a selected residual covariance structure. In the latter case, more random effect terms need to be considered in the specification of a linear mixed model and in statistical inference. Sometimes, the researcher might want to fit a smooth curve to the scatter-plot. If the linear mixed model is adequately assumed, the fitted line should not display distinctive systematic departures from a horizontal line centered at about 0.8 (Fitzmaurice et al., 2004). The transformed residuals can also be used to identify skewness, detect potential outliers, and assess the normal distribution hypothesis.

6.1.2  SEMIVARIOGRAM IN RANDOM INTERCEPT LINEAR MODELS Based on the random intercept linear model, Diggle (1988) introduces the approach of empirical semivariogram to assess residuals in longitudinal data analysis. This

6.1 Residual diagnostics

method is designed to check serial correlation in residuals, conditionally on the random components already included in the model. Construction of this residual diagnostic method begins with the specification of a general linear model with an error term that is assumed to be independently distributed as multivariate normal, referred to as the generalized multivariate linear model. ′ Let Yi = Yi1 ..., Yini be an ni-dimensional vector of repeated measurements ′ for subject i and ei = ei1 ,..., eini be an ni × 1 column vector of assumed errors. A generalized multivariate linear regression model is then written as

(

) (

)

Yi = X i′β + ei ,



(6.8)

where Xi is a known ni × M matrix of covariates and b is an M × 1 vector of unknown population parameters. Given the correlation between serial observations for the same subject, ei is assumed to follow a multivariate normal distribution with mean 0 and ni × ni covariance matrix Vi. Correspondingly, Yi can be expressed in terms of Yi ~ MVN ( X i′β ,Vi ). As summarized by Diggle (1988) and Diggle et al. (2002), an appropriate Vi should at least accommodate three different sources of random variations. First, average responses usually vary randomly between subjects, with some subjects being intrinsically high and some being low. Second, a subject’s observed measurement profile may be a response to time-varying stochastic processes. Third, as the individual measurements involve some kind of subsampling within subjects, the measurement process adds a component of variation to the data. This perspective on the breakdown of variability is briefly mentioned in Chapter 1. The above decomposition of stochastic variations in repeated measurements of the response facilitates the formulation of correlation between pairs of measurements for the same subject. A generalized multivariate linear model, specifying all three features in Vi, can be written as

( )

Yij = X ij′ β + bi + Wi Tij + ε ij ,



(6.9)

where X ij′ β represents the model-based mean response for subject i at time point j, bi indicates the i.i.d. variation in average response between subjects with mean 0 and variance σ b2 , and εij represents the subsampling variation within the subject with mean 0 and variance σ ε2 . The term Wi Tij reflects independent stationary Gaussian processes with zero expectation and covariance function σ 2 ρ T j ′ − T j . Clearly, the terms bi, Wi Tij , and εij correspond to the random intercept, autoregressive correlation, and within-subject random error, respectively, as described previously. Given the specification of Equation (6.9), the variance matrix for subject i can be specified as

( )

(

( )



)

Vi = σ b2 J + σ 2 R (Ti ) + σ ε2 I,

(6.10)

(

where J is a square matrix with all its elements being unity, Ti = Ti1 ,..., Tini

(

)

)′ is the time

vector, R (Ti ) is a symmetric matrix with element ρ T j ′ − T j , and I is the identity

177

178

CHAPTER 6  Residual and influence diagnostics

(

matrix. As indicated in Chapter 5, the form of ρ T j ′ − T j

) can be parameterized in

different situations. When σ = 0 , the above-generalized multivariate linear model reduces to the uniformed correlation model (Diggle, 1988). If σ b2 is also equal to zero, the model reduces further to a typical general linear model with independent random errors. Such cases, however, are empirically very rare in longitudinal data. Let T ∈T be a continuous time variable. Then the semivariogram of a random process {Y (T )} : T ∈T is defined as the function 2



g (u ) =

2 1 E {Y (T ) − Y (T − u )}  : u ≥ 0,   2

(6.11)

which is assumed to be independent of T. Given this specification, the empirical semivariogram of a time series or a time trend is specified as a scatter-plot of squared differences, given by d 2jj ′ =

( ) ( )

2 1  y Tj − y Tj′  ,   2

as against some corresponding quantities. Empirically, the marginal residuals, denoted by rmi, can be used to generate the empirical semivariogram. With rij = Yij − X ij′ βˆ , the empirical variogram can be written as

(

1 E rij − rij ′ 2 

)  = σ 2

2 ε

(

)

+ σ 2 1 − ρ Tij − Tij ′  .  

(6.12)

Obviously, the empirical semivariogram is a spatial data specification. The rationale of this method is that given the specification of the random intercepts across subjects, the empirical semi-variogram displays ordinary-least-squares (OLS)-type residuals by removing the term for correlation in repeated measurements. Consequently, if the random intercept model is correctly assumed, the errors should follow a constant pattern, and, correspondingly, the empirical semivariogram should be scattered randomly rather than systematically. Equation (6.12) literally states that the process variance can be estimated by half the average squared difference between pairs of observations from different subjects. As a nonparametric diagnostic method, the empirical semivariogram provides a useful graphical check on the adequacy of a specific covariance structure in R, provided that there are no random effects except a random intercept term. This approach is particularly useful to check residuals for longitudinal data with unequal time intervals.

6.1.3  SEMIVARIOGRAM IN THE LINEAR RANDOM COEFFICIENT MODEL The specification of the generalized multivariate linear model provides a flexible framework for checking residuals in modeling normal longitudinal data. The development of this approach is based on the thought that the effect of serial correlation

6.1 Residual diagnostics

is dominated by the combination of the random intercept and random errors, and therefore, inclusion of more random effect terms is unnecessary. More recently, some more flexible formulations on covariance structures have been proposed to handle complex longitudinal data patterns (Chi and Reinsel, 1989; Verbeke et al., 1998). The majority of these methods introduce an additional random term in the expression of total errors, containing both the random coefficients across subjects and serial correlation in within-subject random errors, written as Yi = X i′β + Z i′bi + ei ,



(6.13)

where ei is assumed to follow a multivariate normal distribution with mean 0 and ni  ×  ni covariance matrix Ri. As indicated earlier, in the standard linear random coefficient model, cov ( ei ) = σ 2 I is usually assumed given the specified covariates and the random effects. In performing residual diagnostics, however, specifying cov ( ei ) = Ri as multivariate is considered essential to check whether residuals remain correlated with the specified random coefficients. Chi and Reinsel (1989) propose a score test to examine the random coefficient model with cov ( ei ) = σ 2 I against the random coefficient multivariate model with the AR(1) errors for ei. This approach provides a simple check for possible autocorrelation in residuals, which, in linear mixed models, are generally assumed to be conditionally independent. It is found from the score test that the specification of the random effects generally accounts for the serial correlation among repeated measurements, and therefore, the random coefficient model adding a serial correlation term overparameterizes the covariance structure. In the framework of the random coefficient multivariate model, the error structure can be more flexibly decomposed (Verbeke et al., 1998), given by ei = Z i′bi + ε 1i + ε 2 i ,



(6.14)

where ε1i is the ni × 1 residual vector to indicate time-varying stochastic processes operating within the subject (serial correlation), assumed to be normally distributed  . The covariance matrix H  has element with mean zero and covariance matrix H i i 2 2  hijj ′ taking the form τ g Tij − Tij ′ , where τ is the profiled variance for all elements of ε1i and g(.) is the unknown positive decreasing function. The vector ε2i represents random errors assumed to be independent and identically distributed with zero expectation. The q-dimensional vector bi, the subject-specific random effects, are assumed to follow a multivariate normal distribution with mean zero and covariance matrix G, as regularly specified. Given Equation (6.14), the response vector yi marginally follows a normal distribution with mean vector X i′β and covariance matrix

(



)

 + σ 2I , Vi = Z iGZ i′ + H i ε ni

(6.15)

where I ni is the ni-dimensional identity matrix. This marginal random coefficient multivariate model can be estimated by using either the maximum likelihood (ML) or the REML approach. As discussed by Verbeke et al. (1998), this regression model

179

180

CHAPTER 6  Residual and influence diagnostics

can be used to check whether a classical linear mixed model sufficiently accounts for intraindividual correlation without considering an additional residual covariance structure. Empirically, one might compare the analytic and graphical results from different regression models such as the ordinary least squares, the random intercept, the random coefficient, and the random coefficient plus a specific residual covariance structure. If both the OLS and the random-intercept residuals are found to deviate markedly from normality, serial correlation needs to be further specified and some regression coefficients should be considered random across subjects. Likewise, if residuals from a random coefficient model still deviate notably from a normal distribution, the researcher might want to add an appropriate residual covariance matrix to the linear random coefficient model for yielding efficient and consistent parameter estimates. The latter case, however, rarely occurs in longitudinal data analysis. With the inclusion of the random coefficients, the semivariogram based on the random intercept model is extended to the random coefficient perspective. As the covariance structure specified in Equation (6.15) has been found to be dominated by its first component Z iGZ i′ , it is necessary to remove all variability explained by the random effects bi before proceeding with the serial correlation check. The scaled residuals, denoted by Rmi and described in Section 6.1.1, can be used for this removal. These standardized residuals are independent of any distributional assumptions on bi, and therefore, their computation does not require an estimate of the random-effects covariance matrix G (Verbeke et al., 1998). Because the scaled residuals Rmi have unit variance and zero correlation, the semivariogram based on the linear random coefficient model can be written as

(



)

( )

( )

(

2 1 1 1 E Rij − R jj ′  = var Rij + var Rij ′ − cov Rij ,R jj ′  2 2  2 1 1 = + − 0 = 1. 2 2

)

(6.16)

Equation (6.16) indicates that if a random coefficient linear model is correctly specified, the plot of the semivariogram of the transformed residuals against time should be scattered randomly around unity without displaying any systematic time trend. With this property, the semivariogram can be applied to check whether the specified random effects explain all serial correlation between repeated measurements. In performing this residual diagnostic method, the conditional residuals, deˆ are not recommended for use because the BLUP bˆ from the fined as Yi − X i′βˆ − Z i b, i empirical Bayes approximation depends heavily on the normality assumption and is also influenced by the assumed covariance structure Vˆi . The semivariogram, based on either the random intercept or the random coefficient model, is usually applied for spatial data when the time intervals are unequally spaced. For highly unbalanced longitudinal data, a smooth plot of the empirical semivariogram may be fitted from the scatter-plot of the transformed residuals, which should be centered at unity displaying no systematic time trend. Because the

6.2 Influence diagnostics

empirical semivariogram is sensitive to outliers, influence diagnostics need to be performed first before fitting a smooth curve to the scatter-plot of the empirical semivariogram. Large-scale longitudinal data are needed to conduct this residual diagnostic technique.

6.2  INFLUENCE DIAGNOSTICS In regression diagnostics, another important area is identification of influential observations. For various regression models, it is essential to identify particular observations that have extraordinary influences on the analytic results. Identification of influential observations in linear mixed models differs slightly from the classical approaches applied in general linear models. Most significantly, diagnostics on longitudinal data involve individuals having multiple data points, rather than at a single time. Consequently, removal of one individual can affect a series of observations, thereby magnifying the case’s influence on parameter estimates, both the fixed effects and the random components. Therefore, more refined techniques are sometimes required to identify influential observations for a linear mixed model. In most situations, however, influence diagnostics applied in longitudinal data analysis follow the standard perspectives as those used in general linear models. This section describes a variety of the basic diagnostic techniques to identify influential observations in linear mixed models, including the Cook’s distance statistic, leverage, the DFFITS score, the MDFFITS statistic, COVTRACE, COVRATIO, the likelihood displacement (LD) statistic and its approximates, and the LMAX standardized statistic. An illustration is provided to check whether there are any influential observations in fitting the two linear mixed models described in the preceding three chapters.

6.2.1  COOK’S D AND RELATED INFLUENCE DIAGNOSTICS In fitting a linear mixed model, some observations may have unduly impacted the inferential process to derive parameter estimates, as frequently encountered in fitting other types of regression models. Traditionally, these influential cases can be identified by the change in the estimated regression coefficients after deleting each observation in a sequence (Cook, 1977, 1979). Let βˆ be the estimate of b that maximizes the log-likelihood or the log restricted likelihood function and βˆ( − i ) be the same estimate of b when subject i is eliminated from the estimating process. For a single covariate Xm, thedistance in the estimated regression coefficient after removing the subject, denoted dmi , is written as

 d mi = βˆm − βˆm ( − i ) .

(6.17)

Equation (6.17) provides an exact measurement for the absolute influence of deleting subject i from the estimating process on the regression coefficient estimate,

181

182

CHAPTER 6  Residual and influence diagnostics

referred to as Cook’s distance statistic. This statistic can be expressed  in terms of the entire vector of the estimated regression coefficients, given by di = βˆ − βˆ( − i ) . A greater value of di suggests subject i to have a stronger influence on the estimate of b; likewise, a lower value indicates that subject is impact on the model fit is limited. For convenience of performing the significance test, Cook’s distance statistic is often scaled by the estimates of the coefficient standard errors after removing subject i. Mathematically, this scaled or standardized distance score is written as ′  βˆ − βˆ ( − i )  ( X ′X )  βˆ − βˆ ( − i )      di = , [1 + rank ( X )] s 2



(6.18)

where di is the scaled Cook’s distance, or simply Cook’s D, and s2 is the mean square error of the data. In the literature of influence diagnostics, the two Cook’s distance statistics, di and di , are also referred to as DFBETAi and DFBETASi, respectively (Belsley et al., 1980; Fox, 1991).  Like the original statistic di , a large value in di indicates that the parameter estimates are sensitive to removal of the ith subject. With standardization, di approximately follows an F-distribution with the numerator degrees of freedom being rank (X) and the denominator degrees of freedom being N-rank (X). Given the F-distribution, the significance of Cook’s D can be statistically tested. Specifically, assuming X to have full rank, the F statistic can be used to test the null hypothesis with threshold FM, N − M, a. For linear mixed models, the specification of the scaled Cook’s D is slightly more complex, given by

()

−1  βˆ − βˆ ′ var βˆ  βˆ − βˆ  − i ( ) (− i)    di =  , rank ( X )



()

where var βˆ

−1

(6.19)

is the matrix from sweeping

(

 X ′V Gˆ , Rˆ 

)

−1

−1

X . 

If V is known, Cook’s D can be evaluated according to a chi-square distribution with the degrees of freedom being the rank of X (Christensen et al., 1992). If V is unknown, an estimate of V needs to be obtained from the approach described in Chapters 3 and 4, and the statistical evaluation of di should be based on the FM, N − M, a distribution. For large samples, however, checking the statistical significance of di for each subject in sequence is unrealistic. In these situations, plotting the scaled statistics graphically is a more practical approach for a quick diagnostic check. An analytic approach to checking the statistical significance of di without removing subjects in sequence will be described in Section 6.2.4.

6.2 Influence diagnostics

The influential cases can also be identified by the change in the predicted value after deleting each subject in a sequence (Cook, 1977, 1979). One of the useful diagnostics in this regard is the PRESS statistic, defined as

PRESS = ∑ rˆi2( − i ) ,  i ≠i

(6.20)

where the sum does not include subject i whose influence on the model fit is under check, and rˆi( − i ) = yi − X i′βˆ ( − i ) .

There are some other influence diagnostics based on the predicted value when a subject is removed from the regression. As more or less related to the leverage statistic, they are described in Section 6.2.3 after leverage is delineated.

6.2.2 LEVERAGE In linear regression models, leverage is used to assess outliers with respect to the independent variables by identifying the observations that are distant from the average predictor values. While potentially impactful on the parameter estimates and the model fit, a higher leverage point does not necessarily indicate strong influence on the regression coefficient estimates because a far distance for a subject’s predictor values from those of others can be situated in the same regression line as other observations (Fox, 1991). Therefore, checking a substantial influence must combine high leverage with discrepancy of the case from the rest of the data. The basic measurement of leverage is the so-called hat-value, denoted by hi. In general linear models, the hat-value is specified as a weight variable in the expression of the fitted value of the predicted response yˆ j , given by N



yˆ j = ∑ hij yi ,

(6.21)

i =1

where hij is the weight of subject i in predicting the outcome Y at data point j ( j = 1, 2,..., N ), and yˆ j is specified as a weighted average of N observed values. Therefore, the weight variable hij displays the influence of yi on yˆ j , with a higher score indicating a greater impact on the fitted value. Let N



hi = hii = ∑ hij2 . j =1

(6.22)

According to Equation (6.22), the hat-value hi, with property 0 ≤ hi ≤ 1, is the leverage score of yi on all fitted values. In general linear models including a number of independent variables, leverage measures distance from the means of the independent variables and can be expressed

183

184

CHAPTER 6  Residual and influence diagnostics

as a matrix quantity given the covariance structure of the X matrix. Correspondingly, the hat-value hi is the ith diagonal of a hat matrix H. The hat matrix H is given by

H = X ( X ′X ) X ′. −1

(6.23)

The diagonal of H provides a standardized measure of the distance for the ith subject from the center of the X-space, with a large value indicating that the subject is potentially influential. If all cases have equal influence, each subject will have a leverage score of M/N, where M is the number of independent variables (including the intercept) and N is the number of observations. In the literature of influence diagnostics, the leverage values exceeding 2M/N for large samples or 3M/N for samples of N ≤ 30 are regarded roughly as the influential cases. Given the H matrix, the predicted values of y in general linear models can be written as

yˆ = Hy.

(6.24)

Therefore, the H matrix determines the variance and covariance of the fitted values and residuals, given by

var ( yˆ ) = σ 2 H ,

(6.25a)



var ( r ) = σ 2 (1 − H ) .

(6.25b)

In longitudinal data analysis, the specification of the H matrix becomes more complex due to the inclusion of the covariance matrix V(R, G). Let Θ be an available estimate of R and G (also specified in Chapter 4). The leverage score for subject i can be expressed as the ith diagonal of the following hat matrix:

( )

H = X  X ′V Θˆ 

−1

( )



X  X ′V Θˆ 

−1

.

(6.26)

The ith diagonal of the above matrix is the leverage score for subject i displaying the degree of the case’s difference from others in one or more independent variables.

6.2.3  DFFITS, MDFFITS, COVTRACE, AND COVRATIO STATISTICS In addition to Cook’s D, a number of other case-deletion diagnostics are frequently used in linear regression models. These diagnostics include the DFFITS, the MDFFITS, the COVTRACE, and the COVRATIO statistics (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). The DFFITS statistic is defined as the change in the predicted value after removing a case, standardized by dividing by the estimated standard error of the fit. In general linear models, the statistic is defined as DFFITSi =

yˆi − yˆi( − i ) s( − i ) hi

,

(6.27)

6.2 Influence diagnostics

where yˆi is the predicted value of Y for subject i from the linear regression including i, yˆi( − i ) is the same prediction after removing data point i in fitting the regression model, s( − i ) is the standard error estimate without i, and hi is the corresponding leverage score. Given studentizing, the DFFITS statistic follows a Student t distribution multiplied by a leverage factor, given by



DFFITSi = t i

hi . 1 − hi

(6.28)

In longitudinal data analysis, the statistical procedure to fit a regression model is more complex than that of a simple linear regression given the specification of Vi. For example, in linear mixed models, the DFFITS statistic is specified as



DFFITSi =

yˆi − yˆi( − i ) ese ( yˆi )

,

(6.29)

where ese ( yˆi ) is the asymptotic standard error estimate for yˆi . This statistic can be approximated by

(

ese ( yˆi ) = xi′  X ′V Θˆ ( − i ) 

)



−1

X  xi , 

(6.30)

where xi is the observed matrix of X for subject i. As a standardized diagnostic measure, the DFFITS statistic indicates the number of standard deviation that the fitted value changes after the removal of subject i. The MDFFITS statistic is used when multiple data points are removed from the regression (Belsley et al., 1980). In longitudinal data, case deletion generally implies removal of multiple data points, thereby indicating that MDFFITS is an appropriate statistic for performing influence diagnostics in linear mixed models. Christensen et al. (1992) specify this statistic on the fixed effects in the context of linear mixed models, given by



−  βˆ − βˆ ′ var  βˆ   βˆ − βˆ  (− i)  (− i)   (− i)    MDFFITS  β ( − i )  = . rank ( X )

(6.31)

There is a striking similarity between the above MDFFITS[b(−i)] formulation and Cook’s D, specified by Equation (6.19). Both statistics measure the influence of data points on a vector of the fixed effects, with the only difference being the specification of var  βˆ ( − i )  in Equation (6.31).   If the covariance parameters are assumed to be fixed, the MDFFITS score for each subject can be estimated by a noniterative procedure to check only the fixed effects and the residual variance. The MDFFITS score, however, can be underestimated if a subject’s impact on the covariance parameters is overlooked. Therefore, the iterative approach is preferred to assess the overall impact of a subject, including

185

186

CHAPTER 6  Residual and influence diagnostics

the influence on the covariance parameter estimates. For the iterative influence analysis, var  βˆ ( − i )  is evaluated at Θˆ ( − i ) , and therefore, an MDFFITS score can be com  puted specifically for the covariance parameters, written as −1 ′ MDFFITS Θ ( − i )  = Θˆ − Θˆ ( − i )  var Θˆ ( − i )  Θˆ − Θˆ ( − i )  .



(6.32)

The covariance trace and the ratio statistics, referred to as COVTRACE and COVRATIO, respectively, are the other two widely used diagnostics for identifying influential cases in linear regression models (Belsley et al., 1980; Christensen et al., 1992; Fox, 1991; SAS, 2012, Chapter 59). While Cook’s D, DFFITS, and MDFFITS statistics are used to measure a subject’s influence on the parameter estimates and the fitted values, the COVTRACE and COVRATIO statistics are applied to assess the influence on the precision of parameter estimates. For linear mixed models, Christensen et al. (1992) define the corresponding COVTRACE and COVRATIO statistics, given by

{

( ) var  βˆ

COVTRACE  β ( − i )  = trace var βˆ



COVRATIO  β ( − i )  =

{

}



{

}

 − rank ( X ) ,

(− i) 

}

det var  βˆ ( − i )    , ˆ   det var β  

()

(6.33)

(6.34)

where det var  βˆ( − i )  indicates the determinant of the nonsingular part of the   var  βˆ ( − i )  matrix. The COVRATIO statistic can be used to assess the precision   of the fixed effects with the following criteria: if COVRATIO > 1, inclusion of subject i in the regression improves the precision of the parameter estimates; if COVRATIO < 1, the incorporation of the subject in the estimating process reduces the precision of the estimation, so that subject i may be deleted in the model fit. In the iterative influence analysis, the COVTRACE and COVRATIO statistics can also be computed for the covariance parameter estimates:

{

( ) var Θˆ

COVTRACE Θ ( − i )  = trace var Θˆ

(

)

COVRATIO  Θ ( − i )  =



{

}

( )

 − rank  var Θˆ  ,  

(− i) 

}

det var Θˆ ( − i )  . det  var Θˆ   

( )

(6.35)

(6.36)

Empirically, iterative computations of COVTRACE and COVRATIO for the covariance parameter estimates are burdensome and tedious, particularly for a large sample. When difficulty in performing the iterative methods arises, the researcher might want to consider using another diagnostic approach.

6.2 Influence diagnostics

6.2.4  LIKELIHOOD DISPLACEMENT STATISTIC APPROXIMATION The above influence diagnostics are useful for graphically identifying influential cases. Some of those diagnostic statistics are linked to certain types of probability distributions by using standardization. Based on the standardized results, empirically based cut-points may be created. In this aspect, a popular diagnostic approach is LD, which is directly associated with the likelihood ratio statistic. This likelihood-type diagnostic statistic, originally proposed by Cook (1977), produces both graphical plots and analytic results simultaneously. In the literature of influence diagnostics, the LD statistic is generally referred to as LD if the statistic is generated from the log-likelihood function or as RLD if it is derived from the restricted log-likelihood function. Let l be the log-likelihood function log L and lR be the restricted log-likelihood function log LR. In general linear models, the LD and RLD scores for subject i, denoted by LDi and RLDi, respectively, can then be defined as

()



LDi = 2l βˆ − 2l  βˆ ( − i )  ,  

(6.37a)



RLDi = 2l R βˆ − 2l R  βˆ (− i )  ,  

()

(6.37b)

()

()

where l βˆ is the log-likelihood function and lR βˆ is the log restricted likelihood function with respect to the estimated regression coefficients βˆ . In simple linear regression models, the LD statistic is distributed as chi-square with one degree of freedom under the null hypothesis that βˆ ( − i ) = βˆ . With this desirable property, the observations having a strong impact on βˆ can be statistically tested as well as graphically identified. Consequently, this statistic provides sufficient information on whether an identified influential case should be removed from the regression. In longitudinal data analysis, the use of this exact statistic is often not realistic. When the sample size is large, this case-deletion process becomes extremely tedious and time-consuming. Consider, for example, a sample of 2000 subjects: the analyst needs to create 2001 linear mixed models to identify which case or cases have exceptionally strong impact on the parameter estimates. If the distance score, denoted by θˆ − θˆ( − i ) , can be statistically approximated by a scalar measurement in linear mixed models, influential observations can be identified without removing each case in a sequence from the estimating process. Cook (1986) developed a method to approximate Equation (6.37) by introducing weights into the likelihood function, with different individuals allowed different weights. In this method, an N-dimensional vector w = ( w1 , w2 ,..., wN )′ is created, with element wi (i = 1, 2, …, N) being the weight for subject i. The w vector in the classical log-likelihoods can be denoted by w0 = (1,1,...,1)′ , in which each subject has weight one. This approach can be readily extended to linear mixed models. Let u be a parameter matrix consisting of b, G, and R. The log-likelihood and the log restricted

187

188

CHAPTER 6  Residual and influence diagnostics

likelihood functions with weights, referred to as perturbed log-likelihoods (Verbeke and Molenberghs, 2000), are then given by N



l (θ w ) = ∑ wi li (θ ),

(6.38a)

i =1

N



lR (θ w ) = ∑ wi l Ri (θ ),

(6.38b)

i =1

where l (θ w ) is the perturbed log-likelihood function given the inclusion of the weight vector w, and lR (θ w ) is the corresponding perturbed log restricted likelihood function. In longitudinal data analysis, wi is set at zero if the entire set of observations for subject i is not considered in linear mixed models. Given the specification of weights, the influential cases can be identified by measuring the distance between θˆw and θˆ in the LD formulation. The approximated LD and RLD scores for all subjects, denoted by LD ( w ) or RLD ( w ) , are given by

()



LD ( w ) = 2l θˆ − 2l θˆw  ,



RLD ( w ) = 2l R θˆ − 2l R θˆw  .

(6.39a)

()

(6.39b)

Obviously, it is still unrealistic to evaluate LD ( w ) or RLD ( w ) for all elements in w. Cook (1986) suggests studying the local behavior of LD ( w ) around w0 because it describes the sensitivity of LD ( w ) with respect to slight perturbations of the weights. Accordingly, the LD ( w ) at w0 can be approximated by the classical likelihood expressions without directly including w. In the context of linear mixed models, the LD score for subject i can then be approximated by

()

LDi ( wi ) ≈ Ui′I −1 θˆ Ui ,

(6.40)

where Ui is the score function for subject I, mathematically defined as the first partial derivative of the log-likelihood function with respect to θ, and I −1 θˆ is the inverse of the observed Fisher information matrix, mathematically defined as the second partial derivative of the log-likelihood function with respect to θ. Both matrices are evaluated at θ = θˆ . As both Ui and I θˆ can be obtained from the standard linear mixed modeling, this approximation does not include additional statistical inference, thereby evading the specification of wi. Given Equation (4.16), the RLDi ( wi ) statistic can be obtained in the same fashion with minor modifications (Lesaffre and Verbeke, 1998). As discussed in Lesaffre and Verbeke (1998) and Verbeke and Molenberghs (2000), the change in the log-likelihoods between θˆw and θˆ reflects variability of θˆ . If the LD ( w ) or the RLD ( w ) score is large, l (θ ) or lR (θ ) is strongly curved at θˆ , thereby suggesting θ to be estimated with high precision. If the statistic is small, θ is estimated with high variability. As a result, a graph of LD ( w )

()

()

6.2 Influence diagnostics

or RLD ( w ) approximates against the w vector can provide important information on the total local influence of each subject to identify influential cases. For statistical details concerning this graphical method, the reader is referred to Gruttola et al. (1987), Lesaffre and Verbeke (1998), and Verbeke and Molenberghs (2000, Chapter 11).

6.2.5  LMAX STATISTIC FOR IDENTIFICATION OF INFLUENTIAL OBSERVATIONS Based on the LD approximation, a more refined, innovative technique for identifying influential observations is the LMAX statistic, originally developed by Cook (1986) as a standardized diagnostic method in general regression modeling and later introduced and advanced into linear mixed models by Lesaffre and Verbeke (1998). Methodologically, this method maximizes approximation to the LD(w) statistic for the changes standardized to the unit length. Cook (1986) suggests the use of a standardized LD statistic to more adequately measure influences of particular cases in the estimating process. Given the LD statistic, Cook first defines a symmetric matrix B, given by

()

B ≈ U ′I −1 θˆ U ,

(6.41)

ˆ where U is the matrix with rows containing the score residual vector Ui . With B defined, Cook considers the direction of the N × 1 vector l that maximiz−1  es l ′Bl, and l is standardized to have unit length. Because the M × M matrix Iˆ θˆ is positive definite, the N × N symmetric matrix B is positive semidefinite, with rank no more than M. The statistic lmax corresponds to the unit length eigenvector of B, which has the largest eigenvalue γmax . Therefore, lmax ′ Blmax maximizes l′Bl and satisfies the equation

()

Blmax = λmax l max and l max′ l max = 1,

where λmax is the largest eigenvalue of B, and l max is the eigenvector associated with λmax . The elements of l max , standardized to unit length, measure sensitivity of the model fit to each observation. The absolute value of li , the ith element in l max , is used as the LMAX score for subject i. Given the unit length, the expected value of the squared LMAX statistic for each case is 1/N where N is sample size, and correspondingly, a value significantly greater than this expected value indicates a strong influence on the overall fit of a regression model thereby being identified. If M = 1, -1 the LMAX statistic is proportional to U ′ and λmax = I θˆ U . When M > 1, an advantage of examining elements of the LMAX statistic is that each case has a single summary measurement of influence. As the LMAX score is a standardized statistic, it is useful to plot the elements of the LMAX scores against time points and/or the values of other covariates. The

()

189

190

CHAPTER 6  Residual and influence diagnostics

standardization of l max to unit length means that the squares of the elements in lmax sum up to unity, and therefore, the signs of the elements of l max are not of concern. Therefore, for li , only the absolute value needs to be plotted. Observations that have the most unduly influences on parameter estimates and the model fit can be readily identified by examining the relative influence of the elements in l max . If none of the subjects has an undue impact on the model fit, the plot of the LMAX scores should approximate a horizontal line. There is a lack of analytic expressions for l max . Given high proportional contributions of the graphically influential cases to the overall fit of a linear mixed model, it is necessary to check the exact LD for them. The researcher can perform the significance test to further analyze those influential cases with two steps (Liu, 2012). First, graphically identify the most influential observations by using the LMAX approximation (there are usually only a few distinctive outliers in linear models). Next, analytically examine the exact changes in the estimates of the fixed effects and the covariance parameters after deleting each of those few cases, using the classical LD criterion. Following this strategy, a decision can be made regarding whether those influential cases should be removed in fitting a linear mixed model.

6.3  EMPIRICAL ILLUSTRATIONS ON INFLUENCE DIAGNOSTICS In this empirical illustration, I provide two examples for performing influence diagnostics in longitudinal data analysis. I continue to use the two longitudinal datasets indicated in the preceding chapters: one on the effectiveness of acupuncture treatment on PTSD and one concerning the effect of marital status on an older person’s disability severity.

6.3.1  INFLUENCE CHECKS ON LINEAR MIXED MODEL CONCERNING EFFECTIVENESS OF ACUPUNCTURE TREATMENT ON PCL SCORE The first example is a diagnostic check of influence on the linear mixed model fitted in Chapter 5 (Section 5.5.1). As a follow-up analysis, the model specifications, covariate definitions, and the use of an appropriate covariance structure are all the same as described in Chapter 5. Specifically, the dependent variable is PCL_SUM, measured at four time points: 0 = baseline survey, 1 = 4-week follow-up, 2 = 8-week follow-up, and 3 = 12-week follow-up. The treatment factor is dichotomous with 1 = receiving acupuncture treatment, 0 = else. From three covariance pattern models, CS, [AR(1)], and TOEP, CS yields the smallest values in all four information criteria, thereby being selected as the appropriate residual covariance pattern model for this analysis. The objective of this diagnostic analysis is to identify whether there are influential cases for both the fixed effects and the covariance parameters in the model fit. The following SAS program is created to perform the diagnostics by using an iterative approach.

6.3 Empirical illustrations on influence diagnostics

SAS Program 6.1: . . . . . .

In SAS Program 6.1, the ODS Graphics is enabled to create diagnostic plots. As the analysis is focused on the identification of influential cases, the options to derive the fixed effects and the covariance parameters are not specified. The ML estimator is applied in the diagnostic checks because the REML estimator cannot be used to compare nested regression models for the mean (Fitzmaurice et al., 2004). The INFLUENCE option is specified in the MODEL statement telling SAS to compute influence statistics. The influence suboption EFFECT = ID specifies that the classification effect is subject. Additionally, the suboption ITER = 5 informs SAS that the fixed effects and the covariance parameters be updated by refitting the mixed model up to five iterations. It may be mentioned that for ITER = 0, SAS performs a noniterative influence analysis on the fixed effects only, assuming the covariance parameters or their ratios to be fixed. Given SAS Program 6.1, SAS constructs a table of influence diagnostics at the subject level. This diagnostic summary table contains eleven diagnostic statistics, including Cook’s D for both the fixed effects and the covariance parameters, PRESS, MDFFITS for both the fixed effects and the covariance parameters, COVRATIO for both the fixed effects and the covariance parameters, COVTRACE for both the fixed effects and the covariance parameters, the likelihood distance (LD), and the root of mean square error (RMSE). While the diagnostic table includes a large amount of data, a portion of the statistics is selected for display, including Cook’s D, MDFFITS for the fixed effects, COVRATIO for the fixed effects, and the LD statistic. Table 6.1 presents those statistics for the first ten subjects. Table 6.1 displays that each of the ten subjects has multiple observations except subject 271, among whom eight have complete information on the PCL score. All subjects have relatively small values of Cook’s D and the MDFFITS statistics on the fixed effects. As they differ only in the specification of the covariance matrix for the fixed effects, the values of Cook’s D and the MDFFITS are very close, with the MDEFITS value slightly smaller given the use of var  βˆ ( − i )  rather than var  βˆ      in computation. The COVRATIO and the LD statistics are also fairly stable, thereby suggesting that none of these ten subjects has a strong impact on the fit of the linear mixed model on PCL_SUM. From the results of other influence diagnostics and for other subjects, no distinctively influential cases can be found. From SAS Program 6.1, a set of diagnostic plots are displayed. First, a plot of the likelihood distance is presented Fig 6.1.

191

192

CHAPTER 6  Residual and influence diagnostics

Table 6.1  Selected Influence Statistics for 10 Subjects: DHCC Acupuncture Treatment Study (N = 10) Influence Diagnostics for 10 Subjects Subject’s ID

Number of Number of Cook’s MDFFITS Observations Iterations D on b

COVRATIO LD on b Statistic

132 167 193 220 227 230 235 245 271 276

4 4 4 4 4 2 4 4 1 4

1.0080 1.2674 1.1246 1.2960 1.1914 1.1728 1.1858 1.0913 1.0785 1.1090

2 1 2 2 2 1 1 2 1 2

0.0533 0.0114 0.0242 0.0134 0.0195 0.0006 0.0263 0.0358 0.0000 0.0324

FIGURE 6.1  Likelihood Distance for Each Subject

0.0524 0.0108 0.0236 0.0132 0.0188 0.0006 0.0252 0.0351 0.0000 0.0315

0.5396 0.1007 0.2139 0.1690 0.1657 0.0158 0.2130 0.3399 0.0057 0.3003

6.3 Empirical illustrations on influence diagnostics

FIGURE 6.2  Other Influence Statistics on PCL_SUM

As there are 55 subjects in the analysis, not all the ID numbers are displayed in Fig. 6.1. Based on the output table from SAS Program 6.1, influential cases can be easily identified. As judged from both Fig. 6.1 and the output table, subjects 791 and 814 have the strongest impact on the model fit. With the large value of the overall model fit statistic, however, those relatively outstanding values, 1.1328 and 1.5064, cannot lead to a firm conclusion that the two cases make actual influences on the fitness of the linear regression model. Figure 6.2 displays Cook’s D and the COVRATIO statistics for both the fixed effects and the covariance parameters considered in the model. In Fig. 6.2, the aforementioned two subjects also display high Cook’s D values, thereby suggesting a strong impact on the results of the model fit. These two cases are shown to be influential not only on the fixed effects but also on the estimates of the variance and covariance parameters. With respect to Cook’s D, subject 623 is identified as an additional influential case in the estimation of the fixed effects but not in the covariance parameters. The three subjects, 791, 814, and 623, are also linked to low precision both in the fixed effects and in the covariance parameters, given the low COVRATIO scores. These diagnostic statistics, however, display the relative contribution to the model fit but do not necessarily translate into an actual influence. Given the relative importance of the two most influential cases to the overall fit of the linear mixed model, it may be necessary to check the exact LD for both. Accordingly,

193

194

CHAPTER 6  Residual and influence diagnostics

two additional linear mixed models are created with each deleting one of those influential cases. The SAS program for this step is not presented because programming of the two additional models follows the standard procedure, as previously exhibited, except for removal of a single observation. Table 6.2 summarizes the results. In Table 6.2, the first linear mixed model uses full data, with results previously reported. The second model is fitted after removing subject 791 with four observations. Table 6.2  Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models Explanatory Variable

Parameter Estimate

Standard Error

Degrees of Freedom

t-value

p-value

Linear Mixed Model With Full Data (−2 LL = 1411.3; p < 0.0001) Intercept Treatment Time 1 Time 2 Time 3 Treat × Time 1 Treat × Time 2 Treat × Time 3

55.444 2.698 −3.963 −2.977 −9.233 −14.494 −15.803 −8.666

2.509 3.516 2.056 2.168 2.137 3.007 3.232 3.177

53 53 128 128 128 128 128 128

22.10 0.77 −1.93 −1.37 −4.32 −4.82 −4.89 −2.73

<0.0001 0.4463 0.0561 0.1721 <0.0001 <0.0001 <0.0001 0.0073

Linear Mixed Model Deleting Subject 791 (−2 LL = 1371.3; p < 0.0001) Intercept Treatment Time 1 Time 2 Time 3 Treat × Time 1 Treat × Time 2 Treat × Time 3 LD statistic

56.231 1.912 −4.885 −4.359 −10.613 −13.537 −14.369 −7.229

2.550 3.542 2.007 2.122 2.090 2.908 3.129 3.075

52 22.05 52 0.54 125 −2.43 125 −2.05 125 −5.08 125 −4.65 125 −4.59 125 −2.35 40.0 (df = 4; p < 0.01)

<0.0001 0.5916 0.0164 0.0420 <0.0001 <0.0001 <0.0001 0.0203

Linear Mixed Model Deleting Subject 814 (−2 LL = 1368.6; p < 0.0001) Intercept Treatment Time 1 Time 2 Time 3 Treat × Time 1 Treat × Time 2 Treat × Time 3 LD statistic

54.923 3.220 −4.231 −3.338 −7.959 −14.189 −15.388 −9.880

2.535 3.520 1.992 2.106 2.074 2.886 3.105 3.052

52 21.67 52 0.91 125 −2.12 125 −1.59 125 −3.84 125 −4.92 125 −4.96 125 −3.24 42.7 (df = 4; p < 0.01)

<0.0001 0.3646 0.0356 0.1154 0.0002 <0.0001 <0.0001 0.0015

6.3 Empirical illustrations on influence diagnostics

()

The exact LD statistic, from the formula 2l θˆ − 2l θˆ( − i )  , is statistically significant with four degrees of freedom and at a = 0.05 (LD = 40.0, p < 0.01), thereby indicating that this subject makes a very strong statistical impact on the overall fit of the first linear mixed model. Likewise, the third model is fitted after deleting subject 814, with the exact LD statistic also statistically significant at the same criterion (LD = 42.7, p < 0.01). Furthermore, in both the second and the third models, the parameter estimates, including the regression coefficients and the standard errors, vary notably after removing each of those two cases. Obviously, those two cases make a genuine impact in fitting the linear mixed model. Given the considerable changes in the estimated regression coefficients as well as the exceptionally strong statistical contributions, I would recommend that the two influential cases should be eliminated from the regression. The lack of statistical stability may also be linked to the small sample size for the data. For large samples, a strong relative influence of a few particular cases usually can be washed out by the effects of the vast majority of the normal observations in the estimating process. This argument will be further discussed in the next example where a large sample is analyzed. Influence diagnostics can also be applied in the random coefficient model in which time is treated as a continuous variable. For example, I can graphically examine the fixed effects and the covariance parameters in the random coefficient model on the PCL score specified in Chapter 3. By adapting SAS Program 4.1, the code is displayed below. SAS Program 6.2: . . . . . .

In SAS Program 6.2, some options are selected to request an influence analysis on the parameter estimates in the random coefficient model. In the PROC MIXDED statement, the PLOT(ONLY) = INFLUENCESTATPANEL option requests panels of influence statistics, with the ONLY suboption in the parentheses informing SAS that only the requested plots be produced while the default plot in the package be suppressed. For iterative analysis, the panel displays Cook’s D and the COVRATIO statistic for the fixed effects and the covariance parameters. In the INFLUENCE option in the MODEL statement, the suboption EST is added that requests SAS to write the updated parameter estimates to the ODS output dataset. The resulting plots from this program are displayed in Fig. 6.3. Generally, Fig. 6.3 demonstrates the same results as those from the influence analysis for the linear model using a residual covariance pattern model. Such a

195

196

CHAPTER 6  Residual and influence diagnostics

FIGURE 6.3  Influence Statistics on PCL_SUM in the Random Coefficient Model

similarity is not surprising, as both types of linear mixed models usually yield very close estimates for the fixed effects and the covariance parameters. If the researcher is interested in graphically checking the detailed influence statistics for each fixed-effect and covariance parameter in the random coefficient model, additional analyses are needed for a thorough assessment. This influence analysis can be achieved by making some minor modifications in SAS Program 6.2. Specifically, by replacing the option PLOT(ONLY) = INFLUENCESTATPANEL with the PLOT(ONLY) = INFLUENCEESTPLOT option, the following graphs are produced for the fixed effects and the 3 × 3 variance–covariance matrix of the random coefficients. In the interpretation of Fig. 6.4, the focus is on the behaviors of the two most influential cases, subjects 791 and 814. Specifically, subject 791 has a strong impact on the model fit of the intercept and the slopes for CT, CT_2, TREAT, and the interactions between CT and TREAT and between CT_2 and TREAT. Removal of this subject, however, has no influence at all on the fixed effect of CT_3 and the interaction between CT_3 and TREAT. On the other hand, subject 814 has a very strong influence on the regression coefficient estimates of the intercept, CT_2, TREAT, and the interaction between CT_2 and TREAT. At the same time, subject 814 has some impact on the fixed effects of CT_3 and the interactions between CT and TREAT and

6.3 Empirical illustrations on influence diagnostics

FIGURE 6.4  Fixed-Effects Deletion Estimates on PCL_SUM

197

198

CHAPTER 6  Residual and influence diagnostics

between CT_3 and TREAT but has no influence at all on the fixed effects of CT. In this analysis, subject 998 is identified as an additional influential case since it has a strong impact on half of the fixed effects. Figure 6.5 displays the results of the deletion estimates for each of the covariance parameters on the PCL score. In Fig. 6.5, all cases, except subjects 132, 227, and 998, do not contain information because the estimation of G after removal of those subjects does not yield a positive definite matrix, thereby failing to function. With a small sample size of a longitudinal dataset, the influence analysis on the covariance parameter deletion estimates is often inefficient in checking the model fit of the random coefficient model.

6.3.2  INFLUENCE DIAGNOSTICS ON LINEAR MIXED MODEL CONCERNING MARITAL STATUS AND DISABILITY SEVERITY AMONG OLDER AMERICANS In the second example of influence diagnostics, the AHEAD data are used (2000 randomly selected subjects and six waves: 1998, 2000, 2002, 2004, 2006, and 2008). In this analysis, the dependent variable remains the health-related difficulty in performing five activities of daily living (dress, bath/shower, eat, walk across time, get in/out of bed), measured at six time points and named ADL_COUNT. The disability is scored one if an older person has difficulty, personal help, or equipment help for health-related reasons and scored zero if otherwise. Therefore, the value of the ADL count ranges from 0 to 5 at each time point. Marital status, named married, is the same dichotomous variable with 1 = currently married and 0 = else, specified as a time-varying variable. The controls are the three centered variables, Age_mean, Educ_mean, and Female_mean. Given the same set of covariates, the same linear mixed model described in Chapter 5 (Section 5.6.2) is applied. The ML estimator is used to perform an influence analysis on the fixed effects and the covariance parameters. The following is the SAS program for this analysis, an adapted version of SAS Program 5.4. SAS Program 6.3: . . . . . .

SAS Program 6.3 is similar to SAS Program 6.1, with contextual modifications. In the PROC MIXED statement, the PLOTS(MAXPOINTS = 20,000) option tells

6.3 Empirical illustrations on influence diagnostics

FIGURE 6.5  Covariance Parameter Deletion Estimates on PCL_SUM

199

200

CHAPTER 6  Residual and influence diagnostics

SAS to increase the maximum points of data in plotting the least squares means from the default 5,000 to 20,000, given a large sample size of the AHEAD longitudinal data. As indicated in Chapter 5 (Section 5.6.2), the TOEP covariance pattern model is selected to address the covariance structure in residuals. Given the large sample size, SAS Program 6.3 yields a substantially large quantity of output data, containing eleven diagnostic statistics for each of the 2000 subjects. The diagnostic table for this influence analysis takes the same format as that of the first example, in which a rich body of influence diagnostics, including tables and graphs, is presented. Given the large quantity of the output results, the detailed contents of diagnostic statistics and the figures cannot all be reported. Instead, the focus of this influence analysis is placed upon identification of the most influential cases, rather than on the interpretation of the output tables and graphs. It follows that the exact LD scores are computed by removing each of the most influential cases in the linear mixed model. Furthermore, the actual influences of those outstanding cases are examined by checking changes in the parameter estimates after removing each of them from the regression. After a careful examination on the results from SAS Program 6.3, two influential subjects are identified: subject 200520020 (HHIDPN number) and subject 207669010. Given the relatively high LD scores, it is necessary to check the exact LD for both cases. Analogous to the approach applied for the first example, two additional linear mixed models are created with each deleting one of those two influential cases. The SAS program for this step is not presented because the two mixed models follow the standard procedure, as previously displayed, except for the removal of a single observation. Table 6.3 summarizes the results for the fixed effects. In Table 6.3, the first linear model uses full data, while the second model is fitted after deleting the most influential case (subject ID: 200520020). The exact LD statistic, from the formula 2l θˆ − 2l θˆ( − i )  , is statistically significant with two degrees of freedom (this subject has two observations) and at a = 0.05 (LD = 15.8, p < 0.01), indicating that the most influential case makes a very strong statistical impact on the overall fit of the fixed effects. Likewise, the third mixed model is fitted after deleting the second most influential case (subject ID: 207669010), with the exact LD statistic also statistically significant with six degrees of freedom at the same criterion (LD = 31.3, p < 0.01). In both the second and the third models, however, the fixed estimates, including the regression coefficients and the standard errors, do not vary at all after removing each of those two cases. The fixed effects of the three control variables, not presented in Table 6.3, are also strikingly consistent after removing each of those two cases. The three sets of the estimated regression coefficients are almost identical. Obviously, deleting those two cases makes no genuine impact on the fixed-effects estimates, albeit strong statistical influences. Next, the impacts of the two influential cases are further examined on the covariance parameter estimates. The following SAS program output displays three panels of covariance parameter estimates, derived from the aforementioned three linear mixed models, respectively.

()

Table 6.3  Maximum Likelihood Estimates and the Likelihood Displacement Statistic for Three Linear Mixed Models on ADL Count: Older Americans Explanatory Variable

Parameter Estimate

Standard Error

Degrees of Freedom

t-value

p-value

Linear Mixed Model With Full Data (−2 LL = 20,016.3; p < 0.0001) Intercept Married Time 1 Time 2 Time 3 Time 4 Time 5 Treat × Time 1 Treat × Time 2 Treat × Time 3 Treat × Time 4 Treat × Time 5

0.673 0.029 0.257 0.513 0.579 0.768 0.932 −0.159 −0.252 −0.233 −0.060 −0.309

0.044 0.059 0.036 0.052 0.058 0.061 0.060 0.053 0.078 0.090 0.098 0.101

1726 300 4814 4814 4814 4814 4814 4814 4814 4814 4814 4814

15.31 0.49 7.19 9.87 9.94 12.56 15.60 −3.01 −3.24 −2.58 −0.61 −3.07

<0.0001 0.6223 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0026 0.0012 0.0099 0.5441 0.0021

Linear Mixed Model Deleting Subject 200520020 (−2 LL = 20000.5; p < 0.0001) Intercept Married Time 1 Time 2 Time 3 Time 4 Time 5 Treat × Time 1 Treat × Time 2 Treat × Time 3 Treat × Time 4 Treat × Time 5 LD statistic

0.673 0.028 0.257 0.513 0.580 0.768 0.932 −0.159 −0.250 −0.234 −0.059 −0.302

0.044 0.059 0.036 0.052 0.058 0.061 0.060 0.053 0.078 0.090 0.098 0.101

1725 15.29 300 0.48 4809 7.18 4809 9.87 4809 9.94 4809 12.56 4809 15.60 4809 −3.01 4809 −3.21 4809 −2.59 4809 −0.60 4809 −3.00 15.8 (df = 2; p < 0.01)

<0.0001 0.6335 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0026 0.0013 0.0096 0.5478 0.0027

Linear Mixed Model Deleting Subject 207669010 (−2 LL = 19985.0; p < 0.0001) Intercept Married Time 1 Time 2 Time 3 Time 4 Time 5 Treat × Time 1 Treat × Time 2 Treat × Time 3 Treat × Time 4 Treat × Time 5 LD statistic

0.675 0.026 0.255 0.513 0.578 0.770 0.923 −0.156 −0.252 −0.231 −0.061 −0.300

0.044 0.059 0.036 0.052 0.058 0.061 0.060 0.053 0.078 0.090 0.098 0.100

1725 15.35 300 0.44 4809 7.11 4809 9.86 4809 9.92 4809 12.58 4809 15.47 4809 −2.96 4809 −3.23 4809 −2.56 4809 −0.62 4809 −2.99 31.3 (df = 6; p < 0.01)

<0.0001 0.6599 <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 0.0031 0.0012 0.0105 0.5324 0.0028

202

CHAPTER 6  Residual and influence diagnostics

SAS Program Output 6.1: The covariance parameter estimates for full data

The covariance parameter estimates after deleting the first case

The covariance parameter estimates after deleting the second case

As shown in SAS Program Output 6.1, the three sets of the covariance parameter estimates are very close, and therefore, deletion of those two statistically influential cases does not affect the quality of the covariance parameter estimates, either. Given the remarkable similarities of the estimated regression coefficients and covariance parameters, I do not see any reason that the two statistically influential cases should be eliminated from the linear mixed model, although they have exceptionally strong

6.4 Summary

statistical contributions to the model fit. Indeed, for large samples, a strong relative influence of a few particular cases can be easily averaged out by the effects of other cases in the estimating process. Consequently, deleting any influential observation can hardly make any actual impact on the regression coefficient estimates and the covariance parameter approximates in linear mixed models.

6.4 SUMMARY In this chapter, a number of statistical methods are described for performing residual diagnostics in linear mixed models. The chapter starts with an introduction of a variety of residual types that are widely used in longitudinal data analysis. The semivariogram approaches are then delineated to help the reader further command the techniques for checking the behaviors of residuals in linear mixed models. The semivariogram can be applied to check whether serial correlation in repeated measurements is present given the fixed effects and the specified random effects. Compared to the classical residual diagnostics, the semivariogram is proposed to analyze spatial data, thereby not seeing many applications in the analysis of longitudinal data with equal time intervals. In longitudinal data analysis, plotting various residuals is effective in revealing inconsistencies between the observed data and the model-based predictions (Diggle et al., 2002). Compared to residual diagnostics, the identification of influential cases on the goodness-of-fit of linear mixed models is more complex. In Section 6.2, I describe a variety of popular approaches in this area. For large samples, the actual impact of a few statistically influential cases is often found to be very limited, even though they contribute significantly to the model fit. Given this finding, the techniques for the identification of influential cases are particularly important in the analysis of longitudinal data with a small sample size. In such analyses, a few influential cases can significantly modify the analytic results in the application of linear mixed models. I would like to recommend the following steps for identifying influential cases in those studies. First, identify the most influential cases by using the influence diagnostic methods described in this chapter. Next, examine the exact change in the estimated regression coefficients and the covariance parameter approximates after deleting each of the identified cases. Following this strategy, a decision can be made regarding whether those cases should be removed in the application of a linear mixed model.

203