Imputation of missing data in time series for air pollutants

Atmospheric Environment 102 (2015) 96e104 Contents lists available at ScienceDirect Atmospheric Environment journal homepage: www.elsevier.com/locat...

Download PDF

281KB Sizes 0 Downloads 87 Views

Report

PDF Reader
Full Text

Atmospheric Environment 102 (2015) 96e104

Contents lists available at ScienceDirect

Atmospheric Environment journal homepage: www.elsevier.com/locate/atmosenv

Imputation of missing data in time series for air pollutants W.L. Junger*, 1, A. Ponce de Leon Rio de Janeiro State University, Department of Epidemiology, Brazil

h i g h l i g h t s We propose a method for imputation of missing values in times series. Simulations showed adequate goodness-of-ﬁt. The ﬁndings also suggest good accuracy and precision. We implemented the method as an open source R library.

a r t i c l e i n f o

a b s t r a c t

Article history: Received 10 June 2014 Received in revised form 13 November 2014 Accepted 21 November 2014 Available online 24 November 2014

Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for ﬁltering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R. © 2014 Elsevier Ltd. All rights reserved.

Keywords: Air pollution Data imputation EM algorithm Environmental epidemiology Missing data Particulate matter Time series

1. Introduction Missing data are a major concern in epidemiological studies (Eekhout et al., 2012), especially of the health effects of environmental air pollutants, which are often caused by equipment failure or data corruption. Under the Bayesian framework, missing data are extra parameters to be estimated (Gelman et al., 2014) but it is not as trivial otherwise. The analysis of incomplete data has been studied widely and many methods have been developed (Schafer, 1997; Little and Rubin, 1989; Little, 1992; Dempster et al., 1977; Rubin, 1976; Beale and Little, 1975; Hartley and Hocking, 1971), but it has received little attention in epidemiological contexts

* Corresponding author. E-mail addresses: [email protected] (W.L. Junger), [email protected] (A. Ponce de Leon). 1 This author was sponsored by the Brazilian Research Council e CNPq. http://dx.doi.org/10.1016/j.atmosenv.2014.11.049 1352-2310/© 2014 Elsevier Ltd. All rights reserved.

(Miettinen, 1985; Rothman et al., 2008). However, several studies have addressed the impact of incomplete data analysis on epidemiological measures using stochastic simulations (Gorelick, 2006; Plaia and Bondì, 2006; Junninen et al., 2004; Engels and Diehr, 2003) and new methods have been proposed. Simple methods are likely to yield biased estimates and the most sophisticated depend on strong assumptions about the sources of the missing data, while they also involve complex computations (Little and Rubin, 1989; Schafer, 1997). Rubin (1976) classiﬁed incomplete data according to their generating mechanisms. Data can be missing at random (MAR), missing completely at random (MCAR) or missing not at random (MNAR). The MCAR condition is too restrictive because it assumes that the missing data comprise a random sample taken from the observed values. In epidemiological research, the distributions of missing data are often related to the disease status or the exposure. Thus, the MAR assumption may be more realistic (Greenland and Finkle, 1995).

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

The default method used by most statistical software is complete case analysis, i.e., the exclusion of incomplete observations. Under the MCAR assumption, this may yield unbiased estimates. However, a higher proportion of incomplete observations may result in a loss of precision (Rothman et al., 2008; Greenland and Finkle, 1995). Under MAR, the complete case analysis no longer relies on a random sample of the source population and selection bias is likely to occur (Donders et al., 2006). In time series analysis, this problem can be exacerbated because excluding incomplete observations may corrupt temporal structures such as autocorrelation, trends, and seasonality (Box et al., 1994). The approaches used to estimate parameters in a missing data situation can be classiﬁed into two major groups: likelihood-based and imputation-based (Little and Rubin, 1989). Likelihood-based methods are ﬂexible, they do not require ad hoc methods, and they yield an adequate estimate of the variance, but it may be necessary to solve highly complex likelihood equations (McLachlan and Krishnan, 1997; Little and Rubin, 1989; Hartley and Hocking, 1971). These methods generally use computational routines that are tailored for speciﬁc analyses; thus it is difﬁcult to make them readily available in general purpose statistical software systems. By contrast, imputation-based methods are usually simpler, and they can be implemented in most commercial statistical software systems. Some methods may be computationally intensive, e.g., multiple imputation (Schafer, 1997). Despite the simplicity, imputation-based methods do have some drawbacks. The data analysis takes place post-imputation; thus, the extra variability due to imputation is usually not considered, so the variance of the estimated association is usually underestimated. Another important characteristic of imputation-based methods is that the simplest types often yield biased estimates of the association (Donders et al., 2006). Multiple imputation estimation may consider the extra variability, thereby obtaining more precise conﬁdence intervals (Schafer, 1997) but these methods are not addressed in the current study. The simplest and the often misused method is replacing the missing values with the unconditional mean (UM) of the variable. Under MAR, this yields inconsistent estimates of the variance of the regression coefﬁcients. If the MCAR assumption holds, the variance estimates are consistent but underestimated. Thus, hypothesis testing and the estimates of the conﬁdence intervals will be distorted by both the bias and the overestimated precision (Little and Rubin, 1989; Little, 1992). Imputation using the median (MD) may yield better results for skewed distributions (Miettinen, 1985). Single imputation based on unconditional or conditional means tends to distort the marginal distribution of the data due to the higher concentration of observations around the mean. This may be a major concern if one is interested in the tails of the distribution, e.g., hypothesis testing (Little and Rubin, 1989). The latter method can be improved by using the information from measured covariates of the same study unit to impute the missing value based on the prediction from a linear regression model. The regression coefﬁcients are estimated using the complete case analysis. Under the MCAR assumption, this yields consistent estimates of the association (Little, 1992). Hartley and Hocking (1971) proposed a set of iterative equations for likelihood estimation of the mean vector and the covariance matrix of a multivariate normal distribution with missing data. Later, this method was extended to accommodate any distribution from the exponential family, which is now referred to as the expectationemaximization (EM) algorithm (Dempster et al., 1977). Under the assumptions of a multivariate normal distribution, the

97

EM algorithm is an iterative version of Buck's calculator (McLachlan and Krishnan, 1997; Buck, 1960). In this article, we present an imputation-based method that is suitable for multivariate time series data, using the EM algorithm for estimation of the mean vector and covariance matrix of the normal distribution for the underlying framework. In addition to the correlations among covariates, the algorithm also considers the temporal components of the time series. Different approaches are implemented for ﬁltering the temporal components. Simulations were performed to assess the procedure's validity and comparisons were made with some frequently used methods. Our method has been implemented as a package called mtsdi (multivariate time series data imputation) for the statistical software R (R Core Team, 2013), which is available at R repositories. This article is organized as follows: in Section 2 the statistical concepts of the imputation, ﬁltering models, and simulation details are presented; in Section 3 results from the validity and performance analyses as well as the penalization are showed; the discussion on the ﬁndings is presented in Section 4. 2. Methods 2.1. Imputation procedure Let xt, (t ¼ 1, … ,n), be the t-th realization of the p-variate normal random vector X with m unobserved components. The vector xt can be rearranged such that the m missing elements will be in the ﬁrst positions, i.e., xt ¼ (xt1, … ,xtm, xt(mþ1), … ,xtp)T, which are denoted by xt ¼ (xt1, xt2)T. Furthermore, we consider that the observed period can be spanned over B time windows, with indices b, (b ¼ 1, … ,B), and each time window has a different underlying regime of covariance over time. Thus, the mean vector at time t and the covariance matrix at window b can be partitioned by following the same conﬁguration of xt, i.e.,

m ~t ¼

m ~ t1 m ~ t2

~b ¼ and S

~ b11 S ~ b21 S

~ b12 S ~ b22 : S

Our proposed imputation method is a modiﬁcation of the EM algorithm for estimating the mean vector and the covariance matrix of a multivariate normal distribution with missing data (Dempster et al., 1977). The algorithm comprises the following steps: (i) replace the missing values by estimates; (ii) estimate the parameters m and S; (iii) estimate the level for each of the univariate time series; (iv) re-estimate the missing values using updated estimates of the parameters and the level of the time series. These steps are iterated until some convergence criterion is reached. ~ 0 are the mean vector ~ 0 and S In general, the initial estimates m and the covariance matrix estimated from the observed incomplete data. At the (k þ 1)-th iteration of the E (estimation) step of the EM algorithm, the missing values are imputed with conditional means given the observed values and the previous estimates of the parameters given by

i h ðkÞ ~ ðkÞ ðkÞ ðkÞ ~ ðkÞ S ~ ðkÞ1 xt2 m ~ ðkþ1Þ ¼m ~ t1 þ S ¼ E Xt1 xt2 ; m ~t ; S ~ t2 : x b b12 b22 t1 The contributions to the covariance matrix are given by

g xt1 xTt1

ðkþ1Þ

i h ðkÞ ~ ðkÞ ¼ E Xt1 XTt1 xt2 ; m ~t ; S b ~ ðkÞ S ~ ðkÞ S ~ ðkÞ1 ~ ðkÞ ~ ~T ¼S b11 b12 b22 Sb21 þ x t1 xt1

98

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

regression coefﬁcients of the linear partition of the model and g($) are smooth functions of the covariates, e.g., splines.

and

g xt1 xTt2

ðkþ1Þ

i h ðkÞ ~ ðkÞ ~ t1 x ~ Tt2 : ¼x ¼ E Xt1 XTt2 xt2 ; m ~t ; S b

2.3. Penalizing the loss of information

In the M (maximization) step, the revised maximum likelihood estimates of mb and Sb are computed. By dropping the index, at the P P ~ ¼ nb x g ~ =n and S ~T . ~ m ~ ¼ nb x xT =n m iteration (k þ 1), m b

t¼1 bt

b

b

t¼1 bt bt

b

b b

~ b and it is replaced by the ~ b is only used to compute S The estimate m predicted level afterwards. 2.2. Level estimation The temporal contribution to the level of each time series mt is independently estimated using an ad hoc method. Basically, any time series ﬁlter can be used to estimate mt. In this study, we implemented and evaluated three methods: the autoregressive integrated moving average (ARIMA) models, natural cubic splines and regression models. Let Xt ¼ (Xt1, Xt2, … ,Xtp)T and each component Xtj, with j ¼ 1, … ,p, be normally distributed. Each Xtj can be modelled by an ARIMA (p, d, q) denoted by

Vd xjt ¼ 41 xjt1 þ 42 xjt2 þ / þ 4p xjtp þ ajt q1 ajt1 q2 ajt2 / qq ajtq ; where Vd is the d-th order difference operator, 4 are the autoregressive coefﬁcients and q are the moving average coefﬁcients. The level estimate for the covariate Xj at time t is the one-step ahead prediction given by the ARIMA model, which is denoted by m ~ jt ¼ E½Xjt xjðt1Þ ; xjðt2Þ ; …. The level is estimated using previous information for each time series. Seasonality can easily be incorporated into the modelling process as well as exogenous variables via transfer functions. Identiﬁcation and estimation using ARIMA models were discussed thoroughly in Box et al. (1994). The level of each covariate can also be estimated by a natural cubic spline. Let us consider that mjt can be estimated by a smooth function gj, with j ¼ 1, … ,p. The curve gj is obtained by minimizing the functional

Zb n o K X 2 00 2 g S gj ¼ fXt gðnk Þg þ l dx; k¼1

a

where g00 is the second derivative of g. The knots n1,n2, … ,nK are ordered over the interval [a,b] and l is the smoothness parameter. The solution for this optimization problem is a natural cubic spline. The level of each covariate Xj is given by mjt ¼ g(xjt). In this application, [a,b] is the observed time indices of the series, i.e., [1,n]. The parameter l is expressed in terms of degrees of freedom (d.f.), which is related to the number of knots, and can be either arbitrarily chosen following some heuristics or estimated via crossvalidation (Green and Silverman, 1994). Finally, the level mjt can also be estimated using a full-ﬂedged regression model. The main advantage of this approach is the ability to easily include extra information from exogenous covariates in the imputation model to explain part of the time series variability. Both generalized linear models (McGullagh and Nelder, 1989) and generalized additive models (GAM) (Hastie and Tibishirani, 1990) are implemented as ﬁlters. The latter is quite ﬂexible for modelling non-linear relationships because smooth functions of the covariates can be included in the predictor. The estimates of mjt, with j ¼ 1, … ,p, are given by P P mjt ¼ b0 þ bu Xu þ gv ðXv Þ, where the constants b are the u

v

In order to obtain estimates of variances that consider the extra variability in the ﬁnal analysis because some observations are imputed rather than observed, a penalizing criterion is proposed for the imputed values. These quantities can be used as weights in the regression model or to compute weighted means of the pollutant's concentrations. A naive but reasonable weighting function is given by a quantity that is inversely proportional to the number of imputed values in the observation vector, which is denoted by wt ¼ 1 k(mt/p), where mt is the number of missing values in the pvariate observation vector at time t and k is an arbitrary constant. The constant k controls the penalty attributed to the observation. The value k ¼ 0.5 yields weights ranging from 0.5, if all the data on a given time t were imputed, to 1, if all the covariates were observed. Informal analysis (results not shown) suggests that this value for k is a reasonable compromise between no penalty at all and the case deletion approach. Non-linear weights can be easily obtained by using either an exponential or a power function of the number of missing values. 2.4. Simulation study To evaluate the proposed method and to compare its performance with published methods, a simulation study based on different missing data patterns was conducted. The results obtained from the imputed datasets were compared with those produced by the full dataset analysis. The resulting patterns in the missing data considered combinations of the following: generating mechanism, proportion of missing data and conﬁguration. Missing data levels of 5%, 10%, 20%, 30% and 40% were considered in the simulation study. According to (Rothman et al., 2008), complete case analysis with a small amount of missing data yields good results, thus 5% was used as the best case scenario. By contrast, 40% was used to evaluate the procedure with the most extreme loss of information, which was considered to be the worst case scenario. To emulate the MCAR mechanism, a random sample of observed data from the original dataset was replaced by a missing value during each iteration. In order to emulate MAR, the random sample was drawn according to the condition that the average of the observed values of the other covariates for the same observation exceeded some quantile of the distribution of the covariate that was being set as missing. The MNAR was generated in the same manner as MAR, except the missing value that was set as missing exceeded some quantile of the distribution of the missing values, i.e., the distribution of the missing data was conditional on the values that would be observed. The times series for daily data of air pollutants may have gaps that extend across many days for a given monitoring station or many monitoring sites on the same day. In order to assess the effect of these missing data patterns, three different gap patterns were considered in the present study. The sparse pattern was formed using sparsely scattered non-contiguous missing values, although this pattern is not very common in real applications. This pattern comprised gaps with length 1 scattered throughout the entire dataset, i.e., there were no contiguous in-row or in-column missing values. The in-row pattern was formed by three, ﬁve or seven contiguous missing values in the same row (monitoring stations), and the in-column pattern was formed by three, ﬁve or seven contiguous missing values throughout the same column (time series). The choice of gap lengths represents a gradient for an extreme situation of contiguousness among the missing values, where seven

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

represents a more severe missing data situation. To generate these patterns, only the ﬁrst element of the gap was drawn randomly. The dispersed pattern was generated without any restriction in terms of length and location, thereby resulting in combinations of the aforementioned patterns. The latter is probably the most realistic missing data pattern in real-life applications. The standard methodology for estimating the effects of air pollutants on health using the time series design is the semiparametric Poisson regression (Schwartz et al., 1996), where natural cubic splines are used for modelling non-linear associations (Dominici et al., 2002). After controlling for potential confounders, the exponential of the exposure coefﬁcient in the regression model can be interpreted as the relative risk for one unit variation in the pollutant concentration (Rothman et al., 2008). The same statistical model is used to estimate the association between particulate matter (PM10) and the daily counts of hospital admissions for children aged under ﬁve years due to respiratory diseases. During each iteration, the estimate of the association and its standard error under different missing data scenarios were compared with those estimated from the full dataset analysis. Each simulation scenario was replicated 100 times and imputation was applied during each replicate using different univariate and multivariate methods. The univariate methods were the unconditional mean (UM), the median (MD) and the nearest neighbour (NN). The latter was implemented as either the last value carried forward or the next value carried backward, depending on the length of the gap and the index of the observation being imputed. The multivariate methods were the conditional mean (CM), based on a linear regression of the complete cases, and the usual EM algorithm for normally distributed data (EM). The proposed extended EM algorithm used univariate time ﬁltering based on cubic splines, ARIMA models and generalized additive models (GAM) with exogenous variables, with and without moving variance (MV). The temperature and relative humidity are used in the association model; thus, they were excluded from the imputation model to avoid over-ﬁtting. At the end of each replicate, the Monte Carlo means of the estimates and standard errors were compared with those obtained from the full data analysis. In summary, the simulation comprised 100 replications of 720 combinations of the following parameters: 3 generating mechanisms (MCAR, MAR and MNAR); 4 pattern conﬁgurations (sparse, in-row, in-column and dispersed); 5 different proportions of missing values (5%, 10%, 20%, 30% and 40%); and 12 procedures for dealing with missing data in the analysis (CC, UM, MD, NN, CM, EM, EMeSpline, EMeARIMA, EMeGAM, EMeMVeSpline, EMeMVeARIMA and EMeMVeGAM).

2.5. Performance analysis In addition to the validity analysis using several replicates, the methods were also evaluated in terms of their performance with a single imputation pattern that we selected by chance for the dispersed conﬁguration. This conﬁguration was selected because of its representativeness for the missing data problem in environmental epidemiology. The imputed values were evaluated in terms of their accuracy, agreement and precision. The root mean square qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pm 1 ~ 2 error, RMSE ¼ m i¼1 ðxi xi Þ , was used to estimate the mean error for each imputation procedure. A more sensitive measure of the imputation error is the mean absolute error given by Pm 1 ~ MAE ¼ m i¼1 jxi xi j, which is affected less by larger differences between the observed and imputed values. The bias was computed as the mean difference between the observed and the

imputed values, which we denoted as BIAS ¼

99

Pm 1 m

i¼1 ðxi

~ xi Þ.

The precision of each method was assessed by comparing the sample variance of the imputed values with the variance of the original data. The proportional variance was computed as PV ¼ varð~ xÞ=varðxÞ. pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pearson's correlation coefﬁcient, r ¼ covðx; ~ xÞ= varðxÞvarð~ xÞ, is the most common indicator of agreement when evaluating imputation procedures. However, the values of r might not be related to the magnitude of the difference between the observed and the imputed values. To avoid this issue, Willmott (1982) proposed an alternative index of agreement given by P Pm 2 ~ 2 ~ d2 ¼ 1 m i¼1 ðxi xi Þ = i¼1 ðjxi xj þ jxi xjÞ , where d2 ranges between 0 and 1, which indicate the lack of agreement and perfect agreement, respectively. In these equations, m denotes the number xi are the imputed of missing values, xi are the observed values, ~ values, with i ¼ 1, … ,m, and x denotes the mean of the imputed values. 2.6. Data source Evaluating imputation methods is a challenging task because the use of imputation implies that a complete dataset is not available. Furthermore, simulating a dataset with both temporal and spatial dependencies is not trivial, and even the best model may not fully capture the dynamics of the underlying stochastic process. As the full dataset, it was used a chunk that comprised 366 days of the daily concentrations of particulate matter with an aerodynamic volume up to 10 mg/m3 (PM10), which were measured at ten ~o Paulo, Brazil. Data access was monitoring stations in the city of Sa granted by the S~ ao Paulo State Environmental Agency (CETESB). The stations were labelled from S1 to S10. 2.7. Computational aspects Various computational routines were tailored speciﬁcally to generate the patterns of missing values, for univariate and CM imputation, and performance assessment. All of the programming was conducted in R (R Core Team, 2013). Our proposed imputation method was implemented for R as the package mtsdi, which is available for installation from CRAN at www.r-project.org. 3. Results 3.1. Data description Each imputation method and conﬁguration was compared with a reference value, which was estimated using the full dataset analysis. Linear terms were included to control for weekdays and calendar effects, a natural cubic spline of the time index with 6 degrees of freedom (d.f.) to account for trend and seasonality, a spline of 2-day lagged mean temperature with 4 d.f., and a spline of relative humidity with 4 d.f. to control for meteorological effects. The effect of exposure was estimated by a linear term with a coefﬁcient of 0.004321 and standard error of 0.001063. In terms of the percentage relative risk (%RR), the increment of the risk for an increase of 1 mg/m3 of PM10 was 0.433% with a 95% conﬁdence interval of (0.224, 0.643). Table 1 shows descriptive statistics (in mg/m3) for the ten ~o Paulo during monitoring stations for PM10 concentrations in Sa 2004, which we used in this study. The dataset comprised 3660 observed values. The Pearson's correlation coefﬁcients for the ten monitoring stations over the whole year are shown in Table 2. Only seven pairs of stations had correlation coefﬁcients less than 0.7 and

100

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

Table 1 ~o Paulo during 2004. Descriptive statistics for ten PM10 monitoring stations in Sa Station

Mean

Std. dev.

Min.

1st Quartile

Median

3rd Quartile

Max.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

37.2 44.1 36.4 38.9 45.0 45.5 45.3 51.6 49.7 38.0

16.9 25.1 16.0 18.5 24.3 20.0 23.6 26.5 25.7 18.5

9.7 10.6 8.7 10.7 10.7 6.9 8.7 7.3 9.9 6.7

23.8 26.7 24.5 25.0 27.2 30.9 29.0 31.4 30.3 25.6

34.5 38.0 33.4 34.1 38.4 43.1 41.0 46.3 45.8 34.0

46.3 55.5 44.8 49.6 57.3 55.5 55.8 68.8 62.9 46.9

91.5 160.6 98.0 112.5 149.6 132.4 173.0 155.9 164.3 129.2

only one less than 0.6. This pattern suggests the better performance of multivariate imputation procedures. However, the correlation coefﬁcients based on the quarters of the year indicated some variability throughout the year. This justiﬁed using an imputation procedure that accounts for multiple covariance regimes. The air pollutant PM10 is a complex mixture, which comprises a variety of particles with different chemical compositions (Fuller et al., 2002); thus, the prediction of concentrations is a daunting task. Table 2 shows that the concentrations measured at ten sites ~o Paulo during 2004 were fairly well correlated across the city of Sa throughout most of the year. 3.2. Validity analysis 3.2.1. Sparse pattern A general result from the simulation study was that a higher proportion of missing data resulted in greater underestimation of the association. The analysis using only the complete case (CC) yielded accurate estimates of the association when 5% the data was missing. Under MAR with 10% missing data, the mean of the estimates was 0.432, a difference down to the third decimal digit. Imputation with the UM or MD tended to overestimate the association, even with as few as 5% missing data. The mean of the coefﬁcients increased with the proportion of missing data, where scenarios with more than 20% missing data yielded estimated associations of around 0.500 using any method. The imputations obtained using these procedures resulted in estimates with the highest variance. The NN method also yielded good estimates with differences down to the second decimal digit. Imputation based on the CM underestimated the effect when the proportions of missing data were higher, where the mean estimate was 0.414 with 30% missing data. However, the differences were less than 0.005 under MAR. In general, the CM yielded mean estimates with differences down to the second decimal digit and lower dispersion. The regular EM algorithm and our methods proposed yielded estimates that were very close to the reference relative risk under

Table 2 Pearson's correlation coefﬁcients for ten monitoring stations in S~ ao Paulo during 2004.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

1 0.81 0.93 0.81 0.80 0.73 0.81 0.80 0.74 0.78

0.81 1 0.89 0.87 0.70 0.82 0.81 0.85 0.71 0.82

0.93 0.89 1 0.84 0.77 0.79 0.83 0.83 0.74 0.83

0.81 0.87 0.84 1 0.63 0.72 0.72 0.82 0.68 0.79

0.80 0.70 0.77 0.63 1 0.72 0.80 0.75 0.65 0.68

0.73 0.82 0.79 0.72 0.72 1 0.73 0.79 0.72 0.76

0.81 0.81 0.83 0.72 0.80 0.73 1 0.75 0.58 0.82

0.80 0.85 0.83 0.82 0.75 0.79 0.75 1 0.60 0.75

0.74 0.71 0.74 0.68 0.65 0.72 0.58 0.60 1 0.58

0.78 0.82 0.83 0.79 0.68 0.76 0.82 0.75 0.58 1

all combinations of methods and proportions of missing values. The differences between the estimated and reference risk were less than 0.01. These differences were less than 0.005 under MAR. Even under MNAR, the estimates were accurate and precise, where the highest standard deviation did not exceed 0.02. However, a lower precision was observed under MCAR. Procedures based on the use of splines for the level estimation of time series tended to be more precise because the prediction stabilizes due to smoothing. By contrast, procedures based on ARIMA models tended to be less precise due to the higher prediction variability. It was not possible to generate 40% missing data using this pattern. 3.2.2. In-row pattern The in-row patterns of missing values produced similar behaviour, but with different magnitudes. With 5% missing data, all of the imputation methods worked well, except for the UM and MD, which overestimated the association, and this tended to be higher as the proportion of missing values increased. With 40% missing values, the estimated association was over 0.5%. The NN and CC also produced the same behaviour. The CM was based on the use of the complete data for prediction, thus CC and CM yielded the same results under this conﬁguration. Multivariate procedures yielded the best results. For the worst case scenario, i.e., MNAR and 40% missing data, the regular EM yielded the highest difference of 0.008. However, the bias tended to increase according to the amount of missing values. The method based on splines with multiple covariance regimes yielded estimates that were closest to the reference value, where the largest absolute difference was 0.004, except for the conﬁguration with 40% missing data under MNAR. The in-row conﬁguration with gaps of three and ﬁve observations using 40% missing values under MCAR with a temporal component and multiple covariance regimes exhibited convergence problems using the EM, particularly when GAM or ARIMA were used. These models yielded predictions with high variability. The estimates were considerably lower than the reference values and they exhibited high dispersion. This problem appeared to be related to the sample dataset, and it did not occur when a different seed of the random number generator was set. Under MAR or MNAR, our proposed methods obtained good accuracy and precision, with absolute differences as low as 0.01, and there was a slight tendency to underestimate the association as the amount of missing data increased. However, the method based on ARIMA tended to overestimate the association with a higher dispersion. The methods based on splines with different covariance regimes produced the best results with these conﬁgurations. Even under MNAR, the differences between the estimates and the reference values did not exceed 0.005. The simulation of gaps with seven missing observations in-row, which represented a loss of 70% of the daily data, produced varying results among the multivariate procedures. The CM underestimated the association even for 5% missing data and worsened as this proportion increased. The EM-based methods yielded fairly accurate estimates, but those based on ARIMA tended to degenerate, both in terms of accuracy and precision, as the proportion of missing data increased. The univariate methods (UM, MD and NN) along with CC and CM tended to underestimate the association with all amounts of missing data, although the bias was negligible with 5% missing data. The magnitude of the estimates decreased considerably as the proportion of missing data increased. The procedures that obtained higher differences were UM, MD and NN. The standard EM estimates were quite accurate for practical applications. The largest difference was 0.005 with 40% missing values under MNAR, but there was a slight tendency for the difference to increase with the amount of missing data. The

CC: complete case analysis; UM: unconditional mean; MD: median; NN: nearest neighbour; CM: condition mean (linear regression); EM: regular ExpectationeMaximization algorithm; EMeSpline: EM algorithm ﬁltered by splines; EMeARIMA: EM algorithm ﬁltered by ARIMA models; EMeGAM: EM algorithm ﬁltered by generalised additive models; EMeMV: ExpectationeMaximization algorithm with moving variance. MCAR: missing completely at random; MAR: missing at random; MNAR: missing not at random.

(0.005) (0.008) (0.011) (0.014) (0.020) 0.433 0.431 0.432 0.431 0.430 0.433 0.431 0.431 0.430 0.429 (0.007) (0.011) (0.015) (0.019) (0.024) 0.431 0.428 0.423 0.418 0.408

0.445 0.459 0.487 0.517 0.539

(0.012) (0.020) (0.030) (0.041) (0.049)

0.445 0.459 0.487 0.517 0.538

(0.012) (0.020) (0.031) (0.042) (0.049)

0.436 0.439 0.446 0.453 0.457

(0.008) (0.012) (0.021) (0.023) (0.025)

0.433 0.430 0.425 0.419 0.408

(0.007) (0.010) (0.015) (0.019) (0.024)

0.433 0.432 0.433 0.434 0.433

(0.005) (0.008) (0.012) (0.016) (0.022)

(0.005) (0.008) (0.011) (0.014) (0.018)

0.433 0.433 0.432 0.435 0.430

(0.007) (0.011) (0.017) (0.024) (0.029)

0.432 0.431 0.430 0.428 0.427

(0.005) (0.008) (0.011) (0.013) (0.019)

0.433 0.432 0.433 0.433 0.431

(0.005) (0.008) (0.011) (0.014) (0.018)

0.432 0.432 0.431 0.433 0.425

(0.008) (0.013) (0.019) (0.026) (0.033)

(0.004) (0.007) (0.010) (0.012) (0.018) 0.432 0.432 0.433 0.434 0.433 0.433 0.433 0.434 0.434 0.435 (0.006) (0.010) (0.015) (0.018) (0.022) 0.432 0.431 0.429 0.422 0.419

0.445 0.463 0.490 0.518 0.551

(0.012) (0.017) (0.028) (0.040) (0.047)

0.445 0.463 0.490 0.519 0.552

(0.012) (0.017) (0.028) (0.041) (0.048)

0.436 0.439 0.441 0.447 0.454

(0.008) (0.010) (0.017) (0.021) (0.025)

0.433 0.432 0.430 0.422 0.419

(0.005) (0.009) (0.014) (0.018) (0.022)

0.433 0.435 0.436 0.438 0.443

(0.004) (0.007) (0.009) (0.013) (0.015)

(0.004) (0.006) (0.009) (0.012) (0.015)

0.433 0.435 0.438 0.437 0.439

(0.005) (0.009) (0.013) (0.020) (0.024)

0.433 0.433 0.433 0.433 0.433

(0.004) (0.007) (0.009) (0.011) (0.017)

0.432 0.432 0.433 0.434 0.435

(0.004) (0.007) (0.010) (0.013) (0.016)

0.432 0.435 0.437 0.436 0.437

(0.006) (0.012) (0.016) (0.022) (0.028)

(0.006) (0.009) (0.013) (0.019) (0.023) 0.433 0.431 0.431 0.431 0.433 (0.009) (0.015) (0.017) (0.026) (0.032) 0.432 0.431 0.429 0.428 0.422 (0.005) (0.009) (0.013) (0.020) (0.026) 0.433 0.431 0.431 0.431 0.432 (0.006) (0.009) (0.014) (0.018) (0.021) 0.433 0.431 0.428 0.429 0.430 (0.007) (0.014) (0.016) (0.025) (0.030) 0.433 0.431 0.430 0.431 0.426 (0.005) (0.009) (0.013) (0.018) (0.022) 0.433 0.431 0.429 0.431 0.431 (0.006) (0.010) (0.014) (0.019) (0.022) (0.008) (0.014) (0.017) (0.023) (0.030)

MCAR 5% 10% 20% 30% 40% MAR 5% 10% 20% 30% 40% MNAR 5% 10% 20% 30% 40%

0.431 0.428 0.423 0.415 0.406

0.448 0.464 0.502 0.547 0.575

(0.015) (0.023) (0.037) (0.053) (0.073)

0.448 0.464 0.502 0.547 0.575

(0.015) (0.023) (0.037) (0.052) (0.074)

0.436 0.439 0.441 0.447 0.452

(0.012) (0.015) (0.024) (0.032) (0.037)

0.433 0.428 0.424 0.416 0.406

(0.007) (0.013) (0.017) (0.023) (0.030)

0.434 0.431 0.431 0.433 0.434

EMeMV ARIMA EMeMV spline EM GAM EM ARIMA EM spline EM CM NN MD UM CC

Table 3 Summary of the results of the simulations using the dispersed pattern. The Monte Carlo mean percentage relative risk (%RR) and standard deviation of the estimates (shown in parentheses).

EMeMV GAM

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

101

imputation with the temporal component based on splines again yielded better estimates, with differences less than 0.004 and lower dispersion. The methods based on GAM were similar, but those based on ARIMA overestimated the association and they exhibited higher dispersions. 3.2.3. In-column pattern This pattern is common in air quality monitoring data due to malfunctions that affect equipment, telemetry, or data storage, which can take days to repair. With 5% missing data, all the methods produced differences of þ/ 0.001, except for imputation using the UM and MD, which overestimated the association, and CC with a gap of seven days, that obtained a difference of 0.003. For all the in-column conﬁgurations, the CC and CM estimates increased with the amount of missing data. However, the estimates were less than 0.4 under MCAR. The univariate methods tended to produce higher estimates as the proportion of missing values increased. Under MCAR with 40% missing data and gaps of three days, the estimates obtained using UM and MD were as high as 0.55. The regular EM algorithm had a slight tendency to underestimate the association as the proportion of missing data increased under MCAR. However, the regular EM and the EM with a temporal component ﬁltered by splines or GAM produced differences less than 0.01, although the spline-based estimates exhibited lower dispersion. The method based on ARIMA exhibited higher dispersion, and it tended to underestimate the association as the proportion of missing data increased. Under MAR and MNAR, the EMbased methods did not produce differences greater than 0.006 for up to 40% missing data and gaps of seven days. The estimates were most precise for all of the methods that employed temporal component prediction, except when the ARIMA ﬁlter was used. 3.2.4. Dispersed pattern This pattern is probably more realistic for real applications because it combines the different patterns, i.e., sparse, in-row and in-column, with gaps of different lengths. These conﬁgurations were generated by placing no restrictions on the positions or the lengths of the gaps, which yielded very complex patterns of missing data. For this reason, a summarising table is presented only for this pattern; however, tables for the other aforementioned patterns may be obtained upon request. Table 3 presents a summary of the results of the simulations using the dispersed pattern. With 5% missing values, all of the methods except UM and MD yielded estimates with differences of less than 0.002. Under MCAR, CC and CM tended to underestimate the association as the amount of missing data increased. By contrast, the univariate methods overestimated the association considerably. Imputation by UM and MD yielded more severe differences, where the estimates were as high as 0.575 with 40% missing data. Multivariate methods with temporal adjustment had higher accuracy, except the ARIMA-based methods, which tended to underestimate the association as the amount of missing data increased and they also exhibited higher dispersion. Under MAR, CC and CM tended to underestimate the association as the proportion of missing values increased. Univariate methods yielded larger estimates, which also increased with the amount of missing values. The regular EM behaved in a similar manner. The methods with temporal adjustment had higher accuracy and precision, where the differences were smaller than 0.006 with 40% missing data using ARIMA. Imputation of the missing data under MNAR also produced good results. However, CC and CM tended to underestimate the associations, whereas the univariate methods tended to overestimate them. The regular EM algorithm obtained a difference of 0.001. The

102

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

spline-based procedures had higher accuracy and precision. The methods based on GAM and ARIMA with multiple covariance regimes yielded differences of less than 0.003, but the latter exhibited higher dispersion. 3.3. Performance analysis Table 4 shows the performance analysis for the most extreme missing data situations considered in this study, i.e., 5% and 40% under MCAR, MAR and MNAR. In general, the performance indicators decreased as the amount of missing data increased. The performance was not assessed for the CC analysis. Under MCAR, the univariate methods (UM, MD and NN) obtained high root mean squared error (RMSE) values and mean absolute error (MAE) values. Imputation by UM and MD implies the replacement of a missing value by a constant; therefore, these methods exhibited high under-dispersion compared with the original values. This was supported by the low values of the proportional variance (PV). By contrast, using the NN tended to induce over-dispersion, particularly if the generating mechanism was nonignorable. The univariate methods also yielded the lowest correlation coefﬁcient (r) and agreement index (d2). These methods also yielded poor performance under MAR. The regular EM and the CM obtained good performance with the lowest RMSE and MAE values,

although they were slightly over-dispersed. The correlation coefﬁcients were above 0.85 and the agreement index exceeded 0.92. The imputation methods with temporal component adjustment had the lowest RMSE, MAE and BIAS values, and the highest values of r and d2, except those based on GAM and ARIMA with multiple covariance regimes, which also exhibited severe over-dispersion. As expected, the performance of the multivariate methods improved considerably when the missing data were MAR. There was a fair balance between the variance of the imputed values and the variance of the original values, including those methods based on ﬁlters with higher variability in their predictions, i.e., GAM and ARIMA. Even under MNAR, the proposed methods delivered good performance, particularly when compared with the univariate methods. 3.4. Penalization Table 5 shows the estimated coefﬁcients for the association regression model and their standard errors under MAR, with and without considering the penalty factor for lost information. Only the methods with temporal component adjustment were considered in this analysis. The extreme situations of 5% and 40% were evaluated. A naive penalty function was adopted: wt ¼ 1 0:5ðmt =10Þ. Thus, the observation had a weight of 1 if no

Table 4 Performance indicators for a typical missing data pattern, 5% and 40% under MCAR, MAR and MNAR. Root mean square error (RMSE), mean absolute error (MAE), bias (BIAS), proportional variance (PV), Pearson's correlation coefﬁcient (r), and Willmott's index of agreement (d2).

MCAR 5% RMSE MAE BIAS PV r d2 40% RMSE MAE BIAS PV r d2 MAR 5% RMSE MAE BIAS PV r d2 40% RMSE MAE BIAS PV r d2 MNAR 5% RMSE MAE BIAS PV r d2 40% RMSE MAE BIAS PV r d2

UM

MD

NN

CM

EM

EM spline

EM ARIMA

EM GAM

EMeMV spline

EMeMV ARIMA

EMeMV GAM

0.458 0.365 0.057 0.053 0.248 0.346 0.475 0.382 0.016 0.045 0.216 0.293

0.462 0.367 0.070 0.066 0.232 0.359 0.476 0.382 0.030 0.055 0.211 0.313

0.421 0.309 0.067 1.041 0.617 0.784 0.424 0.322 0.014 1.100 0.638 0.795

0.181 0.088 0.005 0.907 0.870 0.928 0.012 0.001 0.000 0.773 0.953 0.969

0.239 0.169 0.005 0.732 0.861 0.921 0.252 0.193 0.006 0.756 0.856 0.919

0.230 0.161 0.010 0.818 0.872 0.930 0.232 0.175 0.000 0.834 0.880 0.935

0.219 0.167 0.017 0.839 0.886 0.938 0.317 0.219 0.010 1.166 0.807 0.895

0.219 0.167 0.017 0.839 0.886 0.938 0.317 0.219 0.010 1.166 0.807 0.895

0.220 0.156 0.016 0.851 0.885 0.938 0.246 0.179 0.008 0.887 0.866 0.929

0.237 0.172 0.014 0.906 0.868 0.930 0.663 0.309 0.035 2.646 0.551 0.674

0.237 0.172 0.014 0.906 0.868 0.930 0.663 0.309 0.035 2.646 0.551 0.674

0.421 0.346 0.286 0.094 0.499 0.526 0.517 0.422 0.383 0.077 0.416 0.500

0.416 0.342 0.281 0.114 0.504 0.534 0.521 0.425 0.390 0.095 0.419 0.503

0.344 0.256 0.059 1.633 0.660 0.790 0.403 0.304 0.136 1.658 0.640 0.767

0.124 0.076 0.008 0.783 0.885 0.937 0.039 0.007 0.002 0.770 0.889 0.971

0.150 0.117 0.001 0.804 0.903 0.946 0.206 0.156 0.001 0.694 0.839 0.906

0.138 0.108 0.010 0.961 0.920 0.959 0.186 0.140 0.002 0.827 0.871 0.930

0.155 0.119 0.006 0.988 0.900 0.948 0.191 0.143 0.019 0.876 0.866 0.928

0.155 0.119 0.006 0.988 0.900 0.948 0.191 0.143 0.019 0.876 0.866 0.928

0.133 0.103 0.011 0.910 0.925 0.960 0.181 0.136 0.004 0.852 0.879 0.935

0.159 0.117 0.005 0.979 0.895 0.945 0.197 0.150 0.019 0.927 0.860 0.925

0.159 0.117 0.005 0.979 0.895 0.945 0.197 0.150 0.019 0.927 0.860 0.925

0.424 0.352 0.352 0.158 0.511 0.514 0.539 0.464 0.464 0.121 0.384 0.461

0.420 0.346 0.346 0.191 0.490 0.518 0.554 0.480 0.480 0.144 0.364 0.456

0.368 0.278 0.120 2.506 0.593 0.705 0.415 0.316 0.165 2.304 0.545 0.674

0.193 0.099 0.066 1.283 0.713 0.808 0.133 0.038 0.034 1.099 0.572 0.765

0.213 0.139 0.092 1.283 0.787 0.863 0.255 0.189 0.116 1.267 0.747 0.828

0.194 0.131 0.063 1.388 0.821 0.890 0.226 0.168 0.091 1.324 0.797 0.868

0.189 0.139 0.046 1.264 0.806 0.888 0.223 0.167 0.083 1.191 0.780 0.862

0.189 0.139 0.046 1.264 0.806 0.888 0.223 0.167 0.083 1.191 0.780 0.862

0.182 0.124 0.048 1.222 0.817 0.895 0.214 0.159 0.076 1.269 0.805 0.879

0.200 0.146 0.042 1.278 0.779 0.874 0.266 0.185 0.066 1.540 0.718 0.824

0.200 0.146 0.042 1.278 0.779 0.874 0.266 0.185 0.066 1.540 0.718 0.824

CC: complete case analysis; UM: unconditional mean; MD: median; NN: nearest neighbour; CM: condition mean (linear regression); EM: regular ExpectationeMaximization algorithm; EMeSpline: EM algorithm ﬁltered by splines; EMeARIMA: EM algorithm ﬁltered by ARIMA models; EMeGAM: EM algorithm ﬁltered by generalised additive models; EMeMV: ExpectationeMaximization algorithm with moving variance. MCAR: missing completely at random; MAR: missing at random; MNAR: missing not at random.

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

103

Table 5 Estimated percentage relative risk (%RR) and standard error (s.e.) with and without penalization for different amounts of missing data (%NA). % NA

Penalty

Estimate

EM spline

EM ARIMA

EM GAM

EMeMV spline

EMeMV ARIMA

EMeMV GAM

5%

No

%RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR) %RR s.e.(%RR)

0.4326 0.1066 0.4310 0.1071 0.4333 0.1066 0.4352 0.1081 0.4343 0.1062 0.4349 0.1093 0.4372 0.1069 0.4362 0.1118 0.4446 0.1079 0.4476 0.1142

0.4257 0.1065 0.4245 0.1070 0.4211 0.1068 0.4238 0.1083 0.4146 0.1062 0.4170 0.1094 0.4112 0.1060 0.4133 0.1112 0.4058 0.1054 0.4135 0.1123

0.4339 0.1066 0.4323 0.1071 0.4330 0.1065 0.4349 0.1081 0.4316 0.1060 0.4324 0.1091 0.4403 0.1068 0.4385 0.1116 0.4483 0.1078 0.4504 0.1141

0.4305 0.1066 0.4290 0.1071 0.4316 0.1068 0.4337 0.1083 0.4294 0.1062 0.4305 0.1093 0.4345 0.1067 0.4335 0.1117 0.4435 0.1078 0.4465 0.1142

0.4270 0.1063 0.4256 0.1068 0.4247 0.1075 0.4273 0.1091 0.4123 0.1064 0.4154 0.1096 0.4105 0.1070 0.4127 0.1122 0.4014 0.1035 0.4099 0.1108

0.4324 0.1066 0.4308 0.1071 0.4314 0.1066 0.4334 0.1082 0.4286 0.1058 0.4296 0.1090 0.4408 0.1069 0.4385 0.1117 0.4483 0.1079 0.4503 0.1141

Yes 10%

No Yes

20%

No Yes

30%

No Yes

40%

No Yes

EMeSpline: EM algorithm ﬁltered by splines; EMeARIMA: EM algorithm ﬁltered by ARIMA models; EMeGAM: EM algorithm ﬁltered by generalised additive models; EMeMV: ExpectationeMaximization algorithm with moving variance.

data were missing for any of the 10 monitoring stations on any given day, whereas the weight was 0.5 if all the observations were missing on any given day because only the temporal component ﬁlter predicted these values. The values from Table 5 may be compared with the estimates of the percentage relative risk (%RR) obtained from the full data analysis, i.e., 0.4330 with a standard error of 0.1068. The standard errors for (%RR) were using the delta rule. The missing data pattern used was the same as that employed in the performance analysis. When the amount of missing data was as low as 5%, imputation yielded slightly underestimates of the standard error. However, the standard error was consistently higher when the penalty factor was included. This behaviour was observed with all the proportions of missing data and all imputation methods. With the highest proportions of missing data, i.e., 30% and 40%, the standard error was much more inﬂated. Similar results were observed under MCAR. However, the standard error was slightly underestimated under MNAR using an ARIMA ﬁlter, with 5% missing data. In most situations, despite its naivety, the penalty criterion yielded standard error estimates that were equal to or greater than that estimated from the full data analysis. Therefore, the penalty decreased the probability of rejecting the null hypothesis when it was true, i.e., type I error. 4. Discussion The completeness of environmental datasets is a major issue that limits the development of epidemiological studies of the effects of air pollution on health. The missing data problem can even occur when pollutant concentrations are measured by large and well-maintained monitoring networks. During the last two decades, time series studies have been developed around the world to quantify the short-term effects of air pollution on health. This measure of population exposure to air contaminants is usually deﬁned as the average of the concentrations measured by various monitoring stations distributed throughout a region on a given day. The results of this simulation study demonstrate that even a small amount of missing data can yield biased estimates of effects and overestimate the precision. Our simulations showed that when the amount of missing data was as low as 5%, the complete case analysis yielded satisfactory

results, regardless of the method used to generate the data. In this case, the statistical efﬁciency was not compromised. However, even with this small amount of missing data, imputation using the UM and the MD should be avoided. The validity of the estimates degenerated when the proportions of missing values exceeded 10%. Thus, multivariate imputation methods are recommended when the amount of missing data is higher, or to restore the data distribution, thereby lessening the impact of lost information on the estimated precision. Our proposed imputation methods exhibited good accuracy and precision in different settings in terms of patterns and gap lengths, both within or among observations. Sequences of missing data that represent gaps of many days are frequent in air quality monitoring. Even with gaps as long as seven days, the estimates of the association after imputation were very similar to the reference values using our methods. Imputation based on the NN yielded accurate estimates with a small amount of missing data. The use of the CM also resulted in high accuracy with small amounts of missing data for many of the conﬁgurations that were analysed, particularly under MAR. However, the prediction model depends on the complete data; thus only a few observations may be available to estimate the regression coefﬁcients if large quantities of data are missing. Therefore, the estimated association may decrease rapidly as the amount of missing data increases. Imputation using the regular EM algorithm with normal multivariate data yielded accurate estimates with all the conﬁgurations analysed. The particulate matter concentrations tended to be highly correlated in the city of S~ ao Paulo, which contributed for the good results obtained with the multivariate methods. The major contribution of the temporal component adjustment to this analysis was an improvement in the efﬁciency by obtaining lower dispersion. The imputation method based on the use of ARIMA for level prediction yielded the most variable association estimates. By contrast, the method based on GAM had the lowest variability, but this method relies on exogenous variables, which are often not available if data are missing, e.g., meteorological data. Moreover, the covariates that need to be included in the association model should not be used in the imputation model to avoid over-ﬁtting. Spline-based methods had the highest accuracy and precision due to the low variability of their predictions.

104

W.L. Junger, A. Ponce de Leon / Atmospheric Environment 102 (2015) 96e104

In general, imputation based on multiple covariance regimes produced the best results. However, using quarter-year windows when only one year of daily data was available produced poorly estimated covariance matrices if more severe missing value patterns were present. This phenomenon caused some convergence issues, mainly for the methods based on ARIMA. The performance analysis highlighted the low quality of the univariate imputation methods, which exhibited low accuracy, under-dispersion and low correlation compared with the original data. These methods performed poorly even with 5% missing data. By contrast, multivariate methods exhibited better performance with larger amounts of missing data and larger gaps. The overestimated precision can be attenuated by applying a penalty factor for lost information. Thus, despite the naivety of the penalty approach employed, our proposed linear penalty function slightly improved the variance of the estimates, thereby accommodating some degree of uncertainty and yielding wider conﬁdence intervals, which were similar to those obtained from the full data analysis. Even under MNAR, most of the imputations obtained using our proposed methods were valid results. However, the analyst must remember that imputed data are at best only good estimates of the values that would be observed if all had progressed smoothly during the measurement phase of the study. Thus, the imputation of complex patterns and large amounts of missing values must be given special attention. Our proposed method may be applied to any missing data problem that can be formulated as a multivariate normal time series, e.g., by applying some transformation. In many situations, the logarithm transformation will sufﬁce; see (Box and Cox, 1964) for further details. Our proposed EM-based methods have been implemented as a package called mtsdi for the statistical software R, which is available at www.r-project.org. Acknowledgements ~o Paulo State Environmental Agency We are grateful to the Sa (CETESB) for granting the data sets of air pollutants. We also appreciate the reviewers' comments, which vastly improved this manuscript. References Beale, E., Little, R., 1975. Missing values in multivariate analysis. J. R. Stat. Soc. Ser. B Methodol. 37 (1), 129e145. Box, G., Cox, D., 1964. An analysis of transformations. J. R. Stat. Soc. Ser. B Methodol. 26 (2), 211e252.

Box, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis. Forecasting and Control, third ed. Prentice Hall. Buck, S., 1960. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J. R. Stat. Soc. Ser. B Methodol. 22 (2), 302e306. Dempster, A., Laird, N., Rubin, D., 1977. Maximum likelihood estimation from incomplete data via the em algorithm. J. R. Stat. Soc. 39 (1), 1e38. Dominici, F., McDermott, A., Zeger, S.L., Samet, J.M., Aug 2002. On the use of generalized additive models in time-series studies of air pollution and health. Am. J. Epidemiol. 156 (3), 193e203. Donders, A.R.T., van der Heijden, G.J.M.G., Stijnen, T., Moons, K.G.M., Oct 2006. Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59 (10), 1087e1091. Eekhout, I., de Boer, R.M., Twisk, J.W.R., de Vet, H.C.W., Heymans, M.W., Sep. 2012. Missing data: a systematic review of how they are reported and handled. Epidemiol. Camb. Mass 23 (5), 729e732. http://www.ncbi.nlm.nih.gov/ pubmed/22584299. Engels, J.M., Diehr, P., Oct 2003. Imputation of missing longitudinal data: a comparison of methods. J. Clin. Epidemiol. 56 (10), 968e976. Fuller, G.W., Carslaw, D.C., Lodge, H.W., 2002. An empirical approach for the prediction of daily mean PM10 concentrations. Atmos. Environ. 36 (9), 1431e1441. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., Jul 2014. Bayesian Data Analysis, third ed. Chapman and Hall/CRC. Gorelick, M.H., Oct 2006. Bias arising from missing data in predictive models. J. Clin. Epidemiol. 59 (10), 1115e1123. Green, P.J., Silverman, B.W., 1994. Nonparametric Regression and Generalized Linear Models: a Roughness Penalty Approach. Chapman and Hall. Greenland, S., Finkle, W.D., Dec 1995. A critical look at methods for handling missing covariates in epidemiologic regression analyses. Am. J. Epidemiol. 142 (12), 1255e1264. Hartley, H.O., Hocking, R.R., 1971. The analysis of incomplete data. Biometrics 27 (4), 783e823. Hastie, T.J., Tibishirani, R.J., 1990. Generalized Additive Models. Chapman and Hall. Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M., 2004. Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38 (18), 2895e2907. Little, R.J.A., 1992. Regression with missing x's: a review. J. Am. Stat. Assoc. 87 (420), 1227e1237. Little, R.J.A., Rubin, D.B., 1989. Statistical Analysis with Missing Data. Wiley. McGullagh, P., Nelder, J.A., 1989. Generalized Linear Models. Chapman and Hall. McLachlan, G.J., Krishnan, T., 1997. The EM Algorithm and Extensions. John Wiley and Sons. Miettinen, O.S., 1985. Theoretical Epidemiology. Principle of Occurrence Research in Medicine. Wiley. Plaia, A., Bondì, A., 2006. Single imputation method of missing values in environmental pollution data sets. Atmos. Environ. 40 (38), 7316e7330. R Core Team, 2013. R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project. org/. Rothman, K.J., Greenland, S., Lash, T.L., 2008. Modern Epidemiology, third ed. Lippincott Williams & Wilkins. Rubin, D.B., 1976. Inference and missing data. Biometrika 63 (3), 581e592. Schafer, J.L., 1997. Analysis of Incomplete Multivariate Data. Chapman & Hall. rova , L., Barumamdzadeh, T., le Tertre, A., Schwartz, J., Spix, C., Touloumi, G., Bacha € nk€ Piekarksi, T., de Leon, A.P., Po a, A., Rossi, G., Saez, M., Schouten, J.P., Apr 1996. Methodological issues in studies of air pollution and daily counts of deaths or hospital admissions. J. Epidemiol. Community Health 50 (Suppl. 1), S3eS11. Willmott, C.J., Nov. 1982. Some comments on the evaluation of model performance. Bull. Am. Meteorol. Soc. 63, 1309e1369.

Imputation of missing data in time series for air pollutants

Imputation of missing data in time series for air pollutants

Recommend Documents