Environmental Pollution 233 (2018) 464e473
Contents lists available at ScienceDirect
Environmental Pollution journal homepage: www.elsevier.com/locate/envpol
Spatiotemporal prediction of daily ambient ozone levels across China using random forest for human exposure assessment Yu Zhan a, Yuzhou Luo b, Xunfei Deng c, Michael L. Grieneisen b, Minghua Zhang b, Baofeng Di a, * a b c
Department of Environmental Science and Engineering, Sichuan University, Chengdu, Sichuan 610065, China Department of Land, Air, and Water Resources, University of California, Davis, CA 95616, USA Institute of Digital Agriculture, Zhejiang Academy of Agricultural Sciences, Hangzhou, Zhejiang 310021, China
a r t i c l e i n f o
a b s t r a c t
Article history: Received 10 July 2017 Received in revised form 11 September 2017 Accepted 8 October 2017
In China, ozone pollution shows an increasing trend and becomes the primary air pollutant in warm seasons. Leveraging the air quality monitoring network, a random forest model is developed to predict the daily maximum 8-h average ozone concentrations ([O3]MDA8) across China in 2015 for human exposure assessment. This model captures the observed spatiotemporal variations of [O3]MDA8 by using the data of meteorology, elevation, and recent-year emission inventories (cross-validation R2 ¼ 0.69 and RMSE ¼ 26 mg/m3). Compared with chemical transport models that require a plenty of variables and expensive computation, the random forest model shows comparable or higher predictive performance based on only a handful of readily-available variables at much lower computational cost. The nationwide population-weighted [O3]MDA8 is predicted to be 84 ± 23 mg/m3 annually, with the highest seasonal mean in the summer (103 ± 8 mg/m3). The summer [O3]MDA8 is predicted to be the highest in North China (125 ± 17 mg/m3). Approximately 58% of the population lives in areas with more than 100 nonattainment days ([O3]MDA8>100 mg/m3), and 12% of the population are exposed to [O3]MDA8>160 mg/m3 (WHO Interim Target 1) for more than 30 days. As the most populous zones in China, the Beijing-Tianjin Metro, Yangtze River Delta, Pearl River Delta, and Sichuan Basin are predicted to be at 154, 141, 124, and 98 nonattainment days, respectively. Effective controls of O3 pollution are urgently needed for the highlypopulated zones, especially the Beijing-Tianjin Metro with seasonal [O3]MDA8 of 140 ± 29 mg/m3 in summer. To the best of the authors’ knowledge, this study is the first statistical modeling work of ambient O3 for China at the national level. This timely and extensively validated [O3]MDA8 dataset is valuable for refining epidemiological analyses on O3 pollution in China. © 2017 Elsevier Ltd. All rights reserved.
Keywords: Ozone pollution Spatiotemporal distributions China Human exposure Machine learning
1. Introduction Ambient ozone (O3) is a worldwide air pollutant harmful to human health (Brunekreef and Holgate, 2002; WHO, 2006). Ambient O3 is mainly formed through photochemical reactions between volatile organic compounds (VOCs) and nitrogen oxides (NOx) in the presence of sunlight (USEPA, 2013). Exposure to ambient O3 has been associated with human health problems such as respiratory and cardiovascular diseases (Kan et al., 2008; Shang et al., 2013; USEPA, 2013; WHO, 2006; Wong et al., 2008). To protect public health, the World Health Organization (WHO) proposes the
* Corresponding author. E-mail address:
[email protected] (B. Di). https://doi.org/10.1016/j.envpol.2017.10.029 0269-7491/© 2017 Elsevier Ltd. All rights reserved.
O3 air quality guideline of a daily maximum 8-h mean ([O3]MDA8) of 100 mg/m3, which is also implemented in China (MEPC, 2012; WHO, 2006). The O3 pollution level in China is predicted to be comparable to the global average (Brauer et al., 2016). In contrast to the decreasing trend of “visible” fine particulate matter (PM2.5) pollution in China, “invisible” O3 pollution shows an increasing trend and becomes the primary air pollutant in warm seasons (Anger et al., 2015; Brauer et al., 2016; Zhao et al., 2016). While sitebased ambient O3 concentrations have been regularly monitored since 2013 via the national air quality monitoring network for China (MEPC, 2015), nationwide spatiotemporal distributions of ambient O3 levels are required for human exposure assessment. Chemical transport models (CTMs), such as the Community Multiscale Air Quality model (CMAQ) (Byun and Schere, 2006), have been employed to predict the spatiotemporal distributions of
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
ambient O3 worldwide (Bell, 2006; Brauer et al., 2016; Hu et al., 2016; Wang et al., 2011; Zhang et al., 2011). By utilizing meteorological conditions predicted by climate models, CTMs simulate environmental processes of O3 and its precursors, providing guidance on source control (Jacob, 1999). CTMs have been extensively validated against O3 monitoring data in the United States, indicating that CTMs are capable of capturing monthly or seasonal O3 concentrations at large spatial scales (USEPA, 2013; Zhang et al., 2011). However, fine-scale predictions by CTMs may deviate widely from observations due to imperfect input data and limited knowledge (USEPA, 2013). In addition, CTMs generally require high computing resources and a wealth of input data (Beelen et al., 2009). The predictive performance of CTMs for China tends to degrade due to the high uncertainty associated with the emission inventories, especially on a fine spatiotemporal scale (Streets et al., 2003; Zhang et al., 2008). While many efforts have been made to refine and update the emission inventories for China (Li et al., 2017; Wu and Xie, 2017), this process requires much extra labor and time. The publicly available emission inventories for China are only released for certain years, e.g., 2008 and 2010 (Li et al., 2017), which may also be used in CTMs simulations for more recent years. The predictive performance of such simulations for more recent years tend to be lower, and post hoc adjustment to the emission inventories are generally required. Moreover, long-term and regional/ national monitoring data is critical for validating the CTM predictions (USEPA, 2009). However, validations of CTMs for China are quite limited due to the scarcity of ambient O3 monitoring data before 2013, and the exposure assessment results based on these CTMs predictions may be encumbered (Brauer et al., 2016; Guo et al., 2016; Lou et al., 2015; Madaniyazi et al., 2016). In order to provide more recent and reliable O3 distributions for human exposure assessment, therefore, statistical models based on the extensive O3 monitoring network for China are more favored. Statistical models such as geostatistical models and land use regressions (LUR) are commonly used to predict spatiotemporal distributions of ambient O3 concentrations (Adam-Poupart et al., 2014; Beelen et al., 2009; Brauer et al., 2008; Wang et al., 2015). On the basis of O3 observations, these statistical models are parameterized with the spatiotemporal autocorrelations of O3 levels and/or their relationships with the predictor variables such as land use types and meteorological conditions. The statistical models that incorporate CTMs predictions as their predictor variables are also referred to as hybrid models, which tend to improve the predictive performance with increased model complexity (Di et al., 2017; Wang et al., 2016). Although statistical models do not explicitly simulate the environmental processes of O3, they generally exhibit higher predictive performance than CTMs on fine spatiotemporal scales in the presence of extensive monitoring data (Adam-Poupart et al., 2014; Hoek et al., 2008; Marshall et al., 2008). While statistically modeling ambient O3 distributions is common in the western countries (Adam-Poupart et al., 2014; Beelen et al., 2009; Brauer et al., 2008; Wang et al., 2015), the number of statistical studies for China is rather limited due to the paucity of monitoring data before 2013. Fortunately, the national air quality monitoring network for China has been established and expanded since 2013 (MEPC, 2015). In 2015, hourly ambient O3 concentrations were measured at more than 1500 sites located in more than 300 cities. However, such a large dataset is still not sufficient to assess the nationwide human exposure levels, since many other areas still lack monitoring data. Statistical models trained with the monitoring data are needed to predict the O3 concentrations in the unmonitored areas. Machine learning algorithms, which are statistical models from the algorithmic modelling culture (Breiman, 2001b), generally show predictive performance that is comparable or superior to
465
traditional statistical models such as general linear regression and kriging (Hastie et al., 2009; Hu et al., 2017; Kanevski, 2008; Li et al., 2011). Random forests, as a popular machine learning algorithm, make statistical predictions by averaging an ensemble of decorrelated classification or regression trees, and they are capable of handling nonlinear relationships and interaction effects (Breiman, 2001a). An algorithm comparison study found that random forests and gradient boosting machine (GBM) exhibit the best performance in predicting PM2.5 concentrations among ten machine learning algorithms (Reid et al., 2015). The geographicallyweighted GBM shows even higher prediction performance than GBM but is much more computationally expensive (Zhan et al., 2017). Random forests are thus one of the optimal choices when striking a balance between prediction accuracy and computing cost. Moreover, unlike some other machine learning algorithms (e.g., neural network), random forests can show the contribution of each predictor variable by the variable importance measure and the partial dependence plot (Hastie et al., 2009). The method for evaluating the variances of prediction made by random forests is also available (Wager et al., 2014). In a recent study, the random forest approach has shown good performance in predicting the PM2.5 concentrations for the conterminous United States, but the prediction uncertainties are not evaluated (Hu et al., 2017). It is important to show the prediction uncertainties and demonstrate the applicability of random forests to different chemical species and geographical regions. Therefore, random forests are proposed to predict the nationwide spatiotemporal distributions of [O3]MDA8 in China for human exposure assessment by utilizing the valuable monitoring dataset. This study aims to predict the spatiotemporal distributions of daily [O3]MDA8 across China in 2015. A 0.1 0.1 grid is used to be consistent with the previous global or national exposure assessments (Brauer et al., 2016; Guo et al., 2016). Random forest is implemented with variable selection to statistically model the spatiotemporal variations of [O3]MDA8 based on the publicly available datasets including the meteorology, elevation, and emission inventories. On the basis of the predicted [O3]MDA8, we assess the exposure intensities and durations for the populations living in different regions of China. To the best of the authors’ knowledge, this study is the first statistical modeling work of ambient O3 for China at the national level. This timely and extensively validated [O3]MDA8 dataset are valuable for refining epidemiological analyses on O3 pollution in China. 2. Materials and methods 2.1. Data preparation The hourly O3 monitoring data for 2015 were obtained from the national air quality monitoring network for mainland China and Hainan Island (MEPC, 2015), the Environmental Protection Department of Hong Kong for Hong Kong (EPDHK, 2015), and the Environmental Protection Administration of Taiwan for Taiwan (EPAROC, 2015). The O3 concentrations were measured with the ultraviolet-spectrophotometry method. The highest 8-h moving average for each day was calculated as [O3]MDA8 after data cleaning at each monitoring site by following the procedure adopted by the California Air Resources Board (CARB, 2006). For a given day to be included, this procedure requires more than six hourly measurements in each third of the day (i.e., 0:00am-7:00am, 8:00am15:00pm, and 16:00pm-23:00pm; Beijing Standard Time, UTCþ8), and no more than two consecutive hourly measurements missing within that day. Around half a million [O3]MDA8 records were obtained from 1608 monitoring sites across China in 2015 (Fig. 1), which were mainly located in East, Central, North, and South China.
466
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
Fig. 1. Ambient O3 monitoring network for China in 2015. Hourly O3 concentrations were measured at 1608 sites. The highly populated zones are labelled, including the BeijingTianjin Metro (BTM), the Pearl River Delta (PRD), the Sichuan Basin (SCB), and the Yangtze River Delta (YRD). Green triangles are the locations of the cities shown in Figure S6. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The monitoring work was relatively sparse in Northeast, Northwest, and Southwest China. The complete set of input variables for modeling [O3]MDA8 included meteorology, elevation, anthropogenic emission inventory, land use, normalized difference vegetation index (NDVI), road density, and population density (Table S1). The daily weather conditions for 2015 were obtained from 839 meteorological stations (CMA, 2015). The daily planetary boundary height (PBLH) data were retrieved from the Modern-Era Retrospective Analysis for Research and Application (MERRA2; resolution: 0.625 0.5 ) (GMAO, 2015). The elevation data were retrieved from the Shuttle Radar Topography Mission (SRTM; 30m resolution) (Jarvis et al., 2008). As the emission inventory for China in 2015 was not publicly available at the time of this study, the monthly data for 2010 were retrieved from the gridded Asian anthropogenic emission inventory (named MIX; 0.25 resolution) (Li et al., 2017). Although the emission inventories were expected to be different between 2010 and 2015, the spatiotemporal patterns of emissions in 2010 were informative for predicting [O3]MDA8 in 2015. The rationality of using the recent-year emission inventories was inferred from the facts that the gross domestic product (GDP) by provinces was highly correlated (R ¼ 0.995) between 2010 and 2015 (NBSC, 2011, 2016), as well as the high correlation (R ¼ 0.997) of gridded population densities (CIESIN, 2016). GDP and population densities are important information for developing emission inventories. In addition, the land use classifications were extracted from GlobeLand30 (30m resolution) (Jun et al., 2014). The NDVI for 2015 were obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite retrievals (250m resolution and 8-day interval) (Didan et al., 2015). The road density data were obtained from the OpenStreetMap (OpenStreetMap contributors, 2015). The population densities for 2015 were obtained from the Gridded Population of the World (GPWv4; 30 arc-second resolution) (CIESIN, 2016).
The input data were resampled to the 0.1 0.1 grid cells (98341 cells in total) of China for model development and prediction. The weather conditions for each cell were interpolated from the meteorological stations through co-kriging with elevation (Deutsch and Journel, 1998). The road length for each cell was summed over the road segments within the cell through spatial clipping. The elevation, emission inventory, land use, NDVI, PBLH, and population density for each cell were calculated as areaweighted averages via spatial overlay. The daily NDVI data were derived from the 8-day-interval retrievals via spline interpolation (Forsythe et al., 1977). The daily emission intensity was obtained by uniformly allocating monthly emissions to each day of that month. The complete list of input variables is noted in Fig. S1, a subset of which was kept in the model after variable selection for better predictive performance and simpler model structure (Fig. S2). 2.2. Algorithm description The random forest algorithm was employed to predict the spatiotemporal distributions of [O3]MDA8 based on a subset of predictor variables (Breiman, 2001a), which was determined through variable selection (please see “Model Evaluation and Variable Selection” for details). Throughout the article, we used the term “algorithm” when referring to the computing rules, and the term “model” when referring to the instance of an algorithm. The random forest model developed in this study consisted of 500 regression trees. Each tree was grown on a bootstrap sample drawn from the training dataset that contained the [O3]MDA8 observations and the predictor variables. One third of the predictor variables (remaining at each step of variable selection) were randomly selected to grow the tree for reducing the correlation between trees. The random forest model made predictions by averaging the predictions from all the individual trees. Please see Text S1 in the
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
Supporting Information (SI) for the detailed description of the algorithm. We used the function RandomForestRegressor of the python package scikit-learn for high computing performance (Pedregosa et al., 2011). The standard deviations of model predictions were estimated by using the infinitesimal jackknife (IJ) method, where the variation of predictions were examined by systemically down-weighting each observation to an infinitesimal extent (Wager et al., 2014). The IJ method tends to give more stable estimates than the ordinary jackknife resample method, where one observation is removed at a time to create subsamples for variance estimation. 2.3. Model evaluation and variable selection The performance of a random forest model in predicting [O3]MDA8 was evaluated using a site-based 10-fold cross-validation. All of the 1608 O3 monitoring sites were divided into 10 approximately even groups, and each group served exactly once as the source of validation data. A random forest model was trained with the monitoring data from nine of the groups, and then used to make predictions for the remaining group. The training and prediction processes were repeated 10 times in turn so that each [O3]MDA8 observation had a paired prediction. In each of the 10 rounds, the validation data are separate from the training data, and the model prediction is validated against the validation data only. By using the observation-prediction pairs, the predictive performance was measured with the coefficient of determination (R2), root-meansquare-error (RMSE), and relative prediction error (RPE; ratio of RMSE to observation mean). In addition, we have also conducted a strict comparison with the spatiotemporal-regression-kriging (STRK) model, which shows high predictive performance in spatiotemporal interpolation (Hengl et al., 2012; Kilibarda et al., 2014). STRK is a composite geostatistical model, where spatiotemporal kriging is used to interpolate the residuals of linear regression (see the SI for the detailed algorithm description). The relative importance (Im) of a predictor variable (Xm) in a random forest model was evaluated according to the number of times that variable was used for splitting and the associated reduction of squared error (Friedman, 2001).
Im
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u K u1 X ¼t I 2m ðTk Þ K
where K is the total number of trees in the random forest model, 2 ðT Þ is the squared variable importance of variable X for tree and Im m k k. J X
bi 2 Iðvt ¼ mÞ t
(2)
t¼1
where J is the total number of splits of the tree, I(vt ¼ m) indicates a 2
The reduced model with the highest predictive performance was chosen for prediction, and that reduced model was then referred to as the final model thereafter. 2.4. Spatiotemporal prediction The daily [O3]MDA8 was predicted on a 0.1 0.1 grid across China for 2015 by using the final model, and then summarized by seasons and regions (or highly-populated zones). a. Seasons: spring, summer, fall, and winter; b. Regions: North, Northwest, Northeast, East, Central, South, and Southwest China (Fig. 1); c. Highly-populated zones: Beijing-Tianjin Metro, Yangtze River Delta, Pearl River Delta, and Sichuan Basin. The annual average for each region (or zone), the national average for each season, and the overall average were also calculated from the daily predictions at cell level. The populationweighted means of [O3]MDA8 were estimated for a given area and period:
CPW ¼
N X i¼1
, ðPi Ci Þ
N X
Pi
(3)
i¼1
where CPW is the population-weighted [O3]MDA8 for the area enclosing N cells; Pi is the population density for cell i; and Ci is the average [O3]MDA8 for cell i during that period. The exposure intensities and durations were characterized by the cumulative percent of populations exposed to different levels of [O3]MDA8. The number of nonattainment days (defined as CPW>100 mg/m3) were summarized by seasons and either regions or zones. The [O3]MDA8 of 100 mg/m3 is both the air quality guideline (AQG) set by the World Health Organization (WHO) and the firstlevel concentration limit set by the Ministry of Environmental Protection of China (MEPC, 2012; WHO, 2006). 3. Results and discussion 3.1. O3 observation summary
(1)
k¼1
2 Im ðTÞ ¼
467
split by Xm, and bi t is the squared error reduction achieved by that split. The relative importance values are scaled so that they sum up to 100. A backward variable selection strategy was implemented to simplify the model structure and improve the predictive performance. The full model with all the predictor variables was evaluated with a 10-fold cross-validation, where the predictive performance (R2 and RMSE) and relative variable importance were recorded. The variable with the lowest importance was removed, and the reduced model with the remaining variables was evaluated with another 10-fold cross-validation. The above steps were repeated until only one predictor variable remained in the model.
The daily [O3]MDA8 observations follow a heavy-tail distribution, ranging from 1 to 718 mg/m3 with mean ± standard deviation of 82 ± 47 mg/m3. The first, second, and third quartiles are 48 mg/m3, 73 mg/m3, and 111 mg/m3, respectively. Daily [O3]MDA8 usually occurs during 12:00pm-19:00pm (Beijing Standard Time, UTCþ8), with the highest hourly concentration around 15:00pm. The [O3]MDA8 values are highly correlated with the daily average O3 concentrations (Spearman rank correlation R ¼ 0.94), and the [O3]MDA8 values are approximately 1.5 times of the daily averages. 3.2. Predictive performance Through the variable selection, we chose the model with the predictor variables of evaporation, temperature, relative humidity, day of year, elevation, carbon monoxide (CO) emission, sunshine duration, organic carbon (OC) emission, ammonium (NH3) emission, atmospheric pressure, nonmethane VOC (NMVOC) emission, wind speed, and PBLH (Figure S2). NOx emission was included as a predictor variable in the full model, but was removed from the final model after variable selection, which was discussed in the “Variable Importance” subsection. The cross-validation results show that this final model exhibits high predictive performance on daily (R2 ¼ 0.69, RMSE ¼ 26 mg/m3, and RPE ¼ 31.2%) and monthly
468
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
Fig. 2. Cross-validations for the predictive performance of the random forest model in predicting (a) daily maximum 8-h average O3 concentrations ([O3]MDA8) and (b) monthly average [O3]MDA8.
(R2 ¼ 0.71, RMSE ¼ 19 mg/m3, and RPE ¼ 23.5%) bases (Fig. 2). The ratio of daily-level RMSE to standard deviation is 0.55, which is better than the ratio of 0.76 in a previous study (Adam-Poupart et al., 2014). Seasonally, the national R2 values range from 0.56 for the winter to 0.72 for the fall (Table S2). The high predictive performance for fall is attributed to the stronger effects of weather conditions on the surface ozone concentrations. For instance, the Spearman correlations between the evaporation (the most important variable in the random forest model) and the [O3]MDA8 are 0.58 for fall versus 0.32 to 0.43 for the other seasons. The regional R2 value is highest in South China (R2 ¼ 0.75), and lowest in Northwest China (R2 ¼ 0.59) mainly due to the relatively sparse sites of monitoring in the latter (Fig. 1). Considering that this is a population-based study, the relatively lower predictive performance for the areas with lower population densities (e.g., Northwest China) causes limited effect on the exposure assessment across China. The high predictive performance in South China tends to result from the stronger associations between the [O3]MDA8 and the emission inventories than those in the other regions. For example, the Spearman correlations between the OC emissions and [O3]MDA8 are 0.49 for South China versus 0.36 to 0.16 for the other regions. The higher correlations may indicate better data quality of the emission inventories in South China. The annual R2 values are 0.87, 0.78, 0.75, and 0.74 for the Beijing-Tianjin Metro, Yangtze River Delta, Pearl River Delta, and Sichuan Basin, respectively (Table S3). The random forest model shows equivalent or better performance than the previous CTMs and statistical models in predicting ambient O3 concentrations (Table S4). Note that none of the reviewed statistical models have been applied to the nationwide data for China, and R2 (as a commonly used performance measure) is dependent on the spatial or spatiotemporal scales. For instance, the model for the Los Angeles Basin with R2 ¼ 0.84 is not necessarily better than the model for the continental United States with R2 ¼ 0.74e0.80 (Di et al., 2017; Wang et al., 2016). While the statistical models and CTMs were associated with different simulation areas and episodes, the statistical models generally demonstrated higher predictive performance than the CTMs. The validation R from a CTM study predicting [O3]MDA8 in the United States ranged from 0.21 to 0.57 (Zhang et al., 2011). In contrast, a study statistically modeling [O3]MDA8 in Quebec, Canada showed much higher performance, with cross-validation R2 of 0.65 (Adam-Poupart et al., 2014). Moreover, Wang et al. reported better performance of the statistical model (R2 ¼ 0.78) than the CTM (R2 ¼ 0.60) for the same simulation area and episode. In the CTMs studies for China (with fewer than five validation sites), the validation R ranged from 0.5 to 0.8 for predicting hourly O3 concentrations (Liu et al., 2010), and
ranged from 0.46 to 0.52 for the annual means (Lou et al., 2015). In addition to the higher R in the present study, the RMSE of the random forest model (26 mg/m3) also falls at the lower end of the RMSE range (21.6e49.8 mg/m3) in that CTM study for China (Liu et al., 2010). In addition to a high degree of uncertainty associated with the emission inventories for China, imperfect knowledge of the photochemical reactions also constrains the predictive performance of CTMs (Streets et al., 2003; USEPA, 2013; Zhang et al., 2008). Under the same setting of cross-validation, STRK exhibits slightly lower predictive performance than the random forest model (daily R2: 0.68 vs. 0.69; monthly R2: 0.69 vs. 0.71; Figure S4b). It is interesting to see the similar performance achieved by using these two quite different models. For STRK, the linear regression captures a considerable amount of [O3]MDA8 variations (R2 ¼ 0.35; Figure S4b), and then the spatiotemporal kriging extracts more variations from the regression residuals. In contrast, the random forest model (also a regression model) captures even more variations based solely on the same predictor variables. It demonstrates that the spatiotemporal autocorrelations of [O3]MDA8 are closely related to the predictor variables, and the effects of the predictor variables (e.g., nonlinearity or interaction) are not properly modeled by the linear regression. For the previous studies on model comparisons, the different amounts of information carried in the predictor variables tended to determine the relative performance between the random forests and the geostatistical models (Li et al., 2011; Temesgen and Ver Hoef, 2015). When the predictor variables are inadequate, geostatistical models generally show higher predictive performance than random forests. With adequate predictor variables, random forests are capable of handling spatial and spatiotemporal problems while exploring the effects of the predictor variables on the response variables. 3.3. Variable importance In the final random forest model, the meteorological variables collectively account for 65% of the overall relative importance (Fig. 3). Evaporation, temperature, relative humidity, and sunshine duration are among the top five most important variables, with relative importance values of 17.7, 14.7, 10.5, and 8.6, respectively. The high importance of meteorology on ambient O3 levels has also been identified in a previous statistical modeling study (Wise and Comrie, 2005). The observed importance of meteorological variables is consistent with the facts that O3 formation via photochemical reactions is highly favored in hot environment with long sunshine duration, while O3 loss via reaction with water vapor tends to be accelerated in moist environments (Beck et al., 1999;
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
469
levels. In addition, it is noteworthy that the predictor variables removed from the final model are more or less associated with [O3]MDA8. However, they are removed from the final model because their introduced noise tends to be higher than the gained information (Figure S2). In the light of the split search and variable selection, the random forest model utilizes the information contained in the emission inventories very well, while reducing the effect of their deficiencies on the model prediction. 3.4. Spatiotemporal distributions
Fig. 3. Variable importance plot for the random forest model predicting the daily maximum 8-h average O3 concentrations ([O3]MDA8) in China.
Bloomer et al., 2009; USEPA, 2013; Zhang and Wang, 2016). Evaporation, as an integrative measure of meteorological conditions such as temperature and humidity, is a valuable environmental indicator for O3 formation and stability. Moreover, water-stressed plants tend to emit more biogenic isoprene (an important O3 precursor) under warm and dry conditions, thus enhancing the O3 formation (Zhang and Wang, 2016). In contrast to the abovementioned meteorological variables, atmospheric pressure, PBLH, and wind speed show relatively lower importance, with importance values of 5.0, 4.3, and 4.3, respectively. Due to the existence of relative humidity in the model, precipitation is removed during variable selection (Figure S2). Although higher atmospheric pressure and thinner PBLH tend to result in higher O3 concentrations, these conditions are usually accompanied by cold temperatures that are unfavorable to O3 formation. Therefore, the importance of atmospheric pressure and PBLH are moderated. While advection and rain scavenging (associated with wind speed and precipitation, respectively) are important O3-loss paths, their importance may be ze et al., relevant only when the O3 concentrations are high (Che 2003). The anthropogenic emissions of NH3, CO, OC, and NMVOC, with importance values of 5.7, 5.0, 5.0, and 4.3, respectively, exhibit relatively lower importance than the meteorological variables (Fig. 3). The random forest model is more robust to the deficiencies in the emission inventories than CTMs. The importance or contribution of the emission of each chemical species is automatically adjusted in the model training process, particularly at the step of split search. More informative or accurate emission variables gain higher chance of getting picked, and vice versa. Although NH3 and OC are not O3 precursors, their emission intensities in space and time indicate the patterns of industrial and agricultural activities associated with ambient O3 formation. CTMs simulations may provide more detailed explanations for the associations of NH3 and OC emissions with ambient O3 concentrations. In contrast, the importance of NOx and NMVOC emissions is lower than expected, which may be due to regional transport or titration effects (NO þ O3/NO2þO2) (Seinfeld and Pandis, 2006; Tang et al., 2012). The emission of NOx is removed through variable selection due to its low importance. NOx and NMVOC may be transported from urban to rural areas via advection, causing the location of O3 formation to occur some distance away from the emission source. NOx titration tends to counteract the importance of NOx emissions to O3
In 2015 the annual average of population-weighted [O3]MDA8 is predicted to be 84 ± 23 mg/m3 (mean ± standard deviation; n ¼ 365) for China (Table 1). Seasonally, the population-weighted [O3]MDA8 is predicted to range from 58 ± 10 mg/m3 for winter to 103 ± 8 mg/m3 for summer. The annual/seasonal means and standard deviations are calculated from the daily values of populationweighted [O3]MDA8 nationwide. The daily variations of [O3]MDA8 indicated by the standard deviations provide important information for the human exposure analyses, especially since WHO currently has only daily-level guideline for O3 (WHO, 2006). The seasonal patterns from this study are generally consistent with those from the previous CTMs studies for China (Hu et al., 2016; Liu et al., 2010; Wang et al., 2011). The seasonal mean predicted in this study is comparable to that from a previous CTM study, where the highest seasonal mean of the population-weighted 1-h daily maximum O3 concentrations was predicted to be 65 ppb (approximately 129 mg/m3) for China in 2013 (Brauer et al., 2016). The difference between 103 mg/m3 in our study and 129 mg/m3 in the previous study can be attributed to the difference between 8-h and 1-h maximums. As the other studies do not focus on human exposure assessment, the population-weighted concentrations are usually not reported, making it difficult to compare our predictions with the other studies. The spatiotemporal distributions of [O3]MDA8 show high spatial heterogeneity across the nation (Table 1; Fig. 4 and S3). The annual average of population-weighted [O3]MDA8 is predicted to be the highest in North China (90 ± 36 mg/m3) and the lowest in South China (79 ± 20 mg/m3). The highest [O3]MDA8 is predicted during summer time in North China, with a seasonal average of 125 ± 17 mg/m3, which is much higher than East China (104 ± 24 mg/ m3) and South China (82 ± 16 mg/m3). The spatial heterogeneity of summer [O3]MDA8 is associated with the spatial pattern of humidity conditions. While North, East, and South China are all hot in summer, it is relatively drier in North China than the other regions. Therefore, ambient O3 tends to form and accumulate to a higher level in North China, resulting in the highest regional
Table 1 Seasonal and annual averages ± standard deviations of population-weighted daily maximum 8-h average O3 concentrations ([O3]MDA8) for each region and the whole nation of China in 2015 (mg/m3). Regiona/Nation
Spring
Summer
Fall
Central China East China North China Northeast China Northwest China South China Southwest China Nation
89 ± 21b 97 ± 24 104 ± 26 93 ± 17 95 ± 14 76 ± 21 97 ± 15 93 ± 16
101 ± 13c 104 ± 24 125 ± 17 103 ± 15 111 ± 11 82 ± 16 93 ± 15 103 ± 8
82 93 79 72 72 88 68 81
a
± ± ± ± ± ± ± ±
Winter 31 31 31 20 20 21 16 22
56 59 50 54 57 70 58 58
± ± ± ± ± ± ± ±
14 13 11 10 10 18 12 10
Annual 82 88 90 81 84 79 79 84
± ± ± ± ± ± ± ±
27 29 36 25 25 20 22 23
Regions are labelled on Fig. 1. The averages and standard deviations are calculated from the daily values of regional/national population-weighted [O3]MDA8 (e.g., n ¼ 365 for annual standard deviations). c The averages higher than 100 mg/m3 are in bold. b
470
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
Fig. 4. Spatial distributions of the predicted seasonal averages of daily maximum 8-h average O3 concentrations ([O3]MDA8) for China in (a) spring, (b) summer, (c) fall, and (d) winter of 2015.
Table 2 Seasonal and annual averages ± standard deviations of population-weighted daily maximum 8-h average O3 concentrations ([O3]MDA8) for each highly-populated zone of China in 2015 (mg/m3). Zonea BTM PRD SCB YRD
Regionb North China South China Southwest China East China
Spring 108 ± 37 75 ± 33 97 ± 23 105 ± 30
c,d
Summer
Fall
Winter
140 ± 29 94.2 ± 32 102 ± 23 112 ± 34
73 ± 41 103 ± 35 65 ± 25 99 ± 37
42 73 47 60
± ± ± ±
16 27 14 16
Annual 91 86 78 94
± ± ± ±
49 34 32 37
a
Highly-populated zones are labelled on Fig. 1. BTM: Beijing-Tianjin Metro; PRD: Pearl River Delta; SCB: Sichuan Basin; YRD: Yangtze River Delta. Enclosing region for each zone. c The averages and standard deviations are calculated from the daily values of regional/national population-weighted [O3]MDA8 (e.g., n ¼ 365 for annual standard deviations). d The averages higher than 100 mg/m3 are in bold. b
concentration. Furthermore, the highly populated zones show relatively higher [O3]MDA8 than the other areas in their enclosing regions (Tables 1 and 2). Industries and populations are highly concentrated in these zones, resulting in emission hotspots of O3 precursors (Li et al., 2017; Wu and Xie, 2017). The highest seasonal [O3]MDA8 values occur in summer for the Beijing-Tianjin Metro (140 ± 29 mg/m3), Yangtze River Delta (112 ± 34 mg/m3), and Sichuan Basin (102 ± 23 mg/m3). On a finer spatiotemporal scale, the daily [O3]MDA8 in summer can be higher than 250 mg/m3 for Beijing, Chengdu, Guangzhou, and Shanghai, which are the major cities in these highly-populated zones (Figure S6). According to the results by the IJ method (Wager et al., 2014), the coefficients of variation (CV) are lower than 1 for more than 99% of the spatiotemporal nodes, indicating low uncertainties of predictions in general. Each node represents a specific day and grid cell. CV is the ratio of the standard deviation of prediction (SDP) to the point prediction. The overall CV and SDP are 0.28 ± 0.19 and 21 ± 17 mg/m3, respectively. SDP shows a moderate correlation with the point prediction (r ¼ 0.32), suggesting that higher variances tend to be associated with the high predictions of [O3]MDA8. The SDP
are higher for North China, as well as the northern parts of Central and East China (Fig. 5), where the intensive human activities complicate the predictions of [O3]MDA8. The high-SDP areas also scatter in the boundaries of Northeast China and the desert of Northwest China, where the monitoring network is sparse. While we demonstrate the uncertainty estimation for the random forest predictions, it should be noted that the IJ method is not as mature as the traditional statistical methods. A limitation of this study is that the IJ method is currently unable to estimate the covariance among multi-node predictions, leading to the difficulty in deriving uncertainties of aggregated estimates (e.g., annual or national means). This limitation also hinders the uncertainty analyses in the human exposure assessment. More comprehensive uncertainty analyses will be conducted when the appropriate methods become available. 3.5. Exposure assessment On the basis of the [O3]MDA8 predictions, the exposure assessment results show that (Fig. 6a):
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
471
Fig. 5. Spatial distributions of the seasonal average standard deviations of the predicted daily maximum 8-h average O3 concentrations ([O3]MDA8) for China during (a) spring, (b) summer, (c) fall, and (d) winter of 2015.
Fig. 6. (a) Percent of population exposed to the daily maximum 8-h average O3 ([O3]MDA8) higher than daily mean levels for longer than specified days. The WHO air quality guideline (AQG: 100 mg/m3) and interim target 1 (IT-1: 160 mg/m3) are indicated(WHO, 2006). (b) Spatial distributions of the numbers of days exceeding 100 mg/m3.
a. 58% of the Chinese population lived in areas with more than 100 nonattainment days (i.e., [O3]MDA8>100 mg/m3) in 2015, and b. 12% of the population were exposed to [O3]MDA8>160 mg/m3 (WHO Interim Target 1) for more than 30 days in 2015. Such cumulative results, which include both exposure intensity and duration, provide more comprehensive information for human exposure assessments (de Vocht et al., 2015). The numbers of nonattainment days vary largely across the nation (Fig. 6b). As expected, the spatiotemporal patterns of the nonattainment days are similar to the [O3]MDA8 patterns (Tables 1 and S5). According to the regional population-weighted [O3]MDA8, North and South China have the most (157 days) and fewest (59 days) nonattainment days, respectively (Table S5). East China, as one of the main O3 pollution regions, has more nonattainment days than any other region in fall. For the highly-populated zones in China, the Beijing-Tianjin Metro,
Yangtze River Delta, Pearl River Delta, and Sichuan Basin have 154, 141, 124, and 98 nonattainment days, respectively. This study estimates individual exposure solely based on the predicted [O3]MDA8, which may be overestimated in the rural areas due to the sampling bias in the observation data (Figure S7). Approximately 7.2% of the observation data are from areas with <400 people/km2 (a criterion of rural area). As the prediction performance for the lowpopulation-density samples (<400 people/km2) is comparable to the overall performance, the sampling bias towards urban areas may affect the population-weighted exposure assessment to a small extent. In addition, although individual exposures is correlated with the ambient concentrations (USEPA, 2013), it is also related to other factors, such as air exchange rates and occupations, especially on fine spatial scales. In the future, the spatiotemporal scale of [O3]MDA8 simulation should be refined or expanded to better meet the requirement of
472
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473
epidemiologic exposure assessment. In the present study, the [O3]MDA8 variation within a 0.1 0.1 grid cell is non-negligible, and the average of intra-cell standard deviation is 8 mg/m3 for the annual average [O3]MDA8. The spatial resolutions of modeling studies largely depend on their spatial extents, such as 0.1 0.1 for China (Guo et al., 2016) and 4 4 km2 for the Los Angeles Basin (Wang et al., 2016). While modeling at finer spatial resolutions is expected to reduce the exposure misclassification rates, data availability and computing cost hinder the use of fine resolutions in large-extent studies. Moreover, expanding the temporal scale of prediction is essential for chronic exposure assessment, where latency periods and historical exposure reconstruction play an important role. Similar to the previous back-extrapolation study by using a generalized additive model (Levy et al., 2015), the random forest model can also be used to predict [O3]MDA8 in the past (or future) as long as the corresponding data of predictor variables are available. However, the temporal expansion is likely subject to greater uncertainty, due to the well-known caution regarding extrapolation in regression analyses. 4. Conclusions With the support of the extensive air quality monitoring network, a random forest model is developed to predict daily [O3]MDA8 in 2015 on a 0.1 0.1 grid across China. By using a handful of predictor variables from readily available datasets, the random forest model shows comparable or even higher predictive performance than CTMs, and the computation cost is much lower. The predictive performance of this random forest model may be further improved through incorporating satellite retrievals for tropospheric O3 and its precursors as well as CTMs predictions, but only at the expense of model simplicity. The input data for the current random forest model are easier to update than for CTMs, so this model is recommended to provide timely and fine-scale [O3]MDA8 for human exposure assessment. The assessment of this study show that effective controls of O3 pollution are urgent for the highly-populated zones of China, especially the Beijing-Tianjin Metro with summer [O3]MDA8 mean of 140 ± 29 mg/m3. Conflicts of interest None. Acknowledgments This work was supported by the National Natural Science Foundation of China (21607127). Appendix A. Supplementary data Supplementary data related to this article can be found at https://doi.org/10.1016/j.envpol.2017.10.029. References Adam-Poupart, A., Brand, A., Fournier, M., Jerrett, M., Smargiassi, A., 2014. Spatiotemporal modeling of ozone levels in Quebec (Canada): a comparison of kriging, land-use regression (LUR), and combined Bayesian maximum entropy-LUR approaches. Environ. Health Perspect. 122, 970e976. Anger, A., Dessens, O., Xi, F., Barker, T., Wu, R., 2015. China's air pollution reduction efforts may result in an increase in surface ozone levels in highly polluted areas. Ambio 45, 254e265. Beck, J., Krzyzanowski, M., Koffi, B., 1999. Tropospheric Ozone in the European Union. Office for Official Publications of the European Communities (Bernan Associates [distributor]). Beelen, R., Hoek, G., Pebesma, E., Vienneau, D., de Hoogh, K., Briggs, D.J., 2009. Mapping of background air pollution at a fine spatial scale across the European Union. Sci. Total. Environ. 407, 1852e1867.
Bell, M.L., 2006. The use of ambient air quality modeling to estimate individual and population exposure for human health research: a case study of ozone in the Northern Georgia Region of the United States. Environ. Int. 32, 586e593. Bloomer, B.J., Stehr, J.W., Piety, C.A., Salawitch, R.J., Dickerson, R.R., 2009. Observed relationships of ozone air pollution with temperature and emissions. Geophys. Res. Lett. 36. Brauer, M., Freedman, G., Frostad, J., van Donkelaar, A., Martin, R.V., Dentener, F., Dingenen, R.v., Estep, K., Amini, H., Apte, J.S., Balakrishnan, K., Barregard, L., Broday, D., Feigin, V., Ghosh, S., Hopke, P.K., Knibbs, L.D., Kokubo, Y., Liu, Y., Ma, S., Morawska, L., Sangrador, J.L.T., Shaddick, G., Anderson, H.R., Vos, T., Forouzanfar, M.H., Burnett, R.T., Cohen, A., 2016. Ambient air pollution exposure estimation for the Global Burden of Disease 2013. Environ. Sci. Technol. 50, 79e88. Brauer, M., Lencar, C., Tamburic, L., Koehoorn, M., Demers, P., Karr, C., 2008. A cohort study of traffic-related air pollution impacts on birth outcomes. Environ. Health Perspect. 116, 680e686. Breiman, L., 2001a. Random forests. Mach. Learn 45, 5e32. Breiman, L., 2001b. Statistical modeling: the two cultures. Stat. Sci. 16, 199e215. Brunekreef, B., Holgate, S.T., 2002. Air pollution and health. Lancet 360, 1233e1242. Byun, D., Schere, K., 2006. Review of the governing equations, computational algorithms, and other components of the Models-3 community Multiscale air quality (CMAQ) modeling system. Appl. Mech. Rev. 59, 51e77. CARB, 2006. In: Procedure for Calculating State 8-hour Ozone Concentrations. Board, C.A.R. ze, N., Poggi, J.-M., Portier, B., 2003. Partial and recombined estimators for Che nonlinear additive models. Stat. Inference Stoch. Process 6, 155e197. CIESIN, 2016. Gridded Population of the World, Version 4 (GPWv4): Population Count. NASA Socioeconomic Data and Applications Center (SEDAC). Palisades, NY. CMA China meteorology data, 2015. China Meteorological Administration. http://da ta.cma.gov.cn/(05/18/2016). de Vocht, F., Burstyn, I., Sanguanchaiyakrit, N., 2015. Rethinking cumulative exposure in epidemiology, again. J. Expo. Sci. Environ. Epidemiol. 25, 467e473. Deutsch, C.V., Journel, A.G., 1998. GSLIB Geostatistical Software Library and User's Guide, second ed. Oxford University Press, New York. Di, Q., Rowland, S., Koutrakis, P., Schwartz, J., 2017. A hybrid model for spatially and temporally resolved ozone exposures in the continental United States. J. Air Waste Manage Assoc. 67, 39e52. Didan, K., Munoz, A.B., Solano, R., Huete, A., 2015. MODIS Vegetation Index User's Guide (MOD13 Series). Version 3.00 (Collection 6). EPAROC Taiwan Air Quality Monitoring Network., 2015. http://taqm.epa.gov.tw (09/ 20/2016). EPDHK Air Quality Monitoring Data., 2015. http://epic.epd.gov.hk/EPICDI/air/statio n/(09/18/2016). Forsythe, G.E., Malcolm, M.A., Moler, C.B., 1977. Computer Methods for Mathematical Computations. Prentice Hall Professional Technical Reference. Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189e1232. GMAO, 2015. MERRA-2 Tavg1_2d_flx_Nx: 2d,1-hourly,time-averaged,single-level,assimilation,surface Flux Diagnostics V5.12.4, Version 5.12.4. Goddard Earth Sciences Data and Information Services Center (GES DISC), Greenbelt, MD, USA. Guo, Y., Zeng, H., Zheng, R., Li, S., Barnett, A.G., Zhang, S., Zou, X., Huxley, R., Chen, W., Williams, G., 2016. The association between lung cancer incidence and ambient air pollution in China: a spatiotemporal analysis. Environ. Res. 144 (Part A), 60e65. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed. Springer-Verlag, New York. Hengl, T., Heuvelink, G.B.M., Tadic, M.P., Pebesma, E.J., 2012. Spatio-temporal prediction of daily temperatures using time-series of MODIS LST images. Theor. Appl. Climatol. 107, 265e277. Hoek, G., Beelen, R., de Hoogh, K., Vienneau, D., Gulliver, J., Fischer, P., Briggs, D., 2008. A review of land-use regression models to assess spatial variation of outdoor air pollution. Atmos. Environ. 42, 7561e7578. Hu, J., Chen, J., Ying, Q., Zhang, H., 2016. One-year simulation of ozone and particulate matter in China using WRF/CMAQ modeling system. Atmos. Chem. Phys. 16, 10333e10350. Hu, X., Belle, J.H., Meng, X., Wildani, A., Waller, L.A., Strickland, M.J., Liu, Y., 2017. Estimating PM2.5 concentrations in the conterminous United States using the random forest approach. Environ. Sci. Technol. 51, 6936e6944. Jacob, D., 1999. Introduction to Atmospheric Chemistry. Princeton University Press, Princeton, New Jersey. Jarvis, A., Reuter, H.I., Nelson, A., Guevara, E., 2008. Hole-filled seamless SRTM data V4.1. International centre for tropical agriculture (ciat). http://srtm.csi.cgiar.org (09/26/2016). Jun, C., Ban, Y., Li, S., 2014. China: open access to Earth land-cover map. Nature 514, 434e434. Kan, H., London, S.J., Chen, G., Zhang, Y., Song, G., Zhao, N., Jiang, L., Chen, B., 2008. Season, sex, age, and education as modifiers of the effects of outdoor air pollution on daily mortality in Shanghai, China: the Public Health and Air Pollution in Asia (PAPA) Study. Environ. Health Perspect. 116, 1183e1188. Kanevski, M., 2008. Advanced Mapping of Environmental Data. John Wiley & Sons, Inc, Hoboken, NJ, USA. Kilibarda, M., Hengl, T., Heuvelink, G.B.M., Graler, B., Pebesma, E., Tadic, M.P., Bajat, B., 2014. Spatio-temporal interpolation of daily temperatures for global land areas at 1 km resolution. J. Geophys. Res. Atmos. 119, 2294e2313.
Y. Zhan et al. / Environmental Pollution 233 (2018) 464e473 Levy, I., Levin, N., Yuval, Schwartz, J.D., Kark, J.D., 2015. Back-extrapolating a land use regression model for estimating past exposures to traffic-related air pollution. Environ. Sci. Technol. 49, 3603e3610. Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011. Application of machine learning methods to spatial interpolation of environmental variables. Environ. Modell. Softw. 26, 1647e1659. Li, M., Zhang, Q., Kurokawa, J.I., Woo, J.H., He, K., Lu, Z., Ohara, T., Song, Y., Streets, D.G., Carmichael, G.R., Cheng, Y., Hong, C., Huo, H., Jiang, X., Kang, S., Liu, F., Su, H., Zheng, B., 2017. MIX: a mosaic Asian anthropogenic emission inventory under the international collaboration framework of the MICS-Asia and HTAP. Atmos. Chem. Phys. 17, 935e963. Liu, X.-H., Zhang, Y., Cheng, S.-H., Xing, J., Zhang, Q., Streets, D.G., Jang, C., Wang, W.X., Hao, J.-M., 2010. Understanding of regional air pollution over China using CMAQ, part I performance evaluation and seasonal variation. Atmos. Environ. 44, 2415e2426. Lou, S., Liao, H., Yang, Y., Mu, Q., 2015. Simulation of the interannual variations of tropospheric ozone over China: roles of variations in meteorological parameters and anthropogenic emissions. Atmos. Environ. 122, 839e851. Madaniyazi, L., Nagashima, T., Guo, Y., Pan, X., Tong, S., 2016. Projecting ozonerelated mortality in East China. Environ. Int. 92e93, 165e172. Marshall, J.D., Nethery, E., Brauer, M., 2008. Within-urban variability in ambient air pollution: comparison of estimation methods. Atmos. Environ. 42, 1359e1369. MEPC, 2012. Ambient Air Quality Standard (GB3095-2012). MEPC Air Quality Daily Report, 2015. http://datacenter.mep.gov.cn/(01/30/2016). NBSC, 2011. China statistical yearbook of 2011. In: National Bureau of Statistics of the People's Republic of China. China Statistics Press. NBSC, 2016. China statistical yearbook of 2016. In: National Bureau of Statistics of the People's Republic of China. China Statistics Press. OpenStreetMap contributors Planet dump, 2015. http://planet.openstreetmap.org (07/09/2016). Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825e2830. Reid, C.E., Jerrett, M., Petersen, M.L., Pfister, G.G., Morefield, P.E., Tager, I.B., Raffuse, S.M., Balmes, J.R., 2015. Spatiotemporal prediction of fine particulate matter during the 2008 Northern California wildfires using machine learning. Environ. Sci. Technol. 49, 3887e3896. Seinfeld, J.H., Pandis, S.N., 2006. Atmospheric Chemistry and Physics: from Air Pollution to Climate Change. John Wiley & Sons, Hoboken, New Jersey. Shang, Y., Sun, Z., Cao, J., Wang, X., Zhong, L., Bi, X., Li, H., Liu, W., Zhu, T., Huang, W., 2013. Systematic review of Chinese studies of short-term exposure to air pollution and daily mortality. Environ. Int. 54, 100e111. Streets, D.G., Bond, T.C., Carmichael, G.R., Fernandes, S.D., Fu, Q., He, D., Klimont, Z., Nelson, S.M., Tsai, N.Y., Wang, M.Q., Woo, J.H., Yarber, K.F., 2003. An inventory of gaseous and primary aerosol emissions in Asia in the year 2000. J. Geophys. Res. Atmos. 108. Tang, G., Wang, Y., Li, X., Ji, D., Hsu, S., Gao, X., 2012. Spatial-temporal variations in surface ozone in Northern China as observed during 2009e2010 and possible implications for future air quality control strategies. Atmos. Chem. Phys. 12, 2757e2776. Temesgen, H., Ver Hoef, J.M., 2015. Evaluation of the spatial linear model, random forest and gradient nearest-neighbour methods for imputing potential
473
productivity and biomass of the Pacific Northwest forests. For. Int. J. For. Res. 88, 131e142. USEPA, 2009. Guidance on the Development, Evaluation, and Application of Environmental Models. USEPA, 2013. In: Final Report: Integrated Science Assessment of Ozone and Related Photochemical Oxidants. Agency, U.S.E.P., Washington, DC. Wager, S., Hastie, T., Efron, B., 2014. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 15, 1625e1651. Wang, M., Keller, J.P., Adar, S.D., Kim, S.-Y., Larson, T.V., Olives, C., Sampson, P.D., Sheppard, L., Szpiro, A.A., Vedal, S., Kaufman, J.D., 2015. Development of longterm spatiotemporal models for ambient ozone in six metropolitan regions of the United States: the MESA Air study. Atmos. Environ. 123 (Part A), 79e87. Wang, M., Sampson, P.D., Hu, J., Kleeman, M., Keller, J.P., Olives, C., Szpiro, A.A., Vedal, S., Kaufman, J.D., 2016. Combining land-use regression and chemical transport modeling in a spatiotemporal geostatistical model for ozone and PM2.5. Environ. Sci. Technol. 50, 5111e5118. Wang, Y., Zhang, Y., Hao, J., Luo, M., 2011. Seasonal and spatial variability of surface ozone over China: contributions from background and domestic pollution. Atmos. Chem. Phys. 11, 3511e3525. WHO Air Quality Guidelines: Global Update 2005, 2006. Particulate matter, ozone, nitrogen dioxide and sulfur dioxide (08/15/2016). http://www.euro.who.int/__ data/assets/pdf_file/0005/78638/E90038.pdf. Wise, E.K., Comrie, A.C., 2005. Meteorologically adjusted urban air quality trends in the Southwestern United States. Atmos. Environ. 39, 2969e2980. Wong, C.-M., Vichit-Vadakan, N., Kan, H., Qian, Z., 2008. Public health and air pollution in asia (PAPA): a multicity study of short-term effects of air pollution on mortality. Environ. Health Perspect. 116, 1195e1202. Wu, R., Xie, S., 2017. Spatial distribution of ozone formation in China derived from emissions of speciated volatile organic compounds. Environ. Sci. Technol. 51, 2574e2583. Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M.L., Shen, X., Zhu, L., Zhang, M., 2017. Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmos. Environ. 155, 129e139. Zhang, L., Jacob, D.J., Boersma, K.F., Jaffe, D.A., Olson, J.R., Bowman, K.W., Worden, J.R., Thompson, A.M., Avery, M.A., Cohen, R.C., Dibb, J.E., Flock, F.M., Fuelberg, H.E., Huey, L.G., McMillan, W.W., Singh, H.B., Weinheimer, A.J., 2008. Transpacific transport of ozone pollution and the effect of recent Asian emission increases on air quality in North America: an integrated analysis using satellite, aircraft, ozonesonde, and surface observations. Atmos. Chem. Phys. 8, 6117e6136. Zhang, L., Jacob, D.J., Downey, N.V., Wood, D.A., Blewitt, D., Carouge, C.C., van Donkelaar, A., Jones, D.B.A., Murray, L.T., Wang, Y., 2011. Improved estimate of the policy-relevant background ozone in the United States using the GEOSChem global model with 1/2 2/3 horizontal resolution over North America. Atmos. Environ. 45, 6769e6776. Zhang, Y., Wang, Y., 2016. Climate-driven ground-level ozone extreme in the fall over the Southeast United States. PNAS 113, 10025e10030. Zhao, S., Yu, Y., Yin, D., He, J., Liu, N., Qu, J., Xiao, J., 2016. Annual and diurnal variations of gaseous and particulate pollutants in 31 provincial capital cities based on in situ air quality monitoring data from China National Environmental Monitoring Center. Environ. Int. 86, 92e106.