Atmospheric Environment 98 (2014) 98e104
Contents lists available at ScienceDirect
Atmospheric Environment journal homepage: www.elsevier.com/locate/atmosenv
Prediction of ozone concentration in tropospheric levels using artificial neural networks and support vector machine at Rio de Janeiro, Brazil ^a b A.S. Luna a, *, M.L.L. Paredes a, G.C.G. de Oliveira a, S.M. Corre a b
~o Francisco Xavier, 524, Maracana ~, Rio de Janeiro 20550-013, Brazil Rio de Janeiro State University, Institute of Chemistry, Rua Sa lo Industrial, Resende, Rio de Janeiro 27537-000, Brazil Rio de Janeiro State University, Faculty of Technology, Rodovia Presidente Dutra Km 298, Po
h i g h l i g h t s The tropospheric ozone concentration was predicted using chemometric tools. The ANN and SVM were used in predicting the O3 with R2 up to 0.95. The predictive model is linked with the interaction of local-level meteorological.
a r t i c l e i n f o
a b s t r a c t
Article history: Received 8 April 2014 Accepted 25 August 2014 Available online
It is well known that air quality is a complex function of emissions, meteorology and topography, and statistical tools provide a sound framework for relating these variables. The observed data were contents of nitrogen dioxide (NO2), nitrogen monoxide (NO), nitrogen oxides (NOx), carbon monoxide (CO), ozone (O3), scalar wind speed (SWS), global solar radiation (GSR), temperature (TEM), moisture content in the air (HUM), collected by a mobile automatic monitoring station at Rio de Janeiro City in two places of the metropolitan area during 2011 and 2012. The aims of this study were: (1) to analyze the behavior of the variables, using the method of PCA for exploratory data analysis; (2) to propose forecasts of O3 levels from primary pollutants and meteorological factors, using nonlinear regression methods like ANN and SVM, from primary pollutants and meteorological factors. The PCA technique showed that for first dataset, variables NO, NOx and SWS have a greater impact on the concentration of O3 and the other data set had the TEM and GSR as the most influential variables. The obtained results from the nonlinear regression techniques ANN and SVM were remarkably closely and acceptable to one dataset presenting coefficient of determination for validation respectively 0.9122 and 0.9152, and root mean square error of 7.66 and 7.85, respectively. For these datasets, the PCA, SVM and ANN had demonstrated their robustness as useful tools for evaluation, and forecast scenarios for air quality. © 2014 Elsevier Ltd. All rights reserved.
Keywords: Air pollution Artificial neural networks Support vector machine Ozone Troposphere
1. Introduction Tropospheric ozone can have a negative impact on the environment and public health when present in the lower atmosphere, in sufficient quantities. Regulations have been introduced to set limits on the emissions of pollutants in such a way that they cannot exceed prescribed maximum values (EPA, 1999). Attention was given to mathematical and computer modeling of air quality to achieve these limits.
* Corresponding author. E-mail addresses:
[email protected],
[email protected] (A.S. Luna). http://dx.doi.org/10.1016/j.atmosenv.2014.08.060 1352-2310/© 2014 Elsevier Ltd. All rights reserved.
To trace and predict ozone, one must know the conditions that contribute to its formation. Besides, ozone concentrations are strongly connected to meteorological conditions. Landesea breezes also affect ozone concentrations at coastal sites. Therefore, it is necessary to develop a model that encompasses the description and understanding relationships between ozone concentrations and the many variables that cause or inhibit ozone production to predict ozone concentrations (Abdul-Wahab and Al-Alawi, 2002). Recently, a group of researchers (Wilson et al., 2012) analyzed trends in ozone levels in the European troposphere from 1996 to 2005. They indicated that average levels have been increasing despite reductions in pollutants which impact ozone formation. However, they also identified year-by-year variations, caused by
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
climate and weather events, and suggested they could be masking the impact of emission reductions on long-term ozone trends. This study confirmed that the relationship between ozone and its precursors is complicated. It is caused by the fact that meteorological and chemical processes can interact over a remarkably wide range of temporal and spatial scales (e.g., Adame et al., 2008). For this reason, the use of statistical tools provides a sound framework for the analysis of such data. Multiple linear regression (MLR), partial least squares (PLS), as well as principal component regression (PCR) methods, were carried out to assess ambient air quality in Miskolc, in Hungary. Ozone concentration was modeled by MLR and PCR with the same efficiency if the conditions of meteorological parameters were not changed (i.e. morning and afternoon). Without night data, PCR and PLS suggest that the main process is not a photochemical but a chemical one (Lengyel et al., 2004). Abdul-Wahab et al. (2005) used multiple linear regression to predict the ozone concentration through chemical and meteorological factors, but as the presence of multicollinearity makes the models obtained improperly, the authors used principal component regression analysis (PCR), applying the stepwise regression option in the choice of the principal components to enter the regression equation, with the ozone as the dependent variable (Abdul-Wahab et al., 2005). To overcome the presence of multicollinearity, Alvarez et al. (2000) used one of the most powerful and regular techniques in regional studies on atmospheric pollution, called rotated principal component analysis (RPCA). The main advantage of RPCA over other techniques is that it offers the possibility of analyzing the spatial and temporal vari ability of pollutants simultaneously at regional level (Alvarez et al., 2000). In addition to this problem, a study was carried out to compare the artificial neural networks (ANNs) with multiple linear regression models to predict the next day's maximum hourly ozone concentration in the Athens basin. Results based on a wide array of forecast quality measures indicate that the ANNs provides better estimates of ozone concentrations at the monitoring sites, whilst the more commonly used linear models are less effective for accurately forecasting high ozone concentrations (Chaloulakou et al., 2003). On the other hand, the support vector machine (SVM) can be used for time-series prediction and has been reported to perform well by some promising results. One group of researchers developed an online SVM model to predict air pollutant levels in an advancing time-series based on the monitored air pollutant database in Hong Kong downtown area (Wang et al., 2008). In another application, the SVM was used to the prediction of hourly ozone values in Madrid urban area. Using the modified SVM-r, based on reductions of the SVM-r hyper-parameters search space, they explore different influences, which may alter the ozone forecast, such as former ozone measurements in a given station, measurements in neighbors stations, and the impact of meteorological variables. A comparison study with the results using ANN (multilayer perception) was also carried out. The prediction tool based on SVM-r was flexible enough to incorporate any other prediction variable, such as city models, or traffic patterns, which may enhance the prediction obtained with the SVM-r (Ortiz-García et al., 2010). The objective of this study was to get parsimonious prediction models (i.e., models that depend on as few variables as needed) for ozone as a function of other ambient air concentration data and meteorological parameters as predictor variables, using the support vector machine (SVM-r) and artificial neural networks regression (ANN-r) models.
99
2. Chemometric techniques concepts In this section, we briefly describe the primary characteristics of the chemometric techniques used in this work. The reader interested in deepening their knowledge in this area should seek especially publications such as described by Vandeginste et al. (2003). 2.1. Principal Component Analysis The Principal Component Analysis (PCA) is an unsupervised method of exploratory analysis. The aim of PCA is dimension reduction, which may be used for visualization of multivariate data by scatter plots transformation of highly correlating x-variables into a smaller set of uncorrelated latent variables that can be used by other methods separation of relevant information (by a few latent variables) from noise combination of many variables that characterize a chemical-technological process into a single or a few ‘‘characteristic’’ variables. PCA can be seen like a method to compute a new orthogonal coordinate system formed by latent variables (components), where only the most informative dimensions are used. Latent variables from PCA optimally represent the distances between objects in the high-dimensional variable space d the distance of objects is a measure of the similarity of the objects. PCA is successful for data sets with correlating variables as €ggblom, K.-E.). is often the case with data from chemistry (Ha The PCA matrix X is decomposed into two matrices, loadings and scores as shown in Eq. (1):
X ¼ TPT þ E
(1)
where X is the samples matrix versus variables (m n), T is the matrix of scores (m number of PCs), PT is the transposed matrix of loadings (number of PCs n), and E is the residual matrix. The score is represented by a matrix (m number of PCs) consisting of samples, and the loading is represented by a matrix (number of PCs n) comprising variables (Otto, 2007; Jolliffe, 2002). The variance of the PC scores, preferably given in percent of the total variance of the original variables, is significant indicators of how many PCs to include. Because of the way the PCs are determined, the scores of each new PC have a lower variance than the scores of the previous PC. If the score variance of certain PC is close to the noise variance of the data, the PC does not contain any useful information. Obviously, the number of PCs should be less than this number. However, in practice it may be difficult to apply this criterion. A simple method is to plot (not shown) the cumulative variance of scores (the variance for each new PC added to the variance of previous PCs) against the PC number. As a rule of thumb, the PCs should explain at least 80%, maybe 90%, of the total variance. The PCA result and the data can be analyzed and evaluated in many ways. In this work, a biplot was chosen because it combines a scores plot and loadings plot; it gives info about clustering of objects and variables relationships between objects and variables. Objects and variables tend to be positively correlated if close to each other negatively correlated if far away from each other €ggblom, K.-E.). (Ha 2.2. Support vector of machine regression (SVM-R) The support vector machine (SVM) is a machine learning technique developed by Vapnik, which increasingly is gaining ground in many areas of knowledge. Originally this technique was developed for pattern recognition problems. The model consists of a number of support vectors (essentially samples selected from the
100
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
calibration set) and non-linear model coefficients which define the non-linear mapping of variables in the input x-block. The model allows prediction of the continuous y-block variable. SVM is implemented using the LIBSVM package which provides both epsilon-support vector regression (epsilon-SVR) and nu-support vector regression (nu-SVR). The original SVM formulations for Regression (SVR) used parameters C [0, inf) and epsilon [0, inf) to apply a penalty to the optimization for points which were not accurately predicted. An alternative version of both SVM regression was developed where the epsilon penalty parameter was replaced by an alternative parameter, nu [0,1], which applies a slightly different penalty. This is because nu represents an upper bound on the fraction of training samples, which are errors (badly predicted), and a lower bound on the fraction of samples which are support vectors. Some users feel nu is more intuitive to use than C or epsilon. Epsilon or nu are just different versions of the penalty parameter (Eigenvector Documentation). Recently, with the inclusion of loss functions in its structure, the SVM has expanded the usefulness of its use in the area of nonlinear regression and time series forecasts (Lu and Wang, 2005). A group of researchers used SVM technique to estimate highly nonlinear sourceereceptor relationships between precursor emission and pollutant concentrations. They applied the SVM model for the resolution of the multis industrial urban objective air quality control problem in the Avile nchez Sua rez et al. 2011). This technique was used area in Spain (Sa to predict the ozone concentration in this study. 2.3. Artificial neural networks regression (ANN-regression) The ANN technique analogously to the human nervous system has nodes in one or more layers, and these are linked by connections, called synapses. According to Fiorin et al. (2011), this technique is capable of storing knowledge, and its use covers problems of adjustment functions, pattern recognition, predictive modeling and other applications in different areas. This method has a high capacity for self-organization and temporal processing that enables to solve different problems of high complexity. In this context, the multilayer perceptron, one type of artificial neural network, has been shown to be a valuable tool for prediction, function approximation and classification. The benefits of the multilayer perceptron approach were particularly evident in applications where a complete theoretical model cannot be built, and especially when dealing with non-linear systems (Gardner and Dorling, 1998). According to Abdul-Wahab and Al-Alawi (2002), ANN-based models have the potential of describing nonlinear relationships such as those which control the production of O3. Moreover, this technique was successfully used by these authors to predict the ozone concentration using meteorological and chemical data.
Table 1 Summary of the correlation between the input variables and ozone concentration for all data bases. Variable
Correlation of variables with O3 for the studied databases PUC_2
UERJ_2
PUC_UERJ_2
NOx NO SWS TEM CO GSR HUM NO2
0.65 0.64 0.54 0.43 0.50 0.34 0.27 0.30
0.57 0.72 0.50 0.75 0.49 0.49 0.66 0.12
0.58 0.62 0.49 0.33 0.45 0.36 0.36 0.07
3.2. Data collection The original data were arranged in a matrix form. The following variables were used as columns: content of nitrogen dioxide (NO2), nitrogen monoxide (NO), nitrogen oxides (NOx), carbon monoxide (CO), and ozone (O3) in the air, scalar wind speed (SWS), global solar radiation (GSR), temperature (TEMP), moisture content in the air (HUM). The samples were arranged in the rows as were taken during the time in an average hourly base. The databases were created under varying conditions; pollution sources and weather conditions for the two studied localities, at different seasons. So, for the database PUC_2 and UERJ_2 different results are expected. This study will propose models for predicting levels of ozone for database PUC_2 and UERJ_2, separately constructed from data obtained at the mobile station located at PUC-Rio and UERJ, respectively. Additionally, in order to provide a greater challenge to the techniques used, the databases have been unified in a single database titled PUC_UERJ_2 where techniques should provide a single prediction model for ozone content in the air, considering the two different databases merged, in order to determine its robustness. Data obtained on rainy days were removed from the databases, since rain lower levels of pollutants. Furthermore, the data obtained on the weekends and holidays need to be treated separately, because vehicular source emissions are drastically reduced and, therefore, the concentrations of pollutants are remarkably different from weekdays. The data, obtained at night, must be also removed because the mixture layer of the atmosphere is extremely low, which increases the concentration of some primary pollutants, and the photochemical reaction is also virtually nonexistent. After the first refining of databases, in accordance with the aforementioned considerations, three databases were obtained: PUC_2: 643 samples with 9 variables (643 9); UERJ_2: 453 samples with 9 variables (453 9); the combined database PUC_UERJ_2: 1096 samples with 9 variables (1096 9).
3. Materials and methods 3.1. Area description The air pollution data for Rio de Janeiro city used in this study was provided by an automated mobile station, which provides hourly O3 measurements in two different points of the city. Therefore, the databases were built using data collected by the mobile station, from the Secretary of Environment of Rio de Janeiro city, the first time at Pontifical Catholic University (PUC-Rio) (Latitude 22 970 S and Longitude 43 230 W), between July to October 2011, including the seasons of winter and spring. After collecting data at PUC-Rio, the mobile station was parked at Rio de Janeiro State University (UERJ) (Latitude 22 910 S and Longitude 43 230 W), between November 2011 and March 2012, during the spring and summer seasons.
3.3. Software and the methodology of the chemometric analysis The software Matlab R2008b version 7.7.0 (Mathworks, USA) was used to construct the artificial neural networks (ANN) models and PLS Toolbox 6.2 (Eigenvector Research, USA) was used to construct models of principal component analysis (PCA) and support vector machine (SVM), respectively. This section was divided according to the chemometric techniques used. First, a pattern recognition study was carried out using the principal component analysis (PCA). Secondly, modeling of ozone concentration was done using artificial neural networks regression (ANN), and support vector machine regression (SVM).
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
4. Results and discussion 4.1. Statistical analysis As showed on Table 1, the covariance matrix for the data reveals that variables GSR and TEMP in PUC_2 show small to moderate levels of correlation with ozone, while, in the UERJ_2 dataset, TEMP, NO and HUM are highly correlated with it. One explanation for the different correlations between GSR and TEMP with ozone, presented by databases, is that, during the seasons fall, winter and spring the sun's rays reach the earth's surface with a lower slope compared to the summer season. So, in the winter the intensity of solar radiation is reduced, during the fall and spring the intensity of solar radiation is intermediate, and it is more difficult in the summer. Ozone concentrations were negatively correlated with NO, NOx as expected since these pollutants are known precursors of ozone, indicating that a rise in ozone concentration follows a drop in the levels of these variables. Similarly, the correlation between ozone and HUM is also negative due to the reaction between water and ozone, leading to the oxidative specie OH, according to Guicherit and Roemer (2000).
4.2. Pattern recognition using principal component analysis In the present work, PCA was applied with the purpose of investigating patterns of variability relating ozone concentration, air pollution, and meteorological variables. In this study, the autoscaling pretreatment was used; therefore, it means that the data were centered on the mean and then divided by the standard deviation. Subsequently, the procedure of cross-validation with choosing of continuous blocks, which is sufficient for the size of dataset, was used for the validation of the model. Table 2 summarizes the figures of merit for the models obtained from different datasets. As one can see, remarkably similar cumulative variances were obtained for the three datasets, but only two PCs were required to explain, up to this level, the dataset UERJ_2 while three were needed for PUC_2 and PUC_UERJ_2. Comparing the results, the PCA model for UERJ_2 has a slightly less complex behavior between ozone content and the studied variables. For a given PC, the absolute magnitude of the loading for a measurement variable defines how that variable contributes to the PC. By analyzing the results for UERJ-2 (Fig. 1) or PUC_2 (Fig. 2) dataset, it is clear that PC1 is the component that builds up the largest percentage of variance explained by the model, captured from variables containing the oxygen atom: NO, NOx, CO, O3, and meteorological variables: SWS, GSR, TEMP and HUM. As can see in Figs. 1 and 2, both dataset showed that primary pollutants NO, NOx and CO also have a significant loading in PC1 and have behaviors opposed to that of O3. This can be explained by the fact that the formation of O3 consumes NOx and, therefore, NO, i.e. when the higher levels of these primary pollutants are presented; the concentration of O3 is lower. At the same time, the PC1 is closely associated with meteorological variables. As can see in Figs. 1 and 2, the loading of the variable HUM is clearly associated with the rainfall because the UERJ-2 dataset was obtained during the period of spring to summer, and the other data set were obtained during the period of winter to spring. The
temporal variation of the variable HUM can be associated with the total rainfall. The averages of the total rainfall, for both periods in Rio de Janeiro, are 117 and 70 mm, respectively (http://www. worldweatheronline.com/Rio-De-Janeiro-weather-averages/RioDe-Janeiro/BR.aspx). This fact showed the variability patterns are usually associated with the interaction of regional-level meteorological.
4.3. Modeling of ozone concentration in tropospheric levels 4.3.1. Support vector of machine regression Auto-scaling was used to preprocess datasets before the application of the regression technique. The SVM-R was applied selecting a 10-fold cross-validation sub-dataset. The epsilon support vector regression showed the best results when it was compared to Nu support vector regression for the prediction of the contents of O3. Table 3 compares the figures of merit from the models obtained for different datasets using support vector machines. The values of RMSEC and RMSECV are errors, respectively, of the calibration and cross-validation. RMSEC values and RMSECV indicate the amount of the error of calibration and prediction; respectively. These values differ from zero to infinity, with zero being the best possible value to be achieved for a model. The calculation of RMSECV is shown in Eq. (2):
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X RMSECV ¼ t ðP Mi Þ2 N I¼1 i
Figures of merit
PUC_2
UERJ_2
PUC_UERJ_2
Number of PCs, (k) Cumulative explained variance, %
3 78.0
2 75.1
3 77.9
(2)
where Pi is predicted, the corresponding measured value Mi and N is the number of segments. It is clear that UERJ_2 dataset showed minor errors of calibration and prediction when it was compared with the others datasets. The coefficients of determination of calibration (R2Cal) and validation (R2CV) obtained for SVM regression model for UERJ_2 dataset were higher than those of PUC_2 or PUC_UERJ_2 datasets. SVM-regression model performed well for UERJ_2, which seems to be a dataset with simpler relationship between ozone concentration and the other observed variables. The predicted versus observed variable plot is useful for model and sample diagnostic. The resulting behavior is to have the points make a line of slope 1 and intercept 0. If the intercept is nonzero, it indicates a constant bias between the known and forecast values. A slope not equal to 1 indicates a proportional bias. Fig. 3 did not show bias and shows reasonable agreement between predicted and observed values for ozone concentration. The White test is a statistical test that establishes whether the residual variance of a variable in a regression model is constant; therefore, this is a test for homoscedasticity. This test and an estimator for heteroscedasticity-consistent standard errors were proposed by Halbert White in 1980. To test for constant variance one undertakes an auxiliary regression analysis: this regresses the squared residuals from the original regression model onto a set of regressors that contain the original regressors, the cross-products of the regressors and the squared regressors. One then inspects the R2 (coefficient of determination). The Lagrange multiplier (LM) test statistic is the product of the R2 value and sample size (n):
LM ¼ nR2 Table 2 Results from the PCA models.
101
(3)
This follows a chi-squared distribution with degrees of freedom equal to the number of estimated parameters (in the auxiliary regression). For all regression models examined, the White test indicated that the residuals variance were constant (p-value <0.05) (White, 1980).
102
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
Fig. 1. PCA results from UERJ_2 dataset.
Fig. 2. PCA results from PUC_2 dataset.
4.3.2. Artificial neural networks regression (ANN-regression) For the analysis of ANN, some parameters must be present for the network construction. They are the number of network layers, number of nodes in each layer, type of relationship between the nodes and the network topology. The number of layers to be chosen depends on the linearity of the data. The multilayer ANN has more than one node between an input and output of the network. The connections between the nodes can be either acyclic (feed-forward) or cyclic (feedback). The connection is acyclic when the output response of a node serves as input to a node of a subsequent layer. The connection is cyclic when the response of the output node serves as input to a node of a single layer or lower layer. When occurs the output feedback in the input layers, the network is said to be recurring (Fiorin et al., 2011). For all three databases (PUC_2, UERJ_2 and PUC_UERJ_2), the artificial networks were chosen to be acyclic (feed-forward), consisting of eight input nodes, representing variables: NO, NOx, NO2, CO, SWS, TEMP, HUM and GSR and an output node, the target variable O3 in two layers with sigmoidal hidden nodes and output nodes linear. For the hidden layer, 20 nodes were considered which is a parameter arbitrated by the Matlab software used. The
networks were trained with training supervision static, with the LevenbergeMarquard method, which is a standard technique used to solve nonlinear least squares curve fitting problems. In the first stage, the parameterization of networks was done, and in the second stage each dataset was divided into three datasets, which are the training set, with 70% of this data, validation set and test set, each with 15% of the data. The samples, in each dataset, were chosen by the algorithm of KennardeStone. Fig. 4 shows the graphs of predicted versus measured data for a set training, validation, testing and complete set for UERJ_2 dataset. The dotted lines in Fig. 4 represent the line where the predicted values are equal to the measured (Y ¼ T) and the solid lines
Table 3 Figures of merit obtained for the O3 prediction using epsilon-SVM regression.
RMSEC RMSECV R2Cal R2CV
PUC_2
UERJ_2
PUC_UERJ_2
11.62 13.79 0.7834 0.6915
5.95 7.66 0.9483 0.9122
11.95 14.47 0.8045 0.7099
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
103
Fig. 3. Predicted versus observed for ozone concentration model using SVM-R for UERJ_2 dataset.
represent the model. The slopes of the lines of the model are influenced by the distribution of the samples, the smaller the angle between the two lines, the closer are the predicted values to the
measured ones. The higher the R value, the smaller is the distance between the observed and predicted values. Again, as for the SVM
Fig. 4. Predicted versus measured data for training, validation, testing, and complete using ANN regression for UERJ_2 dataset.
104
A.S. Luna et al. / Atmospheric Environment 98 (2014) 98e104
Table 4 Figures of merit obtained from ANN regression model applied to three datasets.
RMSEC RMSECV RMSEP R2Training R2Validation R2Test
PUC_2
UERJ_2
PUC_UERJ_2
12.37 12.95 16.51 0.7310 0.7543 0.6527
6.32 7.85 8.10 0.9429 0.9151 0.8864
11.91 14.57 15.27 0.8013 0.7205 0.6583
technique, the ANN achieved the best results for the database UERJ_2. Table 4 showed the figures of merit, for all datasets, obtained for models using ANN regression technique. Once more, this technique offers the best results when applied to UERJ_2 dataset as well as occurred with SVM regression technique. Agreeing with the technique SVM, ANN also had more success in modeling predictions of O3 to the UERJ_2 dataset with low RMSEC and RMSECV and R2Cal and R2CV high values. The results for PUC_2 were slightly worse again; however, the results were acceptable for PUC_UERJ_2 dataset as already mentioned for the SVM technique. Both techniques presented models less suitable for the database PUC_2, which seems to describe the complex relationship between ozone concentration and the other monitored variables. These effects are probably due to its geographical location, as between the sea, and a lagoon is formed regions of high atmospheric turbulence, which strongly influence the variation of pollutant concentrations. The dataset has high variability of the data because there are sudden changes in the concentrations of pollutants, making it difficult to predict concentrations of O3 as a function of the monitored variables. 5. Conclusion The artificial neural network and support vector machine techniques were used to estimate the complex relationship between ozone and other variables based on ambient air monitoring measurements. The results provide insight into the dependence of ozone concentrations on other key pollutant concentrations and meteorological conditions. It was found that the models' predictions and the actual observations were consistent. The relative importance of the various input variables was also examined. The results also indicated the dependence of ozone concentrations on the other pollutants and meteorological conditions. Especially, ozone concentration was negatively correlated with CO, NO, NOx as expected since these pollutants are known precursors of ozone, indicating that a rise in ozone concentration follows a drop in the levels of these variables. Similarly, the correlation between ozone and HUM is also negative due to the reaction between water and ozone, leading to the oxidative specie OH. This study allows implying that both chemometric techniques can be used in modeling and predicting the ground-level concentrations of ozone, with the determination coefficient (R2) up to about 0.95. Clearly, this study has indicated the potential of chemometric tools application for capturing the non-linear
interactions between ozone and other factors and for the identification of the relative importance of these factors. Artificial neural network and support vector machine modeling, therefore, provide an easy way of modeling and analysis of air pollutants and could be used in conjunction with other methods. Also, the obtained results support the fact that variability patterns are usually associated with the interaction of regional-level meteorological. Acknowledgments ^a, S. M. thank the support of UERJ (Programa Luna, A. S. and Corre ^ncia), FAPERJ and CNPq. The authors thank SMA, the SecreProcie tary of Environment of Rio de Janeiro city for providing data. References Abdul-Wahab, S.A., Al-Alawi, S.M., 2002. Assessment and prediction of tropospheric ozone concentration levels using artificial neural networks. Environ. Model. Softw. 17, 219e228. Abdul-Wahab, S.A., Bakheit, C.S., Al-Alawi, S.M., 2005. Principal component and multiple regression analysis in modelling of ground-level ozone and factors affecting its concentrations. Environ. Modell. Softw. 20, 1263e1271. Adame, J.A., Lozano, A., Bolívar, J.P., de la Morena, B.A., Contreras, J., Godoy, F., 2008. Behavior, distribution and variability of surface ozone at an arid region in the south of Iberian Peninsula (Seville, Spain). Chemosphere 70, 841e849. Alvarez, E., de Pablo, F., Tom as, C., Rivas, L., 2000. Spatial and temporal variability of n (Spain). Int. J. Biometeorol. 44, 44e51. ground-level ozone in Castilla-Leo Chaloulakou, A., Saisana, M., Spyrellis, N., 2003. Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens. Sci. Total Environ. 313, 1e13. Eigenvector Documentation: SVM Support Vector Machine for Regression, see http://wiki.eigenvector.com/index.php?title¼SVM_Function_Settings. EPA, 1999. Guideline for Developing an Ozone Forecasting Program. Environmental Protection Agency. EPA-454/R-99-009. ~es de redes neurais Fiorin, D.V., Martins, F.R., Schuch, N.J., Pereira, E.B., 2011. Aplicaço ~ es de disponibilidade de recursos energe ticos solares. Rev. Bras. Ens. e previso Fis. 33, 1309e1320. Gardner, M.W., Dorling, S.R., 1998. Artificial neural networks (the multilayer perceptron) e a review of applications in the atmospheric sciences. Atmos. Environ. 32, 2627e2636. Guicherit, R., Roemer, M., 2000. Tropospheric ozone trends. Chemosphere 2, 167e183. http://www.users.abo.fi/khaggblo/MMDA/MMDA6.pdf. Jolliffe, I.T., 2002. Principal Component Analysis, second ed. Springer, New York, USA. berger, K., Paksy, L., Ba nhidi, O., Rajko , R., 2004. Prediction of ozone Lengyel, A., He concentration in ambient air using multivariate methods. Chemosphere 57, 889e896. Lu, W., Wang, W., 2005. Potential assessment of the “support vector machine” method in forecasting ambient air pollutant trends. Chemosphere 59, 693e701. rez-Bellido, A.M., Ortiz-García, E.G., Salcedo-Sanz, S., Pe Portilla-Figueras, J.A., Prieto, L., 2010. Prediction of hourly O3 concentrations using support vector regression algorithms. Atmos. Environ. 44, 4481e4488. Otto, M., 2007. Chemometrics Statistics and Computer Application in Analytical Chemistry, second ed. Wiley-VCH, Weinheim, Germany. nchez Su Sa arez, A., García Nieto, P.J., Riesgo Fern andez, P., del Coz Díaz, J.J., IglesiasRodríguez, F.J., 2011. Application of an SVM-based regression model to the air s urban area (Spain). Math. Comput. quality study at local scale in the Avile Model. 54, 1453e1466. Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., de Jong, S., Lewi, P.J., SmeyersVerbeke, J., 2003. Handbook of Chemometrics and Qualimetrics: Part B. Elsevier, Amsterdam, Netherlands. Wang, W., Men, C., Lu, W., 2008. Online prediction model based on support vector machine. Neurocomputing 71, 550e558. Wilson, R.C., Fleming, Z.L., Monks, P.S., Clain, G., Henne, S., Konovalov, I.B., Szopa, S., Menut, L., 2012. Have primary emission reduction measures reduced ozone across Europe? An analysis of European rural background ozone trends 1996e2005. Atmos. Chem. Phys. 12, 437e454. White, H., 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48 (4), 817e838.