Atmospheric Environment 217 (2019) 116956
Contents lists available at ScienceDirect
Atmospheric Environment journal homepage: www.elsevier.com/locate/atmosenv
Effects of meteorological factors to reduce large-scale PM10 emission estimation errors on unpaved roads
T
Tse-Huai Liu, Yoojung Yoon∗ Civil & Environmental Engineering, West Virginia University, Morgantown, WV, 26506, USA
ARTICLE INFO
ABSTRACT
Keywords: Unpaved roads Meteorological factors Wind speed Temperature Particulate matter
Unpaved roads are one of the major sources of the particulate matter (PM) known to negatively affect human health, environment, and vegetation. The control of PM10 and PM2.5 (i.e., aerodynamic diameter < 10 μm and < 2.5 μm, respectively) is an increasing concern among many local agencies. The U.S. Environmental Protection Agency (EPA) provided the first emission factor equation for unpaved roads in 1975, and there are also several large-scale or site-specific PM10 estimation models. These models are mostly based on linear regression models, which relate the PM10 emission to the multiple input variables that represent the physical and operational characteristics of vehicles (e.g., vehicle speed, passes, and weights) as well as road surface material properties (e.g., silt and moisture content). Scientific knowledge is still lacking, however, pertaining to the potential effects of meteorological factors such as wind speed, temperature, and cloud cover, which could enhance the quality of PM10 estimation models for unpaved roads. Therefore, this paper presents the results of a study investigating the potential effects of the meteorological factors to reduce the performance errors of largescale PM10 estimation models for unpaved roads employing a hypothesis test. The PM10 models for the performance analysis in this study were developed using Principal Component Analysis (PCA) and Artificial Neural Network (ANN). The raw data were retrieved from the EPA report, “Emission Factor Documentation for AP-42,” published in 1998, and two weather-related input variables (e.g., mean wind speed and temperature) were identified with other vehicle- and road-related input variables. The performance analysis results and hypothesis test based on the two common error measures (e.g., root mean square error and mean absolute error) indicated that there was no advantage in the inclusion of either one or both of the two weather-related variables as far as enhancing the predictive ability of large-scale PM10 estimation models.
1. Introduction Particulate matter (PM) with an aerodynamic diameter of fewer than 10 μm and 2.5 μm (PM10 and PM2.5) were found to have significant impacts on human health, environment, and vegetation (Dockery, 2001; Schwartz et al., 1993; Pope et al., 2002; Molina and Molina, 2004). Based on the studies conducted by the World Health Organization (WHO) of approximately 3,000 cities in 103 countries, 98% of the cities with more than 100,000 inhabitants in low- and middle-income countries did not meet the WHO air-quality guidelines (AQGs) from 2008 to 2015 (WHO, 2016). PM is one of the key air pollutants (e.g., ozone (O3), nitrogen dioxide (NO2), and sulfur dioxide (SO2)) included in the WHO guidelines. The identification of emission sources is necessary for proceeding in the right direction to efficiently reduce the impact of PM. Unpaved roads are one of the main contributors to stationary source PM. According to the 2014 NEI Report, the stationary
∗
emissions for PM10 and PM2.5 in 2014 represented 89.6% and 66.9% of the total PM10 and PM2.5 emissions respectively (U.S. Environmental Protection Agency (EPA), 2016). Unpaved roads account for 51.9% and 27.2% of the stationary emissions for PM10 and PM2.5, respectively. In Highway Statistics (2017), the percentage of unpaved roads was around 33% of the U.S. public road and street mileages (U.S. Department of Transportation, 2018). A recent survey found that many local road agencies are converting deteriorated paved roads that carry very low volume traffic in rural areas to unpaved roads as sustaining those roads paved is not economically feasible (Fay et al., 2016). As a result, there is increasing concern among many local agencies as to how the resuspension of PM10 and PM2.5 from unpaved roads can be controlled. PM emissions on unpaved roads result from the interactions between vehicle tires and loose erodible road surfaces depending on various factors drawn from unpaved road source conditions (e.g., vehicle operations, road properties, and weather). The emission factor is
Corresponding author. E-mail addresses:
[email protected] (T.-H. Liu),
[email protected] (Y. Yoon).
https://doi.org/10.1016/j.atmosenv.2019.116956 Received 13 April 2019; Received in revised form 2 September 2019; Accepted 4 September 2019 Available online 05 September 2019 1352-2310/ © 2019 Elsevier Ltd. All rights reserved.
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
vehicles, and vehicle speed. Their results indicated a very high R2-value of 0.96, but there was limited applicability to the three locations in the Hsinchu City, Taiwan where the field test data were collected. The purpose of the study presented in this paper was to investigate the influence of meteorological parameters in the context of other vehicle characteristics and road surface material properties to enhance the quality of large-scale predictive models of PM10 emissions for unpaved roads. Two meteorological parameters (wind speed and temperature) were identified from the archived data available from the EPA report, “Emission Factor Documentation for AP-42.” Two analytical procedures, Principle Component Analysis (PCA) and Artificial Neural Network (ANN), were utilized for four different datasets depending on whether meteorological parameters were considered with the default set of the vehicle- and road surface-related input variables. The results of the four cases were compared to investigate the effects of the meteorological parameters in order to improve the predictive ability of the PM10 estimate model by employing statistical performance measures.
the widely used representative value to quantify the pollutant emissions generated as a result of various activities. The emission factor for unpaved roads is expressed as the weight of the PM per vehicle mile traveled (lb/VMT, 1 lb/VMT = 281.9 g/VKT). The EPA introduced the first emission factor equation for unpaved roads in a document entitled “Compliance of Air Pollutant Emission Factors” (also called AP-42) in 1975, among others (around 9,000 emission factors exist) (EPA, 1998). The first EPA emission factor equation has been modified and expanded over the years to address various source conditions. The EPA presented a new emission factor equation in the section report for unpaved roads published in 1998 after testing other potential input variables (e.g., mean vehicle weight (W), mean vehicle speed (S), mean number of wheels (w), surface material silt content (s), and surface material moisture content (M)). In the resulting EPA emission factor equation (see Eqs. (4) and (5) in Background Documentation, Emission Factor Documentation for AP-42 – Section 13.2.2, in 1998), three input variables (e.g., W, s, and M) in the logarithmic relationship with the emission factor were formulated (EPA, 1998). The emission factor equation published in 1998 distinguished separate applications for public unpaved roads with mostly light-duty vehicles (weighing less than 3 tons) and industrial plant roads in 2006 (EPA, 2006). The separate emission factors also were intended to eliminate the following undesirable features of the 1998 emission factor equation: 1) overpredicted PM estimates for vehicle speeds less than 15 mph (= 24.14 kph), 2) underestimated effects of watering unpaved roads for dust control, and 3) PM emissions from vehicle exhaust, brake wear, and tire wear (EPA, 2006). Emission factors similar to those used by the EPA can be found in several other countries (e.g., Germany, Sweden, Norway, and the European Environment Agency), which are being applied to emission models for non-exhaust PM on paved or unpaved roads considering vehicle- and/or road surface-related variables in a linear or log-linear relationship (Boulter, 2005). These large-scale emission factor equations were developed using very broad source conditions and were utilized where source-specific PM emission test data were not available so their predictive capability was not high enough for PM emission estimation in site-specific areas. For example, the coefficient of determination (R2) values of the EPA equations in 1998 and 2006 were less than 0.4 (EPA, 1998; EPA, 2006). On the other hand, the PM predictive regression models, when used on a site-specific basis, had relatively higher prediction abilities. For example, Dyck and Stukel (1976) developed an emission factor in a linear relationship with vehicle speed, vehicle weight, and silt content using the data from a study site in Champaign, IL, and the coefficient of multiple correlations of the equation was 0.7974. Using the ambient monitoring data at four gravel processing sites in Taiwan, Chang et al. (2010) modified the 2006 EPA equation for unpaved roads by adding a term for the number of trucks and increasing the powers of the input variables significantly. The R2 of the equation for the unpaved roads was 0.88. Although these equations provide excellent prediction quality for the PM emissions of the study areas considered, their inherent limitation, lies in their broad applicability to other emission sources. As the characteristics of the formation of PM emissions are complex and nonlinear (Zhang and Ding, 2017), the relevance of using both large-scale and site-specific equations to estimate PM emissions from the linear relationships between independent and dependent variables has been a concern and has driven more recent studies which demonstrated excellent results in analyzing the non-linearity of emission data using the artificial neural network (ANN) (Liu and Yoon, 2019). Another aspect which is not adequately addressed in the EPA-type equations and the recent non-linearity-based studies is the consideration of meteorological parameters (e.g., wind speed, temperature, and cloud cover), which are known to be correlated with moisture evaporation (EPA, 1998 and EPA, 2006). Tsai and Chang (2002) presented an emission factor equation for unpaved roads that included a wind speed parameter along with silt loading, moisture content, number of
2. Procedure to compare PM10 emission models The influence investigation of meteorological parameters to enhance the quality of large-scale PM10 emission models compares the performance of the predictive models from different datasets which have none, partial or full weather-related input variables to be considered. Therefore, the procedure to compare the performance consists of the following four steps: data cleaning and normalization (Step-1), data dimensionality reduction using PCA (Step-2), ANN models estimation of PM10 emission (Step-3), and performance analysis of the estimation models (Step-4). PCA and ANN are suitable for reducing the dimensionality of data and dealing with non-linear data and identifying the complex relationships between input and output variables, respectively (Tayfur et al., 2013). PCA is one of the most popular multivariate analysis techniques for handling high-dimensional correlated variables in almost all scientific disciplines (Abdi and Williams, 2010). PCA converts a large set of possibly correlated variables into a small set of orthogonal variables called principal components (PCs) that still retain the maximal amount of information of the original variables (Abdi and Williams, 2010; Lever et al., 2017). A small set of variables is much more efficient for data analysis and interpretation. However, as the applicability of PCA is limited by the assumption that the relations among variables are linear, the use of PCA for data on a non-linear manifold still would require a higher number of dimensions. The ANN model, which was initially inspired by the human nervous system and signal transmission, utilizes an iterative learning process through interconnected neurons for data analysis. ANN has been used widely to solve both simple linear and complex non-linear problems because of its powerful mathematical background and logical system (Chiarazzo et al., 2014). Fig. 1 shows the four-step procedure for this paper. Steps-1 through −3 are repetitive in order to generate PM10 emission estimation models for all the datasets considered. The details of each step are discussed in the following subsections. Step-1: Data Cleaning and Normalization The activities of Step-1 consist of data cleaning and data normalization to implement reliable data analysis and modeling. The main purpose of data cleaning is to remove outliers from the raw dataset. Real-world data often are contaminated by missing values and outliers; and lowquality data can decrease the accuracy and reliability of the data analysis and modeling process. In particular, the PCA analysis in Step-2 is susceptible to outliers as the analysis is based on the correlation/variance matrix (Berrani and Garcia, 2005). The performance of the ANN model in Step-3 is also affected by outliers as the model uses all the data points in a raw dataset to train itself (Zhou and Wu, 2011). Outliers in a dataset can be efficiently detected by distinguishing the values of the outlier data from the rest of data values. Although the outliers might 2
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
Fig. 1. Procedures for performance analysis of PM10 emission estimation model.
not be readily apparent, the following data normalization activity minimizes the adverse impacts of the outliers on model development (Patel and Mehta, 2011). Data normalization is undertaken given that the variables in a raw dataset generally have different units of measurement. Normalizing the values of different scales to a standard scale is essential for conducting a proper statistical procedure to capture the variability of the data (Tan et al., 2006). That is, the PCA analysis is performed on either the covariance matrix or the correlation matrix, and the mathematical concepts of these two matrices are very similar as shown in Eqs. (1) and (2), where n is a number of observed values, x i and yi are the values of a random variable pair (X, Y) for i (= 1, … ∙∙, n), X and Y are the means of X and Y, and sX and sY are the standard deviations of X and Y. The PCA analysis is inherently affected by large scales of variables in a raw dataset by assigning more weights to the variables with higher variances (Jolliffe and Cadima, 2016). For example, suppose that variables A and B are correlated but the raw values of variable A are significantly large compared to variable B. PCA will submerge variable B in variable A regardless of the possible higher importance of variable B over variable A for making a predictive model. Otherwise, the consistency of the PCA analysis to select informative variables in a dataset will be selected. The data normalization procedure also is useful for the ANN model in order to increase the speed of the learning process, treat independent variables in a fair manner, and enhance the learning ability with minimized data errors (Jayalakshmi and Santhakumaran, 2011; Atomi, 2012; Baghirli, 2015). The ANN model will converge slowly and provide prediction results with large errors when the raw values of data are directly considered to train the model.
Cov (X , Y ) =
Corr (X , Y ) =
n i=1
(x i
X )(yi n
Cov (X , Y ) sX sY
1
Y)
xi =
xi
X
(3)
sX
Step-2: Data Dimensionality Reduction using PCA The PCA analysis for the data dimensionality reduction process includes the following activities: calculate the covariance among variables, deduce the eigenvectors and eigenvalues, compute the proportions of variation of the eigenvectors to select PCs, and estimate the principal component scores. When there are d variables (d-dimensional variables) involved in the original dataset, the first activity in Step-2 generates d-by-d covariance matrix S, using Eq. (1), where the mean of each variable is 0 because the raw data values are standardized by Eq. (3). In general, PCA searches for one eigenvalue and eigenvector for each variable so that the d eigenvalues, λ1, λ2, …, λd, and the eigenvectors, e1, e2, …, ed are deduced. The eigenvalues are obtained by solving Eq. (4), where the left-hand side is the determinant of the resulting matrix, which is a d-degree polynomial of λ, and I is the identity matrix:
S
(4)
I =0
Once the eigenvalues are obtained, the corresponding eigenvectors are obtained by Eq. (5) where j = 1, 2, …, d:
(S
j I ) ej
(5)
=0
Each ej is a column vector of d-by-1 so that the eigenvector matrix of d-by-d for all e is generated. Each column of the eigenvector matrix defines the direction of a new axis called a principal component. The d elements of each column account for coefficients on the original variables. The proportion of variation is an indicator by which each PC explains the variability in the data. The size of the proportion of variation is utilized to determine the importance of the PCs. The proportion of variation of each PC is computed by dividing the eigenvalue of the PC by the sum of the eigenvalues. That is, the proportion of variation (PoV) of j-th PC is:
(1) (2)
There are many normalization methods, such as min-max normalization and z-score normalization, which are widely used among others. The min-max normalization method performs a linear transformation to scale raw data values down to a range of 0–1. The method measures the relative distances of the data values within a specified limit determined by the minimum and maximum values of a dataset so that the relationship among the original data is preserved (Han et al., 2012). However, min-max potentially can constrain normalization of future data values that fall out of the specified limit. The z-score normalization method considers the mean and standard deviation of the data values in a dataset as shown in Eq. (3), where x i is the normalized value of x i . Therefore, the z-score method is applicable for the case where a dataset is expansible to include more data points so that its minimum and maximum data values are unknown and outliers are present (Han et al., 2012). Also, the z-score normalization method is useful in eliminating the need for checking the similarities of the scales of the different variables, which is essential for PCA to select the covariance or correlation matrix, as the covariance of the data values normalized by the zscore method is the correlation of the raw data values.
PoV (PC = j ) =
j d j=1
j
(6)
To determine which PCs are retained or dropped for dimensionality reduction, the PCs are presented from highest to lowest in order of the size of PoV and the cumulative PoV is computed. The PoVs of PCs can be graphically represented by a scree plot. Selecting a larger cumulative PoV of the PCs avoids losing too much information about the distribution of the data. As a result, the top p PCs (p < d) from the rank are selected. The appropriate number of PCs to be considered can be determined by observing the changes in the slope of a scree plot, choosing the eigenvalues ≥1, or setting up a minimum acceptable percentage of a cumulative explained variance (Abdi and Williams, 2010 and Lam et al., 2010). As the last activity, Step-2 estimates the principal component scores of the raw data values for the PCs selected. The calculation of principal component scores considers the linear combinations of the raw data values and the coefficient matrix. When Z is a matrix with n rows (data points) and d columns (variables) and V is 3
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
over the range of 0–1 in a sigmoid curve. The hyperbolic tangent activation function extends the output range of the logistic function to −1 and 1. The non-linear activation functions are useful for accommodating complicated and non-linear problems, but they suffer from the vanishing gradient problem so that the non-linear functions are particularly applied for hidden layers. Trial-and-error is used to choose the best activation function (e.g., logistic or hyperbolic tangent) in hidden neurons. The linear activation function is usually implemented in the output neurons for the predictive model because the estimates of output values are not limited to any boundary. The ANN model also employs cross-validation techniques in its training process to build a reliable model structure because they utilize all the data points in a dataset to train and test the prediction rule in order to obtain a less biased prediction model (Wong, 2015). There are two commonly used cross-validation techniques: k-fold and leave-oneout. The k-fold cross-validation (k-fold CV) technique divides the entire dataset into k disjoint subsets, called folds, which are approximately equal in size. That is, when the approximately same size in each fold is m, the number of k folds generated is n/m. Each fold is used to evaluate the prediction error of the model, and the rest of the folds (i.e., k-1 folds) are combined to train the model, implying that the ANN model training process is repeated for k times, resulting in k different testing results that are averaged at the end of the process. The operation logic of the leave-one-out cross-validation (LOOCV) technique is very similar to k-fold CV but differs in the way the dataset is divided. The LOOCV technique leaves one data point out for testing and uses all the other data points for training. The technique thus generates n subsets, where n is a total number of data points in a dataset. The diagrams of these two cross-validation techniques are shown in Fig. 2, where the shaded ones represent testing folds (or data points) at different rotations. In general, k-fold CV is preferred from a computational standpoint while LOOCV is suitable for a small amount of data (Fushiki, 2011; Yadav and Shukla, 2016) as the ANN model generally needs more data points to learn information. The ANN training process considering either the kfold CV or LOOCV techniques, which depends on the sample data size to be used, produces test errors on the data points not used during the training of the ANN model.
Fig. 2. k-fold CV (top) and LOOCV (bottom) Techniques.
a d-by-p coefficient matrix of p PCs selected, the matrix for the linear combination Y is:
x1,1
x1, d
e1,1
e1, p
xn,1
x n, d
ed,1
ed, p
Y = ZV =
(7)
Therefore, the linear combination function of a principal component score for data point i (= 1, … ∙∙, n) and PC j (= 1, 2, …, p) is:
yi, j = x i,1 e1, j + x i,2 e2, j +
+ xi, d ed, j
(8)
Step-3: ANN Model for PM10 Emission Estimation The main activities of Step-3 involve training the ANN model to reach the desired output based on the output of Step-2. The fundamental structure of the ANN model consists of three different layers (e.g., input, hidden, and output). Each layer conveys artificial neuron (s), and the input and hidden layers include the bias terms to increase the flexibility of the model. The ANN model generally considers a certain number of input neurons that is equal to the incoming variables and one output neuron. However, there is no theory available yet to determine the optimal number of neurons in the hidden layer for the best performance of the ANN model. Higher numbers of hidden neurons enhance the learning capacity of the ANN model with low training errors but increase the risk of overfitting (Panchal et al., 2011). The most common approach for determining the optimum number of hidden neurons uses the trial-and-error method, which is repeated testing from one hidden neuron to a maximum number of hidden neurons. By a rule of thumb, the maximum number could be at most two times that of the input neurons or two-thirds of the total neurons in the input and output layers (Sheela and Deepa, 2013). The weights between the network layers (e.g., input: hidden layer or hidden: output layer) are determined through feedforward and backpropagation logic in the ANN model. A one-time ANN training process, called one epoch, computes the weight-input products for the neurons in the hidden layer and one weighted sum in the output layer for the feedforward logic. The backpropagation logic then calculates the error between the weighted sum and an actual output value. This error then is considered in adjusting the weights of the neuron layers and bias values for the next epoch. This process iterates several times until the ANN model finds the minimum error. The recursive ANN training process passes the data through the activation function, which is linear or non-linear. A linear activation function does not enforce change in the weight-input products so it does not confine the output values between any ranges. A non-linear activation function transforms the products to a non-linearity property. Most of the currently popular types of non-linear activation functions for the ANN model are logistic and hyperbolic tangent (Özkan and Erbek, 2003). The logistic activation function produces output values
Step-4: Performance Analysis of the PM10 Emission Estimation Models The performance analysis in Step-4 includes two activities: 1) identification of the ANN PM10 emission estimation model that generates the best performance measure for each dataset and 2) comparison among the performance measures to determine the best choice. Step-3 develops the same number of ANN models as the maximum number of hidden neurons applied for each activation function. The best performance measure of an ANN model at each dataset is identified at the lowest average test error. The test errors are obtained from the difference between the estimated values de-normalized by Eq. (3) and the observed PM10 values. Using the lowest average test errors as the best performance measures from all the datasets, they are compared to each other considering the variability of the errors. The performance analysis is conducted by employing various error measures, such as scale-dependent measures (e.g., mean-square error (MSE), root mean square error (RMSE), and mean absolute error (MAE)) as well as scale-independent measures (e.g., mean absolute percentage error (MAPE), root mean square percentage error (RMSPE), relative MAE (RMAE), and relative RMSE (RRMSE)). These scale-dependent and -independent measures have variations which include standardized RMSE (SRMSE) and mean square reduced error (MSRE) using a standard deviation and variance respectively (Li, 2017). There are also other types of measures to evaluate the accuracy of predictive models, such as the coefficient of determination (R2), adjusted R2, predicted R2, and the variance is explained by predictive models based on cross-validation (VEcv), Legates and McCabe's (E1), and Nash-Sutcliffe efficiency (NSE) (Li, 2017; 4
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
Fig. 4 is verified by VTV = I. Fig. 4 also shows the variance percentages of the eigenvalues (PoVs), which were computed based on Eq. (6) and are graphically represented by the scree plot in Fig. 5. There are three PCs with eigenvalues ≥1. According to the guideline for a scree plot to determine an optimal number of PCs, the first four PCs could be considered as the “elbow” point showing that the slope change from steep to flat occurs at the fourth PC. With the minimum generally acceptable cumulative PoV being 80% in order to account for the information of the original input variables in PCA (Adler and Golany, 2001), the cumulative PoV of the first five PCs is 84.7% (see the last column of the eigenvalues in Fig. 4). This study consequently applied the five PCs selected by the minimum cumulative PoV for the case study. Fig. 6 is a variables factor map that visualizes the loadings of the variables to the first two components (named Dim 1 for horizontal and Dim 2 for vertical). The loadings are computed by eigenvectors × squared-rooted eigenvalues. The vector length of each variable represents the contributions of a variable to the components. The variables factor map in Fig. 6 shows that 46.1% of the total variation of the data used in the case study can be explained by the first two principal components (i.e., Dim1 - 25.1% and Dim2 -21%). The variables in positive correlation are grouped together in the same quadrants while the variables in negative correlation are positioned on the opposite quadrants. For example, the variables such as Moisture MNoVW and NoVP - MVS are projected to the first component well, and Moisture - MNoVW are positively correlated while NoVP - MVS are negatively correlated. Given the five PCs selected, the principal component scores of the 134 data points were computed based on Eq. (8). For example, the principal component scores corresponding to the five PCs (j = 1, 2, …, 5) for the first data point i (= 1) that has normalized values 0.102, −0.128, 0.717, −1.012, −0.958, −1.245, −0.018, and −0.808 (rounded to 3 decimal places) for Temp., NWS, NoVP, MVW, MNoVW, MVS, Silt, and Moisture respectively are:
Reinhardt and Samimi, 2018). As each of these performance measures has its usefulness, the selection of any appropriate measure must consider various factors (e.g., type of model, existence of unexpected and noisy data, and need for unit-free measures (Armstrong and Collopy, 1992). 3. PM10 emission estimation model development for each dataset The procedures presented in Fig. 1 were applied to the data retrieved from the EPA report (https://www3.epa.gov/ttn/chief/ap42/ ch13/bgdocs/b13s02-2.pdf) for unpaved roads. The EPA report presents the PM10 emissions data from the unpaved surfaces at the industrial sites and publicly accessible roads. The data are listed in Tables 4–1 through 4–32 in the report. For this study, 134 data points from the unpaved roads for light- and heavy-duty vehicles, stone quarry haul trucks, construction haul equipment (e.g., scrapers and trucks) were collected considering the availability of a full set of the following input variables: temperature (Temp.), mean wind speed (MWS), number of vehicle passes (NoVP), mean vehicle weight (MVW), mean number of vehicle wheels (MNoVW), mean vehicle speed (MVS), silt content of road surface material (Silt), and surface moisture content (Moisture). As the two weather-related variables (e.g., Temp. and MWS) were considered, four different datasets were generated: 1) without both Temp. and MWS; 2) with Temp.; 3) with MWS; and 4) without both Temp. and MWS. Duplicate data and data with missing values in any of the input variables also were removed. A part of the data presented in the report was collected under surface conditions controlled by chemical dust suppressants such as petroleum resin products, emulsified asphalt, and acrylic cement as well as calcium chloride solutions, which change the physical characteristics of the unpaved road surface. Therefore, the data affected by these chemical stabilizations were eliminated for this study. The observed data of PM10 as the output variable are represented in pounds per vehicle mile traveled (lb/VMT). The statistical properties of the data collected for the input and output variables are presented in Table 1. The raw data were normalized using Eq. (3). Based on the normalized data, all the calculations required for the subsequent activities of Steps-2 and -3 were conducted using R (free software for statistical computing and graphics). The discussions for the application of the activities in both steps here include only the dataset which includes the two weather-related variables due to the page limit of this paper. Fig. 3 shows the 8-by-8 covariance matrix S for the input variables. The eigenvalues and eigenvectors of the eight PCs were then computed. The results are shown in Fig. 4, where the eigenvalues are returned in descending order from Dim. 1 to Dim. 8, and the columns (i.e., PC1 through PC8) of the eigenvector matrix V correspond to the eigenvalues. Each eigenvalue satisfies Eq. (4), and the sum of the eigenvalues is equal to the number of input variables. The eigenvectors are linearly independent as the corresponding eigenvalues are all different. The orthogonality of the columns of the eigenvector matrix in
y1,1 = 0.102 ×
0.130
Min.
Max.
Mean
Std.
Temp. MWS NoVP MVW MNoVW MVS
°F mph (kph) EA Ton EA mph (kph)
33.3 0.8 (1.3) 8.0 1.5 4.0 10.0 (16.1)
95.0 22.7 (36.5) 381.0 286.0 10.0 43.1 (69.4)
13.3 4.4 (7.1) 80.4 74.6 1.6 7.1 (11.4)
Silt Moisture PM10
% % Ib/VMT (kg/ VKT)
1.25 0.07 0.0061 (0.0017)
25.20 20.10 33.00 (9.30)
70.5 6.6 (10.6) 80.9 77.0 5.5 23.8 (38.3) 7.81 2.98 4.73 (1.33)
0.524
0.808 × 0.375 =
1.186 (9)
y1,2 = 0.102 × ×
0.052
0.239
0.128 ×
0.664 + 0.717 ×
1.012 × 0.456 0.958 1.245 × 0.217 0.018 × 0.485
0.044
0.808 × 0.083 = 0.784 (10)
y1,3 = 0.102 × ×
0.235
0.735
0.128 ×
0.014 + 0.717 × 0.128
1.012 × 0.008 0.958 1.245 × 0.371 0.018 × 0.452
0.808 × 0.215 =
0.391 (11)
y1,4 = 0.102 × 0.008
0.128 ×
0.958 × 0.647 1.245 × 0.527
Unit
0.037 + 0.717 ×
1.012 × 0.471 0.958 1.245 × 0.135 0.018 × 0.249
× 0.428
Table 1 Statistical properties of the input and output variables. Variable
0.128 ×
0.018 ×
0.269 + 0.717 × 0.146 0.283
1.012 ×
0.808 × 0.308 =
0.187
1.191 (12)
y1,5 = 0.102 × × 0.107
0.113
0.128 ×
0.249 + 0.717 ×
1.012 × 0.090 0.958 1.245 × 0.364 0.018 × 0.053
0.808 ×
0.497
0.724 =
0.398 (13)
The column vectors of the eigenvector matrix in Fig. 4 also were rounded to three decimal places for computational simplicity in Eqs. (9)–(13). As a result, the principal component scores were computed for the normalized values of the 134 data points on the five PCs. Fig. 7 lists only a portion of the scores due to the page limit of this paper. The principal component scores of the 134 data points were used as the dataset corresponding to a new set of the input variables for the ANN
5.23 3.06 6.44 (1.82)
Note: 1 mph = 1.609 km per hour (kph), 1 lb/VMT = 0.2819 kg/vehicle-kilometer-traveled (kg/VKT). 5
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
Fig. 3. Covariance matrix of the normalized data.
model. The case study employed one hidden layer considering ANN learning efficiency (Macukow, 2016), performance quality (Ahmed, 2005), and sufficiency for most non-linear complex problems (Karsoliya, 2012). Both logistic and hyperbolic tangent activation functions were used for the hidden layer individually. The maximum number of hidden neurons for the trial-and-error testing was set at 10, which is twice the number of input variables (e.g., PC1 through PC5). For each hidden neuron, the ANN training was implemented with the LOOCV technique to provide sufficient opportunity with a relatively small size of data points to learn information in reasonable computing time. Each ANN training with LOOCV, which represents 134 ANN trainings for each hidden neuron, randomly chose the initial weights between −1 and 1 and set the values of the bias neurons in the input and hidden layers at 1. The two most commonly used error measures, RMSE and MAE, were considered to evaluate the accuracy of the ANN test results at various numbers of hidden neurons and two activation functions. The whole ANN training and testing for each neuron were repeated 20 times for more generalized estimation outputs. Therefore, the entire ANN procedures produced 20 RMSEs and MAEs, respectively, for each of the 10 hidden neurons. The error measures were then averaged to find the lowest average error measures. Fig. 8 shows the averages of the RMSEs and MAEs for the hidden neurons, which ranged from 1 to 10 for both activation functions for all four datasets. The ANN test results show that the logistic activation function generated better performance outputs than the hyperbolic tangent activation function for all the datasets. Table 2 summarizes the lowest average RMSE and MAE values as well as their standard deviations for the datasets considered in this study. The four datasets in Table 2 are labeled as follows: 1) the dataset with no consideration of the weather inputs (DS-1), 2) the dataset adding Temp. to DS-1 (DS-2), 3) the dataset adding MWS to DS1 (DS-3), and 4) the dataset adding both Temp. and MWS to DS-1 (DS4).
Fig. 5. Scree plot for proportions of variations.
Fig. 6. Factor map of the input variables.
Fig. 4. Eigenvalues and eigenvectors.
6
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
Table 2 Performance errors and standard deviations. Dataset
RMSEAvg.
RMSESD
MAEAvg.
MAESD
DS-1 DS-2 DS-3 DS-4
5.394 5.546 5.354 5.648
0.0007 0.0007 0.4259 0.1743
3.317 3.422 3.497 3.703
0.0003 0.0005 0.3101 0.1170
Table 3 Two-tailed P-values for average RMSEs and MAEs. Fig. 7. Principal component scores of the normalized data.
4. Model performance analysis results and discussions The performance analysis compared the lowest average RMSEs and MAEs of the four ANN-based PM10 emission estimation models considering their standard deviations for the reliability of the error measures. Table 2 shows that the weather inputs (e.g., DS-2, DS-3, and DS4) had no influence on reducing PM10 estimation errors, which differed
Comparison
P-value (RMSEAvg.)
P-value (MAEAvg.)
DS-1 to DS-2 DS-1 to DS-3 DS-1 to DS-4
1.209E-74 0.768 9.0812E-06
1.109E-63 0.028 2.834E-11
from the authors’ expectations. A hypothesis test employing a twosample t-test was conducted to statistically interpret the performance errors of the weather-inclusive datasets compared to DS-1. The
Fig. 8. Average RMSEs and MAEs of four different datasets. 7
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
hypothesis test considered the equal performance errors of two comparison datasets as a null hypothesis at the significance level (= α) of 0.05. The two-tailed P-values that resulted from the hypothesis test are presented in Table 3, which displays the statistical significance of the PM10 performance of DS-1 compared to most of the performance errors of other datasets, with the exception of the average RMSE of DS-3. Therefore, this study concluded that there was no advantage in including either one or both of the two weather-related input variables to enhance the quality of ANN-based PM10 emission estimation models with no consideration of weather inputs. One possible answer for the model performance analysis results can be inferred from the studies conducted by Claiborn et al. (1995) and Csavina et al. (2014) who found no statistically significant relationship between the PM10 concentration and the wind speed for paved and unpaved roads. Several field experiments concluded that PM10 concentrations decreased as the wind speed increased to a certain point, showing a very weak negative correlation, and then increased because PM resuspension occurs at higher wind speeds (Simpson, 1990; Harrison et al., 2001; Smith et al., 2001; Tsai and Chang, 2002; Aldrin and Haff, 2005; Jones et al., 2010). The specific point of wind speed at which this occurred varied depending on the experiment conditions (e.g., wind direction and locations); for example, > 4 m/s (Harrison et al., 2001; Csavina et al., 2014) or 9–12 m/s (Tsai and Chang, 2002). Also, Tong et al. (2014) found that EPA-like PM10 equations remained consistent with a wind speed ranging between 4 mph (1.78 m/s) and 20 mph (8.89 m/s). As shown in Table 1, the mean wind speed values for this study were from 0.8 mph (= 0.36 m/s) to 22.7 mph (= 10.15 m/s) in order to show any insignificant negative correlation with the PM10 emissions. Fig. 9 shows all 134 data points in the relationship between the PM10 and MWS and the fitted line with an R2 of 0.05%. The PM10 values of some data points at a certain range of MWS (e.g., about 4–7 mph (= 6.4–11.3 kph)) were extremely higher than others at the same MWS range. This study's investigation into the data sources identified that most of these data points had relatively higher ratios of silt to moisture contents compared to the others collected from the same sampling locations, which indicates that silt and moisture in unpaved road surfaces are the prevailing variables in the generation of PM10 emission. Another explanation for the performance analysis results could lie in the various data sources used for this study. The data sources included publicly accessible unpaved roads as well as industrial and construction sites. Also, the potential for PM10 emissions generally decreases during the winter months as the soil moisture content is likely to be retained more at lower soil surface temperature (Kuhns et al., 2003; Lakshmi et al., 2003). On the other hand, the 2006 EPA report indicated that PM10 emissions may be more correlated with heavy vehicle weights at industrial and construction sites and moisture content at public unpaved roads (EPA, 2006). The argument in the EPA report was demonstrated by another performance analysis conducted by this study. The performance analysis using two sub-datasets of DS-1 and DS-2, which included 30 data points from the unpaved roads dominated by light-duty vehicles was conducted following Step-1 through Step-4. The
Fig. 10. The lowest error measures of the sub-datasets of DS-1 and DS-2.
lowest average RMSE/MAE values of the sub-datasets of DS-1 and DS-2 were 2.539/1.413 and 1.528/0.9222, respectively, as shown in Fig. 10, which also included the standard deviations. The hypothesis test at a significance level of 0.05 shows that the P-values were 1.742E-08 for RMSE and 3.283E-09 for MAE so that the null hypothesis was rejected. The prediction reliability of PM10 models for publicly accessible unpaved roads can be improved by adding temperature as an input variable. 5. Conclusions While site-specific PM10 estimation models can provide more reliable prediction quality within a specific limit, large-scale models can be more useful where site-specific PM10 data are unavailable. This study investigated the potential use of meteorological factors to improve the reliability of large-scale PM10 estimation models for unpaved roads using two data analysis methods, PCA and ANN, in R software. The data, which were collected from various sites (e.g., publicly accessible roads as well as construction and mining haul roads) in different regions (e.g., Nevada, North Carolina, Wyoming, Indiana, Michigan, North Dakota, New Mexico, and Missouri) were retrieved from the EPA report published in 1998. The performance analysis utilized 134 data points for eight input variables, which included two weather-related variables (e.g., mean wind speed and temperature). The hypothesis test for the performance analysis indicated that the weather-related input variables had no effect on improving the quality of large-scale PM10 estimation models for unpaved roads when RMSE and MAE were considered as error measures. Two possible reasons were indicated for the data analysis result, which includes the inherent insignificance between PM concentration and wind speed as well as too broad data sources from publicly accessible unpaved roads and industrial and construction sites. However, this study found a statistical significance when temperature data were employed to reduce the estimation errors of the PM10 models for publicly accessible unpaved roads compared to the models with no consideration for any weather-related variables. The contribution of this study lies in its assessment of the potential effects of mean wind speed and temperature for improving the quality of large-scale PM10 estimation models and thereby provides a scientific foundation for the currently available knowledge of the correlation between the mean wind speed and/or temperature and PM10 emissions. Also, this study suggests the potential advantages of including meteorological factors in PM10 estimation models for large-scale but purpose-specific unpaved roads (e.g., unpaved roads dominated by lightweight or heavyweight vehicles) to enhance the quality of the models. This study also has limitations that require future work. First, there are other types of meteorological factors that may have impacts on PM10 emissions, such as solar radiation, humidity level, and sustained winds at a certain speed. Second, some input variables for mean wind speeds, temperature, number of vehicle passes, and surface moisture content vary over time which require data analysis approaches capable of dealing with time-dependent inputs.
Fig. 9. Regression plot between PM10 and MWS. 8
Atmospheric Environment 217 (2019) 116956
T.-H. Liu and Y. Yoon
Declaration of competing interest
Kuhns, H., Etyemezian, V., Green, M., Hendrickson, K., McGown, M., Barton, K., Pitchford, M., 2003. Vehicle-based road dust emission measurement—Part II: effect of precipitation, wintertime road sanding, and street sweepers on inferred PM10 emission potentials from paved and unpaved roads. Atmos. Environ. 37 (32), 4573–4582. Lakshmi, V., Jackson, T.J., Zehrfuhs, D., 2003. Soil moisture–temperature relationships: results from two field experiments. Hydrol. Process. 17 (15), 3041–3057. Lam, J.C., Wan, K.K., Wong, S.L., Lam, T.N., 2010. Principal component analysis and long-term building energy simulation correlation. Energy Convers. Manag. 51 (1), 135–139. Lever, J., Krzywinski, M., Altman, N., 2017. Points of significance: principal component analysis. Nat. Methods 14, 641–642. Li, J., 2017. Assessing the accuracy of predictive models for numerical data: not r nor r2, why not? Then what? PLoS One 12 (8), e0183250. Liu, T.-H., Yoon, Y., 2019. Development of enhanced emission factor equations for paved and unpaved roads using artificial neural network. Transp. Res. D Transp. Environ. 69, 196–208. Macukow, B., 2016. Neural networks – state of art, brief history, basic models and architecture. In: Proceedings in Computer Information Systems and Industrial Management, 15th IFIP TC8 International Conference, September 14-16, Vilnius, Lithuania. Molina, M.J., Molina, L.T., 2004. Megacities and atmospheric pollution. J. Air Waste Manag. Assoc. 54 (6), 644–680. Özkan, C., Erbek, F.S., 2003. The comparison of activation functions for multispectral Landsat TM image classification. Photogramm. Eng. Remote Sens. 69, 1225–1234. Panchal, G., Ganatra, A., Shah, P., Panchal, D., 2011. Determination of over-learning and over-fitting problem in back propagation neural network. Int. J. Soft Comput. 2 (2), 40–51. Patel, V.R., Mehta, R.G., 2011. Impact of outlier removal and normalization approach in modified k-means clustering algorithm. IJCSI Int. J. Comput. Sci. Issues 8 (5), 14–24. Pope III, C.A., Burnett, R.T., Thun, M.J., Calle, E.E., Krewski, D., Ito, K., Thurston, G.D., 2002. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. J. Am. Med. Assoc. 287 (9), 1132–1141. Reinhardt, K., Samimi, C., 2018. Comparison of different wind data interpolation methods for a region with complex terrain in Central Asia. Clim. Dyn. 51, 3635–3652. Schwartz, J., Slater, D., Larson, T.V., Pierson, W.E., Koenig, J.Q., 1993. Particulate air pollution and hospital emergency room visits for asthma in Seattle. Am. Rev. Respir. Dis. 147, 826–831. Sheela, K.G., Deepa, S.N., 2013. Review on methods to fix number of hidden neurons in neural networks. Math. Probl. Eng. 1–11 2013. Simpson, R.W., 1990. A model to control emissions which avoid violations of PM10 health standards for both short and long term exposures. Atmos. Environ. Part A. General Topics 24 (4), 917–924. Smith, S., Stribley, F.T., Milligan, P., Barratt, B., 2001. Factors influencing measurements of PM10 during 1995–1997 in London. Atmos. Environ. 35 (27), 4651–4662. Tan, P.-N., Steinbach, M., Kumar, V., 2006. Introduction to Data Mining. Pearson Education, Inc, Boston, MA. Tayfur, G., Karimi, Y., Singh, V.P., 2013. Principle Component Analysis in conjunction with data driven methods for sediment load prediction. Water Resour. Manag. 27 (7), 2541–2554. Tsai, C.J., Chang, C.T., 2002. An Investigation of dust emissions from unpaved surfaces in Taiwan. Separ. Purif. Technol. 29, 181–188. Tong, X., Luke, E.A., Smith, R., 2014. Numerical validation of a near-field fugitive dust model for vehicles moving on unpaved surfaces. Proc. Inst. Mech. Eng. - Part D J. Automob. Eng. 228 (7), 747–757. U.S. Department of Transportation, 2018. Public road length – 2017 kilometers by type of surface and ownership/functional system nation summary. November 27, Retrieved from. https://www.fhwa.dot.gov/policyinformation/statistics/2017/hm12m.cfm. U.S. Environmental Protection Agency, 1998. Emission Factor Documentation for AP-42, Section 13.2.2. Unpaved Roads Retrieved from. https://www3.epa.gov/ttn/chief/ ap42/ch13/bgdocs/b13s02-2.pdf. U.S. Environmental Protection Agency, 2006. 13.2.2 Unpaved Roads. Retrieved from. https://www3.epa.gov/ttn/chief/ap42/ch13/final/c13s0202.pdf. U.S. Environmental Protection Agency, 2016. 2014 National emissions inventory, version 1: technical support document. Retrieved from. https://www.epa.gov/sites/ production/files/2016-12/documents/nei2014v1_tsd.pdf. Wong, T.-T., 2015. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 48 (9), 2839–2846. World Health Organization, 2016. Air pollution levels rising in many of the world's poorest cities. May 12, Retrieved from. https://www.who.int/news-room/detail/1205-2016-air-pollution-levels-rising-in-many-of-the-world-s-poorest-cities. Yadav, S., Shukla, S., 2016. Analysis of k-Fold cross-validation over hold-out validation on colossal datasets for quality classification. In: 2016 IEEE 6th International Conference on Advanced Computing (IACC), February 27-28, Bhimavaram, India. Zhang, J., Ding, W., 2017. Prediction of air pollutants concentration based on an extreme learning machine: the case of Hong Kong. Int. J. Environ. Res. Public Health 14 (2), 1–19. Zhou, Y., Wu, Y., 2011. Analyses on influence of training data set to neural network supervised learning performance. Adv. Comput. Sci. Intell. Syst. Environ. 3 (106), 19–25.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.What is this? Acknowledgements This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Appendix A. Supplementary data Supplementary data to this article can be found online at https:// doi.org/10.1016/j.atmosenv.2019.116956. References Abdi, H., Williams, L.J., 2010. Principal component analysis. Wiley Interdiscip. Rev: Comput. Stat. 2 (4), 433–459. Adler, N., Golany, B., 2001. Evaluation of deregulated airline networks using data envelopment analysis combined with principal component analysis with an application to Western Europe. Eur. J. Oper. Res. 132 (2), 260–273. Ahmed, F.E., 2005. Artificial neural networks for diagnosis and survival prediction in colon cancer. Mol. Cancer 4 (1), 29. Aldrin, M., Haff, I.H., 2005. Generalised additive modelling of air pollution, traffic volume and meteorology. Atmos. Environ. 39 (11), 2145–2155. Armstrong, J.S., Collopy, F., 1992. Error measures for generalizing about forecasting methods: empirical comparisons. Int. J. Forecast. 8, 69–80. Atomi, W.H., 2012. The Effect of Data Preprocessing On the Performance Of Artificial Neural Networks Techniques For Classification Problems (Master’s thesis). Retrieved from. https://core.ac.uk/download/pdf/20078092.pdf. Baghirli, O., 2015. Comparison of lavenberg-marquardt, scaled conjugate Gradient and Bayesian regularization backpropagation Algorithms for multistep ahead wind speed forecasting using multilayer perceptron feedforward neural network, (Doctoral dissertation). Retrieved from. http://www.diva-portal.org/smash/get/diva2:828170/ FULLTEXT01.pdf. Berrani, S., Garcia, C., 2005. On the impact of outliers on high-dimensional data analysis methods for face recognition. In: Proceedings of the 2nd International Workshop on Computer Vision Meets Databases, June 17, Baltimore, MD. Boulter, P., 2005. A Review of Emission Factors and Models for Road Vehicle Non-exhaust Particulate Matter. TRL Report PPR 065. TRL Limited, Wokingham, UK. Chang, C.-T., Chang, Y.-M., Lin, W.-Y., Wu, M.-C., 2010. Fugitive dust emission source profiles and assessment of selected control strategies for particulate matter at gravel processing sites in Taiwan. J. Air Waste Manag. Assoc. 60 (10), 1262–1268. Chiarazzo, V., Caggiani, L., Marinelli, M., Ottomanelli, M., 2014. A neural network based model for real estate price estimation considering environmental quality of property location. Transp. Res. Procedia 3 (2), 810–817. Claiborn, C., Mitra, A., Adams, G., Bamesberger, L., Allwine, G., Kantamaneni, R., Lamb, B., Westberg, H., 1995. Evaluation of PM10 emission rates from paved and unpaved roads using tracer techniques. Atmos. Environ. 29 (10), 1075–1089. Csavina, J., Field, J., Félix, O., Corral-Avitia, A.Y., Sáez, A.E., Betterton, E.A., 2014. Effect of wind speed and relative humidity on atmospheric dust concentrations in semi-arid climates. Sci. Total Environ. 487, 82–90. Dockery, D.W., 2001. Epidemiologic evidence of cardiovascular effects of particulate air pollution. Environ. Health Perspect. 109, 483–486. Dyck, R.I., Stukel, J., 1976. Fugitive dust emissions from trucks on unpaved roads. Environ. Sci. Technol. 10, 1046–1048. Fay, L., Kroon, A., Skorseth, K., Reid, R., Jones, D., 2016. Converting Paved Roads to Unpaved. Transportation Research Board, Washington, D.C. Fushiki, T., 2011. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 21 (2), 137–146. Han, J., Kamber, M., Pei, J., 2012. Data Mining: Concepts and Techniques. Berkeley Press, Waltham, MA. Harrison, R.M., Yin, J., Mark, D., Stedman, J., Appleby, R.S., Booker, J., Moorcroft, S., 2001. Studies of the coarse particle (2.5–10μm) component in UK urban atmospheres. Atmos. Environ. 35 (21), 3667–3679. Jayalakshmi, T., Santhakumaran, A., 2011. Statistical normalization and back propagation for classification. Int. J. Comput. Theory Eng. 3 (1), 89–93. Jolliffe, I.T., Cadima, J., 2016. Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 374, 20150202. Jones, A.M., Harrison, R.M., Baker, J., 2010. The wind speed dependence of the concentrations of airborne particulate matter and NOx. Atmos. Environ. 44 (13), 1682–1690. Karsoliya, S., 2012. Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. Int. J. Eng. Trends Technol. 3 (6), 714–717.
9