Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
Journal of Loss Prevention in the Process Industries journal homepage: www.elsevier.com/locate/jlp
Deep neural network and random forest classifier for source tracking of chemical leaks using fence monitoring data Jaehoon Cho1, Hyunseung Kim1, Addis Lulu Gebreselassie, Dongil Shin∗ Department of Chemical Engineering, Myongji University, Yongin, Gyeonggido 17058, Republic of Korea
A R T I C L E I N F O
A B S T R A C T
Keywords: Chemical accident Chemical release Leak source tracking Deep-learning network Random Forest Artificial intelligence
Chemical plant leak accidents are classified as one of the major industrial accidents that can spread secondary and tertiary major disasters. It is very important to keep track and diagnose the source location(s) and notify the plant manager and emergency responders promptly to alleviate secondary and tertiary damages, improving the effectiveness of emergency responses. In this study, we propose an emergency response system that can cope with leak accidents of a chemical plant by monitoring sensor data and track down the suspected leak source using machine learning: Deep-learning and Random Forest classifiers. It is also difficult to get enough chemical leak accident scenario data or perform actual leak experiments on real plants due to high risk and cost factors. Consequently, Computational Fluid Dynamics (CFD) simulations are used to derive fence monitoring data for chemical leak accident scenarios. These data are to train the machine learning models to predict leak source locations. Six time-series Deep Neural Network (DNN) structures and three Random Forest (RF) structures are trained using CFD dispersion simulation results for 640 leak accident scenarios of a real chemical plant, divided as training and test datasets. As a result, on DNN model using 25 hidden layers and on RF model using 100 decision trees, 75.43% and 86.33% prediction accuracy are achieved, respectively, classifying the most probable leak source out of 40 potential leak source locations. Analyzing the predicted leak source locations that are wrongly classified, those predicted leak sources are also quite adjacent to the actual leak location and hardly called as misclassifications. Considering the superb performance of DNN and RF classifiers for chemical leak tracking, the proposed method would be very useful for chemical emergency management and is highly recommended for real-time diagnosis of the chemical leak sources.
1. Introduction The Bhopal chemical leak accident of methyl isocyanate in 1984 is recorded as the largest disaster in chemical accident history. 8000 people were killed, 4000 were permanently disabled, and 520,000 were exposed to hazardous chemicals (Jaising and Sathyamala, 1992; Eckerman, 2005). Although Bhopal disaster occurred more than 30 years ago, responsibility problems, lawsuits, and irreparable damage still remain in the society. Systematic management and countermeasures are needed to reduce the risk of chemical accidents because chemical accidents cause unrecoverable human and material damage. Chemical leak accidents are classified as a major industrial accident which can cause immediate damage to workers in the workplace or cause damage to areas near the workplace. In particular, chemical plants in developing countries are crowded without clear distinction from a residential area, which is very likely to lead to secondary and tertiary major disaster in the event of a chemical accident. In order to ∗
1
reduce secondary and tertiary damage from chemical accidents, it is important to respond promptly in the initial stage of the accident, and precise information of the leak source is essential in this step. Table 1 shows the comparison between the leak accidents of hydrofluoric acid occurred in Gumi, Korea and Texas, USA. The main difference between the two leak accidents is in the emergency response in the initial step. In the case of Gumi, the hydrofluoric acid leak was blocked 8 h after the accident occurred. The second and third damages occurred in a range of 2 km radius from the leak point. It resulted in 5 fatalities, 18 injuries and 13,000 people were treated for toxic chemical poisoning. On the other hand, in the case of Texas, the hydrofluoric acid leak was blocked 8 min after the accident with systematic emergency response in the initial step. Despite the fact that hydrofluoric acid leaked three times as much as the leakage of Gumi accident, the amount of damage could be highly reduced. It shows that the success of emergency response in the initial step can greatly reduce accident damage.
Corresponding author. E-mail address:
[email protected] (D. Shin). Both authors contributed equally to this work.
https://doi.org/10.1016/j.jlp.2018.01.011 Received 1 July 2017; Received in revised form 18 January 2018; Accepted 18 January 2018 0950-4230/ © 2018 Elsevier Ltd. All rights reserved.
Please cite this article as: Cho, J., Journal of Loss Prevention in the Process Industries (2018), https://doi.org/10.1016/j.jlp.2018.01.011
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
In contrast, the leak source prediction model applying machine learning algorithms can overcome the disadvantages of the inverse vector tracking methodology mentioned above. Since it is based on pattern analysis, data from fixed sensors is enough to design leak source location prediction system. This can avoid the cost and difficulty to install separate movable sensors. Therefore, it can be applied in a limited space, and it is possible to greatly reduce the time required to track the leak source while moving the sensors. In particular, machine learning models including Artificial Neural Network (ANN) and RF have shown unparalleled performance in classification problems through data learning. For this reason, there have also been attempts to identify sources of leakage using machine learning. Although it is not a problem of source tracking of chemical leak(s), there are studies using ANN as a means of obtaining information on leak sources. Rege and Tock (1996) proposed a model for estimating the emission rate of hydrogen sulfide and ammonia through ANN for a single leak source. The three-layered ANN was trained using a database containing seven variables: downwind concentration, wind speed, downwind distance, crosswind distance, ambient air temperature, relative humidity, and atmospheric stability. Reich et al. (1999) proposed a model to estimate the hourly emission rate and corresponding effective height of leak through three-layered ANN to obtain information on the source of the leak. Although it is not related to the leak source tracking problem of chemical plants, Singh et al. (2004) proposed a model that predicts contaminant flow in groundwater. The ANN learns the results of CFD simulations for three leak sources to predict the leak rate at each leaking sources. Li et al. (2006) experimentally installed 105 sensors in 21 pipes buried underground and conducted the leak experiments. A counter-propagation network was trained using experimental data and the network could predict the leak source location belongs to 21 pipes. Although there is no study which attempts to track the leak source location of a chemical plant which includes complex geometry, which shows the possibility of using ANN, the promising results of machine learning applications on groundwater contaminate source tracking problems can be adopted for the chemical plant leak source tracking problems. In this study, we proposed a method to construct leak source tracking system based on fence monitoring data to detect chemical leaks at chemical plants applying DNN and RF. Leak accident scenarios CFD simulation data have been used to train the DNN and RF models. In addition, the trained DNN and RF models are verified by analyzing the result of test datasets.
Table 1 Comparison of Gumi and Texas hydrofluoric acid leak accidents (Lee, 2013, 2014).
Leakage of hydrofluoric acid Leak stopped
Evacuation time Concentration of hydrofluoric acid Human injury
Gumi, hydrofluoric acid leak accident (2012)
Texas, hydrofluoric acid leak accident (1987)
8 ton
24 ton
Attempt to shut off leak valve: 1 h and 30 min after the accident Gas leak stopped: 8 h after the leak accident Starting evacuation: 27 min after the accident 1 ppm (next day at 09:30) 5-10 ppm (estimation) 5 people killed 18 inpatients 13,000 people screening
Gas leak stopped: 8 min after the leak accident
Completing evacuation: in 20 min 10 ppm (1 h after the leak accident) No death 95 inpatients 939 people screening
In this study, the purpose of developing a model which can find the leak source location, further identifying the type and size of leak source with high accuracy is to develop an emergency system which can increase the success rate of emergency response. It can alleviate the spread of the damage by notifying the plant manager and emergency responder of the suspected leak source location(s) and leakage information quickly. Finding the leak source location contributes greatly to the quality of emergency response. There have been several studies to develop methodologies which can track or predict the leak source location. This studies can be summarized into inverse vector tracking methodology based on fluid dynamics and data analysis methodology such as data pattern recognition and statistical technique. Marques et al. (2003) set the tracking vector by composing the average wind vector, crosswind vector and instantaneous wind vector. After dividing the step of mobile sensor tracking plume into four steps, the authors set the proportional constant applied at each step to make it possible to track more efficiently. Ishida et al. (2004) proposed a methodology for deriving a tracking vector for tracking leak source location by combining the inverse wind vector and the concentration field vector of the leaking material. The mobile sensor moves and updates the measured physical quantity. In this process, it generates the tracking vector from measured data. The mobile sensor moves along the derived tracking vector to find the leak source location. Pisano and Lawrence (2007) set the objective function as the error between the value calculated from the Gaussian dispersion equation and the measured value from the sensor. To find the leak source location, the mobile sensor was designed to move along the direction which can minimize the value of the objective function (error). Zhang et al. (2012) proposed Quasi-Reversibility (QR) model and Lagrangian-Reversibility (LR) model which are based on Eulerian approach and Lagrangian approach of fluid dynamics, respectively. The proposed models use a prerequisite that information on the distribution of a leaking material can be obtained at a given time such as sensor grid. QR and LR model can find the source location(s) by calculating back spatially distributed data. The inverse vector tracking methodologies listed above require mobile sensor(s) or a dense sensor grid capable of sensing the internal material distribution. Because the needed data for tracking the leak source location must be updated to renew the tracking vector on the tracking path. Therefore, it is difficult to apply the inverse vector tracking methodology to a limited space which is hard to use mobile sensor(s). In addition, if the geometry includes some features which can make the gradient of concentration rapidly changed, it could be an obstacle to track the source location due to the local optimum problem while the sensor is moved along the tracking vector.
2. Design of leak source tracking system based on fence monitoring sensor grid The overall schematic diagram of the existing inverse vector tracking method using mobile sensor and our proposed method are shown in Fig. 1. The proposed leak source tracking system works as follow; in the event of a chemical leak, the sensors on the plant fence detect the chemical leak and alarm will be initiated. At this time, measured concentration data of leak material for each sensor location are collected in the integrated monitoring system. The measured leak concentration data along with the wind speed and direction at the time of leakage are provided to the leak source tracking system which has already been trained. Plant managers and emergency responders will get the information of top five suspected leak source locations with their occurrence probability using the leak source tracking system. Since using fixed type sensors is enough to operate the tracking system, it is free from the process of tracking along the plume and all the problems that arise from it. The proposed methodology to develop a system that can track the location of a leak source using a sensor grid installed along a plant fence is shown in Fig. 2. It is difficult to get chemical leak accident scenario data performing actual leak experiments on real plants due to high risk 2
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 1. Comparison between inverse tracking method and machine learning based method. (a)Inverse vector tracking method (b)Machine learning based method.
training process, and testing process. Among the procedures, this paper focuses on the step for the development of leak source tracking models.
and cost factors. Thereby, we proposed an approach using chemical leak accident scenarios CFD simulation data which can be used as a training and test dataset for the machine learning algorithms. This approach involves the assumption that CFD simulation data is very similar to actual leaks. If there are subtle differences between the CFD simulation and the actual phenomenon, it is expected that some performance degradation will occur. However, if the prediction results derived from real-time are taken into consideration in a statistical method, such errors will be greatly reduced. Since the leak flow map greatly varies depending on the geometry of the target, environmental data of the target should be collected after the target have been selected. Next, potential chemical leak accident scenarios should be selected. After conducting CFD simulation based on selected scenarios, optimization of fixed-type sensor placement is required. After generating the coordinates of optimally placed sensor locations on CFD, data for training the leak source tracking model should be derived. There is a process of deriving the database necessary for training. Subsequently, development of the leak source tracking model is required. This step includes the classifier design process,
3. CFD simulations for scenarios of chemical leak accidents The data used for training DNN and RF classifier was derived from the CFD simulation results for chemical leak accidents scenarios which had jointly studied by Cho (2017). The CFD simulations were conducted on a target chemical plant in Yeosu, Korea (Fig. 3) considering different scenarios which can represent actual leak accidents. The 640 scenarios were selected for 0–750 s of real-time simulation, composed of 40 potential leak source locations and 16 wind directions which are shown in Fig. 4. The leakage was set to occur at 100-s in the simulations. Fig. 5 shows an example of CFD simulations. The optimal placement of sensors was decided based on the simulation results. For all 640 scenarios, it is necessary to find the sensor placement where ERPG-2 concentrations can be detected. For this purpose, we sought to find a position that satisfies this with a minimum number of sensors using simple iterative logic:
repeat{ if the sensed maximum concentration of current sensors for 640 scenarios < ERPG-2 then move ‘the last sensor’ to right by 0.5 m if ‘the last sensor’ reaches to the far right and there is former sensor still on the left side then we call the former sensor as ‘the last sensor’ and move it to right by 0.5 m else add new one sensor and rearrange the sensors from the left with 0.5 m interval else save the current location of sensors and repeat } 3
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 2. Flowchart for developing a source tracking system.
The minimum number of sensors derived was eleven. When placing eleven sensors on the plant fence, there were several sets that detected the concentration of ERPG-2 or higher for all scenarios. Several sets of derived scenarios were listed in order of better satisfaction of the three criteria for all scenarios: the average number of sensors that detect concentrations of ERPG-2 or higher is high for all scenarios; the initial detection time is short; the detected concentration magnitude is large. Optimal sensor locations derived from these criteria are shown in Fig. 4 and Table 2. 4. Modeling for deep neural network 4.1. Training data Concentration data on eleven sensors derived from simulation, wind speed, and wind direction data were used for training. For supervised learning, each dataset must be labeled first. The parameters of ANN,
Fig. 3. Yeosu chemical plant: target of chemical leak CFD simulations.
4
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 4. Locations of labeled leak sources and optimal placement of sensors.
wind direction. The 40 categories mean the number of potential leak source locations shown in Fig. 4. The 41 refers to the case which includes the 40 potential source locations and separate label 0 which doesn't reach the ERPG-2 concentration. The ANN consists of an input layer for inputting data, hidden layers to find features from input data, and an output layer for classification. The number of neurons in the input layer must be the same as the number of variables consisting the input dataset. In this way, each variable recorded in the DB is connected to each neuron in the input layer, so that the values of each variable are inputted one by one. The importance of each variable (weight value) for classification is determined by the neurons in the hidden layers and the optimized weight value is automatically found during the learning process. In other words, the problem of assigning weight to variables to more correctly classify leak locations corresponding to each dataset is the role of the hidden layers. Therefore, the values of each variable which are inputted into each neuron in the input layer must be normalized as 1. It should be done to put in the variables with the same scale. It means that equal opportunity to each variable is given to judge the importance of each variable to classify correctly. The number of neuron in the output layer should be one and it is fully-connected with the neurons of the last hidden layer. The Softmax function applied in the last hidden layer converts the derived values into the probabilities of belonging to each potential leak source locations. In this way, the suspected leak location which has the highest probability is outputted from the output neuron. Therefore, the number of neurons in the last hidden layer should be designed to be equal to the number of classification categories. In this study, the number of neurons in the last hidden layer was set to 40 or 41 according to the DNN structure. Hinton and Salakhutdinov (2006) introduced neural net as a dimensional reduction model. For this reason, we used a structure of hidden layers that becomes narrower as the layer becomes deeper. It is expected that the data would go through the narrow path and find the enriched features for classification. The normalized data input to each input neuron is weighted by the weights between the neurons in hidden layers. Subsequently, the data from the weighted sum is transformed through the active function and then forwarded to the next neuron. Softmax function was used as an activation function of the last hidden layer to convert the derived value into the probability of each leaking
Fig. 5. An example of CFD simulation.
such as weight and bias, are modified along the direction to reduce the error between the output value of the network and labeled value. The accuracy of classification is increased while reducing the error and it is called as learning or training. The ERPG-2 concentration of toluene was assumed as a standard for labeling. We labeled each dataset with the digits of source locations shown in Fig. 4, if the maximum concentration of dataset is above the ERPG-2. If the maximum concentration of dataset is lower than ERPG-2, the dataset is labeled as 0. Label 0 means that there is no release or doesn't need to alarm it because there might be an error or noise in detecting. Here, the meaning of ERPG-2 is the highest concentration in the air where almost everyone is exposed for 1 h, causing irreparable damage, unrecoverable or without serious health effects. The standard for labeling used in this study is just assumption and it can be set lower or higher according to the cases.
4.2. Modeling of layer structure Simplifying the problem from the viewpoint of modeling, it can be regarded as a problem for classifying datasets into 40 or 41 categories. The datasets are composed of the concentration data, wind speed, and Table 2 Coordinates of optimally placed sensors. Order of sensors
#1
#2
#3
#4
#5
#6
#7
#8
#9
#10
#11
X(m) Y(m)
0.05 0.00
242.50 0.00
306.00 26.23
325.22 63.57
339.87 92.03
336.18 143.15
304.25 186.21
260.89 172.43
200.62 168.28
158.83 172.44
2.13 30.74
5
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 6. Schematic diagram of a Deep Neural Network.
location and ReLU function was used as an activation function of the other hidden layers. Fig. 6 shows the schematic structure of neural network used in this paper. 4.3. Experiments for network structure The hidden layers play a role in finding the features of the dataset so that the input dataset can be classified into desired classification category. The structure of the hidden layers plays a key role in determining the classification accuracy of the ANN. However, since the dataset corresponding to the humanly perceivable variables, it enters into the hidden layers and becomes human unperceivable data, the optimization method for the structure of hidden layers has not been established yet (Zeiler and Fergus, 2013). Therefore, it is necessary to derive the optimal structure through experiments. For the same input size and output size, we wanted to see the change in accuracy by increasing the number of hidden layers. By using a structure that becomes narrower as the hidden layer becomes deeper, more valuable features can be found at the data propagation process through the model like the dimension reduction model. The training and experiment of the network structure were conducted using Python-based Keras library. Batch Normalization (Ioffe and Szegedy, 2015) layer is known to significantly reduce training time and to mitigate overfitting problem by preventing parameter bias. Therefore, the DNN structure experiments consisting fullyconnected layers and BN layers (batch size: 30,000) were conducted. The simplest approach to solve the problem is to build a structure which just uses the measured values of current time step by second: wind speed, wind direction and concentration data measured at eleven sensors. In this case, the input layer should have 13 neurons. The number of neurons in the last hidden layer is 40 (Model 1) and 41 (Model 2). The other features of the structure of Model 1 and Model 2 are the same. However, Model 1 does not use the dataset that doesn't reach ERPG-2 concentration (label 0), and Model 2 used all datasets
Fig. 7. Comparison of the training results of Model 1 and Model 2.
derived from 0 to 750 s real-time simulation for 640 scenarios. The used datasets are divided into training sets for training and test sets for validation procedure. The training datasets and the test datasets are randomly divided into 7 to 3 ratio. Fig. 7 shows the results of the training and verfication of Model 1 and Model 2. The maximum accuracy of the classification for the test datasets was 0.3715 and 0.4920 respectively when Model 1 and Model 2 were trained up to 10,000 epoch. The accuracy of Model 2 is relatively higher. However, when we derived the classification accuracy for the test dataset of label 0, the accuracy was 0.9945 and the classification accuracy for the test dataset labeled with the other leak locations was 0.3271. The reason why the 6
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 9. Comparison of the training results of Model 1 and Model 3.
Fig. 8. Concentration data using CFD simulation for leak source 17, wind direction SE.
classification accuracy of Model 2 to be relatively higher, is because of 0 labeled datasets. The simulation data of all scenarios does not show leakage in 0–100 s. The concentration data which are collected from 11 sensors are measured as 0 at that period seems to be easy to analyze the pattern. It seems like the main reason why the accuracy of Model 2 is relatively higher than the accuracy of Model 1. However, the test accuracy excluding 0 labeled dataset was 0.3271, which is lower than the classification accuracy of Model 1, 0.3715. In other words, it is judged that the learning was disturbed because of the specificity of the dataset of label 0. Judging whether leak or not is the role of the sensors. In addition, it is highly likely that a wrong alarm will be triggered due to the relatively lower accuracy while it is applied to the wide real-time range. For these reasons, it seems appropriate to exclude the dataset labeled 0 and use it for learning. The classification accuracy of Model 1 and Model 2 is 0.3715 and 0.4920 respectively, which is considerably higher than the probability of being randomly classified into 40 or 41, 1/40 or 1/41 respectively. However, because Model 1 and Model 2 have a malformed structure such as the number of neurons in the input layer is smaller than the number of classification categories, it is considered that the accuracy can be raised when the number of inputting information (number of neurons in the input layer) is enlarged. The problem is how to provide more input data reasonably to the DNN. To do this, it is necessary to analyze the features of the input data. Fig. 8 shows the concentration data detected from eleven sensors over time from 0 to 750 s at the source 17 when the south-east wind blows. In the most of the 640 leak accident scenarios, periodic concentration detecting patterns were found. This is because the wind field is deformed over time by various structures in the chemical plant. Therefore, we tried to input the measured data of the past time point into the DNN with the measured data of the current time point. Model 3 uses wind direction, wind speed, X, Y coordinates of eleven sensors, six-time steps to the past (−60s, −30s, −20s, −10s, -5s, 0s) of measured concentration as one input dataset. In other words, 90 variables form one dataset. Model 3 has the same number of hidden layers as Model 1, but the number of neurons in the input layer is expanded from the existing 13 to 90. When Model 3 was trained up to 10,000 epoch, as shown in Fig. 9, the maximum classification accuracy for test datasets was greatly improved to 0.6579. Model 4–6 builds up 13, 19 and 25 hidden layers based on Model 3. In general, as the hidden layer is deeply stacked up, the nonlinearity increases. If the layers are stacked deeper and deeper, there is a trend that the training accuracy increases infinitely, and the test accuracy also increases. However, the deeper layers may cause an overfitting problem for the test dataset because there are too many parameters should be
Fig. 10. Comparison of training accuracy of the training results of Model 3–6.
Fig. 11. Comparison of test accuracy of the training results of Model 3–6.
fitted. In addition, building the deeper layer, the more the gradient vanishing/exploding problem can arise. Deep layers make it difficult to update the parameters based on backpropagation because of difficulty to forward the error to the first layer. In this case, the accuracy of test dataset can be reduced. 7
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 12. Schematic diagram of Random Forest classification method for predicting leak source locations.
in prediction or in general purpose of classification. Since on the DNN model suitable network structure has been generated, it simplifies the work to be done for RF models. But, the main parameters for the algorithm needs to be tuned to find the best combination of parameters that can give the highest accuracy. Unfortunately, so far there is no defined way (optimal way) for tuning RF as well DNN node structure except engineering judgment. In different models, different tuning approaches were used to reach the maximum achievable accuracy. Before getting to RF models, the below is the description of three basic parameters about RF algorithm in Scikit-learn library, Python (Srivastava, 2015). Number of estimators: This parameter means the total number of trees used in RF model. The RF model is based on the results determined in each tree and uses the maximum likelihood method to obtain optimal results. Bagging: Bagging is a random sampling of the data set to be applied to each tree. K repetitive bagging is applied to make K trees. Confusion matrix: Confusion matrix (Table 3) is used by RF model to determine how many mis-predictions it produces. The rate of mis-
When Model 3–6 was trained up to 10,000 epoch, the maximum training accuracy was derived as 0.6907, 0.7507, 0.8030 and 0.8070 respectively (Fig. 10) and the maximum test accuracy was 0.6579, 0.6940, 0.7250 and 0.7356 respectively (Fig. 11). When we look at the training results up to 10,000 epoch, the both of accuracy tended to be increased as the number of hidden layers increased from 8 to 13, to 19 and to 25. However, considering that the maximum classification accuracy, the accuracy didn't greatly increase when the number of hidden layers was increased from 19 to 25. It was diagnosed that even if the hidden layer was stacked deeper, a large improvement in accuracy could not be expected. 5. Modeling for random forest classifier To compare the DNN results with other algorithms, RF classifier was used for the same set of problem. A schematic diagram of the RF classifier used in this paper is shown in Fig. 12. RF is an extension of bagging that tries to use maximum voting (maximum likelihood) from each tree to get the final prediction better than a single tree decision (Breiman, 2001). RF generates a random sample of the data and recognizes a key set of attributes to grow decision trees. After the generation of key attributes, it builds multiple trees and calculates their Out of Bag (OOB) error rate to decide which tree will be used for the specific set of features. OOB error is the measure of RF performance which calculates the classification error over all the trained classification trees in the RF model. The collection of decision trees with the lower OOB error rate are compared to find the set of variables that can produce more accurate classification model. RF is quite an effective tool
Table 3 Confusion matrix.
Actual No Actual Yes Col-wise total
Predicted No
Predicted Yes
TN = ℕ FN = ℕ
FP = ℕ TP = ℕ
Row-wise total
Where TN: true negative; FP: false positive; FN: false negative; TP: true positive.
8
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 13. Comparison of training results of Model 7 and Model 8.
Fig. 14. Comparison of the training results of Model 7 and Model 9.
predicting the result is called Out of Bag (OOB) error rate. Each tree is created using features that are less likely to mis-predict results. Using the above Confusion Matrix, the Misclassification Rate can be calculated as follows:
Misclassification Rate =
FP + FN Number of total prediction
(1)
5.1. Experiments for tuning optimal parameters The designed simplest model (Model 7) of RF has input data of 11 set of concentration data measured from the optimally placed sensors, wind velocity, and wind direction. These 13 features, has been described as an input node in the DNN Model 1 as well. The data and network structure used in every model of DNN has been projected for equivalent models of RF. The unique parameters of RF should remain and understood separately. The number of estimators for this model has been set to 50. This is useful to see the effect on other models as the number of estimator increases. For the split of nodes, Gini impurity has been used since it is more suitable to minimize misclassification as it is symmetric to 0.5. The Bootstrap has been set to false to make the random selection without replacement. The maximum features to be considered to build a single tree is set to be the square root of the total features number. When Model 7 prediction accuracy is being analyzed, the small number of inputs (13 features) considering 40 classifications is not an appropriate amount of input. As shown in Fig. 13, the prediction accuracy of Model 7 is 0.4196 which is slightly above the prediction accuracy of DNN Model 1. To confirm the source of low prediction accuracy is mainly due to the structure of the dataset, the number of estimators have been increased up to 100, but the prediction accuracy remains stable at 0.4206. For Model 8, 41 classification case has been compared with the Model 7 40classificaion case. The additional labeled leak source location is the no leak release case which can be easy to get predicted since most of the time the data shows no leak release information. Even though the prediction accuracy of Model 8 increased to 0.5325, after analyzing the prediction accuracy based on individual leak source, except the 0 location the rest of the leak source locations showed lower accuracy than Model 7. This observation confirms that even with the high value of total prediction accuracy, individual classification accuracies must be checked to make sure that there is no single class that makes the total prediction unreasonably too high than the rest of the class. Based on the information from Model 8, in the next model, location
Fig. 15. Training result of Model 9.
0 has been excluded. Model 9 uses the modified time network structure which is used in DNN Model 3. The additional inputs are 6-time steps to the past (−60s, −30s, −20s, −10s, -5s, 0s) and X, Y coordinates of the 11 sensors. Since the inputs have been increased from 13 to 90 to consider the past time data information, the prediction accuracy of Model 9 dramatically increased (Fig. 14). Even with 50 number of the estimator, 0.8599 prediction accuracy was recorded. As the number of estimators increased to get saturated prediction accuracy, it has reached near saturation point at 100 estimators with 0.8636 maximum prediction accuracy (Fig. 15). 6. Comparison and discussion 6.1. Accuracy and Top-5 error In this study, the structure experiments of six DNN and three RF classifiers have been done. Table 4 and Table 5 show the results of DNN structure experiments of Model 1–6 and RF classifier structure experiments of Model 7–9 respectively. Here, the Top-5 error is the probability that the top five categories with the highest probability will not include the actual leak point. Considering the results of DNN, Model 6 shows the best performance on all accuracy and Top-5 error criteria. In the experiments, stacking more layers based on Model 6 doesn't show a big improvement in accuracy and Top-5 error. Model 6 as an optimal model and trained 9
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Table 4 Results of deep neural network.
Model Model Model Model Model
1 2 3 4 5
Model 6
Network structure
Hidden layers
Input/output neurons
Epoch
Accuracy (training/ test)
Top-5 error (training/ test)
13-15*-20*-25*-20*-25*-30*-35*-40-1 13-15*-20*-25*-20*-25*-30*-35*-41-1 90-70*-50*-40*-30*-20*-15*-25*-40-1 90-70*-60*-55*-50*-45*-40*-35*-30*-25*-20*-15*-25*-40-1 90-80*-75*-70*-65*-60*-55*-50*-45*-40*-35*-30*-25*-20*-17*15*-17*-20*-25*-40-1 90-80*-75*-70*-65*-60*-55*-50*-45*-40*-35*-30*-25*-25*-30*27*-25*-23*-20*-17*-13*-18*-22*-25*-30*-40-1
8 8 8 13 19
13/40 13/41 90/40 90/40 90/40
10,000 10,000 10,000 10,000 10,000
0.3777/0.3715 0.4954/0.4920 0.6907/0.6579 0.7507/0.6940 0.8030/0.7250
0.2153/0.2234 0.1798/0.1848 0.0602/0.0710 0.0408/0.0591 0.0302/0.0526
25
90/40
10,000 100,000
0.8070/0.7356 0.8520/0.7543
0.0268/0.0480 0.0188/0.0450
Where *: Batch normalization layer added.
to 100,000 epochs to obtain parameter values with the highest accuracy for test data. In general, training accuracy increases steadily, but test accuracy tends to increase to a certain epoch and then decrease. This is because of the overfitting problem. However, not the decrease of test accuracy caused from overfitting was not shown due to the large amount of data and usage of batch normalization. In addition, there was no clear improvement in accuracy as the amount of learning increased. For these reasons, it was selected as the final model. When training Model 6 up to 100,000 epoch, the maximum test accuracy and Top-5 test error are 0.7543 and 0.0450 respectively (Fig. 16). Including the RF case, Model 9 using 100 estimators shows the best accuracy and error. The maximum test accuracy and Top-5 test error of Model 9 are 0.8636 and 0.0335 respectively. Given that it is a 40 classification problem, the accuracy is fairly high. Furthermore, considering Top-5 test error, 0.0335, it seems to be a sufficiently low error to be immediately applicable. Top-5 suspected source locations can be informed to the plant manager and emergency responder with a respective probability of each source. This can reduce the time to search for possible leak sources which intern reduce the damage of the leak accident. Comparing DNN and RF based on accuracy and error, RF showed better performance. However, we cannot conclude that RF always guarantees performance over ANN for all leak source tracking problems. Since the types of parameters that can be tuned in RF are very small, we can conclude that the results obtained in this study are very close to the optimum values. However, ANN has a very high degree of freedom in structural design. Using RNN, CNN and DNN-based variant structures can show improved performance in some cases. The most important point is that DNN and RF can be used to solve the leak source tracking problem in a chemical plant with high accuracy.
Table 5 Results of Random Forest classifier.
Model 7 Model 8 Model 9
Input/output neurons
Estimators
Test accuracy
Top-5 test error
13/40 13/41 90/40
50 50 50 100
0.4196 0.5325 0.8599 0.8636
0.2023 0.1679 0.0362 0.0335
Fig. 16. Training result up to 100,000 epochs of Model 6.
Fig. 17. Test accuracy of Deep Neural Network and Random Forest by labeled source location.
10
Journal of Loss Prevention in the Process Industries xxx (xxxx) xxx–xxx
J. Cho et al.
Fig. 18. Predicted leak source from 31 labeled datasets.
using this methodology, chemical leak accidents can be coped with great efficiency in real time.
6.2. Analysis of misclassification case Fig. 17 shows the test accuracy by each labeled leak source location. Among the 40 leak sources, the classification accuracy of the region where many other leak sources are crowded is low. This is due to the relative difficulty in classifying leak source by analyzing data which has similar patterns. The leak source 31 and 32 have the lowest accuracy of DNN and RF, 0.3528 and 0.3300 (Source 31 also shows the second lowest accuracy in RF). However, analyzing the predicted location from 31 labeled datasets which have the lowest accuracy, we can confirm that even though the model predicts the wrong leak source, the wrongly predicted location is also physically close to the actual source location (Fig. 18). For this reason, it is hard to be called as misclassifications.
Acknowledgments This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2017R1E1A2A01079660). References Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Cho, J., 2017. Placement Optimization and Reliability Analysis of Stationary and Mobile Sensors for Chemical Plant Fence Monitoring. M.S. Thesis. Myongji University. Eckerman, I., 2005. The Bhopal SAGA: Causes and Consequences of the World's Largest Industrial Disaster, first ed. Universities press, Hyderabad. Hinton, G.E., Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. J. Sci. 313, 504–507. Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167. Ishida, H., Yoshikawa, K., Moriizumi, T., 2004. Three-dimensional gas-plume tracking using gas sensors and ultrasonic anemometer. Proc. IEEE 1175–1178. Jaising, I., Sathyamala, C., 1992. Legal rights and wrongs: internationalizing Bhopal. J. Dev. Dialogue 1, 103–115. Lee, S., 2013. White paper of Hube global.Inc Hydrofluoric Acid Leak Accident, first ed. Safety and Disaster Division of Gumi City Hall, Gumi. Lee, Y., 2014. Chemical leak problems and alternative solutions, Presentation slides. In: 3rd Forum of Environmental Justice. Li, Z., Rizzo, D.M., Hayden, N., 2006. Utilizing artificial neural networks to backtrack source location. In: Proceedings of 3rd International Congress on Environmental Modeling and Software, pp. 1–6. Marques, L., Almeida, N., De Almeida, A.T., 2003. Olfactory sensory system for odourplume tracking and localization. Proc. IEEE 1, 418–423. Pisano, W.J., Lawrence, D.A., 2007. Data dependant motion planning for UAV plume localization. In: AIAA Guidance, Navigation and Control Conference and Exhibit, vol. 20. pp. 6740–6753. Rege, M.A., Tock, R.W., 1996. A simple neural network for estimating emission rates of hydrogen sulfide and ammonia from single point sources. J. Air Waste Manag. Assoc. 46, 953–962. Reich, S.L., Gomez, D.R., Dawidowski, L.E., 1999. Artificial neural network for the identification of unknown air pollution sources. J. Atmos. Environ. 33, 3045–3052. Singh, R.M., Datta, B., Jain, A., 2004. Identification of unknown groundwater pollution sources using artificial neural networks. J. Water Resour. Plann. Manag. 130, 506–514. Srivastava, T., 2015. Tuning the Parameters of Your Random Forest Model. https://www. analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/ (accessed 17.06.14). Zeiler, M.D., Fergus, R., 2013. Visualizing and Understanding Convolutional Networks. arXiv:1311.2901. Zhang, T., Li, H., Wang, S., 2012. Inversely tracking indoor airborne particles to locate their release sources. J. Atmos. Environ. 55, 328–338.
7. Conclusions In this study, DNN and RF classifier were used to solve the leak source tracking problem, a challenging inverse problem of dispersion simulations of the known chemical leaks. Since it is difficult to get chemical leakage and dispersion data while performing leak accident experiments to an actual plant, CFD simulations of leak accident scenarios were performed and necessary data was collected as a database to train and test DNN and RF models. The results of CFD simulation for 640 chemical leak accident scenarios of a real chemical plant, composed of 40 potential leak sources and 16 wind directions by each source, were used for training of DNN and RF classifier. Six model of DNN and three model of RF classifier were used and experimented to design the best performing model. After training the optimal models of DNN and RF derived from experiments, using 25 hidden layers on the DNN model and 100 trees on the RF classifier, 75.43% and 86.33% of maximum test accuracy were achieved respectively. Analyzing the misclassified source locations, we confirmed that the mis-predicted leak source locations are physically adjacent to the actual leak source locations, i.e. the prediction of close answers was obtained. For this reason, the misclassifications are hardly called as wrong classifications. Furthermore, when the top 5 candidates are considered as the set of predicted leak sources, Top-5 prediction errors of DNN and RF classifier were reduced to 4.50% and 3.35%, respectively. It means that the plant managers and emergency responders can get a reliable information of the top-5 suspicious leak locations and their occurrence probability from the proposed leak source tracking system. Thereby,
11