Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.com/locate/ijleo
Original research article
Automatic Early Smoke Segmentation based on Conditional Generative Adversarial Networks
T
Yang Jiaa,b, , Hanrong Dua,b, Haijuan Wanga,b, Runyang Yua, Lianghui Fana,b, Gao Xuc, Qixing Zhangc ⁎
a
School of Computer Science, Xi'an University of Posts and Telecommunications, 710121, China Shaanxi Key Laboratory of Network Data Intelligent Processing, Xi'an University of Posts and Telecommunications, 710121, China c State Key Laboratory of Fire Science, University of Science and Technology of China, 230026, China b
ARTICLE INFO
ABSTRACT
Keywords: video smoke detection smoke segmentation smoke image dataset conditional generative adversarial networks (cGAN) cross-validation
In video smoke detection (VSD) system designing, generally suspected smoke regions are firstly segmented from video frames to lower the load of video processing, speed up smoke detection and reduce false alarm rate. However, the environment adaptability of the traditional smoke segmentation algorithms based on motion and color analysis still needs to be promoted. In this research, a model based on conditional generative adversarial networks (cGAN) is designed to automatically segment smoke regions from successive video frames. By learning from labeled smoke regions, a model mapping an original video frame to segmentation result is built and smoke regions in different scenarios can be segmented with the model. Experimental results on the test set show that compared with other traditional methods, segmentation accuracy has been improved and the speed of smoke segmentation is approximately ten times faster than the previously proposed saliency detection based method. This work provides a new deep learning based method for smoke segmentation, which can be utilized for video smoke feature analysis and application of VSD in the future.
1. Introduction With the development of computer technology and electronic technology, intelligent monitoring system and IoT has been widely used in various places in our daily lives, such as shopping malls, hospitals, museums, factories, warehouses and other city buildings [1–3]. At present, most of the closed-circuit televisions there can capture and store video images quickly and accurately. The application of video detection technology has attracted many researchers and entrepreneurs. Video fire detectors detect flame and smoke by analyzing image signals collected by video cameras. Once smoke or flame has been detected in the monitoring scope, a fire alarm signal can be sent to the fire linkage alarm system and administrators would come to handle this emergency as quickly as possible. Compared to the traditional temperature sensing and smoke sensing detectors, advantages such as non-contact, quick response, direct-viewing, large detection scope and small restriction to storey height makes video fire detection (VFD) an attractive technology [4–9]. The cost of equipment replacement can be replaced by embedding the detection algorithm into the installed camera monitoring equipment. Early researches on VFD mainly focus on flame detection, while during the process of fire development, fire develops slowly in the smoldering stage, and after the appearance of flam, fire develops exponentially and the fire risk would be increased. If smoke can be ⁎
Corresponding author. E-mail address:
[email protected] (Y. Jia).
https://doi.org/10.1016/j.ijleo.2019.05.085 Received 14 February 2019; Received in revised form 19 May 2019; Accepted 25 May 2019 0030-4026/ © 2019 Elsevier GmbH. All rights reserved.
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 1. Flowchart of the proposed cGAN based smoke segmentation method.
detected as early as possible, fire can be controlled in time and more time can be saved for fire alarming and firefighting. So Therefore VSD in the smoldering stage becomes a more important research. At present, VSD is mainly based on the rule reasoning method using shallow features of smoke. People try to describe the smoke image with visual features such as color, velocity, transparency, moving direction and so on [4,10–12]. However, due to the lack of a high-level feature mechanism of smoke movement and external form, there are still some problems with the current features and recognition algorithms. For example, if a dataset in the test experiment has been changed, the classification result and the recognition rate will vary greatly. The feature generalization ability is insufficient and high-level smoke features still need to be studied. In recent years, the effectiveness of the AI method based on deep learning has been proved by many experiments in image recognition [13–15]. Frizzi et al. [14] constructed a nine-layer convolutional neural network trained with 27919 tagged images, and finally got three classification results of smoke, flame and other images with a 97.9% classification accuracy. The deep neural network is used to classify fire images, but compared to other classifiers, such as support vector machines (SVM) and random forest, the classification accuracy has not been evidently improved. Advantages of using a deep neural network for target recognition, such as the effectiveness of features in different environments and different test sets has not been explained. Fu [16] built a database of fire image samples captured during the day and night. Based on the traditional SVM classification, the accuracy rate is 92.14%; and based on the 6layer convolutional neural network (CNN), the classification accuracy rate is 95.71%. Xu et.al [17] proposed a deep domain adaptation based approach for video smoke detection to extract a power feature representation of smoke. Their synthetic smoke samples strengthened the dataset and improved the performance of the deep CNN model. Deep neural networks are mostly used as a classifier to distinguish sample images. Preprocessing, segmentation, and sample making in the preliminary stage of detection are very important and these operations can greatly affect the performance of the algorithm. In this research, a deep learning based smoke segmentation method for VSD is proposed. Many famous convolutional-based segmentation networks have been proposed since 2014, such as CNN, FCN, and U-net [18–20]. Ronneberger et al. [20] proposed a network called U-net, which segments an image through pixel classification and pixel location. The algorithm only needs a small number of labeled images, and with the aid of image augmentation the segmentation shows excellence performance in medical image segmentation and it is widely used in segmentation tasks. Generative adversarial neural network (GAN) is also a powerful network that works like a fake currency producing and counterfeit currency detecting game. Competition in this game drives both the generator and the discriminator to improve their methods until the counterfeits are indistinguishable [21]. If the GAN model is trained well, the generator would produce a precise segmentation result [22]. In traditional segmentation frameworks, the task is executed based on human cognition of the target, such as texture, color and shape [23,24]. While abstract high-level features cannot be expressed well by human with these low-level feature descriptors. So in this research, a conditional GAN (cGAN) [25] combined with U-net is used to find pixels of foreground target (generative model) and calculate the higher dimensional differences between the generated smoke distribution and real data distribution (discriminative model). The flowchart of our method is shown in Fig. 1. The smoke segmentation method can be divided into two parts: model training and smoke segmentation. The training part is shown in the green box in Fig. 1. There are mainly 4 steps in model training: sample data preparing, generative model building, discriminative model building, and weights optimizing. After the model training finished, a video frame of smoke can be input into the model to get the segmentation result. This smoldering smoke segmentation algorithm based on cGAN can be used to provide accurate smoke target images for smoke feature analysis and help to develop VFD systems based on deep features. 2. Material and Methods 2.1. Smoke Dataset and Pre-processing of Sample images Data set used in this investigation are frames sampled from smoke video samples. After a computer-aided semi-automatic labeling 2
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 2. Frames clipped from the original images. (a) Smoke of forest fire. (b) Smoke of cotton outdoor. (c) Smoke with pedestrians. (d) Smoke of cotton indoor. (d) Smoke of cotton with pedestrians indoor.
of the frame images extracted from different videos, a dataset containing original images and corresponding segmentation results is constructed and it is divided into the training set and the test set. Besides, cross-validation is also designed to test the generalization performance of the proposed method. 2.1.1. Smoke Image Dataset Fire smoke videos used in this research are collected from Bilkent University (http://signal.ee.bilkent.edu.tr/VisiFire/Demo/ SampleClips.html) [23], Ulsan University of Korea and the State Key Laboratory of Fire Science (SKLFS) at University of Science and Technology of China. Video sizes are not all the same, which ranges from 320 × 240 pixels to 1920 × 1080 pixels. Several examples clipped from original videos are shown in Fig. 2. Fig. 2 (a) shows forest fire smoke. Fig. 2 (b) shows cotton smoke outdoor. Fig. 2 (c) shows cotton smoke in a standard combustion laboratory. Fig. 2 (d) and Fig. 2 (e) show smokes of cotton and leaves in an experiment chamber. The proposed segmentation network first uses a preliminary segmentation based on a threshold, then normalizes the segmented patches to have black backgrounds and uniform size (512 × 512 pixels). 2.1.2. Pre-processing of Smoke Sample images To obtain the mapping function from an original video frame to the segmentation result, labeled smoke images are needed. Here a semi-automatic segmentation method is used to label the original images as segmentation results. Main steps are as follows: Step 1: Read the first frame of a smoke video. With mouse interaction, an ROI (region of interest) is extracted and saved as the ROI template (pixels in the ROI area are set 1, and others are 0). Step 2: Read the next frame and record it as datai (N × M). i is the number of the sample image. Step 3: Load the ROI, and all pixels outside ROI are zero. A smoke region in the ROI has been segmented with the threshold method (1 for smoke pixel and 0 for non-smoke pixel), and a binary smoke image can be obtained. For smoldering smoke, the luminance of smoke in a gray image (single channel image) is higher than other parts, and a rule similar to [26] is used: L1 < I < L 2 . I is the luminance (gray value) of a pixel and L1, L2 are the upper and lower boundaries of smoke color which are modified in different videos (typical values are 150 and 220). In one video sequence, the distribution of smoke color does not change too much and with fixed thresholds, we can get a series of smoke labels with a pair of thresholds. While for the next video, the threshold may need to be adjusted manually to get a good segmented smoke region. The product of the original image and the binary mask is treated as a label, which is recorded as labeli (N × M). Step 4: Normalize data into a square image (N × N). The specific operation is shown in Fig. 3. labeli is filled with zero as a square (N × N), and then Bicubic Spline Interpolation is used to convert the N × N image to a 512 × 512 image, and the two images are saved as data and a corresponding label. Then go to step 2. This preliminary threshold-based segmentation in the ROI region is not sufficient for sample image making, and a manually revise of the segmentation result is needed. For example, some holes in the foreground need to be filled and some small interferences need to be removed with Win10 Drawing Tool. Manual operations are utilized to ensure the label is reliable. The input of the neural network in this study is designed to be a 512 × 512 image, but the videos in the original dataset are 1920 × 1080 and 320 × 240. In order to make the train set and the input data of the network consistent, video frame images are cut and stitched, and all the sample images are converted to 512 × 512. As shown in the red rectangular box in Fig. 3(d), the size of the
Fig. 3. Schematic diagram of the conversion of video frames from the original N × M (N > M) to square N × N. (a) Original sample image, using the mouse to select the ROI. (b) ROI of the original smoke frame. (c) Segmentation result of the ROI with a threshold. (e) Normalized square image with cutting and stitching. 3
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 4. Diagram of the training procedure of smoke segmentation network.
selected rectangle is ((N M )/2) × (N /2) , and four blocks of the same size are stitched together to fill the original image to a N × N image, as shown in Fig. 1 (e). The corresponding label image is also filled with zeros to make it a square N × N image. In the previous paper, some researchers use the shape of smoke as a feature in VSD, so we want to keep the original shape of smoke. The cutting, splicing and scaling operations used to convert the rectangle frame to a square won't add any distortion to the smoke region. 2.2. Design of the Smoke Segmentation Neural Network The main framework for smoke segmentation is based on cGAN combined with a U-net structure network. The diagram of the model training part is shown in Fig. 4. In the blue frame, a pairwise image combined with an original smoke image and the corresponding label is used as the input. They are used to train the U-net based generator. In the green frame, the output of the U-net segmentation network, which is also the output of the generative model in the conditional GAN. The discriminate model calculates the difference between the generated distribution and the labeled true data distribution. With the trained model, an image containing smoke can be input the model and a generated segmentation result can be obtained. Details of model designing and training are as follows. 2.2.1. cGAN based Smoke Segmentation Network The GAN consists of a generative model G and a discriminative model D [21]. The two-player min-max game with the value function V (D, G ) of standard GAN is as equation (1):
min maxV (D , G ) = Ex G
D
Pdata (x ) [log(D (x ))]
+ Ez
Pz (z ) [log(1
D (G (z )))]
(1)
z is a random noise vector and x is the observed image. Generator G builds a mapping function from a prior noise distribution Pz (z ) to data space as G (z ) . Discriminator D (x ) outputs the probability that x came from train set x rather than the output of G. G and D are both trained alternately: parameters for G are adjusted to minimize log(1 D (G (z ))) and parameters for D are adjusted to minimize log(D (x )) .. The most important and distinct difference in our network is that the GAN is extended to a conditional GAN built on the handcrafted label images of smoke. The objective function can be expressed as
min maxV (D , G ) = Ex G
D
Pdata (x ) [log(D (x|y ))]
+ Ez
Pz (z ) [log(1
D (G (z|y )))]
(2)
The auxiliary information is represented by y (the labeled image) [25]. The training procedure is the optimization of the generator and discriminator of the smoke segmentation cGAN, which is shown in Fig. 5. The input is a frame of a smoke video. The ground truth is the corresponding labeled smoke image and the output is the final segmented smoke image. Previous studies indicate that cGAN benefits from the mixture of the traditional GAN objective and a traditional loss [27], and L1 loss is used to promote the performance of the network [28]. Therefore, for the generator, the loss function is defined as: 4
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 5. Training procedures of the generator and the discriminator. (a) Training of the Generator. (b) Training of the Discriminator.
LG =
1 Ex Pdata (x ) [log(D (x ,
G (x )) + e )] +
[|| y 2 E x,y P data (x , y )
G (x ) ||1 ]
(3)
Where x is the input—a smoke video frame. e is a small empirical number to prevent overflow. y is the labeled ground-truth of a smoke image. λ1 and λ2 are used to balance the losses, λ1 = 0.999 and λ2 = 0.001. The first item on the right side of equation (4) is the adversarial term and the second item is the reconstruction error—the L1 distance, which produces less blur in the result. The calculation of LG is also explained in Fig. 5 (a). For the discriminator, the loss function is defined as:
LD = Ex, y Pdata (x , y )log(D (x , y ) + e ) + Ex
Pdata (x ) log(1
D (x , G (x )) + e )
(4)
Definitions of parameters are similar to that in equation (3). The calculation of LD is explained in Fig. 5 (b). Minibatch SGD and Adaptive Moment Estimation (Adam) solver are used as the optimizer. To optimize G and D, the standard approach from [21], one gradient descent on D and then one step on G is done alternately. 2.2.2. Architecture of Smoke Generation Network Segmentation techniques covered in our previous papers [29,30] includes thresholding, color and motion based method, saliencybased method, SegNet, and U-Net. We compared all these methods and found that saliency-based method and U-Net have the best segmentation performance, while the saliency-based method is time-consuming. Finally, U-Net is chosen to be the generator in this CGAN. As shown in Fig. 6, architecture of the generative network can be seen as a huge U-net type network mapping a smoke video frame to a segmented image and weights of the network are optimized with adversary training between G and D, which is shown in Fig. 5. The network is designed by combining the down-sampling layers on the left side and up-sampling layers on the right side, which connects convolution with deconvolution. Each pixel is classified with the convolution network. The final segmentation result is obtained by using deconvolution and pixel locating [20]. The input image is a 512 × 512 three-channel image. The lower sampling stage is the architecture of a typical convolution neural network. The size of the convolution kernel window of the convolution layer is 3 × 3, which can preserve the convolution results at the boundary. Due to the excellent performance of leaky_relu, which can help to make sure the gradient flows through the entire architecture without saturation, leaky_relu is selected as the activation function of the convolution layer [31]. He normal initializer is used here [32]. After convolution, 2 × 2 pooling is used. After pooling, the length and the width of the image have been reduced to half, and length of the feature vector is 2 times of the previous one. In order to prevent over-fitting, dropout is performed at the downsampling stage, and the dropout rate is 0.5 [33]. In the up-sampling stage, features extracted from the down-sampling stage are expanded by 2 × 2, and the expansion results are fused with the lower layer features obtained from the down-sampling. The high-level features and low-level features of the downsampling phase after the convolution have been spliced, as shown by the arrows on the right side of Fig. 6. The sigmoid function is chosen as the activation function of the final convolution layer. The size of the convolution kernel window is 1 × 1, which is used to map the feature vectors to the foreground and the background classes. The final output image is the segmentation result that contains the smoke region and the background region. The network structure is the combination of low-level features and high-level features, which can make full use of the features in all levels, and the feature vectors are more accurate to describe the smoke area. In the model training stage, Adam is used as the optimizer [34,35]. The learning rate is 1e-4. 3. Results The neural network in this paper is designed based on Keras [2] (using tensorflow as the backend). The machine used for the training model is NVIDIA dgx-v1, with 8 blocks of Tesla P100 GPU, and each GPU has 16GB memory, 2 Intel processor, a 7TB solid state hard disk, double Gigabit Ethernet, and 4 100Gb/s InfiniBand net. Two different kinds of the experiments are designed in this research. As shown in Fig. 7. In the experiment on the left side, there are 5477 pairwise images clipped from 137 videos. There are 5194 train images and 283 5
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 6. The architecture of cGAN based smoke segmentation network.
test images. There is no overlap between the two datasets. While because of the strong correlation in consecutive frames, some test images are similar to the images in a train set. Some people may wonder if the train set and test set have been generated in similar conditions, or is the segmentation not strong enough (what about smoke segmentation in a totally new fire scenario?). Therefore another smoke segmentation experiment schedule is designed and the experiment is processed in a 10-fold cross-validation way. There are nine different fire scenarios, as shown in Fig. 2 and Table 1. A brief description of the videos is concluded in Table 1. Eight of those videos are used for training and the reserve one is used for testing. This procedure is repeated nine times each time reserving a different video for testing. Firstly, 5477 mixed samples collected from 9 fire scenarios are used to test the segmentation model. Qualitative analysis and quantitative analysis have been done to evaluate the performance of the cGAN based method. Then, a cross-validation experiment has been done to test the generalization ability of the model. 3.1. Experiments with mixed samples There are 5194 train images and 283 test images randomly selected from the 5477 images. All images are different from each other. Maybe some consecutive frames are similar to each other, but they are not the same images in fact. 3.1.1. Qualitative Analysis of Segmentation Results Because of the physical characteristics of the smoke itself, the boundary of the video is not apparent. Measuring the segmentation results by quantization method is difficult. Therefore, the qualitative analysis of the segmentation results is carried out in this research. Fig. 8 shows several examples of smoke region segmentation result which are generated by the segmentation model proposed in this study. In each video, six frames along the timeline are extracted and segmented. For Example, Fig. 8 (a1) is the original color image clipped from a smoke video. Fig. 8 (a2) is the gray image and Fig. 8 (a3) is the segmentation result. The original segmentation result is a resized 512 × 512 image with stitched edges, as shown in Fig. 3. Subsequently, the image is cut and resized reversely. Therefore, the segmentation result has a similar size to the original video frame. Fig. 8 (a) shows smoke of forest fire. Smoke density near the root of the forest smoke is much higher and it diffuses as the smoke goes up into the sky. Because of the long distance from the camera to the forest smoke and the broad diffusion scope, the density of some parts of the smoke region is at a low level, as the green arrows indicate. With this method, the smoke region with a specific density can be segmented. However, if the smoke density is too low to detect, segmentation leakage will occur. For smoke segmentation, such an open question, the boundary of the smoke is difficult to define so we discuss the results of the experiment here. Fig. 8 (b) and Fig. 8 (d) show outdoor smoke. Fig. 8 (b) is a widely used smoke detection test video that is downloaded from the 6
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 7. Two different experimental designs. The left one using all mixed sample images clipped from all videos as the train set and test set are extracted from the whole dataset without overlap. The right one using images from part of the videos as the train set and images from another video as the test set. Table 1 Description of the original videos. Video No.
Description of the video
1 2 3 4 5 6 7 8 9
forest smoke cotton smoke in the standard lab wood smoke in an experiment chamber cotton smoke in an experiment chamber smoke in the sky smoke in a factory outdoor cotton smoke outdoor cotton smoke with a walk man outdoor cotton smoke in the wind
website of the Bilkent University. The original video has a low resolution (320 × 240). The cotton smoke appears in the outdoor environment. The primary challenge of this image segmentation is with a man walking in the video, how to segment the moving smoke which is also in the video. In this cGAN based segmentation framework, just smoke samples have been input the network and other parts all have been treated as background, so the segmentation is clean and the segmentation error rate is low. Fig. 8 (b) is an outdoor video shot in a place where there is a strong wind. In the video, trees sway heavily and central direction of smoke changes quickly. As shown in Fig. 8 (b), no matter how quickly the smoke shape changes, it can be segmented and the trees can be excluded completely. The performance is amazing, which exceeds many of the traditional smoke segmentation methods, such as color segmentation [26] and motion based segmentation[36,37]. Fig. 8 (c) is smoke in a factory. Smoke fills most of the frame and it is different from other common videos. In the optic flow field, because of changes in lighting conditions, almost every pixel is moving and the segmentation is not easy. However, the smoke with large smoke concentration can be segmented with the proposed method as the experiment shows. The videos are distinct from each other; however, with the same segmentation model, the smoke can be segmented without any adjusting of the parameters. The result is favorable for the later recognition and the physical feature analysis. 3.1.2. Quantitative Analysis of Segmentation Results The precision-recall (PR), F-Measure, and mean absolute error (MAE) are used to evaluate the segmentation method quantitatively. Precision and recall (PR): The precision is the number of correctly segmented smoke pixels divided by the number of all 7
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 8. Example of segmentation results of the algorithm, (a1)-(d1) Original image. (a2)-(d2) Gray image of the original frame. (a3)-(d3) Final segmentation result after post-processing.
8
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
returned segmented pixels. The recall is the number of correctly segmented smoke pixels divided by the number of pixels that should have been returned. The precision (P) and recall (R) are calculated with the equation (5) and equation (6). Seg is the segmentation result and GT is ground truth made with the labeling method in 2.2.
P=
|Seg GT| |Seg|
(5)
R=
|Seg GT| |GT|
(6)
F-beta score: The F score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. F is calculated as equation (7), β is set to be 0.3 [38].
F =
(1 + 2
2)
×P×R ×P+R
(7)
MAE: MAE is a measure of the difference between the seg and GT. The MAE is given by:
MAE =
1 W×H
W
H
|Seg (x , y )
GT (x , y )|
(8)
x=1 y=1
In Fig. 9, evaluation metrics of images extracted from 9 videos have been calculated. For each video, four metrics have been calculated for all extracted frames and the data shown in Fig. 9 is the average value for one video. The MAE value ranges from 0.01 to 0.07. The accuracy ranges from 0.92 to 0.99. The F-beta value ranges from 0.90 to 0.98. These three metrics show outstanding performance. Compare to the saliency-based segmentation method, the three metrics of the proposed method are better, as shown in Table 2. The recall value ranges from 0.59 to 0.94 and the average value for all videos is 0.76, which is a little lower than the saliency-based method. Therefore, it still needs to be considered in future research. In each saliency map computing (for every single frame), different scales of slide boxes are used to traverse the whole image and this operation is time-consuming [29]. With the proposed method, to segment a 1080 × 1920 video frame only need 0.15 second and the time complexity is at least ten times faster than the saliency based method. 3.1.3. Qualitative Comparison of Segmentation Results In order to test the effect of the smoke segmentation algorithm based on a deep convolution neural network, the proposed algorithm is compared with the other four classical smoke segmentation algorithms. The comparison results are shown in Fig. 10. Fig. 10 (a) is the original image. Fig. 10 (b) is based on the MHI (Motion History Image) proposed by Han et al.[37]. Using MHI [40], moving parts of the smoke video sequence can be engraved with a single image, from where one can predict the motion flow as
Fig. 9. Quantitative evaluation of the proposed segmentation method. 9
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Table 2 Comparison of the average precision, recall, F-beta score and elapsing time Method
Average of Precision
Average of Recall
An average of F-beta score
Average of elapsing time for an image
Proposed method Saliency based method [29]
0.96 0.93
0.76 0.83
0.94 0.91
0.15s 15s
well as the moving parts of the video action. However, if the moving velocity varies differently, with the same threshold the adaptation of the method is not very satisfactory. As shown in Fig. 10 (b), the forest fire smoke moves much slower than smoke in other scenarios. According to the last row, the smoke has not been segmented with the fixed threshold. Fig. 10 (c) shows the segmentation results with Yu’s method based on motion and color analysis [39]. The characteristics of the motion are derived from the optical flow, which is very sensitive to light. Thus, interferences appear in many areas with high luminance. The result on each frame is just a mass of discrete small regions, which is difficult to use in smoke feature analysis. Fig. 10 (d) is the result of GMM based segmentation [41]. Because smoke can be treated as a turbulent fluid, motions at different locations vary differently. The segmentation results are all small blobs and it implies that just using GMM to do smoke segmentation is not sufficient. Fig. 10 (e) is the segmentation result of the saliency detection based method proposed by our research group before [29]. By traversing an image with different sizes of windows, the time complexity is high and it takes about 15 seconds to process a 528 × 384 size frame. Fig. 10 (f) is the result of the method proposed in this paper. The results of Fig. 10 (e) and Fig. 10 (f) are both acceptable. In the first row, the accuracy of Fig. 10 (f) is higher than that of Fig. 10 (e). In the second row to the last row, the results also seem better than that of the saliency-based method. While in forest smoke segmentation (the last row), a smoke region in Fig. 10 (f) is smaller than that in Fig. 10 (e). This is also occurred in Fig. 8 (a). If the concentration of smoke is not large enough, it is difficult to segment. This phenomenon can be treated as the sensitivity of the method. Overall, the performance of the proposed cGAN based smoke segmentation method is satisfactory and the calculation speed has been greatly promoted. It shows that cGAN based segmentation can be a potential method for smoke segmentation. 3.2. Experiments with cross-validation Besides the experiment with all extracted images, cross-validation experiments also have been done to test the generalization performance of the model. According to Fig. 11, segmentation with the mixed train set and test set has outstanding performance ((a2)-(c2)). Compared to Fig. 11(a2), Fig. 11(a3), which is segmented with a model trained with absolutely different images from other videos also shows a satisfing result. Even the thin smoke on the top left corner is segmented. According to Fig. 11(b), with the cross-validation train set,
Fig. 10. Smoke area obtained with different segmentation methods, (a) is the original image, (b) - (e) is the result of using the method of Han [37], Yu [39], GMM [36] and Jia [29] respectively, (f) is results of the method proposed in this article. 10
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Fig. 11. Comparison of the segmentation results with a different train set. (a1)-(c1) Original gray images extracted from smoke videos. (a2)-(c2) Segmentation results of the mixed train set and test set. (a3)-(c3) Segmentation results of cross-validation experiment.
the segmentation result is similar to the result with a mixed train set. The model can find out most of the smoke region. Subjectively speaking, the segmentation results in Fig. 11(b2) and (b3) are both acceptable. In Fig. 11(c), the segmentation results in the first four images are all in good condition, while in the last two images in Fig. 11(c3), over-segmentation occurs: sky at the top left corner and cement pavement at the bottom right corner are all segmented. Because in this model, smoke regions are labeled based on the gray color. The gray sky and the gray pavement has not occurred as background regions in the train set, so the model cannot segment the sky and cement pavement clearly. This problem can be solved by add movement analysis easily, such as optical flow and Gaussian mixture model. The sky and the pavement region can be excluded from moving smoke regions [29]. The cGAN model can be used in smoke segmentation. With the train set clipped from 8 videos, segmentation results of the other video are already fine even the generalization performance still needs to be promoted. It can be speculated that with a larger train set or some motion analysis operations, the segmentation results can be optimized. 4. Discussions These studies offer a preliminary smoke segmentation method that works in a VSD system to speed up smoke detection and help reduce false alarms. Object segmentation is a traditional topic and there are lots of methods proposed in past decades of years. Color, 11
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
motion (GMM and optical flow) are the most popular features used to segment smoke from video frames [29, 36, 37, 39]. While combining the information to get a good result is not easy. Inspired by the saliency detection method, intensity and optical flow features are combined to segment smoke [29]. The results are satisfactory while the time consuming is not acceptable in real-time video processing applications. This work is an attempt to use a deep neural network to do smoke segmentation. Experiment results suggest that using cGAN to segment smoke is a practical way and this segmentation method is much faster than the prior saliencybased method. However, in this framework, another significant feature of smoke—motion, is not used. Just using color to segment smoke would introduce a risk of wrong segmentation would occur if there are some regions with a gray color similar to smoke (Fig. 11 (c3)). Therefore, motion information should be added to the segmentation framework in the future. 5. Conclusions In this paper, the application of a deep neural network based on cGAN architecture is used in early fire smoke image segmentation. The main innovations of this paper are summarized as follows. Firstly, by cutting and stitching, input images extracted from videos with different aspect ratios are resized to squares with the same size. The model can adapt to input videos with different aspect ratios. Secondly, as far as we know, this is the first time using cGAN based deep neural network in early fire smoke image segmentation, and the experimental results show that the method is more effective and robust than traditional methods. Moreover, it runs at least 10 times faster than the saliency-based method, while the sample mixed experiment and cross-validation experiment exhibited that this segmentation method depends on data more. Larger and more variable train set would increase the segmentation result significantly. In the future, more negative samples (such as data from VOC2012 and Microsoft COCO dataset) will be added to the train set to improve the segmentation performance and a detection algorithm based on the analysis of the suspected segmented smoke region will be proposed. By combining the high-level features extracted with the convolutional network and traditionally crafted feature vectors, a prospective smoke detection method will be developed. Future goals, advised by the strong power of GAN in image generation will be to find smoke and fire occluded with GAN [42]. This "guessing method" seems impossible in the past by using machine learning algorithms, while now it is possible by using a generative neural network to learn from real fire scenarios to speculate if there is fire occurring. Acknowledgments This research is supported by ZTE's Industry-University-Research Cooperation Forum (HX2018-07), Foundation of Shaanxi Educational Committee (18JK0722), Open Project Program of State Key Laboratory of Fire Science (HZ2019-KF12) and Key Research and Development Program of Shaanxi Province (2019GY-021). The SKLFS has also supported this work, and some of the experimental data is provided by the lab. Authors gratefully acknowledge all of these supports. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
Smart City: Key Technologies and Practices, ZTE Communications 13 (4) (2015) 1–2. Keras: The Python Deep Learning library, (2018) https://keras.io/. Amulya Karaadi I-H M, Lingfen Sun, How to Manage Multimedia Traffic: Based on QoE or QoT? [J]. ZTE Communications 16 (3) (2018) 7. P. Barmpoutis, K. Dimitropoulos, N. Grammalidis, Smoke detection using spatio-temporal analysis, motion modeling and dynamic texture recognition; proceedings of the Signal Processing Conference (EUSIPCO), 2014 Proceedings of the 22nd European, F 1-5 Sept. 2014, (2014) [C]. S. Calderara, P. Piccinini, R. Cucchiara, Vision based smoke detection system using image energy and color information [J], Machine Vision and Applications 22 (4) (2011) 705–719. A.E. Çetin, K. Dimitropoulos, B. Gouverneur, et al., Video fire detection–Review, Digital Signal Processing 23 (6) (2013) 1827–1843. F. Yuan, J. Shi, X. Xia, et al., High-order local ternary patterns with locality preserving projection for smoke detection and image classification, Information Sciences 372 (2016) 225–240. C.E. Prema, S.S. Vinsley, S. Suresh, Multi Feature Analysis of Smoke in YUV Color Space for Early Forest Fire Detection, Fire Technology 52 (5) (2016) 1319–1342. J. Park, B. Ko, J.-Y. Nam, et al., Wildfire smoke detection using spatiotemporal bag-of-features of smoke; proceedings of the WACV, F, (2013) [C]. S. Saponara, L. Pilato, L. Fanucci, International Society for Optics and PhotonicsEarly video smoke detection system to improve fire protection in rolling stocks; proceedings of the SPIE Photonics Europe, F2014, Early video smoke detection system to improve fire protection in rolling stocks; proceedings of the SPIE Photonics Europe, F (2014) [C]. G. Miranda, A. Lisboa, D. Vieira, et al., Color feature selection for smoke detection in videos; proceedings of the Industrial Informatics (INDIN), 2014 12th IEEE International Conference on, F 27-30 July 2014, (2014) [C]. H. Kim, D. Ryu, J. Park, Smoke Detection Using GMM and Adaboost, International Journal of Computer and Communication Engineering 3 (2) (2014) 123–126. D. Zhang, R.W. Zhao, L. Shen, et al., Action Recognition in Surveillance Videos with Combined Deep Network Models [J], ZTE Communications 14 (S1) (2016) 7. S. Frizzi, R. Kaabi, M. Bouchouicha, et al., Convolutional neural network for video fire and smoke detection; proceedings of the Industrial Electronics Society, IECON 2016 - Conference of the IEEE, F, (2016) [C]. K. Muhammad, J. Ahmad, Z. Lv, et al., Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications [J], IEEE Transactions on Systems Man & Cybernetics Systems PP (99) (2018) 1–16. T. Fu, Forest fire image recognition algorithm and realization based on deep learning [D], Beijing Forestry University, 2016. G. Xu, Y. Zhang, Q. Zhang, et al., Deep domain adaptation based video smoke detection using synthetic smoke images, Fire Safety Journal 93 (2017) 53–59. L.C. Chen, G. Papandreou, I. Kokkinos, et al., Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, Computer Science 4 (2014) 357–361. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis & Machine Intelligence 39 (4) (2014) 640–651. O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation; proceedings of the International Conference on Medical
12
Optik - International Journal for Light and Electron Optics 193 (2019) 162879
Y. Jia, et al.
Image Computing and Computer-Assisted Intervention, F, (2015) [C]. [21] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., Generative adversarial nets; proceedings of the International Conference on Neural Information Processing Systems, F, (2014) [C]. [22] K. Yun, J. Bustos, T. Lu, Predicting Rapid Fire Growth (Flashover) Using Conditional Generative Adversarial Networks [J], (2018). [23] B.U. Toreyin, Y. Dedeoglu, A.E. Cetin, Wavelet based real-time smoke detection in video; proceedings of the Signal Processing Conference, 2005 European, F, (2005) [C]. [24] A. Filonenko, D.C. Hern Ndez, K.H. Jo, Fast Smoke Detection for Video Surveillance using CUDA, IEEE Transactions on Industrial Informatics 14 (2) (2017) 725–733. [25] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, Computer Science (2014) 2672–2680. [26] T.H. Chen, C.L. Kao, S.M. Chang, An intelligent real-time fire-detection method based on video processing; proceedings of the IEEE 2003 International Carnahan Conference on Security Technology, 2003 Proceedings, F, (2003) [C]. [27] D. Pathak, P. Krahenbuhl, J. Donahue, et al., Context Encoders: Feature Learning by Inpainting [J], (2016), pp. 2536–2544. [28] P. Isola, J.Y. Zhu, T. Zhou, et al., Image-to-Image Translation with Conditional Adversarial Networks [J], (2016), pp. 5967–5976. [29] Y. Jia, J. Yuan, J. Wang, et al., A Saliency-Based Method for Early Smoke Detection in Video Sequences, Fire Technology 52 (5) (2016) 1271–1292. [30] R.Y. Yang Jia, Lianghui Fan, Early smoke segmentation method based on U-net convolutional network, Fire Safety Science 28 (2) (2019) 6. [31] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks; proceedings of the International Conference on Artificial Intelligence and Statistics, F, (2012) [C]. [32] K. He, X. Zhang, S. Ren, et al., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification [J], (2015), pp. 1026–1034. [33] N. Srivastava, G. Hinton, A. Krizhevsky, et al., Dropout: a simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (1) (2014) 1929–1958. [34] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint arXiv:1609.04747 (2016). [35] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, Computer Science (2014). [36] Z. Xiong, R. Caballero, H. Wang, et al., Video-based smoke detection: possibilities, techniques, and challenges, Journal of Hubei Radio & Televison University (2007). [37] D. Han, B. Lee, Flame and smoke detection method for early real-time detection of a tunnel fire, Fire Safety Journal 44 (7) (2009) 951–961. [38] R. Achanta, S. Hemami, F. Estrada, et al., Frequency-tuned salient region detection; proceedings of the Computer Vision and Pattern Recognition, 2009 CVPR 2009 IEEE Conference on, F (2009). [39] C. Yu, J. Fang, J. Wang, et al., Video Fire Smoke Detection Using Motion and Color Features, Fire Technology 46 (3) (2010) 651–663. [40] M.A.R. Ahad, Motion History Images for Action Recognition and Understanding [M], Springer Publishing Company, Incorporated, 2012. [41] Z. Xiong, R. Caballero, H. Wang, et al., Video-based smoke detection: possibilities, techniques, and challenges; proceedings of the IFPA, fire suppression and detection research and applications—a technical working conference (SUPDET), Orlando, FL, F, (2007) [C]. [42] K. Yun, T. Lu, E. Chow, [C]. International Society for Optics and PhotonicsOccluded object reconstruction for first responders with augmented reality glasses using conditional generative adversarial networks; proceedings of the Pattern Recognition and Tracking XXIX, F2018, Occluded object reconstruction for first responders with augmented reality glasses using conditional generative adversarial networks; proceedings of the Pattern Recognition and Tracking XXIX, F (2018).
13