Research on video classification method of key pollution sources based on deep learning

Research on video classification method of key pollution sources based on deep learning

J. Vis. Commun. Image R. 59 (2019) 283–291 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loc...

1MB Sizes 0 Downloads 29 Views

J. Vis. Commun. Image R. 59 (2019) 283–291

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Research on video classification method of key pollution sources based on deep learning Kunrong Zhao a, Tingting He b, Shuang Wu c, Songling Wang a, Bilan Dai a, Qifan Yang b, Yutao Lei a,⇑ a

South China Institute of Environmental Sciences.MEP, Guangdong, China Guangzhou Hexin Environmental Protection Technology Co., Ltd, Guangdong, China c Guangzhou Huake Environmental Protection Engineering CO.LTD, Guangdong, China b

a r t i c l e

i n f o

Article history: Received 29 November 2018 Revised 9 January 2019 Accepted 10 January 2019 Available online 11 January 2019 Keywords: Pollution sources Deep learning Surveillance video classification Convolution neural network

a b s t r a c t China’s environmental problems are not only related to the fundamental interests of the broad masses of the people, but also to China’s national security and international image. At present, China’s environmental protection work is facing a complex situation. Pollution sources can be divided into natural pollution sources and man-made pollution sources. Natural sources of pollution refer to places where nature releases harmful substances or causes harmful effects to the environment, such as active volcanoes. Man-made pollution source refers to the pollution source formed by human activities, and is also the main object of environmental protection research and control. Among the man-made pollution sources, air pollution sources, water pollution sources and soil pollution sources can be classified according to the main objects of pollution. Among them, air pollution sources and water pollution sources have the greatest impact on human life. Therefore, it has become an important subject worthy of in-depth discussion to take automatic and electronic measures for potential environmental pollution incidents, discover environmental pollution problems in time, reduce the probability of environmental pollution incidents, and even put some major environmental pollution incidents in their infancy. In this paper, deep learning method is used to classify the existing key pollution source video. Water pollution experiments show that the accuracy of video counting reaches 93.1%, which is better than other video processing schemes. The operation time of the system reaches acceptable range, and a solution to meet the real-time requirement is put forward. Ó 2019 Published by Elsevier Inc.

1. Introduction Since the reform and opening up, China’s economy has grown rapidly, and great achievements have been made in various construction projects. However, it has also paid a huge price in resources and environment. The contradiction between economic development and resources and environment has become increasingly acute, and the public has reacted strongly to the problem of environmental pollution. This situation is directly related to the irrational economic structure and the extensive mode of growth.

Abbreviations: RGB, Red Green Blue; HSV, Hue Saturation Value; CNN, Convolutional Neural Network; SIANN, Shift Invariant Artificial Neural Network; BP, Back Propagation; VGG, Visual Geometry Group. ⇑ Corresponding author. E-mail addresses: [email protected] (K. Zhao), [email protected] (T. He), [email protected] (S. Wu), [email protected] (S. Wang), [email protected] (B. Dai), [email protected] (Q. Yang), [email protected] (Y. Lei). https://doi.org/10.1016/j.jvcir.2019.01.015 1047-3203/Ó 2019 Published by Elsevier Inc.

Without speeding up economic restructuring and changing the mode of growth, resources cannot be sustained, environment cannot be accommodated, society cannot afford, and economic development cannot be sustained. There are many classifications of pollution sources. According to attributes, it can be divided into natural pollution source and man-made pollution source [1–3]. Natural pollution sources refer to places where the natural world releases harmful substances to the environment or causes harmful effects, such as volcanic eruptions. Man-made pollution sources refer to the pollution sources formed by human social activities, such as automobile exhaust, industrial pollutant discharge, domestic waste and wastewater discharge. The latter is the main object of environmental protection research and control. According to the types of pollutants discharged, they can be divided into organic pollution sources, inorganic pollution sources, thermal pollution sources, noise pollution sources, radioactive pollution sources and mixed pollution sources discharging multiple pollutants at the same time. According to the main targets of pollution, they

284

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

can be divided into air pollution sources, water pollution sources and soil pollution sources. According to the function of human society, it can be divided into industrial pollution sources, agricultural pollution sources, transportation pollution sources and living pollution sources, etc. [4,5]. Controlling pollution sources is fundamental to prevent and control environmental pollution and improve environmental quality. In recent years, pollution source automatic health system has been used to monitor the environment. According to the content of monitoring, pollution source automatic monitoring system can be divided into water quality online monitoring system, flue gas discharge continuous monitoring system, surface water quality online monitoring system and noise automatic online monitoring system. They have different monitoring objects and different monitoring emphases [6]. Water quality on-line monitoring is a high-tech system which integrates geographic information system technology, modern space technology, computer technology and environmental monitoring data, and stores and processes comprehensive environmental protection information. It combines the geographic location of polluting enterprises with the attribute data of enterprises, the geographic location of monitoring points and the environmental monitoring data, and outputs them to users accurately, authentically, in real time and with pictures and texts according to the needs, so as to satisfy the management of enterprises and environmental protection facilities by environmental protection departments. With the help of its powerful spatial analysis function and visual expression, various auxiliary decision making departments, [7,8], are made. It mainly monitors the chemical oxygen demand, ammonia nitrogen, total phosphorus, PH, flow rate and so on. It has a wide range of applications, including municipal sewage treatment plant, printing and dyeing industry, chemical fiber industry, paper industry, pesticide and fertilizer manufacturing industry, coking wastewater industry, electroplating industry, etc. Continuous monitoring of flue gas emissions, mainly monitoring the emissions of pollutants in flue gas, emission rates, such as sulfur dioxide, nitrogen oxides, particulate matter and so on. It is mainly for flue gas monitoring [9–13] of various boilers, industrial furnaces and refuse incineration. On-line monitoring of surface water quality is mainly to monitor the source of drinking water, surface water quality and water quality of key sections in major river basins. The contents of monitoring include water temperature, conductivity, dissolved oxygen, turbidity, velocity and redox potential. It is applicable to rivers, lakes, water sources, reservoirs, underground water and offshore waters. Noise automatic on-line monitoring is mainly aimed at environmental noise monitoring, the main indicator of monitoring is noise decibel [14,15]. It is suitable for automatic monitoring of urban environmental noise, airport noise, traffic noise and so on. With the improvement of computer performance, pattern recognition, computer vision and other technologies have developed rapidly, especially the advent of high-performance graphics cards, which has promoted the rapid development of in-depth learning [16]. Deep learning means much more than traditional methods in tasks such as target detection and classification. It has become a research trend to develop rapid and efficient pollution source detection and classification technology based on in-depth learning. In recent years, in the field of machine learning, the advantages of in-depth learning have become increasingly prominent. The main reason is that in-depth learning has made great breakthroughs in many aspects, such as voice, image and text. More importantly, it has set off an artificial intelligence revolution in the era of big data Internet. The emergence and popularity of indepth learning has undergone many twists and turns. It is the transformation and evolution of traditional neural networks in the context of large data. In 1958, Frank [17] proposed the

perceptron model, which is based on bio-neuroscience, and since then began the study of artificial neural networks. In 1969, M. Minsky [18–20] and S. Papert used perceptron models for linear classification, but this discovery did not attract much attention because of the training methods and hardware computing ability at that time. Until the 1980s, Rumel Hart, Hinton and Williams [21] put forward a complete and systematic neural network [22,23] based on Back Propagation (BP). This achievement aroused great enthusiasm in neural network research. Scholars focused on machine learning based on statistical models. The system based on artificial rule has been greatly improved. However, BP neural network can only be equipped with a hidden layer in practical use. Shallow hierarchical structure can easily make the neural network fall into local minimum or appear over-fitting phenomenon, especially when the number of network layers is increased [24]. In the 1990s, support vector machine (SVM) [25], Boosting, maximum entropy methods (such as logistic regression) and other shallow machine learning models were proposed one after another. These machine learning models are essentially neural networks with only one hidden layer (such as support vector machine, Boosting) [26,27], or neural networks without hidden layer (such as logistic regression), but they have achieved great success in application. It is precisely because of the rise of these machine learning models that artificial neural networks have been neglected, mainly because the basic mathematical theory and training methods of neural networks have not been broken through. Since 2011, the field of speech recognition has made the biggest breakthrough progress in more than 10 years, and the recognition error rate has decreased by 21–31%. Deep learning is the first breakthrough in the field of speech recognition, making people begin to apply this method in other fields [28,29]. Then, breakthroughs have been made in the field of image recognition. Deep convolution neural network has performed amazingly on large-scale image recognition. In the large-scale image data set ImageNet competition, the error rate has been reduced from 26% to 15%, and the decrease of more than 10% makes people full of hope for deep learning. Subsequently, in-depth learning surpasses traditional methods in the task of target detection, and makes breakthroughs in video classification. On this basis, in view of the outstanding performance of indepth learning, this paper will use in-depth learning to complete the task of pollution source video classification and processing. The classification of pollution source surveillance video was designed, and the features with low dimension and strong representation ability were extracted by CNN network. The detection accuracy rate reached 92.24%, which was better than the comparative detection method. In the counting test, the counting accuracy of the 1538 frame test video reaches 93.1%, which is better than the contrast scheme. The operation time of the system reaches the acceptable range and satisfies the real-time solution.

2. Proposed method 2.1. Color feature extraction A suitable color and texture description model is established, and the fusion of color and texture features is realized from a non-linear point of view. The description model mainly includes 3 steps: color quantization, color co-occurrence matrix, color texture and so on. Color feature is one of the most widely used visual features in image and video. In an image, color can reflect most of the information of objects or scenes. Moreover, compared with other visual features, color features have less dependence on the size, direction and perspective of the image itself, so they have higher robustness. In view of the existing system display equipment and other factors,

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

image and video mostly use RGB color space based on the principle of additive mixing to describe their data. Although the color space has clear physical meaning and is suitable for imaging equipment, RGB color space has a non-uniform perception characteristics, and the color distance between pixels and human eye perception is quite large. The HSV color space is just the opposite. Therefore, this paper chooses HSV color space which reflects the way people observe color, and realizes the transformation of HSV color space in RGB space according to the following formula.

8   > ðrg ÞþðrbÞ > ffi > ; b 6 g; < arc cos 2pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðrg Þ2 þðrbÞðgbÞ   H¼ > ðrg ÞþðrbÞ > > ffi ; b > g; : 2p - arc cos pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2

ð1Þ

ðrg Þ þðrbÞðgbÞ



maxðr; g; bÞ  minðr; g; bÞ ; maxðr; g; bÞ

ð2Þ



maxðr; g; bÞ 255

ð3Þ

According to formula (1)–(3), the range of values of hue component H, saturation S and intensity value V are [0, 2pi], [0, 1], [0, 1], respectively. It can be seen that HSV color space can represent extremely rich colors. However, the human eye’s color discrimination is limited, usually in dozens or so. On the other hand, in order to reduce the computational load of post-processing, this paper uses formula (4)–(6) to quantify H component into 8 quantization levels, and S and V components into 2 quantization levels.

8 1; h 2 ½21; 40; > > > > > 2; h 2 ½41; 75; > > > > > 3; h 2 ½76; 155; > < H ¼ 4; h 2 ½156; 190; > > > 5; h 2 ½191; 270; > > > > > > 6; h 2 ½271; 295; > > : 7; h 2 ½296; 315;  S¼

ð5Þ

1; s 2 ½0:5; 1: 0; v 2 ½0; 0:5;

ð6Þ

1; v 2 ½0:5; 1:

In order to facilitate the subsequent definition of color texture features, the following formula is used to index the quantized HSV color space:

Color ðx; yÞ ¼ Hðx; yÞ þ 8  Sðx; yÞ þ 16  V ðx; yÞ

ð7Þ

where (x, y) represents the spatial coordinates, if Color (x, y) = k, then the (x, y) pixels in the image are the K color, and the possible value of K is 0, 1, 2, . . ., 31. For a given color digital video frame, 8 feature data are extracted and generated to form an 8-dimensional feature vector, which is called the color texture feature vector of color digital image or video frame as follows:

CTCV F ¼



lASM ; rASM ; lCOR ; rCOR ; lMOR ; rMOR

value of the color texture feature vector of its key frame is defined as the color texture feature vector of the video clip, which is recorded as CTCVV. The calculation formula is as follows:

PN CTCV V ¼

i¼1 CTCV KF ðiÞ

N



ð8Þ

For a given video clip l, in order to reduce the computational complexity, this paper chooses some video frames in the video clip l as key frames according to a certain step size, and then marks them as KF (1), KF (2),. . ., KF (N) calculates the color texture feature vectors of each key frame, which is recorded as CTCVKF(1), CTCVKF(2). . .,CTCVKF(N). Therefore, for any video clip l, the average

ð9Þ

2.2. Morphological processing Morphological processing is a unique digital image analysis method and theory, which has been widely used in the field of digital image processing and machine vision. It is expressed as a form of neighborhood operation. The defined neighborhood is called ‘‘structural unit”. Each pixel of the image performs a specific logical operation with the structural unit, and the result is the corresponding output pixel. Corrosion processing is a common method in morphological processing. It is used to process binary images at first, and then extended to process gray images. The etching of a two value image can be described in formula (10):

AHB ¼ fz 2 EjBz # Ag

ð10Þ

Among them, E represents the Euclidean space, A represents the binary image belonging to the Euclidean space, B represents the structural unit for the corrosion treatment of image A, and Bz represents the structural unit B represented by vector Z, for example Bz ¼ fb þ zjb 2 Bg; 8z 2 E. When the structure unit B is a Centrosymmetric structure (shape is a disk or a square) and belongs to Euclidean space E, the corrosion process of the structure unit B to binary image A can be understood as the sliding operation process of the structure unit in the image. Expansion processing is also a common method in morphological processing. Expansion processing of a binary image can be expressed by formula (11):



A  B ¼ z 2 EðBs Þz \ A–u

0; s 2 ½0; 0:5;

 V¼

ð4Þ

285

ð11Þ

Among them, E represents the Euclidean space, A represents the binary image belonging to the Euclidean space, B represents the structural unit that expands the image A, and Bs represents the symmetric unit of the structural unit B, that is Bs ¼ fx 2 Ejx 2 Bg, the structural unit B. Similarly, when the structure unit B is a Centro symmetric structure (shape is a disk or a square) and belongs to the Euclidean space E, the expansion process of the structure unit B to the binary image A can be understood as the sliding operation process of the structure unit in the image. In order to ensure the smoothness of the outline of the foreground region and facilitate the processing of subsequent modules, in this paper, the binary image after foreground extraction is processed morphologically. The specific processing process is as follows: The first step is hole filling. Holes represent a small set of pixels recognized as foreground in the background area, usually a singledigit area. The second step: corrosion treatment. A flat disc structure unit with a radius of 3 is created. The structure unit is used to scan the binary image and perform ‘‘and” operation to achieve corrosion effect. The third step: expansion treatment. The rectangular structure 0 1 0 unit 1 1 1 is created and used to scan the binary image six 0 1 0 times and perform ‘‘or” operation to achieve expansion effect. Multiple expansion is to avoid morphological processing filtering

286

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

out too many foreground parts, affecting the accuracy of subsequent detection.

X1

2.3. Convolution neural network Convolutional Neural Network (CNN) is a typical forwardstructured neural network [39–42]. Convolutional Neural Network has four key characteristics: local connection, weight sharing, pooling and multi-layer use [30–34]. The features extracted by convolution neural network have translation invariant characteristics. Therefore, convolution neural network is also called Shift Invariant Artificial Neural Network (SIANN). Convolutional neural networks are mainly used to recognize image data with translation, scaling and distortion invariance to some extent. Since most of the structures of convolutional neural networks are end-to-end systems, task-related features can be implicitly learned directly from data without too much human intervention and a large amount of expert domain knowledge. Moreover, because of its special weight sharing, it can design a deeper network, which can enhance the expressive ability of the network, so it can be applied to more complex visual tasks. This is because weight sharing reduces the number of adjustable parameters, thereby reducing the risk of over-fitting and accelerating the training speed. The feed forward neural network consists of three different layers, namely, input layer, hidden layer and output layer. Each layer is made up of multiple neurons, as shown in Fig. 1. This neuron is a computational unit with input of sum and off P 3 set of 1. Its output is hW;b ðxÞ ¼ f W T x ¼ f i¼1 W i xi þ b , namely, a point multiplication sum operation between output and weight, then a bias is added, and finally a non-linear transformation is made, f() represents a non-linear function, also known as an activation function. In feed forward neural networks, the most commonly used non-linear function is S-type function, such as sigmoid function, which is shown in formula (12). When we combine multiple neurons and stratify, we form a classical forward neural network. Fig. 2 shows a three-layer feed forward neural network, which consists of an input layer, an implicit layer and an output layer, i.e. a 3–3-1 neural network. Formula (13) represents the computation process from input layer to output layer, which is usually referred to as forward propagation.

1 f ðxÞ ¼ 1 þ expðxÞ

ð12Þ

ð1Þ ð2Þ ð1 Þ ð1 Þ ð1Þ a1 ¼ f W 11 x1 þ W 12 x2 þ W 13 x3 þ b1 ð1Þ ð2Þ ð1 Þ ð1 Þ ð1Þ a1 ¼ f W 21 x1 þ W 22 x2 þ W 23 x3 þ b2 ð1Þ ð2Þ ð1 Þ ð1 Þ ð1Þ a1 ¼ f W 31 x1 þ W 32 x2 þ W 33 x3 þ b3 3  P ð2Þ ð2Þ ð2Þ ð3 Þ W 1i ai þ b1 hw;b ðxÞ ¼ a1 ¼ f ðnetk Þ ¼ f

+1

+1 Fig. 2. 3 level neural network model.

forward neural network does not contain closed-loop or loop, that is, the output of the former layer is the input of the next layer, and the signal propagates from the input layer to the output layer one by one, which is one of the reasons why it is called the feed-forward neural network. While the signals are propagating layer by layer, they have to undergo non-linear transformation operation, which increases the expressive ability of the network. Back propagation, also known as BP algorithm, is the most common and effective algorithm for training feed forward neural networks. The main idea is: starting from an untrained network, the output value of the output layer is calculated by forward propagation, which is in formula (13) hw;b . Compare the output value of the network with the target value, and the difference corresponds to the error between them; then the error is propagated back from the output layer to the input layer. This error is a scalar function of the network weight. Therefore, in the training process, the network weight is adjusted by the error. Iterative until convergence, that is, the error is zero. Before introducing the calculation process of BP algorithm in detail, we first need to define the calculation method of error, namely loss function. For convenience, we use the two loss function:

J ðW Þ ¼

N 2 1 1X 2 ð3Þ tk  ak ¼ k t  að3Þ k 2 K¼1 2

ð14Þ

W represents the weights of the network, including the input

ð13Þ

i¼1

X1 Hw b(x)

X3

Hw,b(x)

X3

layer to the hidden layer W ð1Þ and the hidden layer to the output

Fig. 2 shows only the 3 level feed forward neural network model. We can construct deeper feed forward neural networks by adding any hidden layer. It should be noted that the feed-

X2

X2

layer W ð2Þ ; t and að3Þ separately represents the target value and the network output value. The learning rule of BP algorithm is based on gradient descent method. Firstly, the weights of the network are initialized randomly. Generally, the Gauss distribution with the mean value of 0 is used to initialize the weights, and then the weights are adjusted to move in the direction of error reduction, as shown in formula (2.3.4).

DW ¼ g

@J @W

ð15Þ

The g represents the learning rate and the scale of weight changes. The update rule is W ðm þ 1Þ ¼ W ðmÞ þ DW; m used to indicate the number of iterations. Take the three level neural network model shown in Fig. 2 as an example. Firstly, we consider the weight W ð2Þ from the hidden layer to the output layer. From formula (15), we can see that the error is not an explicit function of the weight. That is to say, the

+ Fig. 1. Schematic diagram of a single neuron.

weight W ð2Þ cannot obviously determine the change of the error function. So we use the chain rule to calculate the partial derivative

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

of the weight W ð2Þ with respect to the error J. Similarly, we can deduce the weight W ð1Þ update rule from the output layer to the hidden layer. Then:

DW kj ¼ g tk  aðk3Þ f ðnetk Þaðj 2Þ

ð16Þ

From the development trend of convolution neural network, firstly, the number of network layers is deeper and deeper. Scholars believe that the deeper the network layer, the more robust the features can be learned. Therefore, the network layer from the beginning of Alex Net layer 8 to Res Net layer 152 or even deeper; then, the network structure becomes more and more complex. It is no longer a linear combination convolution pool like VGG network, but a convolution combination module like Inception, and a densely connected network structure like Dense Net. In addition, various innovations have appeared in the basic components of the network, such as cross-channel convolution, combined convolution and other new methods. Feature extraction network is more and more abundant, and its performance is more and more powerful. The optimization of computing resources has gradually become the focus of consideration. 2.4. Adaptive learning rate algorithm The main advantage of the adaptive gradient descent algorithm is that it can automatically adjust the learning rate in the training process, instead of manual adjustment, it only needs to set an initial learning rate. The disadvantage is that the corresponding learning rate will become smaller and smaller, until a value loses the ability to update parameters, training will not continue. The adaptive learning rate algorithm is an improvement of the adaptive gradient descent algorithm mentioned above, in order to solve the problem that the learning rate of the adaptive gradient descent algorithm keeps decreasing until it disappears. The original algorithm calculates the square sum of all the gradients of each parameter from the past to the present, which is too radical. The adaptive learning rate algorithm calculates the square sum of gradients for each parameter in a fixed time. It is inefficient to calculate the sum of gradient squares in a period of time, which needs to store the gradient values before. The algorithm defines the current gradient squares as the average of the past gradient squares, and the formula is as follows:

    E g 2 t ¼ cE g 2 tþ1 þ ð1  cÞg 2t

ð17Þ

A multiplier c is used to control the update degree, usually set at about 0.9. The equation is shown in the following formula:

g

htþ1 ¼ ht  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  g t E ½g 2 t þ e

ð18Þ

In order to solve this problem, the author constructed a new formula by simulating Newton iteration method to solve this problem.

pffiffiffiffiffiffiffiffiffiffiffiffi E½Dht1  Dht ¼  pffiffiffiffiffiffiffiffiffiffiffiffiffi  gt E½g 2  þe t htþ1 ¼ ht þ Dht

ð19Þ

A 28  28 contaminated video image has 1024 pixels. It can extract 1024-dimensional gray features and 2981-dimensional gray difference features to form 3101-dimensional feature vectors. From the analysis of information sources, all the pixel gray features of an image contain all the information. The gray difference feature is only a feature derived from the gray level feature, and it does not add new information content, but it may reveal that there are more meaningful connotations in all the information. In the research and development practice of this paper, taking the combined features

287

of gray level and gray level difference as input data can achieve higher detection and classification accuracy than using only gray level data as input data. The combination of artificial optimization ability and deep machine learning automatic optimization ability can achieve better performance than pure machine learning. 3. Experiments This experiment verifies that the code of MATLAB version is tested under Windows environment. It has the configuration of Windows 10 and MATLAB. Monitoring equipment, such as cameras, should be placed in the effective protection scope of lightning rods. For the front-end equipment which is already within the protection range of other lightning rods or the original lightning protection system of high-rise buildings, direct lightning protection cannot be considered separately; for the front-end equipment which is not within the protection range of any connecting lightning protection system, direct lightning protection should be considered. Recording polluter content retention time, high resolution and high frame rate require more network bandwidth resources, and also occupy more video storage space. The retention time of recorded content is related to resolution and frame rate. PC system can usually upgrade to get more video storage space. The system uploads video surveillance data of monitoring sites to environmental protection platform through optical fiber or mobile network, and can archive and backup important videos. On the map, the real-time and historical video data of environmental monitoring points of various pollution sources enterprises can be viewed and monitoring parameters can be set based on location. Video surveillance pictures of sewage outlets and chimneys of waste water enterprises can be viewed in real time. Once the pollution situation is found, onsite command and evidence collection can be carried out, which provides an effective auxiliary means for Environmental Protection Bureau pollution source supervision and greatly improves the efficiency of law enforcement. In order to verify the ability of the design method in target detection and classification, 3000 positive samples and 3000 negative samples are extracted from all samples. 5000 samples are randomly selected as training data sets, and the remaining 1000 samples are used as test data sets. The selected samples are then used to test the performance of the system in terms of detection and classification. At the same time, in order to enhance the comprehensiveness of verification, three scenes of serious pollution source scenarios are selected, as shown in Fig. 3. We use the same preprocessing method on different data sets. First, the RGB image is converted to a grayscale image and normalized to 256 * 256. Next, we perform Normalization, minus mean variance. During training and testing, we did not use bounding-box. When testing the performance of the model, we use the classification accuracy as the evaluation criterion, that is, the average of the classification accuracy of all classes. It is more accurately defined as the sum of the diagonal elements of the confusion matrix and the averages. In multi class classification tasks, classification accuracy is a classic performance evaluation criterion. In the experiment, pollution types and locations of pollution were extracted from video processing of actual pollution sources. 4. Discussion The monitoring and management work is embodied by the hierarchical early warning mechanism, and three-level early warning is implemented according to the severity of enterprise sewage discharge, which is divided into white early warning, yellow early warning and red early warning.

288

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

Fig. 3. 3 severe pollution scenes.

Table 1 Monitoring index of waste water. Serial number

Control project

1 2 3

Chemical oxygen demand (COD) Ammonia nitrogen Total phosphorus

4

PH

First level standard A standard

B standard

50 4 (7) 1 0.5 6–9

55 7 (14) 0.5 1.5

Two level standard

Three level standard

90 24 (30) 2 2

110 – 4 4

Fig. 4. 3 scene video image processing.

1. waste water and waste gas monitoring data are missing, data transmission is not stable, and white warning is launched. 2. Start yellow warning when enterprise wastewater is continuously discharged within 6 h or within 0.5 times of the average discharge concentration on the same day; Start yellow warning when enterprise waste gas is continuously discharged within 3 h or within 0.3 times of the average discharge concentration on the same day. 3. Start red warning when enterprise wastewater discharge exceeds the standard for more than 6 h or the average discharge concentration exceeds the standard by more than 0.5 times on the same day. Start red warning when enterprise wastewater discharge exceeds the standard for more than 3 h or the average discharge concentration exceeds the standard by more than 0.3 times on the same day. The online monitoring system of pollution source water quality is mainly used to monitor whether the chemical oxygen demand (COD) or total organic carbon (TOC), flow rate, PH value, ammonia nitrogen and total phosphorus in wastewater exceed the national discharge standards. The monitoring index of wastewater in video is shown in Table 1. Chemical Oxygen Demand (COD) is also called chemical oxygen demand (COD). It uses chemical oxidants to decompose oxidizable substances such as organic matter, nitrite, ferrous salt and sulfide

in wastewater, and then calculates oxygen consumption according to the amount of residual oxidants. The unit of chemical oxygen demand (COD) is micro liter or milligram liter, which is one of the important comprehensive indicators to indicate the degree of water pollution. Ammonia nitrogen mainly refers to several forms of nitrate nitrogen, nitrite nitrogen, ammonia nitrogen and organic nitrogen in wastewater. Total phosphorus mainly refers to the phosphate existing in waste water. PH value is an important parameter for evaluating water quality. The size of the value reflects the acidity or alkalinity of water. Image processing for video monitoring of 3 scenes is shown in Fig. 4. In order to achieve fast convergence in training, an appropriate network weight initialization method becomes very important, because it is very difficult to train a deep network model with a large number of adjustable parameters and non-convex loss functions. In our research, we use the most commonly used initialization method (Gauss distribution) to initialize the network weight. Therefore, the weights of the five convolution layers and the full connection layers of the last layer are sampled from a Gaussian distribution with a zero mean and a variance of 0.01. A Gauss distribution with zero mean variance and 0.005 variance is used to initialize the two fully connected layers at the beginning. The offset of the first and third convolution layers and the last full connection layers is set to 0, and the offset of the remaining layers is set to 1. In training, we randomly cut a fixed size (227  227) sub-image from

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

289

Fig. 5. video image after processing.

Table 2 Classification accuracy of different grades. Hierarchy number The number of classes Accuracy rate (%)

1 1 –

2 2 93.4

3 17 98.19

4 38 98.72

the selected 256  256 pictures, and randomly perform horizontal flip and RGB conversion as in contaminated video. The processed video image is shown in Fig. 5. The performance of multistage transfer learning and benchmark model is compared. The accuracy of the benchmark model is 91.18%. Table 2 shows the results of the test set. As can be seen from Table 2, the proposed framework achieves an important breakthrough, improves classification accuracy by more than 7%, and proves the effectiveness of the proposed model. In addition to the accuracy of each pollution source type, we also show the classification performance of each pollution source scenic spot, such as ‘‘Summer Palace”. Table 2 shows the accuracy of the multi-stage transfer learning model. We can find that multi-stage transfer learning performs best in most types. For further comparison, Fig. 6 shows the recognition effect of four different methods. As can be seen in Fig. 6, ‘‘NoAug” means that the model is not adjusted by data expansion technology and auxiliary tasks, and the first recognition strategy is used for prediction. The two strategies of ‘‘NoAugDS” and ‘‘NoAugCS” use the same prediction method, the difference is that one uses 4019 dimension features, and the other uses hybrid features, namely 4019 dimension features and 140 dimension PAV features. ‘‘AugDS” means using data expansion and auxiliary tasks to train neural networks. At the same time, ‘‘Deep” represents the 4019 dimension feature, and ‘‘Com”

Fig. 6. Effect of data and size on Performance.

represents the combination of PAVs feature and ‘‘Deep” (4019 dimension feature) to form a new feature. By observing these results, we find that data expansion can improve the recognition accuracy by 1.9%, which proves its effectiveness. Similarly, ‘‘Com” is only 0.16% higher than ‘‘Deep”. However, when we use PAV to design auxiliary tasks, according to the comparison between AugHS and AugDS, the accuracy of auxiliary tasks is improved by about 3.2 percentage points. In this scenario, the soft Max classifier behaves differently, for example, when the score is [18,17,17], it will get more losses. We expect that a classifier should focus on how to distinguish ‘‘polluted” from ‘‘unpolluted”, so that SVM classifier is better than softmax to some extent. It is not difficult to find that our method achieves good recognition performance, which proves the validity of the algorithm. At the same time, it verifies that the pollution source video parameters as a cue-constrained

290

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

depth network can drive it to learn the characteristics related to the pollution source video parameters, which is more suitable for dealing with behavior recognition tasks. 5. Conclusions In recent years, many scholars at home and abroad have studied the behavior recognition in dynamic video images [35–38]. They are mainly divided into the methods based on deep learning and the methods based on artificial features. Most of the methods can be classified by the use of advanced clues (pollutant generation). In this paper, we also use the convolution neural network model to study the behavior recognition in dynamic images, and consider how to use the high-level cue constraint network to learn the behavior-related features. We also studied the potential of convolutional neural networks in static image recognition. A new algorithm based on cue enhancement for convolution neural network processing behavior recognition is proposed. Specifically, we use the ImageNet database to train the initial depth model, and then design the histogram features of the behavior (as the distribution of different parts) by pose let method. We regard it as a soft label as a real label for returning tasks. In this way, we regard regression tasks as assistant tasks and behavior recognition as chief tasks, which constitute the framework of multi-task learning. And the two tasks share the same convolution layer features, and then optimize two tasks jointly. The experimental results show that advanced clues in video image processing of pollution sources play an active role in behavior recognition, and can alleviate the impact of complex background, occlusion and other issues on tasks to a certain extent. Finally, we introduce a new database centered around pollution sources, known as pollution sources. In order to solve the problem of pollution source classification, we propose a multi-stage migration learning model based on advanced clues of classification hierarchy structure, and use convolutional neural network as the basic component unit. To test the effectiveness of this model, we designed a benchmark model and a single-level transfer learning model. The experimental results show that our framework has reached a new height. 6. Declarations Ethical Approval and Consent to participate: Approved. Consent for publication: Approved. Availability of supporting data: We can provide the data. 7. Competing interests These no potential competing interests in our paper. And all authors have seen the manuscript and approved to submit to your journal. We confirm that the content of the manuscript has not been published or submitted for publication elsewhere. 8. Author’s contributions All authors take part in the discussion of the work described in this paper. The author Kunrong Zhao wrote the first version of the paper. The author Tingting He and Shuang Wu did part experiments of the paper, Songling Wang revised the paper in different version of the paper, respectively. Acknowledgements The authors thank the editor and anonymous reviewers for their helpful comments and valuable suggestions.

References [1] M.J. Birkner, Identification of sources of environmental pollution at the sites of production, storage and transportation of oil using the PAH indicator ratios, Ecol. Model. 155A (3) (2014) 459–465. [2] W. Zhu, C. Gu, Z. Xie, et al., Continuous emission online monitoring technology of industrial stationary pollution source, Automat. Petro-Chem. Ind. (2016). [3] H.U. Yangyu, L.I. Li, H.U. Wen, et al., Component analysis and pollution sources of contamination on insulators in fog and haze areas, Insul. Surge Arresters (2016). [4] Y. Zhao, W. Lu, C. Xiao, Mixed integer optimization approach to groundwater pollution source identification problems, Environ. Forensics 17 (4) (2016) 355– 360. [5] Junwei Han, Dingwen Zhang, Hu. Xintao, Lei Guo, Jinchang Ren, Feng Wu Background prior-based salient object detection via deep reconstruction residual, IEEE Trans. Circuits Syst. Video Technol. 25 (8) (2015) 1309–1321. [6] P. Wycisk, R. Stollberg, C. Neumann, et al., Integrated methodology for assessing the HCH groundwater pollution at the multi-source contaminated mega-site Bitterfeld/Wolfen, Environ. Sci. Pollut. Res. Int. 20 (4) (2013) 1907– 1917. [7] N. Tahmassebipoor, O. Rahmati, F. Noormohamadi, et al., Spatial analysis of groundwater potential using weights-of-evidence and evidential belief function models and remote sensing, Arab. J. Geosci. 9 (1) (2016) 1–18. [8] T. Sigler, G. Searle, K. Martinus, et al., Metropolitan land-use patterns by economic function: a spatial analysis of firm headquarters and branch office locations in Australian cities, Urban Geography 37 (2016) 416–435. [9] Junwei Han, Dingwen Zhang, Gong Cheng, Lei Guo, Jinchang Ren, Object detection in optical remote sensing images based on weakly supervised learning and high-level feature learning, IEEE Trans. Geosci. Rem. Sens. 53 (6) (2015) 3325–3337. [10] H. Tang, Y. Duan, C. Zhu, et al., Characteristics of a biomass-based sorbent trap and its application to coal-fired flue gas mercury emission monitoring, Int. J. Coal Geol. 170 (2017) 19–27. [11] W. Blum, M. Szelagiewicz, Fast-cycle trace analysis of dioxin in flue gas. Monitoring of the incineration of dioxin-containing waste from seveso, J. Sep. Sci. 11 (6) (2015) 480–486. [12] L. Zhang, Y. Gao, R. Zimmermann, Q. Tian, X. Li, Fusion of multichannel local and global structural cues for photo aesthetics evaluation, IEEE Trans. Image Process. 23 (3) (2014) 1419–1429. [13] Q. Wang, Retrofit of flue gas monitoring and denitration automatic control systems in a power plant, Electric Power (2015). [14] Junwei Han, King Ngi Ngan, Mingjing Li, Hong-Jiang Zhang, Unsupervised extraction of visual attention objects in color images, IEEE Trans. Circuits Syst. Video Technol. 16 (1) (2006) 141–145. [15] K.M. Fristrup, Beyond decibels: inspiring informed noise management in U.S. National Parks, Acoust. Soc. Am. J. 139 (4) (2016), 1981-1981. [16] L.P. Getto, D. Marco, M.A. Papas, et al., The effect of noise distraction on emergency medicine resident performance during intubation of a patient simulator, J. Emerg. Med. 50 (3) (2016) e115–e119. [17] A.R. Pathak, M. Pandey, S. Rautaray, Deep learning approaches for detecting objects from images: a review, Prog. Comput. Anal. Netw. (2018). [18] Y. Li, Deep learning on Computing Optimization on GPU, China Comput. Commun. (2018). [19] Q. Hua, W. Jiang, H. Zhao, et al., Tibetan name entity recognition with perceptron model, Comput. Eng. Appl. (2014). [20] Dingwen Zhang, Deyu Meng, Junwei Han, Co-saliency detection via a selfpaced multiple-instance learning framework, IEEE Trans. Pattern Anal. Mach. Intell. 39 (5) (2017) 865–878. [21] Z. Zhu, P. Luo, X. Wang, et al., Multi-view perceptron: a deep model for learning face identity and view representations, in: International Conference on Neural Information Processing Systems, MIT Press, 2014, pp. 217–225. [22] K. Gregor, I. Danihelka, A. Graves, et al., DRAW: a recurrent neural network for image generation, Comput. Sci. (2015) 1462–1471. [23] Junwei Han, Xiang Ji, Hu. Xintao, Dajiang Zhu, Kaiming Li, Xi Jiang, Guangbin Cui, Lei Guo, Tianming Liu, Representing and retrieving video shots in humancentric brain imaging space, IEEE Trans. Image Process. 22 (7) (2013) 2723– 2736. [24] Afiq Hipni, Ahmed El-shafie, Ali Najah, et al., Erratum to: daily forecasting of dam water levels: comparing a support vector machine (SVM) model with adaptive neuro fuzzy inference system (ANFIS), Water Resour. Manage. 27 (11) (2013), 4113-4113. [25] L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, C. Chen, Probabilistic graphlet transfer for photo cropping, IEEE Trans. Image Process. 22 (2) (2013) 802–815. [26] W. Zhang, D. Zhao, Z. Chai, et al., Deep learning and SVM-based emotion recognition from Chinese speech for smart affective services, Softw.—Practice Exper. 47 (8) (2017) 1127–1138. [27] A.B. Sargano, X. Wang, P. Angelov, et al., Human action recognition using transfer learning with deep representations, in: International Joint Conference on Neural Networks, IEEE, 2017, pp. 463–469. [28] Tuo Zhang, Lei Guo, Kaiming Li, Changfeng Jing, Yan Yin, Dajiang Zhu, Guangbin Cui, Lingjiang Li, Tianming Liu, Predicting functional cortical ROIs via DTI-derived fiber shape models, Cereb. Cortex 22 (4) (2012) 854–864. [29] H. Chenying, W.C. Chen, P.T. Lai, et al., Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database, in: International

K. Zhao et al. / J. Vis. Commun. Image R. 59 (2019) 283–291

[30] [31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

Conference of the IEEE Engineering in Medicine & Biology Society. Conf Proc IEEE Eng Med Biol Soc, p. 3110. L. Zhang, M. Song, Y. Yang, Q. Zhao, C. Zhao, N. Sebe, Weakly supervised photo cropping, IEEE Trans. Multimedia 16 (1) (2014) 94–107. M. Turker, D. Koc-San, Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping, Int. J. Appl. Earth Obs. Geoinf. 34 (5) (2015) 58–69. H. Deng, H. Yi, X. Tang, et al., Interactive effect for simultaneous removal of SO2, NO, and CO2 in flue gas on ion exchanged zeolites, Ind. Eng. Chem. Res. 52 (20) (2013) 6778–6784. L. Zhang, Y. Xia, K. Mao, H. Ma, Z. Shan, An effective video summarization framework toward handheld devices, IEEE Trans. Ind. Electron. 62 (2) (2015) 1309–1316. J.M. Blais, M.R. Rosen, J.P. Smol, Using natural archives to track sources and long-term trends of pollution: an introduction, in: Environmental Contaminants, Springer, Netherlands, 2015, pp. 1–3. G. Saon, H. Soltau, D. Nahamoo, et al., Speaker adaptation of neural network acoustic models using i-vectors, in: Automatic Speech Recognition and Understanding, IEEE, 2014, pp. 55–59. D. Wang, Y. Zou, W. Wang, Learning soft mask with DNN and DNN-SVM for multi-speaker DOA estimation using an acoustic vector sensor, J. Franklin Inst. (2017). C. Reidlleuthner, A. Viernstein, K. Wieland, et al., Quasi-simultaneous in-line flue gas monitoring of NO and NO2 emissions at a caloric power plant employing mid-IR laser spectroscopy, Anal. Chem. 86 (18) (2014) 9058. Z. Chen, H. Xu, J. Luo, et al., Low-power perceptron model based ECG processor for premature ventricular contraction detection, Microprocess. Microsyst. 59 (2018). Dingwen Zhang, Junwei Han, Chao Li, Jingdong Wang, Xuelong Li, Detection of co-salient objects by looking deep and wide, Int. J. Comput. Vision 120 (2) (2016) 215–232. Gong Cheng, Peicheng Zhou, Junwei Han, Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images, IEEE Trans. Geosci. Rem. Sens. 54 (12) (2016) 7405–7415. B. Zhang, C.L. Xu, S.M. Wang, An inverse method for flue gas shielded metal surface temperature measurement based on infrared radiation [J], Meas. Sci. Technol. 27 (7) (2016) 074002. Y.C. Wang, G.S. Zhao, Assessment of potential non-point source pollution risks of high-yield farmland with life cycle assessment method, J. Ecol. Rural Environ. (2015).

Kunrong Zhao was born in Meizhou, Guangdong, P.R. China, in 1979. He received the Doctor’s Degree from Sun Yat-sen University, P.R. China. Now, he works in South China Institute of Environmental Sciences.MEP. His research interest include environmental engineering, computational intelligence and information security.

Tingting He was born in guangzhou, Guangdong, P.R. China, in 1993. She received the bachelor’s degree from GuangDong University of Finance & Economics, P.R. China. Now, she works in Guangzhou Hexin Environmental Protection Technology Co., Ltd. Her research interest include environmental assessment, big data analysis and information security.

291 Shuang Wu was born in Beitun,Xinjiang,P.R. China, in 1990. She received the Master’s degree from Northwest Normal University, P.R. China. Now, she works in South China institute of environmental sciences, ministry of environmental protection Guangzhou Huake environmental protection engineering CO.LTD. Her research interest include Environmental planning and management.

Songling Wang was born in Ledong, Hainan, P.R. China, in 1993. He received the bachelor’s degree from Qingdao University of Technology, P.R. China. Now, He works in South China Institute of Environmental Sciences.MEP. His research interest include computational intelligence, information security and big data analysis.

Bilan Dai was born in Meizhou, Guangdong, China in 1995. She received a bachelor’s degree from Guangdong Ocean University. At present, she is working in South China Institute of Environmental Sciences.MEP. Her research direction is the comprehensive development and utilization of environmental information resources.

Qifan Yang was born in Maoming, Guangdong, China, in 1995. He received the bachelor’s degree from GuangDong University of Finance & Economics. Now, he works in Guangzhou Hexin Environmental Protection Technology Co., Ltd. His research interest include cloud security, chaos encryption and information security.

Yutao Lei was born in Huizhou, Guangdong, P.R. China, in 1978. He received the Master’s Degree from Guangdong University of Technology, P.R. China. Now, he works in South China Institute of Environmental Sciences.MEP. His research interest include environmental engineering and environmental assessment.