Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2018) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 151 (2019) 675–682
The 2nd International Conference on Emerging Data and Industry 4.0 (EDI40) The 2nd International Emerging Data and Industry 4.0 (EDI40) AprilConference 29 – May 2,on2019, Leuven, Belgium April 29 – May 2, 2019, Leuven, Belgium
Deep Neural Network Method of Recognizing the Critical Deep Neural Network Method of Recognizing the Critical Situations for Transport Systems by Video Images Situations for Transport Systems by Video Images
F.F. Pashchenko a,b, O.S. Amosov a,*, S.G. Amosova a, Y.S. Ivanov c, S.V. Zhiganov c F.F. Pashchenko a,b, O.S. Amosov a,*, S.G. Amosova a, Y.S. Ivanov c, S.V. Zhiganov c aV.A.
aV.A.
Trapeznikov Institute of Control Sciences of Russian Academy of Sciences, Profsoyuznaya st. 65, Moscow, 117997, Russia of Physics Technology, Kerchenskaya st.Profsoyuznaya 1A, Moscow, 117303, Russia 117997, Russia TrapeznikovbMoscow Institute Institute of Control Sciencesand of Russian Academy of Sciences, st. 65, Moscow, cKomsomolsk-na-Amure bMoscow Institute of Physics state university, Lenin Avenue 27, Komsomolsk-on-Amur, 681013, Russia and Technology, Kerchenskaya st. 1A, Moscow, 117303, Russia cKomsomolsk-na-Amure state university, Lenin Avenue 27, Komsomolsk-on-Amur, 681013, Russia
Abstract Abstract The deep neural network method of recognizing critical situations for transport systems according to video frames from the intelligent vehicles cameras is offered, that is effective terms of for accuracy andsystems high-speed performance. known The deep neural network method of recognizing criticalinsituations transport according to videoUnlike framesthefrom the solutions the objects and normal or critical andaccuracy recognition, uses the classification the subsequent intelligentfor vehicles cameras is offered, that is situations effective detection in terms of and ithigh-speed performance.with Unlike the known reinforcement on objects the basis of normal several or video stream frames detection and with the annotation algorithm. The adapted solutions for the and critical situations andautomatic recognition, it uses the classification with thearchitectures subsequent of neural networks arebasis offered: the dual network identifyand drivers andautomatic passengersannotation accordingalgorithm. to the faceThe image, the network with reinforcement on the of several video streamtoframes with the adapted architectures independent recurrent to classify to the video fragment. The scheme intellectual city of neural networks arelayers offered: the dual situations network toaccording identify drivers and passengers according to of thethe face image, thedistributed network with system of transport safety using the cameras and on-board united in a single networkofisthe offered. Software modulescity in independent recurrent layers to classify situations accordingcomputers to the video fragment. The scheme intellectual distributed Python developed and natural experiments areon-board made. The possibility of theinoffered programs in UGV or in the system are of transport safety using the cameras and computers united a singlealgorithms network isand offered. Software modules in driver systemsand implementation is shown theThe illustrating examples in real-time. Pythonassistant are developed natural experiments arewith made. possibility of the offered algorithms and programs in UGV or in the driver assistant systems implementation is shown with the illustrating examples in real-time. © 2019 The Authors. Published by Elsevier B.V. © 2019 The Authors. Published by Elsevier B.V. © 2019 The Authors. by Elsevier This is an open accessPublished article under the CC B.V. BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Conference Program Chairs. Peer-review under responsibility of thethe Conference Program Chairs. Peer-review under responsibility of the Conference Program Chairs. Keywords: critical situation; transport system; localization; recognition; deep neural network; computer vision system Keywords: critical situation; transport system; localization; recognition; deep neural network; computer vision system
* Corresponding author. E-mail address:
[email protected] * Corresponding author. E-mail address:
[email protected] 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) 1877-0509 © 2018 Thearticle Authors. Published by Elsevier B.V. Peer-review under responsibility of the Conference Program Chairs. This is an open access article under CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Conference Program Chairs.
1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Conference Program Chairs. 10.1016/j.procs.2019.04.090
676 2
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 F.F. Pashchenko et al. / Procedia Computer Science 00 (2018) 000–000
1. Introduction Due to the increase of vehicles speeds and traffic intensity, the safety increase gains a particular importance for the modern transport system. Herewith the early detection and recognition of critical situations becomes especially important. A critical (abnormal) situation is an event, caused by the actions of technical and physical objects, or the combination of conditions involving the negative impact on these and other objects, the equipment or the environment. The critical situations for transport systems are: the probability of a road accident (RA); the existence of artificial and natural obstacles; the poor visibility; the inadequate behavior of a vehicle; the unauthorized access to cars. The recognition of these situations is made by a driver or by a transport system operator using the visual data, so the computer vision system (CVS) can be applied to solve this task. The main objective of the computer vision algorithms is to search the images on the dynamic video stream frames and to distinguish their key signs, characterizing the objects properties. The recognition objects for unmanned ground vehicles (UGV) are: the background objects (the road, markings, trees, houses, etc.), vehicles (and their orientation relative to the camera), people (and their biometric signs), and also various actions and the signs of critical situations (fire, smoke, a fight, etc.). Many papers [1, 2] are devoted to the UGV or the driver assistant [3], and also to the problems of optical patterns recognition in them. In the paper [4] it is offered to use the method of the main components analysis to detect and classify the traffic light. The authors [5] offer a new HOG-like descriptor on the basis of the motion contour images. Recently one of the noticeable trends is to use the deep neural networks (NN) in CVS while solving the problems of pattern recognition. Despite the high accuracy of such networks, they are quite demanding on resources and involve the video stream transmission and computing with server graphics accelerators (GPU). The delays in transmission do not allow to apply this approach to UGV that require the real-time detection (RT). This makes it necessary to use the embedded computing modules based on GPU with lower performance but having low power consumption and small size as on-board computers. Taking into account the hardware limitations of such modules, it is possible to run lightweight adaptations of deep NN [6], which, with a small drop in accuracy, have a higher computational speed. So in [7] the NN with a deep convolution for traffic sign recognition in RT were used. It is necessary to distinguish a critical situation according to the existence and actions of the objects distinguished on the NN video image. It is interesting to create a combined system joining the artificial intelligence technologies: fuzzy logic and computer vision on the basis of modern computing systems. Therefore the aim of this article is to develop a computational method of pattern recognition in a continuous video stream with use of deep neural networks and the systems of fuzzy logic, considered to be effective in terms of accuracy and speed. UGV or a car with a driver assistant consists of some cameras, many detectors and the on-board computing module, typically equipped with the GPU. The front camera monitors the road and the objects which are the traffic participants. The all-round camera monitors the objects on the next lanes and the situation behind the car. The internal camera watches inside of the car, it can distinguish the driver or, in the case of the unmanned ground taxi, identify passengers. The tasks of the on-board computer include the sensors data analysis and the pattern recognition according to video frames. When a critical situation emerges, the on-board computer should take actions to prevent effects or transfer the information to the city transport network server. By “pattern” we will mean the technical, physical entities, normal and critical situations occurring in the interaction of the objects. Then it is necessary to find an pattern of an object or a situation, having selected the key signs on the arriving video stream and to refer them to one of the classes and to make a decision on absence or existence of critical situation to prevent it. It is necessary to develop a deep neural network computational method for implementation of pattern recognition and to show a possibility of its application for the intellectual transport systems. To solve the task it is necessary to apply the computer vision methods based on the convolution and the recurrent neural networks, to develop the software in the Python programming language and to carry out a natural experiment with the video stream arriving from the UGV surveillance cameras.
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 Author name / Procedia Computer Science 00 (2018) 000–000
677 3
1. The statement and the solution of the problems 1.1. The statement of the pattern recognition task x , given by the features xi , i = 1, n , the set of which for the Assume that we have: a multitude of patterns w ÎW pattern w is represented by the vector descriptions F (w ) = ( x1 (w ), x2 (w ),..., xn (w )) = x ; a multitude of classes B = {b1 , b 2 ,! , b c } , c is the number of classes. A priori information is represented with the training set (dataset) D = {(x j , b j )}, j = 1, L , given by the table, in which each line j contains the pattern vector description F (w ) and the class label b k , k = 1, c . Let’s notice that the training set characterizes the unknown transformation * F: W ® B . It is required to solve the problem of the pattern recognition according to the available frames I t of the continuous video stream V = ( I1 .., I t ,.., It ) and a priori information, given by the training set D = {(x j , b j )}, j = 1, L for the in-depth training of NN with a teacher: to detect the pattern w in the form of the feature estimation x! using the transformation [8] F1: I t ® x! and to classify them using the transformation F2 : x! ® b k , k = 1, c in accordance with the given criterion P (x! ) , minimizing the probability of an error. Thus, it is necessary to find the transformation F: I t ® b k , k = 1, c , in which F is a set of functions and algorithms fi , i = 1, N f .
1.2. The solution of the problem of objects and situations detection and recognition To solve the problem of objects and situations detection and classification the computational method of pattern recognition is offered F: I t ® b k , t = 1,t , k = 1, c with its implementation based on the composition of traditional image processing methods and deep neural networks (Figure 1). It is necessary to note that there are some objects on the video stream frame. The information about the parameters of each object must be written as an array G t . Then the comparison element (CE) checks the conditions from the knowledge base Dnormal according to the parameters from G t and distinguishes a normal or a critical situation according to the equation: normal ïì1, if G t Î D s = F4 ( G t ) = í , normal ïî0, if G t Ï D
(1)
where s = {0,1} is a normal or an abnormal situation. The existence of deviation in one of the objects parameters from the given in Dnormal possible combinations of values, puts the system in a critical situation.
Fig. 1. The steps of the deep neural network method for the critical situations detection
678 4
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 F.F. Pashchenko et al. / Procedia Computer Science 00 (2018) 000–000
2. The implementation of the method On the Figure 2 the scheme of the intelligent distributed urban transport security system is offered, containing a variety of CCTV cameras. It is proposed to use The NVIDIA Jetson TX2 computing module as an on-board computer with GPU.
Fig. 2. The structural scheme of the intelligent distributed urban transport security system
Let us consider the features of the stages of deep neural network recognition method in relation to the task of objects and situations detection and classification: 1. The separation from the continuous video stream V = ( I1 .., I t ,.., It ) of an image I t with a size wIt ´ h It , where t is the number of the current frame. 2. The search of the object patterns on the frame f1:I t ® G t , where G t is the array of elements, containing the parameters n of the objects in the video stream frame. If it is necessary to specify some additional features of the desired object o , we move to the next step, otherwise the next frame is taken. An array G t structure is created. It is suggested to use the pre-trained deep neural network YOLO as the algorithm for searching the objects: "vehicles", "people". For image segmentation it is suggested to use the re-trained deep NN, having the architecture as in the SegNet NN model [9]. This will allow to detect the background objects ("a road", "a house", "a marking", "a tree", etc.), as well as the technogenic objects ("smoke", " fire", "flashes"). As a result we get the array G t , containing the detected objects, their coordinates and the class marks. If
o! = {1, 2 ,..., 6} , i.e. the objects "a person", "a vehicle", "a traffic sign", "smoke", " fire", "flashes" are detected, the transition to the next step is made, otherwise the first step of the method is made to get the next frame. If there is an object "a person" it is necessary to identify the person according to his\her face. In spite of the fact that the YOLO model distinguishes between several classes of vehicles, it is necessary to recognize its orientation about the camera. "A traffic sign" class requires some specification to abide the traffic rules, while the technogenic objects are caused by critical situations and require the analyses of the video fragment. 3. Distinguishing the interest area of the first level R 1 = crop ( I t , x o , y o , wo , h o ) , where x o , y o are the
coordinates of the center of the object o , wo , h o are its dimensions, crop is the operation of cutting out from the submatrix I t according to the coordinates ( x o - wo 2,y o - h o 2 ) ,
(x
o
+ wo 2,y o + h o 2 ) . The video sequence
analysis must be performed for the all-round camera, it means that the interest area is the whole image, then x o = i 2, y o = j 2 , wo = i, h o = j , т.е. R 1
seq
= It .
4. The refinement of the interest area to detail the information about the pattern f 2 :R 1 ,t ® R 2 . When a face is found on the picture the algorithm HOG [10] or SSD MultiBox may be used as the localization algorithm. Then the result of the algorithm is a matrix R 2 , containing an image of a person's face.
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 Author name / Procedia Computer Science 00 (2018) 000–000
679 5
When analyzing the situation it is necessary to analyze the frames sequence and to generalize the obtained 2 information for a certain time interval, then R = concat ( I t ,..., I t + i -1 ) , where i is the number of frames, concat is the operation of concatenation of several consecutive frames into a multidimensional array. This problem is solved with the sequential passing through the video stream V by the scanning window of the size i = 690 frames and the 2 offset step d = 23 frames. The frames form an array R k , where k is the number of the window. 2* 2 5. Preprocessing the interest area R = f3 R , M, g , where M is the matrix of geometrical linear and affine transformations R 2 , g is the set of the matrix functions and their parameters for brightness and contrast transformations R 2 . The result of the affine transformations is the matrix R 2* . In the case of a video fragment it is necessary to preprocess each frame from R k2 = [ I t , I t +1 ,..., I t + i -1 ] by means of the noise elimination algorithm, including fuzzy logic [11], the result will be the preprocessed interest area 2* 2* 2* 2* R k = éë R kt , R kt +1 ,! , R kt +i-1 ùû . 6. Distinguishing the informative features R 2* by means of frame coding by the convolutional NN: 2* CNN F : R ® x! , where x! is the interest area, set to the feature area CNN. For this purpose, the architectures of convolutional neural networks can be used, both pre-trained on the dataset ImageNet [12] and on the prepared datasets D . As a feature area for face images it is suggested to use the features that were obtained from the adopted and re-trained deep NN architecture MobileNet v2 without two last layers and the features from the pre-trained Inception v3 [13] to encode the video sequences. Unlike the known solutions, the adopted architecture MobileNet v2 with deep convolutions acquires higher classification speed. This allows to analyze a large number of frames in a time period and to reinforce the estimate using fuzzy logic algorithms. The features, obtained from the last image convolution layer, are presented by the x! size (4, 3, 1280), and for the video fragment x! seq by the 2048 ´ 1 size. 7. The assignment of the feature vector to one of the classes f 4 :x! ® p x! , where p x! is a vector of the с ´ 1 size, containing the possibilities of classification, c is the number of classes. The classes multitude B = {b1 , b 2 ,! , b c } is a set of people’s faces, vehicles’ orientations, traffic signs or situations. To solve the problem of vehicles and traffic signs classification it is suggested to add the last fully connected layer with the activation function Softmax to the adopted MobileNet v2. To solve the problem of faces classification a new deep NN architecture was developed [14] that was based on the dual approach (Figure 3а). 2 inputs are provided in the offered NN architecture: the standard in the form of a feature vector b! k arrives to the first input and is calculated hl1 = x! - b! k , the object feature vector x! arrives to the second input and is calculated hl2 = x! × b! k . It should be noted that the first two layers of the developed architecture are the dual network. In the left branch classical approach with calculation of Euclidean distance is used, and in the right branch the feature descriptions product is calculated. Each of the outputs hl1 and hl2 goes through the developed filtration unit, which is a sequence of the following layers: the convolution layer with the activation function s linear ; the normalization layer [15]; the activation layer with the activation function s ReLU . The sequence of the layers is repeated 4 times, herewith the size of the filter takes the convolution layer values 32, 64, 128 and 256. After the filtration unit the fully connected layer of the 128 neurons size with the activation function s ReLU follows. Then the unification is carried out by means of a concatenation layer and the alignment layer is added to a one-dimensional vector to reduce the NN output array. As an NN output a fully connected layer with 1 neuron and the activation function s Sigmoid is used. The result of the proposed deep NN architecture functioning is a vector p x! for each example from the most similar cluster B v , the remaining probability vectors are reset. In contrast to the classical Siamese network, the proposed dual approach allowed to use a bigger amount of features that increased the accuracy of face identification. As a video fragment classifier f 4 a new deep NN architecture f 4event was developed [16, 17], that was constructed by various combinations of convolution layers and independent recurrent layers IndRNN [18] (Figure 3б). The NN architecture branch on the left consists of independent recurrent layers with activation function s ReLU .
(
)
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 F.F. Pashchenko et al. / Procedia Computer Science 00 (2018) 000–000
680 6
a
b
Fig. 3. Artificial neural network architecture (a) to identify faces; (b) to classify video fragments
The architecture consists of two parts with independent inputs. The first part is presented by the subselection layer with the operation GlobalMaxPooling and the fully connected layer with the activation functions s ReLU . The outputs of both NN architecture brunches are united by the concatenation layer and two consistently going fully connected layers with activation functions s ReLU and s Softm ax accordingly. The result of the classifier f 4event is obtaining the probability vector p x! with the 5 ´ 1 size. Unlike the classical approaches with LSTM, the application of the IndRNN layers allowed to achieve greater speed. The classification criterion is determined as J ( f 4 ) = max p x! . If J ( f 4 ) ³ e ,
( )
where e is the given limitation, then b k = arg max p x! k , otherwise the classification is considered to be incorrect. k Î1..c
In the image classification problem e = 0.99 , for video fragments e = 0.7 . To improve the recognition accuracy, each classifier estimation is multiplied by the confidence factor. The
( ) from the object center
calculation of the coefficient is performed by the s-like affiliation function µ y R
2
coordinates. Then the average value of each k element of the vector p x! for n frames is calculated: æ 4£t £ 60 æ R 2 t p x! k = ç å µ ç y è è t
ö t ÷ × p x! k ø
ö ÷/ n, ø
(2)
where t is the number of the frame, and not less than 4 and no more than 60 frames are considered, p x! t is vector с ´ 1 , containing the frame classification probability I t .
8. According to the objects parameters G t to distinguish a normal or a critical situation according to the equation: ìï 1, if G t Î Dnormal s = F4 ( G t ) = í , normal ïî0, if G t Ï D
(3)
where s = {0,1} is a normal or an abnormal situation. The existence of deviation in one of the objects parameters from the given in Dnormal possible combinations of values, puts the system in a critical situation. 3. Illustrative examples The offered approach was realized in Python with the use of Tenserflow and Keras libraries. The built-in GPU Nvidia Jetson TX2 module was used as the computing platform. The Figures 4–6 give the examples how the deep neural network method of critical situations recognition works on the images taken from the vehicle cameras. The Figure 4 shows the segmentation task of the front camera UGV image, including difficult conditions (snow, rain).
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 Author name / Procedia Computer Science 00 (2018) 000–000
681 7
Fig. 4. The examples of the front camera segmentation
The Figure 5 gives the example of face recognition by the car internal camera with use of original NN architecture based on the dual approach (Figure 3a). The following situations are shown: • a normal situation – the authorized user is driving and there is the authorized passenger in the car (Fig. 5a); • a critical situation, safety violation – the passenger’s attempt to take a driver's seat (Fig. 5b); • a critical situation, safety violation – a black list passenger in the car (Fig. 5c). The proposed NN architecture for face recognition was tested on the public base LFW (Labeled Faces in the Wild), the accuracy of the algorithm was 98.2%. The time of a frame processing, received from the camera, was no more than 0.05 sec. a
b
c
Fig. 5. The examples of face recognition by the internal camera: (a) the authorized user and the authorized passenger; (b) a critical situation - the passenger takes a driver's seat; (c) a critical situation – a black list passenger
The Figure 6 gives the examples of critical situations recognition by the all-round camera: a fight, a fire, an animal on the road using the architecture with the independent recurrent layers. This architecture was trained on the own dataset, founded by UCF Crime [19] and changed in the following way: the video clips were cut into 30seconds parts with a 10-seconds scan window; each video fragment obtained was checked and classified by the expert as referring to one of 5 classes: 1–Assault (66/24), 2–Fire/Explosion (46/16), 3 – Gun(43/16), 4 – Road Accident(41/14), 5 – Normal Event (230/77). In the brackets splitting a set on classes into the training and testing examples is given. When solving the abnormal situations recognition task with the help of the architecture proposed, the accuracy up to 80% [16] was achieved and the speed was 1.43 sec. for a 30-seconds video fragment. Thus, it may be concluded that it is possible to apply the developed computing method in difficult conditions in RT. a
b
c
Fig. 6. The examples of the situations recognition by means of the all-round camera: (a) a fight; (b) a fire; (c) an animal on the road
4. Conclusion The statement of the problem of pattern recognition for transport systems according to the continuous video stream is given. The deep neural network method combining the technologies of computer vision and fuzzy logic is developed to solve the problem of critical situations recognition according to the video frames. Unlike the known solutions for the object and normal or critical situations detection and recognition, it uses the classification with the subsequent reinforcement on the basis of several video stream frames and with the automatic annotation algorithm. The original architectures of deep neural networks are offered that allow to distinguish pattern in real time with high accuracy at low computing costs: the dual network to identify drivers and passengers according to the face image, the network with independent recurrent layers to classify situations according to the video fragment. The software in Python is developed and the natural experiment with the video stream arriving from
682 8
F.F. Pashchenko et al. / Procedia Computer Science 151 (2019) 675–682 F.F. Pashchenko et al. / Procedia Computer Science 00 (2018) 000–000
the surveillance UGV cameras is made. The scheme of the intellectual distributed city system of transport safety is offered in which the distinctive feature is the use of cameras and vehicles on-board computers united in a single network. The accuracy of the method when solving the face recognition problems was 98.2% on the public base with a camera frame processing time no more than 0.05 of a second. When solving the problem of abnormal situations recognition on the computing module GPU with the accuracy of the method was up to 80% and the speed was 1.43 second for a 30-seconds video fragment. The deep neural networks application together with the use of modern graphic accelerators allows to achieve good results in RT when solving the problems of image recognition classification. Acknowledgements The work was supported by the Russian Ministry of Education research project – state task № 2.1898.2017 /4.6. References [1] K. Kim, B. Kim, K. Lee, B. Ko and K. Yi. (2017) "Design of Integrated Risk Management-Based Dynamic Driving Control of Automated Vehicles." IEEE Intelligent Transportation Systems Magazine, 9(1): 57-73. [2] S. Kim and W. Liu. (2016) "Cooperative Autonomous Driving: A Mirror Neuron Inspired Intention Awareness and Cooperative Perception Approach." IEEE Intelligent Transportation Systems Magazine 8(3): 23-32. [3] C. Braunagel, D. Geisler, W. Rosenstiel and E. Kasneci. (2017) "Online Recognition of Driver-Activity Based on Visual Scanpath Classification." IEEE Intelligent Transportation Systems Magazine 9(4): 23-36. [4] Z. Chen and X. Huang. (2016) "Accurate and Reliable Detection of Traffic Lights Using Multiclass Learning and Multiobject Tracking." IEEE Intelligent Transportation Systems Magazine 8(4): 28-42. [5] S. Koehler, M. Goldhammer, S. Bauer, S. Zecha, K. Doll, U. Brunsmann and K. Dietmayer. (2013) "Stationary Detection of the Pedestrian’s Intention at Intersections." IEEE Intelligent Transportation Systems Magazine 5(4): 87-99. [6] A. G. Howard , M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam. (2017) "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision." [Online]. Available: https://arxiv.org/pdf/1704.04861.pdf. [accessed 02.05.2017]. [7] W. Hu, Q. Zhuo, C. Zhang and J. Li, "Fast Branch Convolutional Neural Network for Traffic Sign Recognition." IEEE Intelligent Transportation Systems Magazine 9(3): 114-126, 2017. [8] O. S. Amosov (2004) "Markov sequence filtering on the basis of bayesian and neural network approaches and fuzzy logic systems in navigation data processing." Journal of Computer and Systems Sciences International 43(4): 551-559. [9] A. Kendall, V. Badrinarayanan and R. Cipolla (2015) "Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding." [Online]. Available: http://arxiv.org/abs/1511.02680. [accessed 01.02.2018]. [10] O. S. Amosov, Y. S. Ivanov and S. V. Zhiganov (2017) "Human localization in the frame of the video stream using an algorithm based on growing neural gas and fuzzy inference." Computer Optics: 41(1): 46-58. [11] O. S. Amosov, S. G. Baena, Y. S. Ivanov and S. Htike (2017) "Roadway Gate Automatic Control System with the Use of Fuzzy Inference and Computer Vision Technologies." 2017 12th IEEE Conference on Industrial Electronics and Applications (ICIEA), Siem Reap: 707-712. [12] "ImageNet," [Online]. Available: http://www.image-net.org/. [accessed 02.06.2017]. [13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna (2015) "Rethinking the Inception Architecture for Computer Vision.". [Online]. Available: https://arxiv.org/abs/1512.00567. [accessed 2018.04.01]. [14] O. S. Amosov, S. G. Amosova and Y. S. Ivanov (2018) "Intelligent Access Monitoring and Control System of Physical Person." 2018 IEEE XXI International Conference on Soft Computing and Measurements (SCM): 352-355. [15] S. Ioffe and C. Szegedy (2018) "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" 2015. [Online]. Available: https://arxiv.org/abs/1502.03167. [accessed 10.05.2018]. [16] O.S. Amosov, Y.S. Ivanov, S.V. Zhiganov (2018) "Abnormal situations recognition in the continuous video stream of information and telecommunication systems". IOP Conference Series: Earth and Environmental Science 2018 International Multi-Conference on Industrial Engineering and Modern Technologies, FarEastCon 2018. [17] O. S. Amosov, Y. S. Ivanov and S. V. Zhiganov (2018) "Semantic Video Segmentation with Using Ensemble of Particular Classifiers and a Deep Neural Network for Systems of Detecting Abnormal Situations." IT in Industry 6: 14-19. [18] S. Li, W. Li, C. Cook, C. Zhu and Y. Gao (2018) "Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN." 2018. [Online]. Available: https://arxiv.org/abs/1803.04831. [accessed 10.05.2018]. [19] W. Sultani, C. Chen and M. Shah (2018) "Real-world Anomaly Detection in Surveillance Videos" 2018. [Online]. Available: https://arxiv.org/abs/1801.04264. [accessed 09.06.2018].