Reduce False Positive Alerts for Elderly Person Fall Video-Detection Algorithm by convolutional neural network model

Reduce False Positive Alerts for Elderly Person Fall Video-Detection Algorithm by convolutional neural network model

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedi...

610KB Sizes 0 Downloads 9 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000 Procedia ProcediaComputer ComputerScience Science148 00 (2019) (2019) 2–11 000–000

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018) Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018)

Reduce Reduce False False Positive Positive Alerts Alerts for Elderly Person Fall Video-Detection for Elderly Person Fall Video-Detection Algorithm Algorithm by convolutional neural network model by convolutional neural network model a b Amal EL KAIDaa , Karim BA¨INA BA¨INA ¨INAa ,, Jamal ¨INAb Amal EL KAID , Karim BA Jamal BA a

Alqualsadi, Rabat IT Center, ENSIAS, Mohammed V University, Rabat, Morocco b AngelENSIAS, Rabat IT Center, Mohammed V University, Rabat, Morocco Assistance, Nancy, France b Angel Assistance, Nancy, France

a Alqualsadi,

Abstract Abstract Currently, image acquisition and understanding has become a necessity. In fact, it’s what allows machines to become one of the Currently, image acquisition and understanding has become a necessity. In fact, it’s what allows machines to become one of the most powerful tools. most powerfulmachines tools. Nowadays, that replace humans and experts in making decisions in several areas have seen their success from what Nowadays, replacelearning humansinand expertslearning in making in several areas have their success from so-called deepmachines learning, athat powerful Machine tooldecisions for processing, classifying and seen object recognition tasks.what The so-called deep learning, a powerful learning in Machine learning tool for processing, classifying and object recognition tasks. The idea behind deep learning is training machines, adapting their skills and applying them to many tasks. In the same way that human idea behind learning is training machines, adapting skillsand andgoes applying them to many tasks. Inbefore the same wayprocessed that human brain learns,deep the information is entered through our sensestheir (eyes...) through billions of neurons it was to brain information is entered through our as senses (eyes...) and goes throughthrough billionsseveral of neurons before it was processed to get anlearns, output,the deep learning also takes information in input then start proceeding hidden layers before an output get an output, deep learning also takes information as in input then start proceeding through several hidden layers before an output layer. layer. For that, we choose to profit from this powerful learning to improve the Video fall detection algorithm which suffers from For that,awe choose to profit this So powerful learning to work improve the Videothese fall false detection which suffers generating huge amount of falsefrom alarms. we propose in our to minimize alertsalgorithm using a CNN model thatfrom can generating a huge amount of false alarms. So we propose in our work to minimize these false alerts using a CNN model that can classify a person sitting in a wheelchair from others to eliminate them. classify a person in a wheelchair to eliminate them. We present in sitting this paper, on the one from hand,others a survey of the most recent and powerful architectures of CNN, on the other, we We present this paper, thethe one hand,person a survey the most recentAlgorithm and powerful architectures of CNN, on the other, we propose to addin a CNN modelon into elderly fallof Video-Detection to improve its accuracy. propose to add a CNN model into the elderly person fall Video-Detection Algorithm to improve its accuracy. c 2019  2019 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. © c 2019  The Authors. Published by Elsevier B.V. This is This is an an open open access access article article under under the the CC CC BY-NC-ND BY-NC-ND license license (http://creativecommons.org/licenses/by-nc-nd/3.0/) (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under BY-NC-ND license under responsibility of the the CC scientific committee of (http://creativecommons.org/licenses/by-nc-nd/3.0/) the Second International Conference on Intelligent Computing in Peer-review Peer-review under responsibility of the scientific committee of the Second International Conference on Intelligent Computing in Sciences (ICDS 2018). Data Data Sciences (ICDS 2018). Keywords: Deep learning; Convolutional neural networks (CNN); Classification; CNN architectures; Reduce false positives Keywords: Deep learning; Convolutional neural networks (CNN); Classification; CNN architectures; Reduce false positives

∗ ∗

EL KAID AMAL. Tel.: +2126 2633 3668. EL KAID AMAL. Tel.: +2126 2633 3668. E-mail address: [email protected] E-mail address: [email protected]

c 2019 1877-0509 Authors. Published Published by by Elsevier Elsevier B.V. 1877-0509  © 2019 The The Authors. B.V. c 2019 1877-0509  Thearticle Authors. Published by B.V. This access under the CC licenselicense (http://creativecommons.org/licenses/by-nc-nd/3.0/) This isisan anopen open access article under the BY-NC-ND CCElsevier BY-NC-ND (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under license Peer-review under responsibility of the the scientific committee of(http://creativecommons.org/licenses/by-nc-nd/3.0/) the Second International ConferenceConference on Intelligent Data Sciences Peer-review under responsibility ofCC theBY-NC-ND scientific committee of the Second International onComputing IntelligentinComputing in Peer-review responsibility Data Sciences (ICDS 2018). of the scientific committee of the Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018).under (ICDS 2018). 10.1016/j.procs.2019.01.004

2

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

3

1. Introduction Fall detection is a major challenge in the public healthcare domain, especially for the elderly [1] (over 60), who represents 18% of the population in 2015, it is expected that in 2060 this number will be multiplied by three [2]. The physical consequences of a fall are variable depending on the person . They may correspond to decreased mobility and increased dependence on activities of daily living. The fall also has psychological consequences for seniors, such as loss of self-confidence, which can accelerate the decline of functional abilities. As it generates a large number of hospitalizations; the fracture of the hip is the main reason. Finally, they are the leading cause of trauma death in this population [3]. Even when wearing telealarm on the wrist or around the neck that has a button where the senior should lean to call help if necessary, he remains unprotected and rescued in the event of a sudden fall followed by a loss of consciousness that he couldn’t trigger an alert to the helpdesk manually. For these consequences, an automatic fall detection solution is strongly recommended for the elderly. There are some offers to automatic fall detection such as sensors worn by the patient or environmental that trigger the alert automatically but this proposal doesn’t allow the help desk agent to know the severity level of the fall. This problem has been solved by using video sensors that allow visualization and analysis of the situation of the person to make the necessary rescue. This service is offered by Angel Assistance; it installs the home protection devices in the riskiest living rooms that automatically sends an alert to the assistance center in the form of an image describing the situation, once a fall is detected [4]. The Angel Assistance Algorithm works with a huge amount of data that is associated with the 4V concepts that describes the phenomenon of Big Data: Volume, Velocity, Veracity, and Value. The first V refers to the vast amounts of data generated every second, and the second one refers to the speed at which new data is generated and the speed at which data moves around. These two Vs are confirmed by observing the number of data that is handled, which equal to HDCameraS tream × 24H × 7d × Numbero f Residents. Whereas Veracity concept refers to the messiness or trustworthiness of the data. In this society, the video detector is coupled with the human operator. After all, Value is the most important V of Big Data, this notion determines the profit that can be derived from the use of Big Data. For Angel Assistance the value is the security of the elderly [2]. And the core of big data are Prediction, Personalization, and Prevention, which emphasize in an original way the role played by Big Data in some particularly relevant use cases. Prediction represents how to exploit data to better anticipate and explain phenomena. While, prevention uses Big Data to identify and define the potential risk or danger and, if possible, prevent it by proposing measures and anticipatory actions. Personalization is a deep knowledge of an environment that allows you to configure the entire system specifically for a group of people or even an individual to propose, suggest and recommend. Though it is described as part of the branch of computer science called artificial intelligence, and more specifically, an area called machine learning, its about applying math to huge quantities of data in order to infer probabilities. And regarding image processing and video analysis such as recognizing speech or recognizing objects, we use instead deep learning. Which means in simple words, large and deep artificial neural nets. In the third section, we give an overview of this algorithm [5]. The purpose of this article is to find a suitable model to minimize the error rate of the Angel Assistance algorithm for an elderly person falls video-detection. We propose to use the Convolutional Neural Network. A network that has shown its performance in real-world applications for images classification. Researchers from many scientific disciplines are designing it to solve a variety of problems in pattern recognition, prediction, optimization, and control. Although with successful applications, it can be found in certain well-constrained environments. None of these networks is flexible to perform well outside its domain. The rest of the paper is organized as follows: The first part gives a survey of deep learning and Convolutional neural network. We introduce in section 2 an overview of deep learning followed by the Convolutional neural network in section 3 where we explain the main layers of a CNN model and we present the most used architectures in classification today. In the second part, detection fall systems are introduced followed by our proposition to reduce false positive alerts for elderly fall video-detection algorithm by building a CNN model in section 4. Finally, in the last section, we present the obtained results and we discuss future directions of research.

4

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

3

2. Overview of deep learning 2.1. Paradigms of classification learning Machine learning is about using the right features to build the right model to do the right things. With machine learning, we define the outcome and the program learns to get there. There are a lot of machine running models out there and one of them is called a neural network which is inspired by the brain to solve practical problems. When we use a neural network that use many layers deep to make a prediction, we call that deep learning. It’s a subset of machine learning algorithms that have outperformed almost every other type of model almost every time on a huge range of tasks. Recently, numerous deep learning algorithms have been proposed to solve traditional artificial intelligence problems. It has many layers of neural net with many more nodes in each one. So we need enough data to learn parameters efficiently, and we need a powerful computer for that learning would be too fast and sufficient. The machine learning algorithms are classified into three groups, supervised, unsupervised and semi-supervised learning. The difference between them is based on the way that they learn, the supervised learning used the labeled data to train the algorithm, yet the unsupervised learning looked at inherent similarities between the images and separated them into groups accordingly, assigning its own new label to each group, however the semi-supervised algorithm works with some labeled data and a lot of unlabeled ones so it used a mixture of supervised and unsupervised techniques to make predictions. In fact, in supervised learning, we give a model with labeled data set so it gets feedback on what’s correct and what’s not. It just has to learn the mapping between the labels and the data. And then they can solve some given task like classifying the images in their categories. In 1998, a researcher introduced a model called a convolutional neural network who learn features by itself, able of classifying characters with 99% in its accuracy, which broke every record. In 2012, it was used by other researcher named Alex Krizhevsky at the yearly ImageNet competition. And it was able to classify thousands of images with a new record accuracy, at the time of 85%. Since then CNN’s have been adopted by Google, to identify photos in search, Facebook for automatic tagging. Basically, they are very hot right now. Generally, a CNN consists of three main neural layers, which are convolutional layers, pooling layers, and fully connected layers. The training of a CNN consists of optimizing the coefficients of the network, starting from a random initialization, to minimize the error of classification at the output. The two parts of the CNN are driven simultaneously: one learns both the coefficients of the convolution kernels to extract relevant characteristics and the right combination of these characteristics. In practice, the coefficients of the networks are modified so as to correct the classification errors which we found, according to a gradient descent method. These gradients are back-propagated in the network from the output layer, hence the name back-propagation of the gradient given to the training algorithms of neural networks. Batch training consists of back-propagating the classification error into groups of images. This method is faster than calculating the error over the entire training game at each iteration. It is more stable than working frame by frame because the error gradients have less variance. Note that too many images per batch can cause memory problems when running the code. Fine understanding of back-propagation algorithms is not necessary for an end user. They differ essentially in their policy on learning speed, that is, the amount of change in coefficients at each iteration. The Batch Normalization technique can greatly improve convergence during training. It consists of normalizing in the average and in the variance the outputs of the layers of the network [6]. 2.2. Convolutional Neural Network 2.2.1. The layers of CNN model A Convolutional Neural Network is a special kind of multi-layer neural networks, designed to recognize visual patterns directly from pixel images with minimal preprocessing. It uses supervised learning, and it trains the network by backpropagation. All we need to do in model architecture is to just define all of these sequences of layers together. converts one activations volume to another through a differentiable function. In fact, there are three types of crucial layers that we must find in a CNN: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (figure 2). The first layer is that of convolution layer which contains a set of feature maps produced using different filters like Blur Filter, Accent Filter, Edge Enhancement Filter, Border Detection Filter and Embossing Filter that allow extracting features in the image. The feature maps in this layer use different sets of weights and biases, thereby extracting different

4

A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000 Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11

5

Fig. 1. Typical CNN architecture

types of local features [7] which present this special structure information by keeping the high numbers in the feature maps thanks to the feature detectors that we applied to the input image in the convolution step. So basically, these high numbers represent the special structure of all images because these high numbers in the feature maps repented specific features in the input image. So basically these high numbers in the feature maps represent the special structure of all images because they repented a specific feature in the input image. These features are then given to the next hidden layer to extract still more complex features. Followed by non-linear detectors, Rectified Linear Unit function (ReLU), an activation function defined by Max(0,x) where x is the output value of the immediately preceding convolution layer. It sets then all negative values in the matrix x to zero and all other values are kept constant. The rectifier activation function is used instead of a linear activation function to add non-linearity to the network, otherwise, the network would only ever be able to compute a linear function. ReLU is preferable to other functions because it forms the neural network several times faster without significant penalty for generalization precision, and it converges much faster than Sigmoid or Tanh in practice. Indeed, our neural network should recognize all images that contain a person sitting on a wheelchair even if it occupies different parts of images, or even if the images have a different texture. As LeCun et al. said in his paper in 1998, once a feature has been detected, its exact location becomes less important. A simple way to reduce the spatial resolution of the feature map [11]. So the characteristics that our CNN learns should not be based on the exact location or exact shape or texture because it could not detect other images with a person sitting on a chair. This problem is solved by adding the pooling layers to our model after the convolutional layers to simplify the information in the feature maps. The most common non-linear function to implement this layer is maxpooling which takes the maximum value of each region. Also, this grouping of features allows us to prevent overfitting by reducing the number of parameters that go to the last layers of the neural network. After getting the output of this layer, it should pass by flattening step which just consists of putting all these high numbers of the feature maps into one same single vector to keep the special structure information of the object in this one huge single vector. In Fully Connected Layer which is the last step, the neurons have full connections to all activations in the previous layer. By calculating a matrix multiplication of the last neurons with the corresponding weights, we arrive at determining, for each class, a probability that allows the model to classify the input image. 2.2.2. CNN Architectures In this subsection, we will talk about some specific kinds of CNN architectures that are used today in cutting edge application and research. The first successful applications of ConvNets was LeNet [7], that could be used to read handwritten numbers. Since then, advances in computing and powerful GPUs have allowed researchers to improve neural network models. In 2010 the Stanford Vision Lab released ImageNet, a data set of 14 million images with labels detailing the contents of the images. It has become one of the research worlds standards for comparing CNN models. So that, the winners of imageNet classification benchmarks are been the most used architectures today. In chronological order; AlexNet [8] is mainly used to object detections, ZFNet [9], VGGNet [10], GoogleNet [11] and ResNet were put forward based on AlexNet. LeNet was a convolutional neural network proposed by Yann LeCun et al. in the 90s, which was one of the first instantiations of a convolution neural network that was successfully used in practice. The LeNet-5 architecture has LeNet has six hidden layers between the image layer and the output layer, it took an input image with size 32×32 pixel and filter it in the first convolution layer by 6 kernels to get 6 feature maps. Each unit in each feature map is connected to several 5 × 5 neighborhood at an identical location in the input image at stride one. So then, the size of each feature map is 28×28. A max function is used to keep the high number of each sub-table of 2 × 2 dimension

6

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

5

for these feature maps to obtain 6 pooled feature maps of size 14 × 14 in the first pooling layer. The third layer in LeNet-5 is the second convolution layer with 16 feature maps of 10 × 10 which are obtained by applying kernels of size 5 × 5 on the each pooled feature map. Followed by the second pooling layer that uses the same way as the first one to summarize subregions. The last layer is a convolutional layer, instead of a fully-connected layer [7] with 120 feature maps and uses the kernel of size 5 × 5 to all of 16 of the last pooled feature maps. As the size of these pooled feature maps is also 5 × 5, the size of the output of this convolution layer is 1 × 1. The last layer is the fully connected layer that contains 84 units, then we move to the output layer where we specify the number of classes of our problem. This fairly simple architecture was very successfully applied to do digit recognition. However, AlexNet model, proposed by Alex Krizhevsky in 2012 was the first large-scale convolutional neural network that was able to do well on the imageNet classification task. The network had a very similar architecture as LeNet but was deeper, with more filters per layer, and with stacked convolutional layers. It contained 8 layers, the first 5 were convolutional layers followed by 3 fully connected layers and use ReLu (Rectified Linear Unit) for the non-linear part, instead of a Tanh or Sigmoid function which was the earlier standard for traditional neural networks. The size of the input images is defined by 224 × 224 × 3 In the first Convolutional layer, 96 convolution cards are produced by using the kernels size 11 × 11 at stride 4, and the ReLU activation function which is used by all of the convolutional layers. Then we use Max pooling to summarize the convolution cards in under 3 × 3 dimensions at stride 2. The output of this pooling layer will be the input of the second convolutional layer which uses 256 filters, each filter has size 5 × 5 at stride 1. Then we move to the pooling layer and we keep the same parameters as the first one. The same thing in the third Convolutional layer, we use 384 kernels with the size 3×3 at stride 1. Followed by two convolution layers each of which produces 384 convolution cards by using a kernel with size 3×3 at stride 1. In the last convolution layer, 256 feature maps are produced that they would be summarized in a third pooling layer which use the same parameters as the previous ones. Its outputs should be put in a huge vector in a flattening step before moving to the fully connected layers. The first dense layer takes the vector produced by the flatten layer and create 4096 in hidden layers using the ReLU activation function, as well as the second FC layer. And to avoid Overfitting, authors introduced the dropout technique that removes temporarily some units from the network, along with all its incoming and outgoing connections. AlexNet architecture use the dropout layer with a retain probability of 0.5 after every fully connected layer. In the last dense layer is an output layer that will give us final predictions, it creates N units which correspond a number of the classes (or labels) of the data. In 2013 the ImageNet challenge was won by something called a ZFNet (Zeller Fergus Net) [9]. And so this mostly was improving hyper-parameters over the AlexNet. It maintains the same number of layers (8 layers), the same general structure with additional Deep Learning by making a few changes like changing the stride size, different numbers of filters and after playing around with these parameters more, they were able to improve the error rate. But it still basically the same idea. Zeiler et al. proposed to reduce the 1st layer filter size from 11 × 11 to 7 × 7 and made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features. And instead of 384, 384, 256 filters in last three CONV layers, they use 512, 1024, 512 filters. In 2014, there are a couple of architectures that were now more significantly different and made another jumping performance, and the main difference between these networks was much deeper networks. GoogleNet from Google with 22 layers and VGGNet from Oxford with 11 to 19 layers. The VGG network is the idea of much deeper networks and with much smaller filters. Simonyan and Zisserman [10] increased the number of layers from eight layers in AlexNet to 16 or 19 layers in VGGNet but they used very small filters, only three by three Conv all the way, to reduce the number of parameters. They kept this very simple structure of three by three CONV layers with the periodic pooling all the way through the network. In fact, during training, the input to our network is fixed to 224 × 224 × 3. The RGB image is passed through a stack of convolutional layers, where we use filters with a very small receptive field 3 × 3 with the stride fixed to1 pixel, and also the spatial padding is 1 pixel. The first stack of VGG-16 contains two CONV of 64 filters in each one, the second one, contains also two CONV but this time with 128 filters. In the third stack, they used three CONV layers with 256 filters, likewise the two last one with 512 filters. After each stack, they applied a max-pooling layer of 2 × 2 input fields and 2 × 2 strides. Then, they used three Fully-Connected layers: the first two have 4096 channels what is a good feature representation and the last one contains 1000 which represents the number of classes in ImageNet database. So, VGG-16 consists of 16 layers; 13 convolutional layers and 3 Fully connected layers. However, VGG-19 has a very similar architecture with a few more Conv layers in there, what works

6

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

7

slightly better in practice but it used more memory. VGGNet is very simple elegant network architecture, it was able to get 7.3 top five error on the ImageNet challenge. As for the GoogleNet, it is a much deeper network with 22 layers plus 5 pooling layers, but one of the main insights and special things about this architecture is that is really looked at this problem of computational efficiency and it tried to design a network architecture that was very efficient in the amount of computing by using the inception module. The main idea of this Inception module [12] is to design a good local network topology and then stack these modules on top of each other. The inspiration of this architecture comes from the idea that you need to make a decision as to what type of convolution you want to make at each layer, while all of them are actually beneficial to the modeling power of your network, so they applying several different kinds of filter operations in parallel on top of the same input coming into this same layer. Instead of having a single convolution, they have a composition of filter operations of a size 1 × 1 convolution then 3 × 3 then 5 × 5 and a pooling operation of size 3 × 3. And at the top, you simply concatenate the resulting feature maps of each of them before going to the next layer. The idea is that they use all of the convolution types in each layer and let the model decide and pick whats the best choice. But this naive inception module causes a very expensive computing. One of the key insights that GoogleNet used to deal with this was to address it by using bottleneck layers and try to project these feature maps to lower dimension before our convolutional operations, so before our expensive layers. The authors have added the 1 × 1 convolutional layers before the large convolutions (3 × 3 and 5 × 5 convolutions which considered large) and after a max pooling layer. This architecture allows the model to recover both local features via smaller convolutions and high abstracted features with larger convolutions [13]. All the convolutions in GoogleNet architecture, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224 × 224 taking RGB color channels with mean subtraction. The authors started with a steam network which had six sequences of layers, then they used multiple stacked inception modules, and after they terminated their architecture by a classifier output. But, they removed fully connected layers in their network, so they got rid of that were able to save a lot of parameters and so in total, there are only five million parameters which is twelve times less than AlexNet. And it got 6.7% top five error. Moreover, GoogleNet used two little mini-networks with an average pooling, a one by one convolution layer, and a couple of fully connected here going to the SoftMax that present the auxiliary classification outputs to inject the gradient at lower layers. They actually use the training classification loss in three separate places. That what makes it a deeper network.

3. Reduce false positives of Angel Assistance Algorithm For the elderly, falls are the main risk at home, they can lead to the critical head or hip injuries, and cause loss of independence. The wait for first aid can be long, painful and aggravate, the consequences, whereas thanks to a rapid detection will allow an immediate delivery of the medical service to the wounded. This goal is achieved by installing a protection device at home that immediately detects the fall and sends automatically an alert to the assistance center to trigger the necessary humanitarian assistance depending on the severity of the fall. It can mean the difference between life and death for those who need it. Although automatic fall detection is an excellent system, still, it does not detect 100% of falls. It is based on ambient camera videos, and the corresponding methods generally require a large set of data to form a classifier and are likely to be influenced by the quality of the image. Angel-Assistance System is one of these systems, although it can detect hundreds of falls and risky situations and assist them in less than 5 minutes, launches false alarms to the center. And to minimize these false alarms, we propose in this paper to add a box to the system that will eliminate the false positives which contain a person sitting on a wheelchair by using deep learning models. Moreover, if an image has been detected as a fall by the Angel Assistance classifier while our model detects that the image is of a person sitting on his wheelchair then we filter the image as a false alarm and do not send it to the back office alert workflow management system. The diagram in figure 2 explains the main steps followed for our contribution. The goal of this article is to improve the fall detection performance of the Angel Assistance system by minimizing false positives sent. Indeed, by observing these false positive images, we found that most of them contain a person sitting in a wheelchair or with a walker because people who move a lot and have less mobility are the ones who cause the most alerts and are most likely to fall. So, we propose to add a box that can filter images as shown in figure 3. If

8

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

7

Fig. 2. Worklfow Diagram: Our contribution

Fig. 3. principle of the Real Time Algorithm

the image contains a person on a wheelchair, the algorithm will cancel the alert, otherwise, it will send it to the center. For this purpose, we have chosen to use a CNN model trained on a set of mages annotated by hand and who manages to process a new image and classify it in a very short execution time.

8

A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000 Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11

9

Fig. 4. (a) Relu graph; (b) Sigmoid graph.

We chose a simple model to create this classifier. Before applying the CNN model to our images, we resize them to 64 × 64 pixels. The model has two convolutional layers which produce 32 feature maps by using kernels of size 3 × 3, each Convolution layer is followed by a pooling layer to resume these feature maps by keeping only the high number in each subtable of 2 × 2. Then, we create a classic Neural Network, fully connected layer, that will classify the images with 128 nodes in the hidden layer, following the output layer where we found the prediction of the input image. To bring non-linearity to our neural network model, we used ReLU function in the convolutional and dense layers. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value back. So it can be written as : ϕ (x) = max (0, x) . While in the output layer, we applied a sigmoid activation function, because we have two classes (person sitting on a wheelchair or not), to get the final prediction. It is given by: 1 . 1 + ex The sigmoid function has an S shaped curve, as shown in figure 4: Therefore, if the output of our CNN model is greater than 0.5, we can classify the result as 1 and if it is less than 0.5, we can classify it as 0. This is how we predict the proper class of our image to decide whether to send the alert or cancel it. In order to adjust the CNN model to the images and to extract more information of the training dataset, we apply image augmentation process which consists of preprocessing our images and increasing them via a certain number of random transformations. This helps prevent overfitting and helps the model to become more generalized. Namely, overfitting is the fact of obtaining excellent precision in the training data, but a much lower precision in the test data. Well, we know that one of the situations that allow overfitting is when you have a bit of data to train the CNN model. In this case, the model finds correlations in some observations of the training set but fails to generalize these correlations on some new observations. Since we have only a small amount of images to train our model, we are far from getting an excellent performance result, so we need to increase the images by applying this technique. So that, we have created our classifier that can distinguish the images of ’person sitting on a wheelchair, which will help the Angel Assistance System to recognize some false positives in order to reduce its error rate. ϕ (x) =

4. Results and Disscussion We tested our classifier on four false positives (5) that it has never seen before and as expected it was able to correctly classify them and predict their true labels. This shows us the ability of convolutional neural networks in improving Angel Assistance. 1

people’s identities were hidden in the images to protect data as exhorted by General Data Protection Regulation.

A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

10

.1

9

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11

Fig. 5. Predection of images that wasn’t in our training or testing datasets

As observed, using this model allows the classification of a new false positive from Angel-Assistance System into two classes. We measured the accuracy on more than 200 images and we got 98 percent. So as discussed, the elderly person falls video-detection algorithm considers some images of ”people sitting on a wheelchair” as a fall, which increases the error rate. However, the convolutional neural network models show its capacity to classify well images in any field, the reason it was attached extensive attention to be widely applied to several real-world applications. In our case, we benefit from it to improve the accuracy of that necessary algorithm which solved an important social problem in video surveillance research by eliminating the images predicted in the class cited above. So that, we can do the same thing to distinguish the false positives which contain people in their walkers or even more images of empty rooms to eliminate them.

10

Amal El Kaid et al. / Procedia Computer Science 148 (2019) 2–11 A. EL KAID et al. / Procedia Computer Science 00 (2019) 000–000

11

5. Conclusion It is true that we have been able to minimize false positives using this model of CNN since we managed to eliminate 98% of the images of a person sitting on a wheelchair who represents more or less of 17% of false alerts. However, there is still a large number of false positives represented in empty room images and none of the CNN models tested could distinguish them because of the complexity of the images. So we should think of another idea to improve more the accuracy of fall detection algorithm. Acknowledgements The authors would like to acknowledge Angel Assistance for sharing images used to evaluate our work with us. References [1] Muhammad Mubashir, Ling Shao, Luke Seed. (2011) “A survey on fall detection: Principles and approaches” Neurocomputing 100 (2): 144– 152. [2] Prof. Karim Ba¨ına, ”Seminar: Big Data Catalyst for Digital Transformation of Information Systems” Thursday 07 July 2016, at ISIMA, France ´ aux Personnes aˆ g´ees [3] Philippe Douste-Blazy, Catherine Vautrin.“Pr´evention des chutes chez les personnes aˆ g´ees a` domicile” Secr´etaire dEtat p 18. [4] www.angel-assistance.fr [5] Mayer-Schnberger, Viktor, and Kenneth Cukier. (2014) “Big Data: A Revolution That Will Transform How We Live, Work, and Think” p 11-12. [6] Yu, Xinguo. (2008) “Approaches and principles of fall detection for elderly and patient.” In e-health Networking, Applications and Services, 2008. HealthCom 2008. 10th International Conference on, pp. 42-47. [7] Lecun, Yann, Patrick Haffner, Yoshua Bengio, and Leon Bottou. (1999) ”Gradient-based learning for object detection, segmentation and recognition.” Proc. IEEE, pp. 2278-2324. [8] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. (2012) ”Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. [9] M.D. Zeiler and R.Fergus. (2014) “Visualizing and understandig convotuional networks.” European Conference on Computer vision, Zurich, pp.818-833. [10] Simonyan, Karen, and Andrew Zisserman. (2014) ”Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556. [11] Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed, and D. Anguelov. & Rabinovich, A. (2015) ”Going deeper with convolutions.”, in Joanne Evans and Lester Hunt (eds) In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1-9. [12] Mulc, Tommy. (2016) ”Inception modules: explained and implemented.” [13] Karthick Arya, Jayesh Kudase. (2017) ”Convolutional Neural Networks based Sign Language Recognition”, International Journal of Innovative Research in Computer and Communicati on Engineering, Vol. 5, Issue [14] Guo, Yanming, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and Michael S. (2016) Lew. ”Deep learning for visual understanding: A review.” Neurocomputing 187 : 27-48. [15] Lu, Dengsheng, and Qihao Weng.(2007) ”A survey of image classification methods and techniques for improving classification performance.” International journal of Remote sensing 28, no. 5 : 823-870. [16] Dumoulin, Vincent, and Francesco Visin. (2016) ”A guide to convolution arithmetic for deep learning.” arXiv preprint arXiv:1603.07285. [17] Singh, Rajat Vikram. ”ImageNet Winning CNN ArchitecturesA Review.” [18] Burel, Gilles. (1991) ”RESEAUX DE NEURONES EN TRAITEMENT D’IMAGES-Des Modles thoriques aux Applications Industrielles.” PhD diss., Universit de Bretagne occidentale-Brest. [19] Julien Krywyk, Pierre-Alain Jachiet. (2016) ”classification dimages les reseaux de neurones convolutifs en toute simplicite”. [20] Chollet, Francois. (2016) ”Building powerful image classification models using very little data.” Retrieved December 13 2016.