Neural networks based visual attention model for surveillance videos

Neural networks based visual attention model for surveillance videos

Neurocomputing 149 (2015) 1348–1359 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Neura...

4MB Sizes 0 Downloads 71 Views

Neurocomputing 149 (2015) 1348–1359

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Neural networks based visual attention model for surveillance videos Fahad Fazal Elahi Guraya n, Faouzi Alaya Cheikh Faculty of Computer Science and Media Technology, Gjovik University College, P.O. Box 191, N-2802 Gjovik, Norway

art ic l e i nf o

a b s t r a c t

Article history: Received 19 September 2013 Received in revised form 16 July 2014 Accepted 25 August 2014 Communicated by Huiyu Zhou Available online 23 September 2014

In this paper we propose a novel Computational Attention Models (CAM) that fuses bottom-up, topdown and salient motion visual cues to compute visual salience in surveillance videos. When dealing with a number of visual features/cues in a system, it is always challenging to combine or fuse them. As there is no commonly agreed natural way of combining different conspicuity maps obtained from different features: face and motion for example, the challenge is thus to find the right mix of visual cues to get a salience map that is the closest to a corresponding gaze map? In the literature many CAMs have used fixed weights for combining different visual cues. This is computationally attractive but is a very crude way of combining the different cues. Furthermore, the weights are typically set in an ad hoc fashion. Therefore in this paper, we propose a machine learning approach, using an Artificial Neural Network (ANN) to estimate these weights. The ANN is trained using gaze maps, obtained by eye tracking in psycho-physical experiments. These weights are then used to combine the conspicuities of the different visual cues in our CAM, which is later applied to surveillance videos. The proposed model is designed in a way to consider important visual cues typically present in surveillance videos, and to combine their conspicuities via ANN. The obtained results are encouraging and show a clear improvement over state-of-the-art CAMs. & 2014 Elsevier B.V. All rights reserved.

Keywords: Visual salience Video Surveillance Neural network Attention model HVS

1. Introduction The HVS is naturally attracted to salient objects or events in a visual scene. This is done automatically, unconsciously and effortlessly in the visual system when light propagates through retina cells to the complex cells of the primary visual cortex. It is a rather challenging task to model such a complex mechanism of human vision. On the other hand, it is very tempting to do so as computational versions of a human attention model (CAM) can be used in many image and video processing applications such as image and video compression [1–4], perceptual quality evaluation [5,6], and object tracking [7,8] to name a few. Saliency helps to determine the capability of attracting visual attention towards some region/object in an image/video scene [9,10]. The visual attention models can be divided into different categories based on the algorithms used. In [11], authors classified visual attention models into seven categories, i.e. Bayesian models, decision theoretic models, information theoretic models, graphical models, spectral analysis models and pattern classification models. All visual attention models use different features to identify salient regions in a visual scene. These features are generally categorized into two groups: bottom-up, and top-down features [12,13].

n

Corresponding author.

http://dx.doi.org/10.1016/j.neucom.2014.08.062 0925-2312/& 2014 Elsevier B.V. All rights reserved.

The bottom-up stage of the HVS processes the input scene/ image in a parallel and pre-attentive manner and forwards this information to a serial, attentive and computationally intensive topdown stage. In the bottom-up stage our visual system computes the salient regions from low-level features such as color, intensity, and orientation. It has been shown that the HVS combines low-level features in the early stage [10,14]. Saliency computation models based on information theory have successfully modeled human attention from these local features [15,16]. The very first visual attention models proposed were based only on bottom-up features [17,18]. The famous computational model of bottom-up attention proposed by Itti et al. [17] uses low level features such as color, intensity and orientation. Later it was modified to include more complex features such as motion and flicker [13,19–21]. Top-down mechanisms implement our longer-term cognitive strategies, biasing our attention toward detecting people or recognizing faces for example in the surveillance context. It has been observed that the HVS diverts attention to faces 16.6 times more than to other similar regions [22]. Therefore, face detection can significantly improve the performance of any attention model if used in addition to low-level features, such as those used in these salience models: Itti's [17,23], GBVS [24], or GAFFE [25]. Thus face conspicuity as top-down visual cue was added, by Sharma et al. [26], to the bottom-up salience computational model in [17] which gave better results. In most surveillance applications,

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

people represent the most important objects in the scene. Therefore, to obtain efficient visual attention models for surveillance videos, it is natural and intuitive to combine high-level features, such as face and motion, with low-level features into a single CAM. Combining bottom-up and top-down approaches efficiently guides the visual system towards the salient regions or regions of interest in a visual scene [27–29]. Motion is the feature that differentiates video from still images. The latter are fully characterized by spatial parameters and pixel color values while video has time as the extra dimension that introduces a strong relation between the content of consecutive frames. Differences between the contents of these frames are mainly due motion, being motion in the scene or of the camera. Motion has a great influence on identifying salient regions in complex dynamic visual scenes. Most computational models compute attention based only on low-level features and do not take motion into account. A recent comparative study investigated the state-of-the-art salience models, shows that only seven out of thirty five models use video stimuli for attention computation, the rest use only still image features [30]. In [31] Itti proposed an attention model for dynamic scenes. This model uses color, intensity, orientation, flicker, and motion (CIOFM) features from the video. It gave improved results over his CAM proposed in [17] based only on static features. More recently, Itti and Baldi [23] proposed another CAM based on Bayesian surprise. This model uses all the static and dynamic features of previous Itti et al.'s model [31]. The Bayesian surprise model [23] is based on Bayes theory and computes the divergence between the posterior and prior probabilities of surprise (event occurrence) in a video, to detect salience. There are several attention models proposed in the literature, a detailed review of dynamic salience models for videos is recently presented with a comparative study in [32]. In this paper, we propose a CAM model using both top-down and bottom-up approaches combining low-level as well as highlevel features extracted from the visual content of the videos and compare it to state of the art CAM models proposed in [17,31,23] in the context of video surveillance. The rest of the paper is organized as follows: in the next section we describe in detail the proposed model. Section 3 describes the experimental setup and test data. Section 4 discusses the obtained results. The last section concludes the paper and points to possible future research directions.

2. Proposed CAM: neural network based salience model (NNBSM) The proposed CAM is shown in Fig. 1. The model has basically three different components, with each one computing a specific conspicuity map. The first one computes the static salience conspicuity map based on still image low-level features. The second uses a top-down approach to compute a conspicuity map based on face features. While the last one computes salient motion conspicuity map. These three conspicuity maps are combined in a final step using a neural network (NN) as shown in Fig. 1. This NN is trained on gaze maps obtained from psycho-physical experiments. The three components of the CAM model are explained in the next three subsections. 2.1. Bottom-up and top-down visual cues in the proposed CAM Several static or stationary salience models have been proposed in the literature as already described in Section 1. The most popular was proposed by Itti and Koch [17]. This salience model is based on bottom-up features and thus generates the salience

1349

Fig. 1. Proposed NNBSM visual salience model.

map based on a combination of color, orientation and intensity conspicuity maps. This salience model computes the salience map by averaging the three conspicuity maps afore-mentioned. As this model is based on low-level features and does not consider high-level ones, such as faces, text or other familiar objects, it was shown in [22,33] that such high-level features attract more attention than low-level ones. To overcome this problem and incorporate faces, a model that incorporates this high-level visual cue was proposed in [26]. It uses color, intensity, orientation, and face features extracted using the same approach as in [34]. Their experimental results showed that faces should be given approximately four times larger weight than the weight of each of the lowlevel features during the combination step. This model provides an overall 33% performance improvement over other stationary models [26], when faces are present in the scene which is very likely in the surveillance scenario. The weights in this combination approach are still defined in an empirical way. In this paper we propose to use both low-level features such as colour, texture and orientation, and high-level ones such as face feature, and motion feature as shown in Fig. 1. This proposed model uses an improved salient motion detection algorithm that estimates the motion using optical flow method and filters it to keep only salient motion. The next section discusses the use of salient motion in attention models and explains the adopted salient motion detection model. 2.2. Motion cues in proposed CAM Salient motion is the motion that stands out from the other motion in the dynamic scene and grabs the attention of viewers. Salient motion detection is a complex task that depends highly on the specific scene, environment, or scenario. It is also heavily dependent on the application or viewers interests and interpretation. Therefore, normal motion detection methods such as the Lukas and Kanade method [35] are not appropriate in detecting salient motion. In our model we use only salient motion, thus nonsalient motion is filtered out. To compute the motion feature conspicuity, we propose a modified version of the Tian and Hampapur salient motion model [36]. This model has five main steps: temporal difference between adjacent frames, motion extraction, temporal filtering, region growing and multi-source fusion. In our proposed model, we use the three

1350

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

first steps: temporal difference, motion extraction and temporal filtering. The region growing and multisources fusion steps of this model are not included in our proposed model for two reasons. First, segmentation is computationally complex with suboptimal results. Second, thresholding used in the region growing is data dependent and varies for different videos. Therefore, we propose to use Gaussian filtering to replace region growing for computing the salient regions around the salient pixels that are detected in the temporal filtering step. Gaussian filtering has two advantages over region growing: the first advantage is increased processing speed. The second advantage is the improved robustness, as segmentation is highly sensitive to region growing thresholding value which is data dependent. The temporal filtering step computes the pixels that have salient motion. If these pixels are used as seeds for the segmentation step, then there will be more chances to wrongly detect nonsalient regions as being salient. For example, when a person wearing a dark gray shirt moving close to a wall, segmentation typically detects the shadow region as part of the moving person region. To avoid these kinds of wrong classification we perform a Gaussian filtering instead of region growing on the detected salient pixels which gives us regions which are most likely salient. The first step for computing the motion salience is to compute the temporal difference. It is computed between two consecutive frames It and I t þ 1 at times t and t þ 1 respectively. The difference is then thresholded by a threshold value Td estimated based on the image statistics, and was estimated to be 15 as in [36]. To detect slow moving objects, [36] proposed to use a weighted mean of the accumulation of frame difference and current frame by a fixed weight wd as described in Eqs. (2) and (3). This weight can be changed according to the application requirements: I fr  diff ¼ I t þ 1 I t :

ð1Þ

I diff ðt þ 1Þ ¼ ð1  wd ÞI t þ wd ðI fr  diff Þ:

ð2Þ

 I temp  diff ðt þ 1Þ ¼

1

if I diff ðt þ 1Þ 4 T d ;

0

otherwise:

ð3Þ

The next step consists of computing the motion vectors. In this step the Lucas–Kanade optical flow [35] is used. For a given set of pixels in one frame, the Lucas–Kanade method has to find similar pixels in the next frame based on optical flow theory. Mathematically, for a given point I t ðx; yÞ in frame It find the point I t þ 1 ðx þ xδ ; y þ yδ Þ in frame I t þ 1 that minimizes the error ϵ given in Eq. (4). In our case, we compute the motion vectors for those pixels that are filtered out in the first step in I temp  diff :

ϵ ¼ ∑∑ J It þ 1 ðx þ xδ ; y þ yδ Þ  It ðx; yÞ J ; x y

ð4Þ

where xδ is the number of displaced pixels in the x-direction and yδ is the number of displaced pixels in the y-direction. After computing motion vectors, we multiply the I temp  diff with the motion vector's x and y components, Mx and My respectively, to filter out the non-salient motion vectors. Finally, temporal filtering is performed on the filtered motion vector's x and y components. After temporal filtering, Gaussian filter is applied on the detected pixels. According to [37], a study is conducted to quantify the center bias of the observers in free viewing condition. The results show quantitatively that center-bias is correlated with photographer bias and is influenced by the following factors: viewing strategy at scene onset, orbital reserve, screen center, and motion bias. Thus if there are more than one moving object in a scene, high priority should be given to the center of each moving object by computing the Gaussian filtering on the salient motion points. The farther we go from the center of the object, the more the salience reduces, according to a Gaussian function. The MSMs are filtered by a spatial Gaussian filter of σ ¼37 which was chosen

to approximate the size of the viewing field corresponding to the fovea in the gaze map [38]. The model proposed in our paper uses the same filter parameters as proposed by [38]. 2.2.1. Salient motion in surveillance videos In most surveillance scenarios, the object with salient motion will move in a consistent way in the same direction for a considerable period of time ½t; t þ n, where n is the number of frames. Hence it is assumed that periodic motion is non-salient in surveillance videos. This may not be true for videos other than surveillance videos though. A positive count P and negative count N is computed by counting the number of times a pixel moved in positive x or positive y directions, and similarly negative x or negative y directions over the period ½t; t þ n. This gives us the pixels with salient motion information because we assumed that the pixel that has non-periodic motion is salient. In the last step of salient motion detection, salient motion pixels are filtered with a Gaussian filter to simulate the HVS center-bias phenomenon [37]. 2.3. Neural network for combination of visual cues The CAMs [39,40] are inspired by feature integration theory [41], and the model using late fusion of features by [17]. All models use features and fuse their conspicuities maps to get the salience maps. The methods that incorporate machine learning methodology are called learning based attention models. A few examples of learning based attention models can be found in the literature such as [22,42]. However none of them used motion feature which has vital importance in case of surveillance applications. A detailed review of learning based visual attention models could be found in [43]. The challenging question that all these models have to address is how to combine different features in away that mimics the HVS? In our proposed model, we consider the human visual characteristics in two ways. First, we propose to use neural networks for combining the features conspicuities obtained for color, intensity, orientation, faces, and salient motion in the proposed CAM. The reason to use neural networks is to build an attention model that adopts machine learning technique to combine several features and find out salient regions from surveillance videos. The reason to use back propagation neural network (BPNN) that propagates the error backwards to adjust the weights is that it is similar to HVS feedback mechanism. Second, we use the gaze map data obtained by psychophysical experiments to train and validate the neural network. We use eye tracking devices to capture the subjective foveated vision, that gives the positions of a subject's eye on a 2-D plane for a given image or video frame [44,45]. Eye tracking results may give different observation points depending on the observer. These observation points are used to create gaze maps, that are finally used in training or compared to the computed salience maps for validation. In our CAM we train the neural network by feeding different features (static and dynamic) as input and gaze maps as the output of the neural network. Once the neural network is trained, we use it to combine the features conspicuities that are computed from test videos. This eliminates the need of using fixed weights to combine the features, that are typically defined by ad hoc methods. 2.3.1. Back propagation neural network Combining different conspicuity maps for visual cues such as color, intensity, orientation, face and motion into one salience map is a challenging task. In addition it is important to consider HVS perception characteristics during this combination phase. The best way to do this is by using machine learning algorithm to train an ANN based on gaze maps. Neural networks can be

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

trained by using the visual features conspicuities and gaze maps of test videos as input and output datasets. In the training phase, inputs will be low-level and high-level visual features conspicuities and motion salience map, while gaze maps obtained from psychophysical experiment, for a given set of surveillance videos, will be used as output. In this way, the neural network can learn how to mix the different conspicuity maps and generate a salience map. Every neural network is based on several neuron layers. One neuron can be considered as a summing device that has inputs and produces output. To be more precise, a neuron also has some sort of weighting mechanism. The inputs of the neuron are multiplied by the corresponding weights and the sum of these products is sent to the output. The neurons can be placed in 3 different layers of a neural network. These layers are called input neuron layer, hidden layer, and output layer. Each neural network has to be trained before being utilized for an actual task. During training neurons get the inputs and corresponding outputs so that it can learn and configure the weights of each neuron in the neural network. In the next paragraph, neuron functionality is explained in greater detail. The basic neuron consists of an activation function FðWS; TÞ where WS is weighted sum of the inputs as shown in Eq. (6) and T is the threshold as shown in Eq. (5). The weights are initialized to random values and get updated during the training phase:  valuei ¼

1 0

if weight i ninput i 4T; otherwise:

n

WS ¼ ∑ valuei : i¼1

ð5Þ

ð6Þ

Various functions can be used as activation function F. The sigmoid activation function is used in this paper is shown in the following equation:

σ ðxÞ ¼

1 : 1þex

ð7Þ

The most used neural networks are Feed Forward Neural Networks (FFNN). FFNN has no feedback, it does not have any connection that loops. The FFNN has weights that need to be tuned to get desired outputs for some certain inputs. For tuning these weights the FFNN needs a training phase. There are many ways to train FFNN. The most basic one is Back Propagation Neural Network (BPNN) as shown in Fig. 2. BPNN was proposed in 1969 [46,47]. BPNN were first used in [48]. The property of BPNN is that it back propagates the output through the neural network to update the internal weights of each neuron. The BPNN uses the gradient descent learning method. BPNN learns in such a way that any mistakes or errors made during the training phase are sent

1351

backwards through the network to correct the weights. This process is called backward propagation of errors. The detailed information about back propagation neural networks can be found in [46,47].

3. Experiments Two experiments were performed to train and validate the performance of the proposed CAM. The first one is to acquire gaze maps data from different subjects using an eye-tracking device, and the second one is to use the obtained gaze maps to train the neural network and to validate the results. In the rest of this section, we will first describe the test data used, the gaze acquisition, and the neural network training and validation experiments. We compare the performance of our proposed model with two state of the art models, i.e. CIOFM [31] and his model based on Bayesian Surprise theory [23].

3.1. Test dataset We have chosen five different surveillance videos to test the performance of our attention model. Some of these are indoor surveillance videos and some are outdoor surveillance videos. Videos # 1, 3 and 4 show a scene from train station where different persons are standing, sitting and moving around. Video # 2 represents a scene from a store where a man enters the scene, picks a CD box and leaves the scene. As we want to test the performance of our attention model in different kinds of scenarios, we have also chosen one video from Itti's videos dataset [50], that is composed of different scenes (indoor, outdoor, news channel, sports match, etc.). In this paper, the video from Itti's dataset is referred to as video # 5. Videos # 1 and 4 are from the iLIDS database of the AVSS 2007 conference. The frame size for videos # 1–4 is 608  800. For the fast computation of AUC values, we have downsampled the gaze maps and salience maps by a factor of four. The number of frames of each video is provided in Table 1. Table 1 Surveillance videos properties. Video #

Content type

# of frames

1 2 3 4 5

Train platform with medium motion activity Shop internal surveillance video with one customer Train station and platform video with people Train platform with high motion activity Video of mixed scenes from Itti dataset

1251 473 1248 1500 948

Fig. 2. Architecture of a standard back-propagation artificial neural network [49].

1352

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

3.2. Experimental setup The salience model needs to be validated with psychophysical experiments. A subjective eye tracking experiments is performed, where the videos are presented to a number of subjects on a CRT monitor. We presented four surveillance videos (videos # 1–4) to 30 subjects, and recorded their eye movements using a SMI high speed eye tracker [51]. The eye tracker works at a frequency of 500 samples per second. The eye-tracks are acquired and gaze maps were computed as an average of all subject's eye fixations. In the last step of gaze map production the maps are filtered with a Gaussian filter to mimic the HVS center-bias [38] mechanism. The gaze data for video # 5 is taken from Itti's data repository [50]. For illustration purposes the gaze maps of frame number 104 of video # 1 is shown in Fig. 3. Gaze maps of other video frames are shown in frames 4 and 5. 3.3. Neural network setup In this paper we propose to use the BPNN to combine the conspicuity maps of intensity, orientation, color, face and motion. The neural networks are computationally expensive, so the input conspicuity maps are down sampled from 608  800 to 19  25 pixels ¼ 475 pixels for one frame. The inputs and outputs for the neural network are normalized to 1, with a precision of 0.1. The BPNN has five inputs and one output. The five inputs are color, intensity, orientation, face and motion conspicuity maps. In the training phase, gaze maps are used as ground truth at the output of neural network. Once the network is trained, it generates salience maps at its output. The neural network is trained with a subset of video data and salience maps are computed for the remaining videos. The videos details are given in Table 1. The training phase of the neural network requires data for 3 steps:

training, generalization and validation. 60% of the video frames are used for the training of the neural network. 20% of the frames are used as the generalization set and the remaining 20% of data is used as a validation set. A Mean Square Error (MSE) and an accuracy of the ANN are computed on the validation set. The number of hidden layers is decided experimentally via trial and error. The number of hidden layers can be chosen on two basis: (1) the neural network should converge fast, (2) the neural network results in increased accuracy and decreased MSE. The Table 2 Neural network MSE and accuracy. Video #

ACC (%)

MSE

# of valid data samples

1 2 3 4 5

70 80 75 71 78

0.033 0.028 0.039 0.035 0.018

125,587 47,231 129,824 157,037 111,219

Table 3 Mean AUC for salience maps with gaze maps. CAM

SSM with face CIOFM Surprise NNBSM

Mean AUC Video # 1

Video # 2

Video # 3

Video # 4

Video # 5

0.51 0.63 0.76 0.72

0.42 0.52 0.64 0.68

0.57 0.65 0.70 0.73

0.58 0.60 0.71 0.69

0.43 0.61 0.71 0.70

Note: Bold numbers represents the best performing method for the corresponding video.

Fig. 3. Saliency maps and gaze map for video # 1, frame # 104.

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

Fig. 4. Saliency maps and gaze map for video # 2, frame # 150.

Fig. 5. Saliency maps and gaze map for video # 3, frame # 500.

1353

1354

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

accuracy and MSE of the validation set are shown in Table 2. The accuracy value varies between 0 and 100. Zero means no accuracy, and 100 means highest accuracy, i.e. the neural network manages Table 4 Mean Correlation of salience maps with gaze maps. CAM

to compute all correct results for the provided validation set. The BPNN is trained on all the videos in different experiments, and the accuracy values are shown in Table 2. Only the valid data samples are used in the training and validation phases of the BPNN. The Table 5 Performance comparison between NNBSM, CIOFM and Surprise models.

Correlation V. # Frames

SSM with face CIOFM Surprise NNBSM

Video # 1

Video # 2

Video # 3

Video # 4

Video # 5

0.11 0.13 0.25 0.35

0.02 0.06 0.10 0.32

0.12 0.18 0.24 0.32

0.15 0.11 0.20 0.31

0.05 0.10 0.18 0.21

Note: Bold numbers represents the best performing method for the corresponding video.

Fig. 6. ΔAUC of NNBSM and CIOFM salience maps for video # 2.

1 2 3 4 5

% improvement Frames

% improvement

CIOFM NNBSM NNBSM/CIOFM

Surprise NNBSM NNBSM/Surprise

247 74 271 268 238

677 129 418 727 413

958 350 952 1178 667

288 373 251 340 180

531 307 785 730 486

 22 138 88 0 18

Fig. 8. ΔAUC of NNBSM and Surprise salience maps for video # 2.

Fig. 7. AUC based performance comparison chart of NNBSM and CIOFM for video # 2.

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

valid data sets are those sets that have at least one non-zero input. For example, if we have 5 input conspicuity maps, i.e. color, intensity, orientation, motion, and face, then each one of them should be non-zero to be considered as valid input. If all inputs are zeros, this means there is no salient visual cue, and hence no need to use this input set. Input values are distributed between 0 and 1, with a precision of 0.1 to increase the training speed of BPNN. The lower the MSE is, the better the results are. MSE is computed between gaze maps and resulting salience maps. The MSE after training for different videos is shown in Table 2. 3.4. Performance assessment

values are computed for pairs of a salience map obtained from a CAM and a gaze map obtained from the subjective experiments. The salience maps used for comparison are computed from the following CAMs: static salience model (SSM) [17] augmented with face detection, CIOFM CAM (CIOFM) [31], Surprise CAM (Surprise) [23] and the proposed NNBSM. Table 3 shows the mean AUC value for the five surveillance videos. The salience maps of two frames are shown in Fig. 4 for video # 2, frame # 150 and in Fig. 5 for video # 3, frame # 500. Figs. 4 and 5 show that NNBSM is combining features conspicuities while considering the importance of each feature. Table 4 shows the correlation coefficient computed between corresponding salience maps and gaze maps. The mean correlation

There is no standard performance assessment metric for validation of CAM. Area under the curve (AUC) measure has been used to compare two CAMs by many researchers [17,26]. Cross correlation has also been used for performance measurement of different CAMs [6,52]. We use these two performance measures to validate our results. The proposed CAM generates salience maps for each video frame that are compared to corresponding gaze maps to compute the AUC and cross correlation scores. Both scores compare the two maps to find their similarity index. An AUC value of 0.5 is associated with two random variables, thus the AUC value should be above 0.5 to show some similarity between the two maps, and be as close as possible to 1 (best case) to show that the salience maps obtained via our model mimics the human visual attention gaze maps. Cross correlation compares the two input maps and gives a similarity score that varies between 0 and 1, where 0 means no similarity at all, and 1 is 100% match. The results obtained for both measures are presented in the next section.

4. Experimental results and discussion To validate the proposed model, we have computed AUC values of each video frame for state-of-the-art CAMs [17,23,31]. The AUC

1355

Fig. 10. ΔAUC between NNBSM and CIOFM salience maps for video # 5.

Fig. 9. AUC based performance comparison chart of NNBSM and Surprise for video # 2.

1356

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

Fig. 11. AUC based performance comparison chart of NNBSM and CIOFM for video # 5.

Fig. 12. ΔAUC between NNBSM and Surprise salience maps for video # 5.

coefficient value is computed by averaging the correlation coefficient values for the entire video sequence. For almost all the videos, the AUC plot shows that overall NNBSM performs better than the other state-of-the-art models, as shown in Figs. 6, 8, 10 and 12. Our model is sometimes performing similar to the MSM, which means that the neural network has learned that the motion is the most significant feature to compute salience. The major issue for a CAM is how to combine the rather different visual features conspicuities into a single salience map. Indeed most of the time, in surveillance videos, motion and faces are the most important features compared to stationary low-level features as shown in the gaze maps in Fig. 4. This can be validated by the gaze maps acquired from psychophysical experiments,

which show that motion and faces always attract the attention of the viewer. It can be seen in Fig. 3 that NNBSM is successful in detecting motion of the people as salient while the board on the right is not considered salient. Furthermore, in Fig. 3 it is evident that NNBSM produces a saliency map almost identical to the gaze map, while the average salience map of SSM (with face) is not similar to its corresponding gaze map. The NNBSM for video # 1, frame # 104 shows good correlation with the gaze map in Fig. 3. NNBSM is detecting only the most salient regions, i.e. person with salient motion, while SSM (with face) is combining visual cues using an average function, and is not sensitive to variations in one of the features thus is not able to detect salient regions very well. The SSM (with face) of the video # 1, frame # 104, detects many other areas as salient regions, which are not salient regions according to the corresponding gaze map. The same is true for video # 3, frame # 150, and video # 4, frame # 500 presented in Figs. 4 and 5 respectively. In all of these frames NNBSM is detecting accurate salient regions, that have high correlation with the gaze maps of the corresponding frame, while SSM (with face) highlights many more non-salient regions as salient regions. For a deeper analysis, we have compared AUC values of CIOFM and NNBSM models frame by frame for video # 2 and video # 5, see Figs. 6 and 10 respectively. Figs. 8 and 12 show the comparison between NNBSM and Surprise CAMs for video # 2 and video # 5 respectively. The ΔAUC is defined as

ΔAUC ¼ AUC CAM1 AUC CAM2

ð8Þ

ΔAUC values which are above 0 show that the proposed method CAM1 (NNBSM) performs better than the CAM2 (CIOFM or Surprise models). Further, we present the comparison chart to better analyze the performance of a pair of attention models under consideration. These charts are shown in Figs. 7, 9, 11, and 13. In these plots each frame is represented by a single point with coordinates given by the AUC values obtained by the two CAM models being compared. For example in Figs. 7 and 11, we plotted

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

1357

Fig. 13. AUC based performance comparison chart of NNBSM and Surprise for video # 5.

the AUC results for NNBSM and CIOFM on the x- and y-axes respectively. In Figs. 9, and 13, we plotted the AUC results for NNBSM and Surprise on the x and y-axes respectively. There are 4 quadrants in each of these plots:    

Q1.Bottom-left: AUC for CAM1 o 0:5 and AUC for CAM2 o 0:5. Q2. Bottom-right: AUC for CAM1 4 0:5 and AUC for CAM2 o 0:5. Q3. Top-left: AUC for CAM1 o 0:5 and AUC for CAM2 4 0:5. Q4. Top-right: AUC for CAM1 4 0:5 and AUC for CAM2 4 0:5, this quadrant is further divided into two triangles: Q4.a. Lower triangle: AUC for CAM1 4 AUC for CAM2. Q4.b. Upper triangle: AUC for CAM1 4 AUC for CAM2.

The number of points Pr in Q 2 UQ 4:a gives the number of frames for which the NNBSM performs better than random and better than the state-of-the-art CAM it is compared to, i.e. either CIOFM or Surprise. While the number of points St in Q 3 UQ 4:b represents the number of frames where our proposed model fails to outperform the state-of-the-art models mentioned: P r ðCAM 1 =CAM 2 Þ ¼ of points in Q 2 þ of points in Q 4:a:

ð9Þ

St ðCAM 1 =CAM 2 Þ ¼ of points in Q 3 þ of points in Q 4:b:

ð10Þ

The performance of different CAMs can be analyzed by counting the number of points in the different quadrants. The performance can also be analyzed by visual inspection of the plots. Table 5 shows the frequency of the points when one CAM performs better than another one. We will analyze in detail the results obtained for videos # 2 and # 5. Video # 2 is an indoor store surveillance video, where the main activity is carried out by a customer, who enters into the scene, picks up a cd box, puts it back and leaves the scene. If we analyze closely the performance comparison graphs, for NNBSM and CIOFM, for video # 2 shown in Figs. 6, and 7, we can notice that our model did

not perform well for the initial 50–60 frames. These are the frames where the person is not in the video, and there is no activity happening in the scene. NNBSM uses face and motion information, and thus it performs better when there is a person moving in the scene. This is also true for performance comparison graphs of NNBSM and Surprise shown in Figs. 8 and 9. The same can be confirmed from the results of video # 5 shown in Figs. 10–13. NNBSM performs better than the other CAMs for video # 2 as shown in Table 5. We can see that NNBSM performs up to 373% better than CIOFM and up to 138% better than Surprise for video # 2. The reason why our method works better than the Surprise method in case of video # 2 is that, the Surprise method is trying to detect unusual events or sudden changes in the scene, whereas our method focuses on combining different visual cues by mimicking the HVS. For testing NNBSM with different kinds of scenes, we have chosen “mtvclip07” from Itti's videos dataset [50]. This clip is video # 5 in our test data set and contains 20 different short scenes. Looking at the comparison of NNBSM with CIOFM presented in Figs. 10 and 11, it can be clearly seen that most of points are above ‘0’, thus NNBSM outperforms CIOFM model for most of the scenes. CIOFM performs better than NNBSM for scenes numbers 1 (frames 1–40, content: many people walking around), 7 (frames 299–328, content: rugby match), 10 (frames 395–442, content: a person sitting in the car drivers seat with no motion), 18 (frames 795–872, content: many people standing outside a building), and 20 (frames 910–948, content: static text for news headline). The scenes where CIOFM is performing better than NNBSM either have no motion, have some text, or the motion is too high. If we analyze the video segment where an anchor person is reading the news with her face clearly visible, NNBSM performs better than all other methods. This is because of the inclusion of face models in the NNBSM. If we analyze the Surprise attention model for video # 5, we can see that Surprise performs better than CIOFM but still is outperformed by NNBSM. The video segments where Surprise outperforms NNBSM are same as the ones in the CIOFM case.

1358

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

5. Conclusions and future directions A novel CAM (NNBSM) is proposed in this paper, for surveillance video analysis, combining different high-level and low-level visual cues using back propagation neural network as a machine learning technique to detect salience in dynamic scenes. The proposed model uses an improved salient motion detection algorithm combined with low-level features (color, intensity, orientation) and the face feature, using a back-propagation neural network. The performance of the proposed model is tested on a number of video datasets chosen such that it can cover both indoor and outdoor surveillance scenarios. The results show good correlation of the NNBSM output with the gaze maps obtained from psychophysical experiments. The detailed comparison of the AUC between NNBSM and the state-of-the-art CIOFM and Surprise CAMs were also presented and discussed. These comparisons show that NNBSM performs better than existing state-of-theart CAMs for surveillance videos. The results prove that for all the videos, the NNBSM attention model performs better than the CIOFM model, and for many of the test videos NNBSM performs better than the Surprise attention model. In the future work, we propose to enhance the proposed CAM with familiar objects detection and to test the proposed CAM for other applications such as video compression and event detection. For future studies, the existing model can be improved by using the findings of this benchmark [53] to predict the human fixation. Furthermore, other classification methodologies such as linear Support Vector Machine (SVM) and AdaBoost can be used to examine which learning method outperforms in case of surveillance applications. References [1] N. Ouerhani, J. Bracamonte, H. Hugli, M. Ansorge, F. Pellandini, Adaptive color image compression based on visual attention, in: Proceedings of the International Conference of Image Analysis and Processing (ICIAP), Palermo, Italy, September 2001, pp. 416–421. [2] M. Hrarti, H. Saadane, M. Larabi, A. Tamtaoui, D. Aboutajdine, Adaptive quantization based on saliency map at frame level of h.264/avc rate control scheme, in: 3rd European Workshop on Visual Information Processing (EUVIP), July 2011, pp. 61–66, doi: http://dx.doi.org/10.1109/EuVIP.2011.6045539. [3] F.F.E. Guraya, V. Medina, F. Alaya Cheikh, Visual attention based surveillance videos compression, in: Proceedings of the Color and Imaging Conference Los Angeles, USA, 1 November 2012, pp. 2–8. [4] S.A. Amirshahi, M.-C. Larabi, Spatial-temporal video quality metric based on an estimation of QoE, in: IEEE Third International Workshop on Quality of Multimedia Experience, Belgique, 2011, pp. 84–89, URL: 〈http://hal.archive s-ouvertes.fr/hal-00628605〉. [5] F.F.E. Guraya, A.S. Imran, Y. Tong, F. Alaya Cheikh, A non-reference perceptual quality metric based on visual attention model for videos, in: Proceedings of the 10th International Conference on Information Sciences Signal Processing and their Applications (ISSPA), Kualalampur, Malaysia, May 2010, pp. 361–364, doi: http://dx.doi.org/10.1109/ISSPA.2010.5605523. [6] F.F.E. Guraya, F. Alaya Cheikh, A. Tremeau, Y. Tong, H. Konik, Predictive saliency maps for surveillance videos, in: Proceedings of the 9th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), Hong Kong, China, IEEE Computer Society, August 2010, pp. 508–513, ISBN 978-0-7695-4110-5, URL: 〈http://www.dx.doi. org/10.1109/DCABES.2010.160〉. [7] N. Ouerhani, H. Hugli, A model of dynamic visual attention for object tracking in natural image sequences, Lect. Notes Comput. Sci. 1 (2003) 702–709, URL 〈http://www.springerlink.com/index/00032QE53351A4TV.pdf〉. [8] F. Fraundorfer, H. Bischof, Utilizing saliency operators for image matching, in: Proceedings of the International Workshop on Attention and Performance in Computer Vision (WAPCV), 2003, pp. 17–24. [9] E.B. Titchener, Elementary Psychology of Feeling and Attention, Ayer Co Pub, ISBN 0405051662, 1973 (Original 1908). [10] A.R. Koene, L. Zhaoping, Feature-specific interactions in salience from combined feature contrasts: evidence for a bottom-up saliency map in v1, J. Vis. 7 (7) (2007), pp. 6.1–14, URL: 〈http://discovery.ucl.ac.uk/79037/〉. [11] A. Borji, L. Itti, State-of-the-art in visual attention modeling, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 185–207, ISSN 0162-8828, doi: 〈http://doi.ieeecomputersociety.org/10.1109/TPAMI.2012.89〉. [12] L. Itti, Models of bottom-up and top-down visual attention (Ph.D. thesis), California Institute of Technology, January 2000, URL: 〈http://resolver.caltech. edu/CaltechETD:etd-12022005-103530〉. [13] L. Itti, C. Koch, Computational modelling of visual attention, Nat. Rev. Neurosci. 2 (March (3)) (2001) 194–203.

[14] J. Krummenacher, H.J. Muller, D. Heller, Visual search for dimensionally redundant pop-out targets: evidence for parallel-coactive processing of dimensions, Percept. Psychophys 63 (5) (2001) 901–917, ISSN 0031-5117. [15] T. Kadir, A. Zisserman, M. Brady, An affine invariant salient region detector, Image Rochester NY 3021 (6) (2004) 228–241, URL 〈http://www.springerlink. com/index/AHJRHQDX3UQRVDXU.pdf〉. [16] M. Mancas, D. Unay, B. Gosselin, B. Macq, Computational attention for defect localisation, in: Proceedings of the 5th International Conference on Computer Vision Systems, 2007. [17] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (November (11)) (1998) 1254–1259. [18] C. Kock, S. Ullman, Shifts in selective visual attention: towards the underlying neural circuitry, Hum. Neurobiol. 4 (4) (1985) 219–227. [19] L. Itti, C. Koch, Feature combination strategies for saliency-based visual attention systems, J. Electron. Imaging 10 (March (1)) (2001) 161–169. [20] F. Miau, C. Papageorgiou, L. Itti, Neuromorphic algorithms for computer vision and attention, in: Proceedings of the 46th Annual International Symposium on Optical Science and Technology San Diego, 4479, USA, July 2001, pp. 12–23. [21] V. Navalpakkam, J. Rebesco, L. Itti, Modeling the influence of task on attention, Vis. Res. 45 (2) (2005) 205–231. [22] M. Cerf, E.P. Frady, C. Koch, Faces and text attract gaze independent of the task: experimental data and computer model, J. Vis. 9 (12) (2009) 1–15, ISSN 1534-7362, URL: 〈http://dx.doi.org/10.1167/9.12.10〉. [23] L. Itti, P. Baldi, Bayesian surprise attracts human attention, Vis. Res. 49 (10) (2009) 1295–1306. [24] J. Harel, C. Koch, P. Perona, Graph-based visual saliency. in: Advances in Neural Information Processing Systems, 19, MIT Press, California Institute of Technology, Pasadena, CA 91125, 2007, pp. 545–552 〈http://citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.70.2254〉. [25] U. Rajashekar, I. van der Linde, A.C. Bovik, L.K. Cormack, Gaffe: a gaze-attentive fixation finding engine, IEEE Trans. Image Process. 17 (April (4)) (2008) 564–573. http://dx.doi.org/10.1109/TIP.2008.917218, ISSN 1057-7149. [26] P. Sharma, F. Alaya Cheikh, J.Y. Hardeberg, Face saliency in various human visual saliency models, in: Proceedings of the 6th International Symposium on Image and Signal Processing and Analysis, Salzburg, Austria, September 2009, pp. 327–332. [27] J.M. Wolfe, K.R. Cave, S.L. Franzel, Guided search: an alternative to the feature integration model for visual search, J. Exp. Psychol.: Hum. Percept. Perform. 15 (3) (1989) 419–433. [28] M.W. Jeremy, Visual search in continuous, naturalistic stimuli, Vis. Res. 34 (9) (1994) 1187–1195. http://dx.doi.org/10.1016/0042-6989(94)90300-X, ISSN 0042-6989. [29] M.W. Jeremy, Visual memory: what do you know about what you saw? Curr. Biol. 8 (9) (1998) R303–R304. http://dx.doi.org/10.1016/S0960-9822(98)70192-7, ISSN 0960-9822. [30] A. Borji, D.N. Sihite, L. Itti, Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study, IEEE Trans. Image Process. 22 (1) (2013) 55–69. http://dx.doi.org/10.1109/TIP.2012.2210727, ISSN 10577149. [31] L. Itti, Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes, Vis. Cognit. 12 (2005) 1093–1123. [32] N. Riche, M. Mancas, D. Culibrk, V. Crnojevic, B. Gosselin, T. Dutoit, Dynamic saliency models and human attention: a comparative study on videos, in: K.M. Lee, Y. Matsushita, J.M. Rehg, Z. Hu (Eds.), Computer Vision ACCV 2012, Lecture Notes in Computer Science, vol. 7726, Springer, Berlin, Heidelberg, 2013, pp. 586–598, ISBN 978-3-642-37430-2, URL: 〈http://www.dx.doi.org/10.1007/ 978-3-642-37431-9_45〉. [33] R. Desimone, T.D. Albright, C.G. Gross, C. Bruce, Stimulus selective properties of inferior temporal neurons in the macaque, J. Neurosci. 4 (8) (1984) 2051–2062. [34] D. Walther, C. Koch, Modeling attention to salient proto-objects, Neural Netw. 19 (9) (2006) 1395–1407, URL 〈http://www.ncbi.nlm.nih.gov/pubmed/17098563〉. [35] B.D. Lucas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI'81), vol. 2, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1981, pp. 674–679, URL: 〈http://dl.acm.org/citation.cfm? id=1623264.1623280〉. [36] Y.L. Tian, A. Hampapur, Robust salient motion detection with complex background for real-time video surveillance, in: Proceedings of the IEEE Workshop on Motion and Video Computing (WACV/MOTION'05) (WACV-MOTION'05), vol. 2, IEEE Computer Society, Washington, DC, USA, 2005, pp. 30–35, ISBN 0-76952271-8-2, URL: 〈http://www.dx.doi.org/10.1109/ACVMOT.2005.106〉. [37] P.-H. Tseng, R. Carmi, I.G.M. Cameron, D.P. Munoz, L. Itti, Quantifying center bias of observers in free viewing of dynamic natural scenes, J. Vis. 9 (7) (2009) 1–16. [38] T. Jost, N. Ouerhani, R. von Wartburg, R. Müri, H. Hügli, Assessing the contribution of color in visual attention, Comput. Vis. Image Understand. 100 (October (1–2)) (2005) 107–123, ISSN 1077-3142 URL 〈http://dx.doi.org/10.1016/j.cviu.2004.10.009〉. [39] J.K. Tsotsos, S.M. Culhane, W.Y. Kei Wai, Y. Lai, N. Davis, F. Nuflo, Modeling visual attention via selective tuning, Artif. Intell., Spec. Vol. Comput. Vis. 78 (1–2) (1995) 507–545. http://dx.doi.org/10.1016/0004-3702(95)00025-9, ISSN 0004-3702 URL 〈http://www.sciencedirect.com/science/article/pii/0004370295000259〉. [40] O. Le Meur, P. Le Callet, D. Barba, D. Thoreau, A coherent computational approach to model bottom-up visual attention, IEEE Trans. Pattern Anal. Mach. Intell. 28 (May (5)) (2006) 802–817. http://dx.doi.org/10.1109/TPAMI.2006.86, ISSN 0162-8828.

F.F.E. Guraya, F. Alaya Cheikh / Neurocomputing 149 (2015) 1348–1359

[41] A.M. Treisman, G. Gelade, A feature-integration theory of attention, Cognit. Psychol. 12 (January (1)) (1980) 97–136. http://dx.doi.org/10.1016/0010-0285(80)90005-5, ISSN 00100285, URL: 〈http://homepage.psy.utexas.edu/homepage/class/Psy355/Gil den/treisman.pdf〉. [42] T. Judd, K. Ehinger, F. Durand, A. Torralba, Learning to predict where humans look, in: IEEE 12th International Conference on Computer Vision, October 2009, pp. 2106–2113, doi: http://dx.doi.org/10.1109/ICCV.2009.5459462. [43] Qi Zhao, Christof Koch, Learning saliency-based visual attention: a review, Signal. Process. 93 (June (6)) (2013) 1401–1407. http://dx.doi.org/10.1016/j. sigpro.2012.06.014, ISSN 0165-1684.. [44] B.C. Motter, E.J. Belky, The guidance of eye movements during active visual search, Vis. Res. 38 (12) (1998) 1805–1815. http://dx.doi.org/10.1016/S00426989(97)00349-0, ISSN 0042-6989, URL: 〈http://www.sciencedirect.com/ science/article/pii/S0042698997003490〉. [45] J. Shen, E.M. Reingold, M. Pomplun, Distractor ratio influences patterns of eye movements during visual search, Perception 29 (2) (2000) 241–250. [46] A.E. Bryson, Y.C. Ho, Applied Optimal Control: Optimization, Estimation, and Control, Blaisdell Publishing Company Xerox College Publishing, USA, 1969 〈http://books.google.no/books/about/Applied_Optimal_Control.html?id= P4TKxn7qW5kC&redir_esc=y〉. [47] P. Werbos, Beyond Regression: New tools for prediction and analysis in the behavioral sciences (Ph.D. thesis), Harvard University, Cambridge, MA, 1974. [48] S. Russell, P. Norvig, Back-propagation neural network. In: Artificial Intelligence – A Modern Approach. [49] Basic Neural Network Tutorial – theory, 〈http://takinginitiative.net/2008/04/ 03/basic-neural-network-tutorial-theory/〉 (accessed 02.01.12). [50] L. Itti, ilab Neuromorphic Vision c þ þ Toolkit (Invt), URL 〈http://ilab.usc.edu/ toolkit/〉(accessed: 06.11.09). [51] Smi Gaze and Eye Tracking Systems, URL 〈http://www.smivision.com/en/ gaze-and-eye-tracking-systems/home.html〉 (accessed: 10.10.13). [52] F.F.E. Guraya, F. Alaya Cheikh, Predictive visual saliency model for surveillance video, in: proceedings of the 19th European Signal Processing Conference (EUSIPCO), Barcelona, Spain, August 2011, pp. 554–558, URL 〈http://www. eurasip.org/Proceedings/Eusipco/Eusipco2011/papers/1569427583.pdf〉. [53] T. Judd, F. Durand, A. Torralba, A Benchmark of Computational Models of Saliency to Predict Human Fixations, Technical Report, Computer Science and Artificial Intelligence Lab (CSAIL), January 2012, URL 〈http://hdl.handle.net/ 1721.1/68590〉.

Fahad Fazal Elahi Guraya received his Doctorate degree in Information Technology from University of Oslo, Oslo, Norway in June 2014. His area of research during doctorate studies was Visual saliency models for video surveillance applications. Before his Ph.D., he has been granted European commission scholarship for Master program at three prestigious European universities in UK, France and Spain, where he completed his Masters of computer vision and robotics. His master thesis was related to depth estimation in under water environments. He has also done an M.Sc. in Digital Communication and Digital Signal processing from Center for Advanced Studies in Engineering, U.E.T,

1359

Taxila, Pakistan. Before this he has done a B.S. degree in Computer science from PICS, UCP, Lahore, Pakistan. He has taught courses on Matlab and computer programming. His research interests include image and video processing and analysis, computer vision, biometrics, pattern recognition and robotics. In these areas, he has published several peer-reviewed journal and conference papers, and supervised couple of M.Sc. theses projects.

Faouzi Alaya Cheikh, received his Dr. Sc. in Information Technology from Tampere University of Technology, in Tampere, Finland in April 2004, where he worked as a researcher in the Signal Processing Algorithm Group since 1994. From 2006, he has been affiliated with the Department of Computer Science and Media Technology at Gjøvik University College in Norway, at the rank of Associate Professor. He teaches courses on image and video processing and analysis and media security. His research interests include e- Learning, 3D imaging, image and video processing and analysis, video-based navigation, biometrics, pattern recognition and content-based image retrieval. In these areas, he has published over 80 peer-reviewed journal and conference papers, and supervised two Ph.D. and several M.Sc. theses projects. He is currently the co-supervisor of four Ph.D. students. He has been involved in several European and national projects among them: ESPRIT, NOBLESS, COST 211Quat, and HyPerCept. He is on the editorial board of the IET Image Processing Journal and the editorial board of the Journal of Advanced Robotics and Automation, Steering Committee of EUVIP and the technical committees of several international conferences. He is an expert reviewer to a number of scientific journals and conferences related to the field of his research. He is a senior member of IEEE, member of NOBIM and Forskerforbundet (The Norwegian Association of Researchers – NAR).