Pattern Recognition Letters 34 (2013) 713–722
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Classification of multiscale spatiotemporal energy features for video segmentation and dynamic objects prioritisation Anna Belardinelli a,⇑, Andrea Carbone b, Werner X. Schneider c,d a
Computer Science Department, University of Tübingen, Germany ISIR – Institut des Systèmes Intelligents et de Robotique, UPMC, Paris, France c CITEC – Cognitive Interaction Technology Excellence Center, Bielefeld University, Germany d Neurocognitive Psychology, Department of Psychology, Bielefeld University, Germany b
a r t i c l e
i n f o
Article history: Available online 14 September 2012 Keywords: Video segmentation Spatiotemporal features Visual attention Object-based saliency
a b s t r a c t High level visual cognitive abilities such as scene understanding and behavioural analysis are modulated by attentive selective processes. These in turn rely on pre-attentive operations delivering perceptual organisation of the visual input and enabling the extraction of meaningful ‘‘chunks’’ of information. Specifically, the extraction and prioritisation of moving objects is a crucial step in the processing of dynamic scenes. Motion is of course a powerful cue for grouping regions and segregating objects but not all kinds of motion are equally meaningful and should be equally attended. On a coarse level, most interesting moving objects are associated with coherent motion, reflecting our sensitivity to biological motion. On the other hand, attention operates on a higher level, prioritising what moves differently with respect to both its surrounding and the global scene. In this paper, we propose how a qualitative segmentation of multiscale spatiotemporal energy features according to their frequency spectrum distribution can be used to pre-attentively extract regions of interest. We also show that discrimination boundaries between classes in the segmentation phase can be learned in an automatic and efficient way by a Support Vector Machine classifier in a multi-class implementation. Motion-related features are shown to best predict human fixations on an extensive dataset. The model generalises well to datasets other than that used for training, if scale is taken into account in the feature extraction step. Regions labelled as coherently moving are clustered in moving object files, described by the magnitude and phase of the pooled motion energy. The method succeeds in extracting meaningful moving objects from the background and identifying other less interesting motion patterns. A saliency function is finally computed on an object basis, instead of on a pixel basis, as in most current approaches. The same features are thus used for segmentation and selective attention and can be further used for recognition and scene interpretation. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction The interpretation of dynamic scenes poses the same challenges (image segmentation and categorisation, object extraction and recognition) and partly relies on the same principles (feature filtering, edge extraction, and perceptual organisation) as the understanding of still images, but it adds a further dimension to the analysis, namely the temporal one. For biological beings, there is actually no perception aspect, from sensing to conscious recognition, which does not unfold in time. Motion perception anyway is particularly critical since it acts as a warning system, helps recovering 3D structure (structure from motion), segmenting and grouping objects and of course, tracking targets. We can cope with a lot of motion phenomena: translation motion, multiple ⇑ Corresponding author. E-mail addresses:
[email protected] (A. Belardinelli),
[email protected] (A. Carbone),
[email protected] (W.X. Schneider). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.09.005
motions, apparent motion, optical flow, stabilisation and so on. This is deeply rooted in the neurophysiology of the brain of living beings and, more specifically, it is known that cells in different areas of the primate cortex are assigned to detection of different motion patterns and velocities (Qian et al. (1994), Schrater et al. (2000) and see Burr and Thompson (2011) for an extensive review on motion psychophysics). Cells in the V1 cortex are mostly direction selective, albeit responding to bidirectional motion too, while cells in the MT area are sensitive to motion opponency, inhibiting the non-preferred direction. Going up in the visual hierarchy, receptive fields become larger, more complex motion features can be detected and global motion is better estimated (Bruce et al., 2003; Orban, 2008). In humans, motion features undergo a different, separate processing, along the dorsal pathway, which is mostly used for spatial cognition and navigation tasks, to fast localise potentially interesting objects and possibly react to them, even when object recognition is not fully accomplished (Kravitz et al., 2011).
714
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
Such an ability to distinguish and categorise motion and process it for gaining further knowledge of the environment would be of course desirable for artificial systems as well. As to the detection and computation of local visual motion, current approaches can be ascribed to two different strategies: correlation methods and gradient methods, the first being more popular in the neurobiology branch, the second in computer vision (Dellen and Wessel, 2009; Borst, 2009). Correlation methods are more adherent to the way motion is computed biologically. A wealth of literature showed how human motion sensitivity can be characterised experimentally and computationally in terms of a lot of properties, such as spatial frequency specificity, contrast sensitivity, speed discrimination or inhibition between directions (Movshon et al., 1978; Watson and Ahumada, 1985; Burr, 1986). Computationally, a solution to the aperture problem connected with the too small extension of the receptive fields with respect to the stimulus dimension was proposed by Yuille and Grzywacz (1988) within the Motion Coherence Theory. Motion is perceived through a series of parallel local motion detectors. Two neighbour detectors are delayed one with respect to the other and their outputs are combined to obtain direction selection (this is called the Reichardt’s detector). That is, the outputs of the two units are correlated in space and time. During the eighties a whole bulk of research on motion sensing modelling was developed in computational biology, starting from psychophysical premises and making use of spatiotemporal filtering, in the effort to remain adherent to biological models. These models can be seen as a gradient solution or another example of a differential method (Simoncelli, 1993). Adelson and Bergen (1985) showed how such models enable motion detection and direction selection in terms of spatiotemporal oriented energy, for both continuous and sampled motion. Extending this idea, Wildes and Bergen (2000) introduced a qualitative taxonomy of motion according to the orientation bands that collect most energy by filtering in the frequency domain. Indeed, the frequency spectrum of the spatiotemporal volume described by a video sequence can provide interesting information as to the content of the occurring motion patterns. In (Wildes and Bergen, 2000), the authors envisaged six types of motion patterns and their relative frequency signatures suggesting that these could be further used to discriminate interesting motion regions and guide the focus of attention. In this paper, we build on this representation to further improve our computational model of object-based attention to visual motion (Belardinelli et al., 2010; Wischnewski et al., 2010). Motion segmentation according to several energy features is used to select pixels displaying consistently coherent motion. Our goal is to extract dynamic objects moving coherently with respect to the (almost) static background or to other incoherently/scintillating/ flickering dynamic textures present in the scene. Objects candidates are the result of a first pre-attentive segregation of motion features into discrete unities with fairly homogenous direction. Only at this point a measure of saliency is assigned, in an objectbased account of attention (Scholl, 2001). Segmentation provided by no other cue but motion was recently shown to suffice for instantiating object files, namely mid-level representations underlying object spatiotemporal persistence (Gao and Scholl, 2010). Attention models or vision architectures and applications have often taken into account motion as an informative feature to detect and segment interesting objects or targets by means of optical flow computation, block matching or other motion computation techniques (Milanese et al., 1995; Le Meur et al., 2007; Marat et al., 2009). Motion indeed represents a very robust bottom up feature, notably accounting in a causal way, rather than in a correlational way, for most of the fixations in free viewing of video sequences (Carmi and Itti, 2006; Mital et al., 2011). Some approaches concentrated on the notion of ‘novelty’ or ‘surprise’ in the sense of centresurround discrimination in Bayesian information theoretic
frameworks (Itti and Baldi, 2009; Bruce and Tsotsos, 2009; Mahadevan and Vasconcelos, 2009). Still, these methods consider in some cases differences between subsequent frames and do not account for a broader analysis of motion, usually providing low level pixel-based saliency. Proto-objects, intended as ‘‘visual chunks’’ or ‘‘object tokens’’ emerging from segmentation (Schneider, 1995), anyway, should be the real argument of saliency. This is also, consistent with the coherence theory proposed by Rensink (2000) which suggests a sketch of low-level vision based on three steps: a transduction stage, concerned with photo-reception at pixel level; a primary processing stage computing image properties via linear filtering; a secondary processing stage extracting proto-objects directly accessible to attention, which in turn provides coherence to selected objects in the form of spatio-temporal unity. In the following section we show how we extract motion energy features by means of spatiotemporal filtering and how these can be use to segment the scene into 6 motion classes. In Section 3 coherent motion is used to segment interesting objects for which saliency is computed according to motion energy and direction. Finally, we present some results and discuss them along with future directions.
2. Motion feature extraction and segmentation The temporal structure of a video sequence can be observed by looking at the x–t or y–t planes slicing the frame buffer volume parallel to rows or columns. These are characterised by oriented textures produced by the object edges in time. That is, as for the domain of static pictures, edges and bars (and corners) represent perceptually significant features, easily detected in the frequency domain by looking for local energy maxima, which is where the phase of the different harmonic components is aligned (Morrone et al., 1988). Extraction of oriented bars and edges can be done in low level processing by taking the responses to linear oriented filters with a 90 phase offset (i.e., forming a quadrature pair), squared and summed to produce a phase independent measure of motion energy. Analysis of motion through spatiotemporal filtering and energy models has been widely investigated in works on the perception of motion (Adelson and Bergen, 1985; Watson and Ahumada, 1985; Heeger, 1987) and recently further developed by Wildes and Bergen (2000). Since velocity is represented by the slope of the slanted tracks left by moving objects, Wildes and Bergen (2000) proposed a categorisation of motion types according to the responses of x–t or y–t planes to oriented Gaussian derivative filters (0 ; 45 ; 90 ; 135 ): unstructured areas basically offer no significant response; static patterns concentrate most of their spectrum in horizontal bands (response to vertical filters); flickering patterns in vertical bands; coherently moving objects respond to one dominant diagonal band (enhanced by motion opponency), while incoherently moving objects respond both to leftward and rightward objects and scintillating areas respond quite uniformly across orientations. In (Belardinelli et al., 2010), we have presented a framework for computing motion energy from a filter bank of 12 Gabor filters coarsely tuned at 2 velocities (30 ; 60 ), 2 directions (rightwards/leftwards or upwards/downwards, respectively) and centered at 3 different frequencies. Gabor filters, indeed, represent a suitable combination of uncertainty optimality and biological plausibility. That architecture was designed to extract just coherent motion, but a more complete representation of the spatiotemporal textures entailed in a video sequence can be obtained by evenly tiling the 2D spatiotemporal frequency domain with filters capturing not only different directions and velocities of motion, but also static and flickering patterns at different scales. For a frequency bandwidth of 1 octave and an orientation bandwidth of 30 , we obtain 18 filters (each one in odd and even form) at 6
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
orientations and 3 frequencies (0.0938, 0.1875, 0.3750), spanning up to the maximal frequency of 0.5 cycle/pixel, as illustrated in Fig. 1 left. This Gabor wavelet representation refines the one previously presented, allowing to assign each pixel to the best fitting category. For each orientation and frequency band, energy is computed as follows. Each plane Sðx; tÞ and Sðy; tÞ is filtered by the 2D Gabor filters tuned at frequency fi and oriented at hj :
Efi ;hj ðx; yc ; tÞ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðGofi ;hj ðx; tÞHSyc ðx; tÞÞ2 þ ðGefi ;hj ðx; tÞHSyc ðx; tÞÞ2
ð1Þ
Here Go and Ge denote odd and even phase for a given Gabor filter, yc the row corresponding to the filtered plane. 2.1. Segmentation with pre-defined thresholds From these features we can obtain the four energy components defined by Wildes and Bergen (2000), needed for the analysis of the motion pattern occuring at each location:
Sx ¼
X Efi ;h¼0 fi
jR Lj ¼
X
ð2Þ
jEfi ;h¼60 Efi ;h¼120 j þ jEfi ;h¼30 Efi ;h¼150 j
ð3Þ
fi
RþL¼
X
Efi ;h¼60 þ Efi ;h¼120 þ Efi ;h¼30 þ Efi ;h¼150
ð4Þ
fi
Fx ¼
X Efi ;h¼90
ð5Þ
fi
Rightward and leftward motion are obtained by collapsing together the response to faster units and slower units. The same is done in the vertical spatial dimension. All components are normalised between 0 and 1 so to be directly comparable. The components obtained for the different frequencies for a sequence can be seen in Fig. 2. The sequence (from the dataset presented in (Mahadevan and Vasconcelos, 2009)) is particularly suited for demonstration because it contains different motion patterns, from static areas to the flickering of perturbed water, from the coherently moving boats to the splashing of waves. Each scale and each motion component capture different details. A temporal span of 10 frames (about 33 ms) was considered. Qualitative segmentation is subsequently obtained by assigning each pixel to one of the following classes according to the specified conditions: 1. Static if Sx obtained the largest response; 2. Coherent if jR Lj obtained the largest response; 3. Flicker if F x obtained the largest response;
715
4. Incoherent/Scintillating if jR Lj < 0:3 and jS Fj < 0:3; 5. Unstructured if no component exceeded 30% of its maximum range. Static labels are mostly assigned to contrasted objects whose scale falls within the support of one of the vertical filters, too large static areas or large untextured moving regions fall within the Unstructured category. Incoherent and scintillating patterns where collapsed together since they are quite difficult to discriminate given similar signatures presenting no dominant response in coherent channels nor in the flickering and static channels. The 0.3 threshold in the last two classes was tuned experimentally as giving the most reasonable labelling. Results of this labelling for both the horizontal and vertical case can be seen in Fig. 3. The bigger boat in foreground is overcoming a wave hence moving downwards and coherent pixels in that area are mostly displayed in the segmentation of the vertical planes (in light blue). The boat on the background, moving to the left, is labelled as coherent in the horizontal segmentation. Most of the water region is either flickering (in green) or unstructured (red), except for some waves moving coherently within the short time span. Border regions between bigger flickering or static areas are often deemed as incoherent/scintillating (orange). The conditions presented above assure that pixels are labelled to the best fitting class even if they give some weak response to other types of motion. By selecting the pixel labelled as coherent in the segmentations obtained via vertical or horizontal spatiotemporal filtering, we can compare the coherent motion segmentation results with the selection of human subjects on the same scenes (Fig. 3, bottom). 2.2. Learning to segment dynamic scenes The presented framework relies on a segmentation step characterised by a hard-wired, pre-defined discrimination between classes, derived by arbitrarily setting some thresholds. Moreover, the procedure handles separately vertical and horizontal features, producing some ambiguity (pixels labelled differently in the vertical and horizontal segmentation). And yet, the segregative power is actually inherent in the filter responses, taken as oriented edge/bar detectors. This means that without the need to sum accross frequencies and to artificially set discrimination rules a classifier can be trained on all the features (18 vertical and 18 horizontal filter responses) of selected, manually labelled points and automatically retrieve the discriminative signatures pertaining to each class. To this end, we considered a new video dataset of 18 sequences, shot with almost fixed camera (Dorr et al., 2010) and depicting several urban scenes with different motion patterns (people and cars
Fig. 1. On the left spectrum coverage of the used filter bank. On the right a sketch of the motion pattern assigned to each band in the spatiotemporal domain. Unstructured patterns are confined in the shadowed circle in the centre (to signify that their spectrum spreads weakly and in no particular direction, but it is not confined to low frequencies).
716
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
Fig. 2. First row, a frame of the sequence boats. The following rows show the 4 components computed for motion segmentation, at each frequency (f1 ¼ 0:0938; f2 ¼ 0:1875; f3 ¼ 0:3750), for the x–t planes of the sequence.
moving, clouds and water scintillating or flickering, static objects and unstructured background). Due to the ambiguous frequency signature of the incoherent and scintillating patterns, which makes them not clearly assignable to any portion of the frequency plane on the right in Fig. 1, we defined just 4 classes: static, coherently moving, unstructured and flickering. The sequences were resized at 360 640 for computational convenience and the first 20 frames were used for spatiotemporal filtering. We extracted the 10th frame of each sequence and on these frames we labelled a
total of 36 points for each class. In Fig. 4 are displayed for example the points labelled for the sequence bridge_2. We collected then the filter responses at those points and in a 5 5 neighbourhood (the pixels in the neighbourhood were labelled as the centre), so to allow for some variance in the responses associated to each class. Each of the 36 features was scaled to the same range [0, 1]. The whole dataset consisted then of 3600 points. We split the dataset by using 2/3 of the points for training and 1/3 for testing. The Support Vector Machine implementation of Chang
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
717
Fig. 3. Top row: on the left, motion segmentation for the horizontal energy components, as obtained for the sequence in Fig. 2. On the right, motion segmentation for the vertical energy components. Colours are coded as in the classification given in the text (blue for static, azure-blue for coherent motion, green for flickering, orange for incoherent/scintillating, red for unstructured). Bottom row: on the left, coherent motion mask as extracted from horizontal and vertical segmentation; on the right, foreground motion mask as annotated by human subjects. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 4. Manually labelled points for the training set fed to the SVM on one sequence of the used video dataset. Colour codes the class label (blue for static, cyan for coherently moving, yellow for unstructured and red for flickering). Not all the sequences contained every pattern of motion, hence the scenes were unevenly sampled. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
and Lin (2011) was used to train a model on a multi-class task by means of the one-against-one procedure with Gaussian kernel. Model selection on a grid search was used to tune the parameters C and c and obtain an accuracy of 89.7%. Some results obtained applying the model to the whole frames are displayed in Fig. 5. Interestingly, the very same model performed quite well on the dataset used in the previous section, even though it was not trained on samples from it and the sequences have a different and variable resolution (between 152 232 and 348 468), as shown in Fig. 6 on the same sequences of Figs. 3 and 8, second row.
3. Object extraction and saliency computation Coherent motion locations are selected to be further processed in the attentive stage. Our visual system is known to be particularly sensitive to this kind of motion since it is related to biological
motion and hence connected to possible approaching dangers or to harmless but still salient objects. The responses to diagonal filters of the regions labelled as coherent are used to compute a measure of horizontal and vertical energy, EH and EV, as in (Belardinelli et al., 2010), by computing motion opponency, summing responses related to consistent directions (rightwards, leftwards, upwards, downwards) and summing across frequencies. From these measure of vertical and horizontal energy, considered as the components of the motion energy projected on the spatial axes, motion magnitude and direction can be computed as module and phase of the total energy of every pixel, jEðx; y; tÞj and \Eðx; y; tÞ. A representation encoding these quantities for the boats sequence is shown on the top of Fig. 7. After this step, we extracted proto-object patches defined as blobs of consistent motion in terms of module and direction. As the Gestalt law of common fate states, points moving with similar velocity and direction are perceptually grouped together in a single
718
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
Fig. 5. Multi-class classification results for 2 sequences (beach and holsten_gate). Colour is coded as for the points in Fig. 4. Moving people and cars are correctly depicted in cyan, while static structures are depicted in blue. Scintillating clouds are labelled in red while the untextured regions are always in yellow. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 6. Multi-class classification results for 2 sequences from the dataset by Mahadevan and Vasconcelos (2009). The classifier was not trained on this dataset, still it can correctly classify most pixels. Colour is coded as for the points in Fig. 4. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
object (Palmer, 1999). A simple segmentation on the energy module map would not be sufficient, since adjacent objects moving in different directions would be merged. The module map was preprocessed by means of morphological operations to achieve more compact regions. We then applied the mean shift algorithm to the phase of the coherent points, weighted according to their magnitude. The mean shift algorithm is a kernel-based mode-seeking technique, broadly used for data clustering and segmentation (Comaniciu and Meer, 2002). Being non-parametric, it has the advantage that it does not need the number of clusters nor the distribution to be specified beforehand, even though some bandwidth parameters need to be specified. We took here a spatial bandwidth of 25 pixels and an orientation bandwidth of p=4. We thereby clustered pixel regions with a certain amount of energy according to their motion direction. Once we have segmented these blobs, we can extract the protoobject convex hulls and compute their saliency. We define object motion saliency according to two factors, one local and one global. The local factor is computed in a centre surround fashion upon the energy magnitude of each object o. The more the mean energy module of the object differs from the mean magnitude of a local surrounding NðoÞ (twice as large as the object itself), the higher its magnitude salience:
Smag ðoÞ ¼ jhjEðx; yÞjiðx;yÞ2ðoÞ hjEðx; yÞjiðx;yÞ2NððoÞÞ j
ð6Þ
where the hi operator computes the mean of the points in the subscript set. The global factor is given by motion direction saliency. If multiple objects move in a scene, the one(s) moving in a direction deviating from the distribution of the other object directions should be enhanced. Since some non rigid objects can display more than one direction but still a dominating general direction, we compute the histograms of the orientations of the object o, weighted according to the energy module. In so doing, the more likely orientations are the ones relative to high energy points. Orientation saliency is hence given by the similarity between the orientation distribution of the object and that of the objects in the entire scene. Similarity is evaluated through the Bhattacharyya distance:
Sor ðoÞ ¼ 1
Xqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ho ðiÞ hSo ðiÞ
ð7Þ
i
where S o denotes the set of all clustered objects except o and ho ðiÞ the i-th bin histogram (we consider bins of 20 ). Hence, the more the orientation distribution of the object differs from that of the global surrounding, the greater the orientation saliency.
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
719
Fig. 7. Top row: on the left, the colour code used to encode motion energy magnitude and phase on the coherently moving locations of the boat sequence (right). Bottom row: on the left the results of the meanshift clustering into proto-objects. Right, the convex hull covering each segmented object is superimposed on a frame of the sequence, colour is proportional to evaluated saliency, with less salient objects in green and most salient in red. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Finally, the overall saliency of the object is calculated as linear combination of the two components:
SðoÞ ¼ aSmag ðoÞ þ bSor ðoÞ
ð8Þ
In the considered sequence, Fig. 7, the boat in foreground achieves the highest saliency since it moves more consistently than its local neighbourhood, and it is the only object moving rightwards. 4. Results and discussion Both the segmentation and the saliency evaluation were tested on two publicly available datasets, both containing 18 sequences and a measure of human selection (in the form of manual annotation or eye fixations). The dataset presented in (Mahadevan and Vasconcelos, 2009) for background subtraction consists of short gray-levels sequences along with ground truth masks, annotated by human subjects. The subjects were given the instruction to label foreground moving objects. The dataset1 contains some particularly challenging sequences displaying multiple coherent moving objects, scintillating or flickering backgrounds. Some results of coherent motion segmentation and object prioritisation are displayed in Fig. 8. In the first sequence, showing traffic on a highway, the only cars driving downwards are the most salient objects. In the second sequence, the pedestrians in foreground gain more salience because of their high energy magnitude and relative isolation. The third sequence tracks some cyclists against a textured background. In this case hence the camera is not fixed and slanted tracks are produced also by patterns on the background. Nevertheless the system succeeds in recovering also the cyclists shapes and segmenting them. The final saliency map is object-based providing a more structured and perceptually organised input to be fed into a recognition system. Analogously, in the fourth sequence a bottle floats on the water surface. This gets segmented out of the consistently moving background. The SVM classification performance was quantitatively tested on the dataset by Dorr et al. (2010). The dataset2 contains several 1 2
http://www.svcl.ucsd.edu/projects/background_subtraction/demo.htm. www.inb.uni-luebeck.de/tools-demo/gaze.
categories of dynamic stimuli: natural, trailers, stop motion and still images. In this study we focused on the natural movies set. Video sequences are provided with corresponding eye-tracking raw data of 54 subjects, whom it was given the task of ‘‘watching the sequences attentively’’. Gaze samples were acquired with a SR Res. EyLink II running at 250 Hz and come in raw format, each gaze sample being described by the corresponding estimated screen coordinates and progressive timestamp. No information is given whether a sample belongs to a saccade, a fixation or other categories of eye movements, thus requiring a pre-filtering stage to discriminate between eye movement classes. Saccades are identified based on their angular velocity profile following the iterative approach described in (Nyström et al., 2010). After eliminating saccades and noise, the remaining samples are considered fixations (if longer than the minimum fixation duration). All the fixation samples (including smooth pursuit samples) were used to build a three dimensional fixation map collecting the fixations of all the subjects on a specific sequence in time. Gaussian functions were centered on each fixation as in (Dorr et al., 2010). This map was iteratively thresholded and the binary mask was confronted with each of the maps containing one of the classes labelled by the SVM by means of a ROC curve. The results in Fig. 9 show the Area Under the Curve (AUC) performance for each class and each threshold. As the threshold rises and just the locations of high coherence across subjects are retained, the moving class consistently has a higher predictive power w.r.t the other classes. This is in accordance with the findings that show how motion can explain best fixation distribution with respect to other features such as orientation, intensity or colour contrast (see Mital et al., 2011; Carmi and Itti, 2006). A comparison between the object-based saliency map and the attention map for some sequences in the second dataset can be seen in Fig. 10. At the moment, all the computation is done in Matlab, and is not in real time. We are currently working on an implementation using 3D Gabor filters which would allow to directly capture the orientation of the moving pattern in space and time. 3D Gabor were already successfully used for optical flow computation by Heeger (1987) and for motion detection by Petkov et al. (2008).
720
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
Fig. 8. Results of the segmentation and prioritisation process for some sequences of the first tested dataset. The first column shows a frame of the sequence, the second the ground truth mask, the third the selected coherently moving locations with corresponding motion energy and direction (see colour code in Fig. 7), the fourth the clustered objects with colour-ranked salience. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. AUC results for the different classes (static, moving, unstructured and flickering, colour-coded as in previous sections). Location labelled as moving have the greatest overlap to locations fixated by most of the subjects. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Being the convolution with multiple filters the major bottleneck, even in the frequency domain, an implementation on a CUDA architecture would allow to parallelize the whole filtering phase, hence providing much faster, possibly on-line processing. Moreover, it must not be forgotten that motion features are extracted and saliency is computed in a pure bottom-up instantaneous way and cannot capture the possible unfolding of the narrative going on. A major improvement of the model, also in terms of concordance with human fixations, would result from combination with top-down knowledge. Indeed, many fixations are not driven just by motion features but are directed to semantically relevant parts of the moving object, as for example faces, which are known to attract attention almost pre-attentively. Introducing this kind of topdown knowledge as well as a probabilistic description of the gist of the scene as in (Torralba et al., 2006) (also crucially relying on spectral characteristics) and of the expected patterns of motion,
would enrich and speed up selection of meaningful objects. Further, we relied on SVM as a simple and effective way of training a classifier on spatiotemporal features, still coherently moving patterns determine spatiotemporal textures that can be probabilistically modelled for example by means of Markov Random Fields (as in (Zhu et al., 1998)) allowing a more sophisticated and precise inference. In conclusion, the system delivers reasonable results for both the segmentation and the prioritisation of meaningful objects in a scene. It should be noted that the whole process relies on the same features computed in the beginning and that these are only motion features. It is expectable that the combination of these results with static features will refine the object extraction and priority assignment (as done, for example, in a top-down framework in (Wischnewski et al., 2010)). A further improvement would consist in the discrimination between object and self-motion. Indeed, if
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
721
Fig. 10. Results of the segmentation and prioritisation process for some sequences of the second dataset. The left column shows a frame of the sequence, with the superimposed object saliency map. The right column displays the fixation maps as obtained from 54 subjects looking at that frame.
the camera is moving, parallax effects produce the impression that the background is moving and not just with one velocity (as in the sequence with the cyclists). This would require a probabilistic description of the oriented spatiotemporal textures in the scene. The delivered proto-objects can anyway serve as the basis for the object files instantiation and indexing, preserving spatiotemporal continuity over time and motion (Gao and Scholl, 2010). Labelled objects can indeed be reviewed across subsequent frames and possible occlusions to consider if they kept their direction, velocity and size (to a certain extent). Moreover, all the features underlying the objects can be passed along with the object units and help also gesture or event classification. References Adelson, E.H., Bergen, J.R., 1985. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Amer. A 2 (2), 284–299. Belardinelli, A., Schneider, W., Steil, J., 2010. OOP: object-oriented-priority for motion saliency maps. In: Brain-inspired Cognitive Systems (BICS 2010), pp. 370–381. Borst, A., 2009. Visual motion models. In: Squire, L. (Ed.), Encyclopedia of Neuroscience. Oxford Academic Press, pp. 297–305. Bruce, N.D.B., Tsotsos, J.K., 2009. Saliency, attention, and visual search: an information theoretic approach. J. Vision 9 (3), 1–24. Bruce, V., Green, P.R., Georgeson, M.A., 2003. Visual Perception: Physiology, Psychology and Ecology, fourth ed. Psychology Press. Burr, D., 1986. Visual processing of motion. Trends Neurosci. 9 (7). Burr, D., Thompson, P., 2011. Motion psychophysics: 1985–2010. Vision Res. 51 (13), 1431–1456. Carmi, R., Itti, L., 2006. Visual causes versus correlates of attentional selection in dynamic scenes. Vision Res. 46 (26), 4333–4345. Chang, C.-C., Lin, C.-J., 2011. LIBSVM: A library for support vector machines. ACM Trans. Intell. Systems Technol. 2, 27:1–27:27, software available at
. Comaniciu, D., Meer, P., 2002. Mean shift: a robust approach toward feature space analysis. IEEE PAMI 24 (5), 603–619. Dellen, B., Wessel, R., 2009. Visual motion detection. In: Squire, L. (Ed.), Encyclopedia of Neuroscience. Oxford Academic Press, pp. 291–295.
Dorr, M., Martinetz, T., Gegenfurtner, K.R., Barth, E., 2010. Variability of eye movements when viewing dynamic natural scenes. J. Vision 10 (10). Gao, T., Scholl, B.J., 2010. Are objects required for object-files?: roles of segmentation and spatiotemporal continuity in computing object persistence. Visual Cognit. 18 (1), 82–109. Heeger, D.J., 1987. Model for the extraction of image flow. J. Opt. Soc. Amer. A: Opt. Image Sci. Vision 4 (8), 1455–1471. Itti, L., Baldi, P., 2009. Bayesian surprise attracts human attention. Vision Res. 49 (10), 1295–1306. Kravitz, D., Saleem, K., Baker, C., Mishkin, M., 2011. A new neural framework for visuospatial processing. Nat. Rev. Neurosci. 12 (4), 217–230. Le Meur, O., Le Callet, P., Barba, D., 2007. Predicting visual fixations on video based on low-level visual features. Vision Res. 47 (19), 2483–2498. Mahadevan, V., Vasconcelos, N., 2009. Spatiotemporal saliency in dynamic scenes. IEEE Trans. Pattern Anal. Machine Intell. 32, 171–177. Marat, S., Ho Phuoc, T., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A., 2009. Modelling spatio-temporal saliency to predict gaze direction for short videos. Internat. J. Comput. Vision 82 (3), 231–243. Milanese, R., Gil, S., Pun, T., 1995. Attentive mechanisms for dynamic and static scene analysis. Opt. Eng. 34 (8), 2428–2434. Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M., 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognit. Comput. 3 (1), 5–24. Morrone, M.C., Burr, D.C., 1988. Feature detection in human vision: a phasedependent energy model. Proc. Roy. Soc. Lond. Ser. B. Biological Sci. 235 (1280), 221–245. Movshon, J.A., Thompson, I.D., Tolhurst, D., 1978. Receptive field organization of complex cells in the cat’s striate cortex. J. Physiol. 283. Nyström, M., Holmqvist, K., 2010. An adaptive algorithm for fixation, saccade, and glissade detection in eyetracking data. Behav. Res. Methods 42 (1), 188–204. Orban, G.A., 2008. Higher order visual processing in macaque extrastriate cortex. Physiol. Rev. 88 (1), 59–89. Palmer, S.E., 1999. Vision Science: Photons to Phenomenology. The MIT Press. Petkov, N., Subramanian, E., 2008. Motion detection, noise reduction, texture suppression, and contour enhancement by spatiotemporal Gabor filters with surround inhibition. Biological Cybernet. 97 (5), 423–439. Qian, N., Andersen, R.A., Adelson, E.H., 1994. Transparent motion perception as detection of unbalanced motion signals. I. Psychophysics. J. Neurosci. 14 (12), 7357–7366. Rensink, R., 2000. The dynamic representation of scenes. Visual Cognit. 7 (1), 17–42. Schneider, W.X., 1995. VAM: a neuro-cognitive model for visual attention control of segmentation, object recognition, and space-based motor action. Visual Cognit. 2 (2–3), 331–376.
722
A. Belardinelli et al. / Pattern Recognition Letters 34 (2013) 713–722
Scholl, B.J., 2001. Objects and attention: the state of the art. Cognition 80 (1–2), 1– 46. Schrater, P.R., Knill, D.C., Simoncelli, E.P., 2000. Mechanisms of visual motion detection. Nature Neurosci. 3 (1), 64–68. Simoncelli, E., 1993. Distributed representation and analysis of visual motion. Ph.D. thesis, MIT Media Laboratory, E15–385. Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M., 2006. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychol. Rev. 113 (4), 766–786. Watson, A.B., Ahumada, A.J.J., 1985. Model of human visual-motion sensing. J. Opt. Soc. Amer. A: Opt. Image Sci. Vision 2 (2), 322–342.
Wildes, R.P., Bergen, J.R., 2000. Qualitative spatiotemporal analysis using an oriented energy representation. In: ECCV ’00: Proc. 6th European Conf. on Computer Vision-Part II, pp. 768–784. Wischnewski, M., Belardinelli, A., Schneider, W.X., Steil, J.J., 2010. Where to look next? Combining static and dynamic proto-objects in a TVA-based model of visual attention. Cognit. Comput. 2 (4), 326–343. Yuille, A., Grzywacz, N.M., 1988. A computational theory for the perception of coherent visual motion. Nature 333, 71–74. Zhu, S., Wu, Y., Mumford, D., 1998. Filters random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. Int. J. Comput. Vision 27 (2), 107–126.