A computer vision system for rapid search inspired by surface-based attention mechanisms from human perception

A computer vision system for rapid search inspired by surface-based attention mechanisms from human perception

Accepted Manuscript A computer vision system for rapid search inspired by surface-based attention mechanisms from human perception Johannes Mohr, Jong...

1MB Sizes 1 Downloads 114 Views

Accepted Manuscript A computer vision system for rapid search inspired by surface-based attention mechanisms from human perception Johannes Mohr, Jong-Han Park, Klaus Obermayer PII: DOI: Reference:

S0893-6080(14)00205-6 http://dx.doi.org/10.1016/j.neunet.2014.08.010 NN 3387

To appear in:

Neural Networks

Received date: 28 February 2014 Revised date: 30 June 2014 Accepted date: 24 August 2014 Please cite this article as: Mohr, J., Park, J. -H., & Obermayer, K. A computer vision system for rapid search inspired by surface-based attention mechanisms from human perception. Neural Networks (2014), http://dx.doi.org/10.1016/j.neunet.2014.08.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Computer Vision System for Rapid Search Inspired by Surface-Based Attention Mechanisms from Human Perception Johannes Mohr∗, Jong-Han Park, Klaus Obermayer Department for Electrical Engineering and Computer Science, Technische Universit¨at Berlin, Germany MAR 5-6, Marchstr. 23, D-10587 Berlin, Germany

Abstract Humans are highly efficient at visual search tasks by focusing selective attention on a small but relevant region of a visual scene. Recent results from biological vision suggest that surfaces of distinct physical objects form the basic units of this attentional process. The aim of this paper is to demonstrate how such surfacebased attention mechanisms can speed up a computer vision system for visual search. The system uses fast perceptual grouping of depth cues to represent the visual world at the level of surfaces. This representation is stored in short-term memory and updated over time. A top-down guided attention mechanism sequentially selects one of the surfaces for detailed inspection by a recognition module. We show that the proposed attention framework requires little computational overhead (about 11 ms), but enables the system to operate in real-time and leads to a substantial increase in search efficiency. Keywords: computer vision, biological vision, attention, search, object ∗

corresponding author Email address: [email protected] (Johannes Mohr)

Preprint submitted to Neural Networks

August 30, 2014

recognition 1. Introduction One reason why humans are so efficient at visual search even in cluttered environments is the use of selective visual attention. This mechanism allows the brain to concentrate its computational capacity on the part of the visual input that is most relevant at a given time. When searching for an object, attention is continuously shifted from region to region (Johnson and Proctor, 2003). This sequential process is often accompanied by eye-movements (Buswell, 1935), in which particular regions are fixated by high-resolution foveal vision. Insights on attentional processes in human perception have inspired the use of attention mechanisms within computer vision systems. Before we briefly review these approaches, we need to introduce some underlying concepts and terms from human perception. Humans extract visual features such as color, orientation and luminance from the light which reaches the eye by photoreceptor and ganglion cells in the retina. This visual information is transmitted over the primary visual pathway to early stages of the visual cortex for further processing. These features are therefore referred to as low-level features. Attention models that are purely feature driven and do not require feedback connections from later stages of the visual processing stream are called bottom-up guided models. It was found that attention is also task dependent and influenced by various cognitive factors (Henderson et al., 2009). Since this involves the flow of information from higher to lower brain areas this is called top-down processing (Corbetta and Shulman, 2002). Computational saliency models give a quantitative and biological plausible 2

explanation how separate low-level features can be integrated to guide focused attention. We will now briefly describe a particular saliency model (Itti et al., 1998) that has become a gold standard of bottom-up saliency and is often applied in computer vision. As a first step, scale-space pyramids are constructed for different features belonging to the color, the luminance, and the orientation domain. These features are then used to model the behavior of receptive fields by applying center-surround filters to calculate local feature contrasts (on/off intensity, red/green, blue/yellow, and four orientation contrasts) at different scales. By normalizing the resulting feature maps, those map locations that particularly stand out from their local surroundings are assigned high values. Across-scale combination and further normalization result in a single “conspicuity map” for each feature domain. Finally, a master “saliency map” is obtained by a linear combination of the three conspicuity maps. Attention is then focused on a circular region of fixed size at the maximum of the saliency map, the most salient point. The model inhibits previously attended locations, switching attention to the next salient point, which accounts for the experimentally observed inhibition of return effect (Posner, 1980). Such saliency maps have been used in several attention systems for computer vision (Walther and Koch, 2006; Frintrop, 2006; Lee et al., 2010; Rudinac et al., 2012). Some approaches integrate attention maps based on top-down information (Gould et al., 2007; Lee et al., 2010), or use depth information to find more likely object locations (Meger et al., 2008; Garc´ıa et al., 2013). All of these systems focus attention on regions around the maximum of some underlying attention map. The shape and size of this attentional focus is usually either fixed or defined as region with similar image features. One drawback of this kind of approach is that

3

the attended region will often be sub-optimal for recognizing the target object. It could have the wrong shape, or miss parts of the object that have different visual features than the most salient point. If the attended region is too small, several features might not be available to the recognition module. If it is too large, foreground or background features could confound the recognition process. The above map-based attention systems were motivated by the success of biological saliency models based on low level features, which were able to predict human eye-movements better than chance. However, recent eye-tracking studies on realistic scenes (Einh¨auser et al., 2008; Nuthmann and Henderson, 2010) suggest that in human perception attention is directed at higher-level features that result from a bottom-up grouping process (Yanulevskaya et al., 2013). Neural recordings showed that attention spreads along Gestalt criteria (Wannig et al., 2011) and is surface or object-based, rather than spatial or feature-based (Fallah et al., 2007). Functional magnetic resonance imaging studies found that brain activity in early visual cortex is modulated by attending to surfaces (Ciaramitaro et al., 2011; Hou and Liu, 2012). Thus there is increasing evidence that in human perception the visual world is represented at the level of surfaces, which form the basic units of attention (Nakayama et al., 1995; He and Nakayama, 1995; Scholl, 2001; Nakayama et al., 2009). Using surfaces rather than map locations as units of attention also offers advantages for computer vision systems. In dynamic scenes, where either the camera or some objects are moving, the image regions corresponding to a particular object change over time, therefore inhibiting or tagging fixed locations in an attention map does not work. In a surface-based representation, however, the surfaces are tracked over time, allowing the implementation of object-based inhibition of

4

return that is also observed for humans (Tipper et al., 1991). Moreover, by restricting the object recognition process to the attended surface, background features are automatically eliminated, and the features are extracted from a region that correspond to the surface of a physical object. The main challenge that needs for the development of surface-based attention systems is that all surfaces in the image need to be segmented and tracked in the time-span of a few milliseconds. In this work we propose such a surface-based attention framework for a computer vision system that searches for known objects. The system uses fast grouping of depth cues to segment all surfaces within a visual scene. The surface-based representation is maintained and updated over time, also in dynamic environments and under camera movements. This allows the inhibition of surfaces that have already been investigated. An attention module then selects a particular surface at a time based on prior knowledge about the target object. The attended surface is then analyzed in detail by a recognition module at high resolution (SXGA). The attention framework is very fast and allows the system to work in real-time by restricting the computationally intense recognition process to a particular surface. 2. Methods The task of the proposed visual search system is to locate all instances of a particular target object within a dynamic environment, where both objects and camera might be moving. It should also keep track of identified target objects once they are found. An overview of the system is given in Fig. 1. In the following, the single components of the system will be described in detail.

5

Figure 1: Overview of the proposed system for visual search.

2.1. Sensor Data The system receives rgb video data at SXGA (1280 x 1024 pixels) resolution at 15 Hz, and depth video data at QVGA (320 x 240 pixels) resolution at 30 Hz acquired from a Microsoft Kinect device. The SXGA rgb video mode allows the use of detailed textural information for the object recognition module, whereas the QVGA depth mode is sufficient for a rapid segmentation of surfaces. The rgb image is also down-sampled to QVGA resolution to be used in the tracking and attentional selection steps. 2.2. Segmentation of Surfaces In order to be suitable for the suggested attention framework, the procedure for the segmentation into surfaces has to fulfill several requirements. As in human perception, the pre-attentive process of obtaining the surface representation should be based on simple grouping operations and not require any top-down knowledge.

6

Most importantly, it needs to be extremely fast, on the order of milliseconds. The surfaces of two physically distinct objects should be represented by separate regions. In addition, the process should be able to separate objects from surfaces that physically support them, such as tables, chairs, or the floor. Finally, the segmented surfaces should be large enough to allow recognition. We propose a surface segmentation procedure meeting all these requirements that makes use of two grouping cues, depth and depth gradient. The algorithm is based on the principles of cohesion and boundedness (Spelke, 1990). According to the cohesion principle, two surface points lie on the same object only if they are linked by a path of connected surface points that are continuous in depth. The boundedness principle states that two surface points lie on distinct objects if there is no path of connected surface points that links them. However, this principle cannot be used to separate objects from their supporting surfaces, such as tables or the floor. In order to do this, our algorithm detects the strong discontinuities in the vertical component of the depth gradient that occur at the points where an object touches its supporting surface. Given the depth image D(x, y) (Fig. 2b), we first calculate the depth gradient ∇D(x, y). Multiplying this by a distance-dependent factor yields a weighted depth gradient that accounts for the decrease of depth resolution with distance: W(x, y) =

∇D(x, y) . |D(x, y)| + 0.5

(1)

This function allows applying a constant threshold value over a large depth range, and was chosen empirically. After denoising its vertical component Wy (x, y) (Fig. 2c) through median filtering, we obtain an image M (x, y) (Fig. 2d). All pixels in M (x, y) for which at least one component of W(x, y) is smaller than τd = 0.03 or for which there is no depth information available, are set to undefined. In 7

the resulting image G(x, y) boundaries between objects that are caused by depth discontinuity are clearly visible (Fig. 2e). The algorithm walks through G(x, y) row-wise from the top left to the bottom right, and compares each pixel to its preceding horizontal and vertical neighbors. If neither the pixel nor its neighbor have an undefined values of G, and if the difference in G is less than τg = 0.005, then the pixel gets assigned the same label as its neighbor, i.e. the two pixels are merged in the label map. If both of the preceding neighbors are merged to the pixel, then the segments those neighbors belong to in the label map are also merged. If a pixel could not be merged to any of its preceding neighbors, a new label is assigned. This happens at pixels where the vertical component of the gradient suddenly changes, e.g. at the point where an object touches a table. Pixels having undefined values of G (because their depth information is missing or a depth discontinuity was detected) receive the special label 0. The resulting label image is cleaned up by merging small segments with their largest neighbors, yielding a final set of surfaces (Fig. 2f). The computational cost of this segmentation procedure is linear in the number of pixels. 2.3. Tracking of Surfaces According to Ballard (Ballard et al., 1995), human perception maintains a limited internal representation of physical objects in memory, and accesses the environment by eye-movements when necessary for a task. In our computer vision system, the surface-based representation of the scene is stored in short-term memory. This representation consists of the coordinates and the corresponding depth and rgb values of all pixels belonging to each surface. During tracking, the surfaces in the current frame are matched to the surfaces in memory. Humans 8

(a)

(d)

(b)

(e)

(c)

(f)

Figure 2: Rapid segmentation of surfaces based on depth cues: (a) LR rgb-image (b) D(x, y): LR depth image (c) Wy (x, y): vertical component of the weighted depth gradient (d) M (x, y): denoised vertical component of the weighted depth gradient (e) G(x, y): all pixels in M (x, y), for which at least one component of W(x, y) is smaller than τd or with missing depth information were set to undefined and are marked in black (f) final segmentation into surfaces

9

track moving objects by using surface features in addition to motion trajectories (Makovski and Jiang, 2009). Similarly, our computer vision approach uses differences in position, size and color distribution for tracking. We define the position p(s) of a surface s as the sample mean over the 3Dcoordinates of all pixels assigned to s, measured in meters within a coordinate system fixed to the Kinect sensor. Its size σ(s) is defined as the maximum diagonal of its axis-aligned 3D bounding box (in meters). For calculating color histograms, colors are transformed into the normalized color rgb-space, which is insensitive to surface orientation, illumination direction and intensity (Gevers and Smeulders, 1999). The color-values of the pixels in s are then assigned to 4 bins for each of the three channels. The resulting 64-bin histogram is normalized to sum to 1 in order to ensure scale invariance and concatenated into a vector h(s). We match the surfaces si ∈ S that were segmented in the current frame to the

surfaces oj ∈ O in memory. For this, we evaluate for all pairs of objects (si , oj ) the following assignment costs,

cij = 0.1 · kp(si ) − p(oj )k + 0.5 · + 0.4 ·

64 X k=1

kσ(si ) − σ(oj )k max(σ(si ), σ(oj ))

|hk (si ) − hk (oj )|.

(2)

The weights in this cost function sum to 1 and were chosen empirically. This gives rise to the cost matrix {cij }N ×N , where N = max(|(S)|, |O|). If |(S)| 6= |(O)|,

we introduce dummy-surfaces, since the used matching algorithm requires the same number of entities in each set. The respective rows or columns of the cost matrix are set to 100. For all pairs (si , oj ), for which kp(si ) − p(oj )k > 0.3, 10

(3)

or kσ(si ) − σ(oj )k > 0.3, max(σ(si ), σ(oj )) or

64 X k=1

|hk (si ) − hk (oj )| > 0.3,

(4)

(5)

we set cij = 100. This ensures that matched surfaces have a certain similarity in size, position, and color. Let {aij }N ×N be a binary matrix. Then the matching problem amounts to

minimizing the cost function

C({aij }N ×N ) = under the constraints

PN

i=1

aij = 1 and

N X N X

aij cij

(6)

i=1 j=1

PN

j=1

aij = 1. We solve this problem

using the Hungarian algorithm (Munkres, 1957). Finally, matches for which the assignment costs are equal to 100 are eliminated, as they are matches to dummysurfaces or violate at least one of the threshold conditions. The procedure results in a set of unique matches between surfaces in memory and in the current frame. Finally, the short-term memory is updated, i.e. for each match, the surface representation in memory is replaced by the representation of the surface in the current frame. A surface in memory that could not be matched within 15 frames is deleted from memory. A surface tracked for at least 2 consecutive frames is considered stable and can be targeted by selective attention. Also, the memory contains a status variable indicating if a surface has been previously selected by the attention module (section 2.4) or recognized by the recognition module (section 2.5). While humans are only able to keep track of up to 5 objects at once (Scholl, 2001), there is no limit to the number of surfaces that can be simulta11

neously tracked by our system. The tracking procedure is robust against short occlusions, the appearance of new objects, or changes in movement direction. 2.4. Selective Attention For each frame, one of the surfaces is selected by the attention module and undergoes detailed inspection by a recognition process. Similar to the humans, the attention module of our computer vision system uses top-down information on the target object to narrow down search. Only surfaces that are not larger than the target object can be targeted by attention. The search is prioritized based on similarity to the target object in terms of color distribution. Both cues can be efficiently evaluated based on LR depth and rgb images. The size constraint is implemented by comparing the maximum diagonal of the 3D bounding box of each surface to that of the target object. For all surfaces fulfilling the size criterion, the Manhattan distance between the normalized color histograms of the surface s and the target object t is calculated as Dh (hk (s) − P hk (t)) = 64 k=1 |hk (s) − hk (t)|. The surface with the smallest value becomes the

focus of attention, i.e. it is inspected by the recognition module and compared to the target. In biological vision, object-based inhibition of return (Tipper et al., 1991) speeds up visual search by directing attention away from an object that has previously been selected. In our attention system this is implemented by checking the status of each surface in short-term memory. Surfaces that have been inspected before are neglected as long as other surfaces are present. When all new surfaces have been screened, the already observed surfaces are inspected once again, including surfaces that had previously been recognized as the target object. This helps to recover from any tracking or recognition errors. 12

2.5. Recognition Unlike systems which conduct recognition within a fixed-size spatial fovea, our system restricts the recognition process to the image area corresponding to the extracted surface. We use both texture and color features for recognition. The procedure for building the target object models is described in section 2.6. For texture-based matching, SIFT key-points and descriptors (Lowe, 2004) are calculated on the HR image within the attended surface and matched to the set of descriptors of the target object model using fast approximate nearest neighbor search (FLANN) (Muja and Lowe, 2009). If the same object is viewed at different distances to the camera, the scale of two SIFT key-points detected on the object should change approximately by the same factor. Therefore, for all key-points belonging to the same object, the ratio of the scale of the key-point in the current frame to the scale of the corresponding key-point in the target object model should be similar. If the object is rotated with respect to the target model, the change in orientation of the SIFT key-points in the current frame as compared to the key-points in the target model should likewise be similar. Thus, the scale and orientation of the key-points can be used to filter out false positive matches of SIFT key-points. In the first step, we construct a histogram of orientation differences with 12 bins of width π/6, where each pair of matched key-points is assigned to the two nearest bins to avoid discretization artifacts, and ignore all matches that are farther than one bin width away from the mode of the histogram. In the second step, we consider the scale ratios between matched key-points, and ignore all matches farther away than θs = 0.1 from the median. If the remaining matches are larger than a target-specific threshold θm , the texture of the surface matches the target. 13

For color-based matching, we compare the normalized color histograms between the inspected surface and the target object. If the Manhattan distance between the two histogram is smaller than θc = 0.4, the color distribution of the surface matches the target. If both texture and color distribution match, the object was recognized as an instance of the target object. Depending on the result of the recognition process, the status variable of the object in short-term memory is set to ”recognized” or ”selected”. Note that the system is able to identify multiple instances of a target object in the visual scene. 2.6. Target Model Generation Target objects were placed on a black turntable in front of a black background. Different views of the object were recorded in with the Kinect at SXGA (1280 x 1024 pixels) resolution by rotating the table in steps of 9◦ until a full rotation was done. The color histogram of the target model was obtained from the first view. The texture model was based on local invariant feature trajectories. Such models have successfully been used for describing 3D objects (Noceti et al., 2009). First, SIFT key points were detected for all frames during a full rotation of the object, and SIFT descriptors were calculated. After pairwise matching of the SIFT descriptors in neighboring frames, mismatches were filtered out by applying spatial constraints to the key-points. Feature trajectories were constructed by connecting consecutive matches. SIFT features that could not be matched to the next frame were then compared to the second next frame, and so on, allowing for holes in the trajectories of up to 9 frames. The trajectories of the first and the last frame were stitched together, as they corresponded to neighboring views after a full rotation. Finally, trajectories shorter 14

than 3 frames were removed from the model. Each trajectory was represented by the descriptor, the scale and the orientation of the SIFT feature that lies in the middle of the trajectory. These values were used by the recognition module for filtering out false positive key-point matches, as was described in section 2.5. By using feature trajectories, the dimensionality of the model is much reduced, since it makes use of the fact that the same local feature is usually present in several neighboring views. Moreover, only stable SIFT features that can be reliably matched over several neighboring views are included in the model. 3. Evaluation In this paper we present a system approach which aims at showing the benefit of using the concept of an attentional system targeting object-surfaces. As a systems approach, it targets a particular application setting and requires specific sensor information. Therefore, a direct comparison to the performance of the previous attention systems discussed in section 1 was not possible. We analyzed the performance gain obtained by using surface-based attention in comparison to using no attention, as well as the impact the use of top-down information had on the search performance of our system. In addition, we compared for different scenarios the results of the proposed surface segmentation to a graph-based approach, both with respect to time performance as well as the capability to segment physically distinct objects. We also investigated the robustness of the proposed surface tracking algorithm in different scenarios. Moreover, we evaluated the search and recognition performance, and measured the processing time of all system components and the total system. The evaluation was done on an Intel Core i7-980X 3.33GHz CPU. All time measurements refer to wall time, not CPU-time. 15

3.1. Segmentation of Surfaces For our application, the segmented regions should ideally correspond to the surfaces of distinct physical objects. We applied the proposed surface segmentation method to different scenes, shown in Fig. 3a (rgb image) and 3b (depth image). Black pixels in the depth images depict areas for which no depth information could measured by the Kinect sensor. The segmentation results obtained with our method on the LR-depth images (Fig. 3b) are shown in Fig. 3c. We compare our method to a graph-based segmentation method (Felzenszwalb and Huttenlocher, 2004) that forms the basis of a recently proposed technique for grouping an image into proto-objects (Yanulevskaya et al., 2013). Fig. 3 shows the segmentation results obtained by this method on the LR color images in Fig. 3a for different settings of the smoothing parameter s = 0.3 (d) and s = 1.0 (e). In order to compare the time performance of our method to the graph based method (Felzenszwalb and Huttenlocher, 2004), we averaged the computation time on the LR images over 1000 frames. Our method took 7.6 ± 2.4 ms on aver-

age, while the graph-based method required 43.5 ± 2.4 ms. The graph-based segmentation method took six times longer than the proposed segmentation method.

We also did a quantitative comparison of segmentation results. For this purpose, we counted for each scene the number of times where regions belonging to physically distinct objects were merged into a single segment. The results for the scenes in Fig. 3 are shown in Table 1. A figure in which all the counted segmentation errors are marked in the images is provided as supplementary material. For each scene, the graph based methods yielded at least four times as many segmentation errors as the method proposed by us. Moreover, it can be seen from Fig. 3 that many object surfaces are over-segmented by the graph-based methods. 16

Scene

1

2

3

4

5

6

7

8

9

Our method

2

1

2

2

1

2

2

0

0

Graph-based method (s=0.3) 10

8

13

9

11

11

8

8

4

Graph-based method (s=1.0) 10

7

16

13

11

16

14

7

9

Table 1: Number of segmentation errors where two physically distinct objects are assigned to the same segment. Comparison of the proposed segmentation method to the graph-based method (Felzenszwalb and Huttenlocher, 2004) on the scenes from Fig. 3 (ordered from top to bottom).

For the higher smoothness parameter (Fig. 3e), this over-segmentation is reduced a bit, but a larger number of regions belonging to physically distinct objects are merged into a single segment (Table 1). 3.2. Tracking of Surfaces An example of the surface tracking is shown in Fig. 4 for a subset of frames from a sequence in which the Kinect camera was rotated in a cluttered scene. The top image always shows the LR rgb image, while the bottom image shows the tracked surfaces. If a surface has been successfully tracked between two frames, its color is identical in both frames. If the tracking is interrupted at some point between two frames, the surface will change its color. Some examples where this happens are the back of the chair, which changes its color from purple to green between Frame 385 and frame 406, and the teddy bear, which changes its color from pink to green between frame 575 and frame 607. The larger surfaces of all objects were tracked over many frames. For some very small surfaces, such as the lids of boxes, the segmentation was not always stable, therefore they were sometimes initialized as new objects. We quantitatively evaluated the robustness of the tracking by analyzing short 17

(a)

(b)

(c)

(d)

(e)

Figure 3: Comparison of the segmentation performance of the proposed surface segmentation method to the graph-based method (Felzenszwalb and Huttenlocher, 2004) on different scenes. (a)

18

RGB image (b) depth image (c) Result from the proposed segmentation method (d) Result from the graph-based method with smoothing parameter 0.3 (e) Result from the graph-based method with smoothing parameter 1.0.

Frame 1

Frame 28

Frame 57

Frame 103

Frame 187

Frame 252

Frame 291

Frame 304

Frame 322

Frame 345

Frame 385

Frame 406

Frame 447

Frame 467

Frame 480

Frame 502

Frame 537

Frame 575

Frame 607

Frame 627

Figure 4: Result of the tracking process under rotation of the camera. For each frame, the figure shows the LR rgb image (top) and the tracked surfaces (bottom).

19

sequences under rotation of the Kinect camera in three different scenarios: laboratory, kitchen, and office. For each scenario we indexed a number of physically distinct objects (shown in Table 2). Each of these objects was manually inspected over 40 successive frames in which it appeared. In Table 2 we report the percentage of correct tracking operations within the inspected sequence over all objects. The videos used for this analysis are provided as supplementary material. 3.3. Target Model Generation Target object models of eight objects (Fig. 5) were built using the procedure described in section 2.6. We used these models in different experiments to investigate the impact of the selective attention module on the system. In Table 3 we report for each target object the number of SIFT features that were obtained for one particular view of the object, as well as the number of feature trajectories representing the whole 3D object. The number of features describing the whole object is only slightly larger than the number of features describing a single view, leading to a very efficient 3D representation of the target. 3.4. Selective Attention The proposed surface-based attentional selection mechanism is illustrated in Fig. 6 on a dynamic scene. At the beginning the target object is not yet in view of the camera. The target is first tracked in frame 15, where it is immediately targeted by selective attention and recognized by the recognition module. Note that the target object already appears in frame 14, but since a surface needs to be tracked in two consecutive frames to be considered stable, the surface of the target is available to the attention system not before frame 15. One can see that in each frame a different surface is selected. An object is re-inspected only if all other 20

Scenario 1: Laboratory

No of inspected object surfaces

Successful tracking operations

40

87% Scenario 2: Kitchen

No of inspected object surfaces

Successful tracking operations

36

87% Scenario 3: Office

No of inspected object surfaces

Successful tracking operations

28

88% 21 Table 2: Robustness of Tracking

1 2 3

4

7

6

5

8

Figure 5: The eight target objects used for training the models

Object

1

2

3

4

5

6

7

8

# features (single view)

427

735

638

529

797

402

504

245

# feature trajectories

704

1667

1674

1434

1097

869

1056

435

Table 3: Number of SIFT features needed to represent the target objects shown in Fig. 5. The first row shows the number of features detected for the frontal view of an object, the second row shows the number of features necessary to simultaneously represent all views of the object via feature trajectories.

22

Frame 1

Frame 2

Frame

Frame

Frame

Frame

Frame

Frame

Frame 9

Frame 10

Frame 11

Frame 12

Frame 13

Frame 14

Frame 15

Figure 6: Example for the surface based attention selection system for a dynamic scene, where object 6 from Fig. 5 was used as search target. The figure shows the last 15 frames before the target object is recognized. For each frame, the figure shows the LR rgb image (top) and the tracked surfaces (bottom). The currently attended surface is marked in red in the rgb image, and a yellow box marks an identified target object.

23

admissible objects have been checked. For example, the object first inspected in frame 1 is later re-inspected in frame 14, and the object first inspected in frame 2 is later re-inspected in frame 13. Time measurements were conducted on the static scene that is shown in Fig. 2 with a fixed Kinect camera. This allowed assessing the influence of the different system components under well-defined conditions. Since the time measurements had to be done on the live sensor stream, not a recording, the fixed camera and static set-up made sure that the sensor stream was exactly reproducible, except for the camera noise. Each of the objects in Fig. 5 was used two times as search target, which resulted in 16 experiments. We investigated the run-time of the different components of our system, the run-time when using surface-based attentional selection in comparison to using no attention, and the influence the inclusion of topdown knowledge had on recognition time. In each of the 16 experiments timings were averaged over 500 frames. The result of these experiments is summarized in Table 4. All steps of the attention system, including pre-processing, surface segmentation, tracking and selective attention, took together only 11 ms. The run-time of the whole algorithm is therefore dominated by the recognition process. Calculation of SIFT features within the selected surface on the rgb image at SXGA (1280x1024) resolution and FLANN matching took 25 ms on average. We compared this to the time it would take if SIFT feature extraction and FLANN matching were done on the whole SXGA image (Table 5). Together, these two steps took 1046 ms on average. The proposed attention framework requires an additional processing time of 11 ms, but only 25 ms for SIFT and FLANN, leading to an effective speed-up by a factor of 29. Including any over-

24

Overhead

Step

Average time

Total

8.4 ± 0.3 ms

Pre-processing Segmentation Attention

Tracking Selection Total

3.2 ± 0.2 ms 3.2 ± 0.3 ms 0.7 ± 0.4 ms

11.4 ± 0.5 ms

SIFT Recognition

4.3 ± 0.1 ms

FLANN Other

20.8 ± 0.2 ms 3.5 ± 0.4 ms 0.9 ± 0.1 ms

Total

25.2 ± 0.5 ms

Total

44.9 ± 0.8 ms 22.3 Hz

Table 4: Average time performance of the proposed visual search system using surface-based attention. The standard deviation is taken with respect to the average times obtained in the 16 experiments of 500 frames each.

25

Step

Average Processing Time

SIFT

957.6 ± 15.6 ms

FLANN Total

88.4 ± 1.9 ms

1046.0 ± 17.0 ms

Table 5: For comparison: Time for SIFT and FLANN steps if conducted on the whole SXGA image (1280 × 1024 pixels), averaged over 500 frames.

head such as memory operations, the total processing time of our system was 45 ms on average. This corresponds to an average frame-rate of 22 Hz. This values is above the frame rate of 15 Hz that is provided by the Kinect camera at SXGA resolution, so the system is able to operate in real-time. In order to assess the effect of using top-down guidance on search efficiency, we compared the average search time obtained by the proposed method (priority ranking and size constraints) to the case where size constraints are used, but random priority is assigned to all surfaces, and the case without top-down guidance. 16 experiments were conducted (2 experiments for each of the eight target objects). In each experiment, the time until the target object was found was measured repeatedly. Then the average of this search time is calculated over 500 frames. Table 6 reports the average and standard deviation of the results for the 16 runs. Depending on the target object, the size constraints yielded a reduced set of 9-10 candidate surfaces. The use of size constraints reduced search time by a factor of 3, and using both size constraints as well as priority ranking reduced search time by a factor of 33. In general, however, this factor will depend on the complexity of the particular scene.

26

Top-down guidance

Average Search Time

none

861.8 ± 374.9 ms

only size constraints priority ranking & size constraints

267.8 ± 23.3 ms 26.3 ± 12.3 ms

Table 6: The effect of top-down guidance on the average search time needed to find an object that is present in the scene. Shown are average and standard deviation of the average search time, measured over 16 experiments of 500 frames each.

3.5. Visual Search For evaluating the search performance of our system, we used both a moving camera and a dynamic scene. Several objects (including the current target object) were placed on a rotating turntable. The camera first observed a cluttered scene not containing the target object, and was then rotated until the moving objects on the turntable were visible. This setup is illustrated in Fig. 7. The turntable was observed during a full rotation, such that the target object was visible from all views. At some points during the rotation, the target object was partly or fully occluded by other objects. For each of the eight target objects two such scenes were evaluated, resulting in 16 experiments. An excerpt from such an experiment is shown in Fig. 8. In this example, the computer vision system is searching for object 5 from Fig. 5. At the beginning (e.g. frames 1 and 10), the target object is not visible. In frame 17, the surface of the target object is tracked for the first time, immediately selected by the attention module (indicated by the red surface in the rgb image), and recognized (indicated by the yellow box). Once the object has been recognized, it is marked in the following frames, but the object is reinspected when all other candidate surfaces 27

Figure 7: FOR THE ELECTRONIC VERSION: LINK THE VIDEO video 1.mp4 HERE. Our setup for evaluating the visual search performance of our system. The sub-windows on the left side show from top to bottom (i) the target object (here object 2 from Fig. 5), (ii) the currently attended candidate surface, (iii) the tracked surfaces, and (iv) the low resolution image, in which the currently attended surface is marked in red. The large sub-window on the right side shows the scene at high resolution SXGA). The candidate surface is marked by a red box, an identified target object by a (yellow box. SIFT features detected on the candidate surface are marked by white squares, matches to the target object model by blue squares.

28

Frame 1

Frame 10

Frame 17

Frame 25

Frame 32

Frame 70

Frame 94

Frame 102

Frame 135

Frame 140

Frame 200

Frame 214

Frame 215

Frame 226

Frame 227

Figure 8: Example where our system for visual search is applied to a dynamic scene. Here, object 5 from Fig. 5 was used as search target. For each frame, the figure shows the LR rgb image (top) and the tracked surfaces (bottom). The currently attended surface is marked in red in the rgb image, and a yellow box marks an identified target object.

29

Object

1

1 2

2 3

3

4 4

5 5

6 6

7 7

8 8

Nsearch

4

1 1

1 1

1

2 4

2 1

1 1

1 1

1 2

Table 7: Evaluation of the search performance of our system on a dynamic scene as shown in Fig. 8. For each of the 16 experiments, the table shows the index of the target object and the number of frames Nsearch between the time the target object was first visible and the time it was first recognized.

have been visited (e.g. in frame 32). In frame 70, the target is partly occluded by another object, but still recognized. Between frames 90 and 101 the target is almost fully occluded, and no longer recognized (see frame 94 for an example). In frame 102, the target is again selected by the attention system and recognized again. In frame 135 and 140, one can again see that the target is recognized even if partially occluded. In frame 214 the target is reinspected and not recognized due to the fact that there is no almost no texture information available in that particular view. In frame 226 the surface of the target object is reinspected, and now there is again sufficient texture information to allow recognition (e.g. frame 227). For each of the 16 experiments, we measured the number of frames Nsearch between the time the target object was first visible and the time it was first recognized. This value reflects the search efficiency of the system. The results are reported in table 7. The target object was always found within 4 frames after it came into view. Note that a surface can only be attended if it was stably tracked, therefore a target object can never be recognized before the second frame it appears in. Most target objects were recognized immediately when they were tracked for the first time (Nsearch = 1). Since previous attention frameworks for computer vision have different application settings, use different input data, and are integrated with

30

particular object recognition frameworks, it is not possible to directly compare the search efficiency to these systems. To give an impression of the overall recognition performance of our system, we measured precision and recall for the described experimental setup. We defined precision as the number of frames with false recognitions divided by the number of total recognitions, and recall as the ratio of the number of frames in which the target object was correctly identified to the number of frames in which at least a tiny part of the target object was visible. This included frames in which the target object was almost fully occluded by another object (see frame 94 in Fig. 8 as an example). The average precision over the 16 experiments was 100 ± 0%

and the average recall was 90 ± 7%. This indicates that the used trajectory model

was able to reliably represent the different views of the target object, and showed that the system was able to handle partial occlusions quite well, as long as enough of the object remained visible to allow recognition. 4. Conclusions In human vision, selective attention is employed to restrict the computational resources of the brain to the most relevant region of the visual field. There is evidence that besides certain locations or features, attention targets the surfaces of discrete visual objects. These surfaces are obtained by fast perceptual grouping processes at the early stages of the visual processing stream. Selective attention is focused on one of these surfaces to enable object recognition. In the present paper, this concept has been adopted in a real-time computer vision system. We showed that it is possible to obtain a surface-based representation of the visual world by a rapid grouping of depth cues. The quantitative analysis showed 31

that the proposed segmentation method was not only much faster than graph-based segmentation of color images but also was better suited to identify image regions belonging to distinct physical objects. We proposed a fast algorithm for tracking all surfaces in a scene that required only 3 ms on average. Object surfaces were successfully tracked in 87 % of cases on average. Since losing track of a surface just means it has to be re-inspected at a later point, this is acceptable for our application, where the highest priority is on speed. The tracked surfaces form the basic units of an attention process that is guided by prior knowledge about the target. At each frame, a particular object is inspected at high resolution by a recognition module. By using surfaces rather than regions in a saliency map as units of attention, the system is able to keep track of already inspected image regions also in non-stationary scenes and under camera movements. Restricting costly recognition processes to the region of the surface sped up the recognition process by a factor of 29. The attention framework, including surface segmentation, surface tracking and top-down guided selective attention, required only 11 ms on average. This enabled the visual search system to work at an average frame-rate of 22 Hz. In our experiments, the target objects were always found within 4 frames from the time they were first visible. This high search efficiency can to a large part be contributed to the use of top-down guidance, which in our experiments reduced search times by about a factor of 33. The proposed framework for surface-based attention can be combined with any object recognition framework. For our visual search system, we used an object recognition module that combined 3d object recognition based on trajectories of SIFT features with a comparison of normalized color histograms. Such trajec-

32

tory models of local invariant features are a state-of-the-art technique in computer vision to efficiently represent 3d objects. The system was able to recognize the target objects in a cluttered environment from most viewpoints, even when larger parts of the target objects were occluded. Apart from heavy occlusions, the system only failed when a particular view of the object did not provide sufficient texture features. In order to achieve good recognition performance also on objects without any texture, different methods would need to be employed that also include shape information. The surfaces are obtained very rapidly in a purely bottom-up process. However, a single object is sometimes divided into several parts, e.g. a chair-leg is assigned to a different surface than the rest of the chair. At the recognition stage, top-down information could be used to combine these different surfaces into a single object. This would improve recognition performance for complex objects consisting of many parts. While our system focused on the particular task of searching for known objects in dynamic environments, the proposed surfacebased representation could also prove useful for many other computer vision and robotics applications, such as robot navigation or indoor-surveillance. 5. Acknowledgments This research was funded by the BMBF as part of the the Bernstein Focus Neurotechnology (grant 01GQ0850), and partially supported by the German Research Foundation (GRK 1589/1). We thank Sahil Narang, Fritjof Wolf and Konrad D¨oring for their help with the implementation and testing of the system.

33

References References Ballard, D. H., Hayhoe, M. M., Pelz, J. B., 1995. Memory representations in natural tasks. Journal of Cognitive Neuroscience 7 (1), 66–80. Buswell, G., 1935. How people look at pictures: A study of the psychology of perception in art. Chicago, Ill: The University of Chicago press. Ciaramitaro, V. M., Mitchell, J. F., Stoner, G. R., Reynolds, J. H., Boynton, G. M., 2011. Object-based attention to one of two superimposed surfaces alters responses in human early visual cortex. Journal of Neurophysiology 105 (3), 1258–1265. Corbetta, M., Shulman, G. L., 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience 3 (3), 201–215. Einh¨auser, W., Spain, M., Perona, P., 2008. Objects predict fixations better than early saliency. Journal of Vision 8 (14), 1–26. Fallah, M., Stoner, G. R., Reynolds, J. H., 2007. Stimulus-specific competitive selection in macaque extrastriate visual area v4. PNAS 104 (10), 4165–4169. Felzenszwalb, P. F., Huttenlocher, D. P., 2004. Efficient graph-based image segmentation. International Journal of Computer Vision 59 (2). Frintrop, S., 2006. VOCUS: A Visual Attention System for Object Detection and Goal-directed Search. Lecture Notes In Artificial Intelligence (LNAI). Springer Berlin/Heidelberg. 34

Garc´ıa, G. M., Frintrop, S., Cremers, A. B., 2013. Attention-based detection of unknown objects in a situated vision framework. KI-K¨unstliche Intelligenz 27 (3), 267–272. Gevers, T., Smeulders, A. W. M., 1999. Color-based object recognition. Pattern Recognition 32, 453–464. Gould, S., Arfvidsson, J., Kaehler, A., Sapp, B., Messner, M., Bradski, G., Baumstarck, P., Chung, S., Ng, A. Y., 2007. Peripheral-foveal vision for real-time object recognition and tracking in video. In: Proc IJCAI-07. pp. 2115–2121. He, Z. J., Nakayama, K., 1995. Visual attention to surfaces in 3d-space. Proceedings of the National Academy of Sciences 92, 11155–11159. Henderson, J. M., Malcolm, G. L., Schandl, C., 2009. Searching in the dark: cognitive relevance drives attention in real-world scenes. Psychon Bull Rev 16 (5), 850–856. Hou, Y., Liu, T., 2012. Neural correlates of object-based attentional selection in human cortex. Neuropsychologia 50 (12), 2916 – 2925. Itti, L., Koch, C., Niebur, E., 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence 20 (11), 1254–1259. Johnson, A., Proctor, R. W., 2003. Attention: Theory and Practice. Sage Publications, Inc. Lee, S., Kim, K., Kim, J.-Y., Kim, M., Yoo, H.-J., 2010. Familiarity based unified

35

visual attention model for fast and robust object recognition. Pattern Recognition 43 (3), 1116–1128. Lowe, D. G., 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), 91–110. Makovski, T., Jiang, Y. V., 2009. Feature binding in attentive tracking of distinct objects. Vis cogn. 17, 180–194. Meger, D., Forss´en, P. E., Lai, K., Helmer, S., McCann, S., Southey, T., Baumann, M., Little, J. J., Lowe, D. G., Dow, B., 2008. Curious george: An attentive semantic robot. Robotics and Autonomous Systems 56 (6). Muja, M., Lowe, D. G., 2009. Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Application VISSAPP’09). INSTICC Press, pp. 331–340. Munkres, J., 1957. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics 5 (1), 32–38. Nakayama, K., He, Z. J., Shimojo, S., 1995. Visual surface representation: a critical link between lower-level and higher-level vision. In: Kosslyn, S. M., Osherson, D. N. (Eds.), Visual Cognition. MIT press, Cambridge, MA, pp. 1– 70. Nakayama, K., Shimojo, S., Ramachandran, V. S., 2009. Authors’ update: Surfaces revisited. Perception 38, 859–877. Noceti, N., Delponte, E., Odone, F., 2009. Spatio-temporal constraints for on-line

36

3d object recognition in videos. Computer Vision and Image Understanding 113 (12), 1198–1209. Nuthmann, A., Henderson, J. M., 2010. Object-based attentional selection in scene viewing. Journal of Vision 10(8) (20), 1–19. Posner, M. I., 1980. Orienting of attention. Quarterly Journal of Experimental Psychology 32 (1), 3–15. Rudinac, M., Kootstra, G., Kragic, D., Jonker, P. P., 2012. Learning and recognition of objects inspired by early cognition. In: Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. IEEE, pp. 4177–4184. Scholl, B. J., 2001. Objects and attention: the state of the art. Cognition 80, 1–46. Spelke, E. S., 1990. Principles of object perception. Cognitive Science 14, 29–56. Tipper, S. P., Driver, J., Weaver, B., 1991. Object-centred inhibition of return of visual attention. Quarterly Journal of Experimental Psychology A, Human Experimental Psychology 43 (2), 289–298. Walther, D., Koch, C., 2006. Modeling attention to salient proto-objects. Neural Networks 19, 13951407. Wannig, A., Stanisor, L., Roelfsema, P. R., 2011. Automatic spread of attentional response modulation along gestalt criteria in primary visual cortex. Nature Neuroscience 14, 12431244. Yanulevskaya, V., Uijlings, J., Geusebroek, J. M., Sebe, N., Smeulders, A., 2013. A proto-object-based computational model for visual saliency. J Vis. 13 (27), 1–19. 37

Captions: =====

Search for object 2.mp4: In this video, our computer vision system searches for a particular target object (number 2 in Figure 5 of the paper) within a dynamic scene, where both the camera and some of the objects are moving. The right panel shows the LR rgb image, the left panel shows the segmented and tracked surfaces. The currently attented surface is overlayed in red on the right panel. Once the target object is identified by the system it is surrounded by a yellow box.

Search for object 3.mp4: In this video, our computer vision system searches for a particular target object (number 3 in Figure 5 of the paper) within a dynamic scene, where both the camera and some of the objects are moving. The right panel shows the LR rgb image, the left panel shows the segmented and tracked surfaces. The currently attented surface is overlayed in red on the right panel. Once the target object is identified by the system it is surrounded by a yellow box.

Search for object 4.mp4: In this video, our computer vision system searches for a particular target object (number 4 in Figure 5 of the paper) within a dynamic scene, where both the camera and some of the objects are moving. The right panel shows the LR rgb image, the left panel shows the segmented and tracked surfaces. The currently attented surface is overlayed in red on the right panel. Once the target object is identified by the system it is surrounded by a yellow box.

Search for object 5.mp4: In this video, our computer vision system searches for a particular target object (number 5 in Figure 5 of the paper) within a dynamic scene, where both the camera and some of the objects are moving. The right panel shows the LR rgb image, the left panel shows the segmented and tracked surfaces. The currently attented surface is overlayed in red on the right panel. Once the target object is identified by the system it is surrounded by a yellow box.

Search for object 7.mp4: In this video, our computer vision system searches for a particular target object (number 7 in Figure 5 of the paper) within a dynamic scene, where both the camera and some of the objects are moving. The right panel shows the LR rgb image, the left panel shows the segmented and tracked surfaces. The currently attented surface is overlayed in red on the right panel. Once the target object is identified by the system it is surrounded by a yellow box.

Tracking_video_office.mp4: This video illustrates the surface tracking within a kitchen scene. The right panel shows the LR rgb video, the left panel shows the segmented and tracked surfaces.

Tracking_video_kitchen.mp4: This video illustrates the surface tracking within a kitchen scene. The right panel shows the LR rgb video, the left panel shows the segmented and tracked surfaces.

Tracking_video_lab.mp4: This video illustrates the surface tracking within a lab scene. The right panel shows the LR rgb video, the left panel shows the segmented and tracked surfaces.

segmentation_errors.pdf: This document supplements the analysis in section 3.1 of the paper. It outlines the segmentation errors that were made by the different algorithms on the scenes in Fig. 3 of the paper.

video 1.mp4: Our setup for evaluating the visual search performance of our system. The sub-windows on the left side show from top to bottom (i) the target object (here object 2 from Fig. 5), (ii) the currently attended candidate surface, (iii) the tracked surfaces, and (iv) the low resolution image, in which the currently attended surface is marked in red. The large sub-window on the right side shows the scene at high resolution SXGA). The candidate surface is marked by a red box, an identified target object by a (yellow box. SIFT features detected on the candidate surface are marked by white squares, matches to the target object model by blue squares.