CHAPTER
95 Saliency in Computer Vision Gérard Medioni and Philippos Mordohai
ABSTRACT
factor dominates and produces a certain arrangement instead of another competing possibility. For instance, for a given set of stimuli, similarity may prevail over proximity to produce a more salient organization. The conflicts that often occur between gestalt principles are obstacles to the development of artificial systems for perceptual organization. Most researchers limit the number of principles they consider and develop ways to adjust their relative weights to achieve the desired organization under different circumstances. Grossberg and Mingolla (1985) and Grossberg and Todorovic (1988) have developed the Boundary Contour System and the Feature Contour System that can group fragmented and even illusory edges to form closed boundaries and regions by feature cooperation. Shashua and Ullman (1988) detect globally salient curves in cluttered environments using locally connected networks. The measure of saliency they selected increases with respect to the length and smoothness of a curve and attenuates with gaps and curvature. Heitger and von der Heydt (1993) group elementary curves into contours via convolution with a set of orientation selective kernels whose responses decay with distance and difference in orientation. Junctions, corners, line ends, and illusory contours can also be explicitly detected. The motivation behind the Tensor Voting Framework was the development of a general computational framework for computer vision problems. They are addressed within a gestalt framework, where the primitives are grouped according to principles to give rise to salient structures. This is compatible with the most generally applicable constraint in vision, that the “matter is cohesive,” proposed by Marr (1982). We chose a local, model-free approach over a global, model-based alternative. With the latter, we cannot discriminate between local model misfits and noise. Furthermore, a hierarchy of more abstract descriptions
The goal of computer vision is to develop algorithms for image understanding for computers and not necessarily to emulate the human vision system in biologically plausible ways. Nevertheless, research in computer vision has understandably looked to the human visual system for inspiration and intuition. One key aspect of human perception is saliency, the property of certain arrangements conspicuously standing out from a cluttered background. Over the past several years, we have developed a computational framework to detect salient perceptual structures in 2D, 3D, or N-D data sets, even under severe noise corruption. In the framework, data tokens are represented by tensors and the saliency of each token is computed based on information propagated among neighboring tokens via tensor voting. The Tensor Voting Framework enables us to cast computer vision problems as perceptual organization ones whose solution is the most salient perceptual structures.
I. INTRODUCTION As pointed out by Max Wertheimer in a classic article (Wertheimer, 1923), the perception of wholes comprising many parts is not a property of the parts themselves, but a property of the arrangement of these parts. Gestalt psychologists investigated the principles that guide the arrangements and divisions of stimuli. These include proximity, similarity, good continuation, uniform destiny, closure, and simplicity. They operate at the pre-attentive bottom-up stages of perception. Psychological experiments documented the contribution of the gestalt principles to a universal perception of most stimuli by human observers. In many cases, they reinforce one another, but, in other cases, one
Neurobiology of Attention
583
Copyright 2005, Elsevier, Inc. All rights reserved.
584
CHAPTER 95. SALIENCY IN COMPUTER VISION
can be derived from the low-level local one, if that is desired, whereas the opposite is not always feasible. The framework is noniterative, very robust to noise, has no initialization requirements, and has only one critical parameter, the scale of voting.
II. THE TENSOR VOTING FRAMEWORK In this section we briefly review the tensor voting framework that was originally presented in Medioni et al. (2000). The two main aspects of the framework are the representation by second-order symmetric tensors and the information propagation mechanism by tensor voting. It is described in 2D here, but it is suitable for perceptual organization in any dimension. The novelty of our approach is that there is no objective function that is explicitly defined and optimized according to global criteria. Instead, tensor voting is performed locally and the saliency of perceptual structures is estimated as a function of the support tokens receive from their neighbors. Tokens with compatible orientations that can form salient structures reinforce one another. A key property of the framework that sets it apart from all of the methods developed to date is that we define saliency as a tensor, as opposed to a scalar. The tensor conveys considerably richer information, such as the type of the perceptual structure and its most likely orientation. The representation of a token consists of a symmetric second-order tensor that encodes perceptual saliency. The tensor essentially indicates the saliency of each type of perceptual structure the token belongs to and its preferred orientation. Tensors were first used as a signal processing tool for computer vision applications by Granlund and Knutsson (1995). In 2D, a symmetric second-order tensor can be viewed as an ellipse, or a 2 ¥ 2 matrix, that can be decomposed as in the following equation: T = (l1 - l 2 )eˆ1eˆ1T + l 2 (eˆ1eˆ1T + eˆ2 eˆ2T )
which tokens convey their orientation preferences to their neighbors in the form of votes, which are also tensors that are cast from token to token. Each vote has the orientation the receiver would have if the voter and receiver were part of the same smooth perceptual structure, which we choose to be the arc of the osculating circle at the voter that goes through the receiver. The saliency decay function we have chosen is the following (see also Fig. 95.1): DF(s, k , s ) = e
Ê s2 + ck 2 ˆ -Á ˜ Ë s2 ¯
(2)
where s is the arc length OP, k is the curvature, c is a constant, and s is the scale of voting, which determines the effective neighborhood size. According to the gestalt principle of proximity, the strength of the votes attenuates with distance. The strength of the vote also decreases with increased curvature of the hypothesized structure, making straight continuations preferable to curved ones following the principles of smooth continuation and simplicity. Votes are cast by the stick and ball component of each tensor using the precomputed voting fields shown in Fig. 95.1. Each token casts votes to its neighbors and collects votes from them. Votes are accumulated by tensor addition and result in a generic tensor. An analysis of the results is performed by decomposing the tensors according to Eq. (1) and generating stick and ball saliency maps. Junctions can be detected as distinct local maxima of ball saliency, region inliers are characterized by high ball saliency, and curve inliers are
(1)
where li are the eigenvalues in descending order and êi are the corresponding eigenvectors. The first term corresponds to an elongated stick tensor that encodes certainty of orientation normal to ê1 with saliency l1 - l2, and the second term corresponds to an isotropic ball tensor that encodes uncertainty of orientation with saliency l2. The inputs consist of tokens, which can be points or points with an associated orientation (curvels). The former are encoded as ball tensors and the latter as stick tensors. We propose to combine the information contained by the arrangement of these tokens by tensor voting, a method of information propagation in
FIGURE 95.1 (A) Vote generation from a unit stick voter and (B) the stick and (C) ball fields.
SECTION IV. SYSTEMS
II. THE TENSOR VOTING FRAMEWORK
585
Note that virtually the same results can be obtained using a wide range of voting scales. We have briefly presented a generic framework that provides a novel definition of and implementation of saliency. It can be applied to curves, regions, and also problems in higher dimensions. It is efficient, is robust to noise, and has only one critical parameter. FIGURE 95.2 (A) Input, (B) Curves and junctions. Curves and junctions from a noisy point set. Junctions have been enlarged and marked as squares.
FIGURE 95.3 (A) Input, (B) Regions and curves. Regions and curves from a noisy point set. Region boundaries and curve end points are marked in black.
characterized by high curve saliency. Results are illustrated in Fig. 95.2. We also perform first order voting (Tong et al., 2001) to detect curve end points and region boundaries. Figure 95.3. demonstrates the simultaneous detection of a region and a curve, as well as their terminations, from a set of unoriented points.
References Granlund, G., and Knutsson, H. (1995). “Signal Processing for Computer Vision.” Kluwer, Dordrecht. Grossberg, S., and Mingolla, E. (1985). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychol. Rev. 92, 173–211. Grossberg, S., and Todorovic, D. (1988). Neural dynamics of 1-d and 2-d brightness perception: A unified model of classical and recent phenomena. Perception Psychophys. 43, 723–742. Heitger, F., and von der Heydt, R. (1993). A computational model of neural contour processing: Figure-ground segregation and illusory contours. In Proceedings of the International Conference on Computer Vision (H. Nagel, T. Huang, Y. Shirai, Eds.), IEEE, pp. 32–40, Berlin. Marr, D. (1982). “Vision.” Freeman Press. Medioni, G., Lee, M., and Tang, C. (2000). “A Computational Framework for Segmentation and Grouping.” Elsevier, Amsterdam. Shashua, A., and Ullman, S. (1988). Structural saliency: The detection of globally salient structures using a locally connected network. In Proceedings of the International Conference on Computer Vision (R. Bajcsy, S. Ullman, Eds.), IEEE, pp. 321–327, Tampa. Tong, W., Tang, C., and Medioni, G. (2001). First order tensor voting and application to 3-d scale analysis. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (P. Flynn, Ed.), IEEE, pp. I:175–182, Kauai, Hawaii. Wertheimer, M. (1923). Routledge and Kegan Paul translated by W. Ellis, Vol. 4, 301–350, London.
SECTION IV. SYSTEMS