CHAPTER
Open problems: an argument for new vision models rather than new algorithms
11
11.1 The linear receptive field is the foundation of vision models We recall from Chapter 2 the definition of Receptive Field (RF): the RF of a neuron is the extent of the visual field where light influences the neuron’s response. The “standard model” of vision is grounded on the concept of a linear RF. From Carandini et al. [1]: “At the basis of most current models of neurons in the early visual system is the concept of linear receptive field. The receptive field is commonly used to describe the properties of an image that modulates the responses of a visual neuron. More formally, the concept of a receptive field is captured in a model that includes a linear filter as its first stage. Filtering involves multiplying the intensities at each local region of an image (the value of each pixel) by the values of a filter and summing the weighted image intensities.” And from Olshausen and Field [2]: “And there has even emerged a fairly wellagreed-on ‘standard model’ for V1 in which simple cells compute a linearly weighted sum of the input over space and time (usually a Gabor-like function), which is then normalised by the responses of neighboring neurons and passed through a pointwise nonlinearity. Complex cells are similarly explained in terms of a summation over the outputs of a local pool of simple cells with similar tuning properties but different positions or phases. [...] The net result is often to think of V1 as a kind of ‘Gabor filter bank.’ There are numerous papers showing that this basic model fits much of the existing data well, and many scientists have come to accept this as a working model of V1 function.” While there have been considerable improvements on and extensions to the standard model, the linear RF remains as the foundation of most vision models: • In neuroscience, where models of single-neuron neurophysiological activity in the retina, the LGN and the cortex begin with a linear RF stage [1]. • In visual perception, where models of perceptual phenomena are based on convolving the visual input with a bank of linear filters [3]. Vision Models for High Dynamic Range and Wide Colour Gamut Imaging. https://doi.org/10.1016/B978-0-12-813894-6.00016-8 Copyright © 2020 Elsevier Ltd. All rights reserved.
295
296
CHAPTER 11 Open problems: an argument for new vision models
• In computational and mathematical neuroscience, very diverse approaches also assume a linear filtering of the signal [4–6]. Artificial Neural Networks (ANNs) are inspired by classical models of biological neural networks, and for this reason they are also based on the linear RF, which is their essential building block. For instance, in his classical treatise on ANNs [7], Haykin states that one of the three basic elements of the neural model is “an adder for summing the input signals, weighted by the respective synaptic strengths of the neuron; the operations described here constitute a linear combiner.” The other two basic elements are the weights of the linear summation and a nonlinear activation function. Therefore, ANNs can be seen as a cascade of linear and nonlinear (L+NL) modules. L+NL representations are very popular models in vision science [8]. In visual neuroscience, most modelling techniques for analysing spike trains consist of a cascade of L+NL stages [1], while in visual perception the most successful models are also in L+NL form [3]. But we saw in Chapter 3 how despite the enormous advances in the field, the most relevant questions about colour vision and its cortical representation remain open [9, 10]: which neurons encode colour, how does V1 transform the cone signals, how shape and form are perceptually bound, and how do these neural signals correspond to colour perception. A key message is that the parameters of L+NL models depend on the image stimulus, as seen in Chapters 3 and 4, a topic we shall discuss in the next section. Probably not coincidentally, the effectiveness of these models decays considerably when they are tested on natural images. This has grave implications for our purposes, since in colour imaging most essential methodologies also assume a L+NL form: brightness perception models (a nonlinearity, possibly shifted by the output of a linear filter), contrast sensitivity functions (a linear model, possibly applied after a nonlinearity), colour spaces and colour appearance models (with linear filters implemented as matrix multiplications and signal differences, and nonlinearities that may take different forms such as power laws or Naka-Rushton equations) and image quality metrics (based on L+NL vision models or based on ANNs, as seen in Chapters 8 and 9). We have seen in the book that there are many things which L+NL models can’t satisfactorily explain. HDR/WCG imaging has further put to the test the capabilities of vision models and highlighted their limits. Many essential questions remain open, and solutions that were good enough for SDR and standard colour gamuts do not cut it now. To mention just a couple of key examples: there are no good models of brightness perception for HDR images, so for instance there can’t be fully automated methods to re-master SDR content in HDR as the optimal level of diffuse white changes from shot to shot [11,12]; inter-observer differences could be ignored to define the colour matching functions (CMFs) for standard colour gamut monitors where the primaries were not too saturated, but with WCG displays inter-observer differences can be substantial, to the point where it may make sense to define CMFs for individuals [13].
11.2 Inherent problems with linear filters
As a result, both tone and gamut mapping remain open, challenging problems for which there are neither fully effective automatic solutions nor accurate vision models. But we must remark that there are solutions to these problems: they are manual solutions, performed by cinema professionals, who have the ability to modify images so that their appearance on screen matches what the real-world scene would look like to an observer in it [14]. Remarkably, artists and technicians with this ability are capable of achieving what neither state-of-the-art automated methods nor up-to-date vision models can. Put in other words, the manual techniques of cinema professionals seem to have a “built-in” vision model.
11.2 Inherent problems with using a linear filter as the basis of a vision model The fundamental issue is the following: the responses of visual neurons, as well as visual perception phenomena in general, are highly nonlinear functions of the visual input. The RF of a neuron is characterised by finding the linear filter that provides the best correlation between visual input (often, white noise) and neuron response. The problem is that model performance degrades quickly if any aspect of the stimulus, like the spatial frequency or the contrast, is changed, because the resulting RF depends on the stimulus, since the visual system is nonlinear [15,1,16]. This is an essential, inherent limitation of these models, and it’s such a key point that we think it’s best to let top vision scientists explain it themselves. From Olshausen and Field [2]: “Everyone knows that neurons are nonlinear, but few have acknowledged the implications for studying cortical function. Unlike linear systems, where there exist mathematically tractable textbook methods for system identification, nonlinear systems cannot be teased apart using some straightforward, structuralist approach. That is, there is no unique ‘basis set’ with which one can probe the system to characterize its behavior in general. Nevertheless, the structuralist approach has formed the bedrock of V1 physiology for the past four decades. Researchers have probed neurons with spots, edges, gratings, and a variety of mathematically elegant functions in the hope that the true behavior of neurons can be explained in terms of some simple function of these components. However, the evidence that this approach has been successful is lacking. We simply have no reason to believe that a population of interacting neurons can be reduced in this way.” From Wandell [15] (Chapter 7): “Multiresolution theory [uses] a neural representation that consists of a collection of component-images, each sensitive to a narrow band of spatial frequencies and orientations. This separation of the visual image information can be achieved by using a variety of convolution kernels, each of which emphasizes a different spatial frequency range in the image. This calculation might be implemented in the nervous system by creating neurons with a variety of receptive field properties. [...] There is a bewildering array of experimental methods – ranging
297
298
CHAPTER 11 Open problems: an argument for new vision models
from detection to pattern adaptation to masking – whose results are inconsistent with the central notions of multiresolution representations.” From Olshausen and Field [2]: “The Gabor function has been argued to provide a good model of cortical receptive fields. However, the methods used to measure the receptive field in the first place generally search for the best-fitting linear model. They are not tests of how well the receptive field model actually describes the response of the neuron. [...] The results demonstrate that these models often fail to adequately capture the actual behavior of neurons. [...] There is only one way to map a nonlinear system with complete confidence: present the neuron with all possible stimuli. The scope of this task is truly breathtaking. Even an 8 × 8 pixel patch with 6 bits of gray level requires searching 2384 > 10100 possible combinations (a googol of combinations). If we allow for temporal sensitivity and include a sequence of 10 such patches, we are exceeding 101000 . With the estimated number of particles in the universe estimated to be in the range of 1080 , it should be clear that this is far beyond what any experimental method could explore. The deeper question is whether one can predict the responses of neurons from some combinatorial rule of the responses derived from a reduced set of stimuli. The response of the system to any reduced set of stimuli cannot be guaranteed to provide the information needed to predict the response to an arbitrary combination of those stimuli. Of course, we will never know this until it is tested, and that is precisely the problem: the central assumption of the elementwise, reductionist approach has yet to be thoroughly tested.” From Carandini et al. [1]: “In the past few years, a number of laboratories have begun using natural scenes as stimuli when recording from neurons in the visual pathway. For example, David et al. (2004) have explored two different types of models [...]. These models can typically explain between 30 and 40 per cent of the response variance of V1 neurons. One could possibly obtain a better fit to the data by including additional terms [...] but it is still sobering to realize that the receptive field component per se, which is the bread and butter of the standard model, accounts for so little of the response variance. Moreover, the way in which these models fail does not leave one optimistic that the addition of modulatory terms or pointwise nonlinearities will fix matters. [...] Thus, there appears to be a qualitative mismatch in predicting the responses of cortical neurons to time-varying natural images that will require more than tweaking to resolve. What seems to be suggested by the data is that a more complex, network nonlinearity is at work here and that describing the behavior of any one neuron will require one to include the influence of other simultaneously recorded neurons.” From Olshausen [17]: “At the end of the day we are faced with this simple truth: No one has yet spelled out a detailed model of V1 that incorporates its true biophysical complexity and exploits this complexity to process visual information in a meaningful or useful way. The problem is not just that we lack the proper data, but that we don’t even have the right conceptual framework for thinking about what is happening. In light of the strong nonlinearities and other complexities of neocortical circuits, one should view the existing evidence for filters or other simple forms of feature extraction in V1 with great skepticism. The vast majority of experiments
11.3 Conclusion
that claim to measure and characterize ‘receptive fields’ were conducted assuming a linear systems identification framework. We are now discovering that for many V1 neurons these receptive field models perform poorly in predicting responses to complex, time-varying natural images. Some argue that with the right amount of tweaking and by including proper gain control mechanisms and other forms of contextual modulation that you can get these models to work. My own view is that the standard model is not just in need of revision, it is the wrong starting point and needs to be discarded altogether. What is needed in its place is a model that embraces the true biophysical complexity and structure of cortical micro-circuits, especially dendritic nonlinearities.”
11.3 Conclusion: vision-based methods are best, but we need new vision models Now that we reach the very end of the book we hope that the reader may share our conclusions, based on the exposition and reported results in this and all previous chapters: • Imaging techniques based on vision models are the ones that perform best for tone and gamut mapping and a number of other applications. • The performance of these methods is still far below what cinema professionals can achieve. • Vision models are lacking, most key problems in visual perception remain open. • Rather than be improved or revisited, a change of paradigm seems to be needed for vision models. Our proposal is to explore models based on local histogram equalisation, with fine-tuning by movie professionals. We have shown that this approach yields very promising outcomes in tone and gamut mapping, but results can still be improved, and new vision models developed. Local histogram equalisation is intrinsically nonlinear, and it’s closely related with theories that advocate that spatial summation by neurons is nonlinear [18,19], and those using nonlinear time series analysis of oscillations in brain activity [20]. A change of paradigm in vision models, with intrinsically nonlinear frameworks developed by mimicking the techniques of cinema professionals, could clearly have a really wide impact, much wider than the HDR/WCG domain, given that, as mentioned above, the L+NL formulation is prevalent not only in vision science and imaging applications, it’s the basis of ANNs as well.
References [1] Carandini M, Demb JB, Mante V, Tolhurst DJ, Dan Y, Olshausen BA, et al. Do we know what the early visual system does? Journal of Neuroscience 2005;25(46):10577–97.
299
300
CHAPTER 11 Open problems: an argument for new vision models
[2] Olshausen BA, Field DJ. How close are we to understanding v1? Neural Computation 2005;17(8):1665–99. [3] Graham NV. Beyond multiple pattern analyzers modeled as linear filters (as classical v1 simple cells): useful additions of the last 25 years. Vision Research 2011;51(13):1397–430. [4] Wilson HR, Cowan JD. Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal 1972;12(1):1–24. [5] Atick JJ, Redlich AN. What does the retina know about natural scenes? Neural Computation 1992;4(2):196–210. [6] Lindeberg T. A computational theory of visual receptive fields. Biological Cybernetics 2013;107(6):589–635. [7] Haykin SS. Neural networks and learning machines, vol. 3. Pearson Education Upper Saddle River; 2009. [8] Martinez-Garcia M, Cyriac P, Batard T, Bertalmio M, Malo J. Derivatives and inverse of cascaded linear+nonlinear neural models. PLoS ONE 2018;13(10):e0201326. [9] Solomon SG, Lennie P. The machinery of colour vision. Nature Reviews. Neuroscience 2007;8(4):276. [10] Conway BR, Chatterjee S, Field GD, Horwitz GD, Johnson EN, Koida K, et al. Advances in color science: from retina to behavior. Journal of Neuroscience 2010;30(45):14955–63. [11] Boitard R, Smith M, Zink M, Damberg G, Ballestad A. Using high dynamic range home master statistics to predict dynamic range requirement for cinema. In: SMPTE 2018; 2018. p. 1–28. [12] Ploumis S, Boitard R, Jacquemin J, Damberg G, Ballestad A, Nasiopoulos P. Quantitative evaluation and attribute of overall brightness in a high dynamic range world. In: SMPTE 2018; 2018. p. 1–16. [13] Fairchild MD, Heckaman RL. Metameric observers: a Monte Carlo approach. Color and imaging conference, vol. 2013. Society for Imaging Science and Technology; 2013. p. 185–90. [14] Van Hurkman A. Color correction handbook: professional techniques for video and cinema. Pearson Education; 2013. [15] Wandell BA. Foundations of vision, vol. 8. Sinauer Associates Sunderland, MA; 1995. [16] DeAngelis G, Anzai A. A modern view of the classical receptive field: linear and non-linear spatiotemporal processing by v1 neurons. The Visual Neurosciences 2004;1:704–19. [17] Olshausen BA. 20 years of learning about vision: questions answered, questions unanswered, and questions not yet asked. In: 20 years of computational neuroscience. Springer; 2013. p. 243–70. [18] Poirazi P, Brannon T, Mel BW. Pyramidal neuron as two-layer neural network. Neuron 2003;37(6):989–99. [19] Polsky A, Mel BW, Schiller J. Computational subunits in thin dendrites of pyramidal cells. Nature Neuroscience 2004;7(6):621. [20] Andrzejak RG, Rummel C, Mormann F, Schindler K. All together now: analogies between chimera state collapses and epileptic seizures. Scientific Reports 2016;6:23000.