Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations

Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations

Accepted Manuscript Title: Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations Author: Arne F. Meyer Jan-P...

2MB Sizes 4 Downloads 35 Views

Accepted Manuscript Title: Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations Author: Arne F. Meyer Jan-Philipp Diepenbrock Frank W. Ohl J¨orn Anem¨uller PII: DOI: Reference:

S0165-0270(15)00061-8 http://dx.doi.org/doi:10.1016/j.jneumeth.2015.02.009 NSM 7151

To appear in:

Journal of Neuroscience Methods

Received date: Revised date: Accepted date:

5-6-2014 22-1-2015 11-2-2015

Please cite this article as: Arne F. Meyer, Jan-Philipp Diepenbrock, Frank W. Ohl, J¨orn Anem¨uller, Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations, Journal of Neuroscience Methods (2015), http://dx.doi.org/10.1016/j.jneumeth.2015.02.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights • We apply stochastic approximations to receptive field (RF) estimation • Reliable reconstruction of RFs from responses to highly non-Gaussian stimuli

• Estimation time is reduced by about 90% compared to the full solution

ip t

• Stochastic approximations preserve predictive power and tuning properties

Ac ce

pt

ed

M

an

us

cr

• On-line monitoring of RF parameters for more than 30 neurons on a single computer

1

Page 1 of 32

Fast and robust estimation of spectro-temporal receptive fields using stochastic approximations Arne F. Meyera,1,∗, Jan-Philipp Diepenbrockb , Frank W. Ohlb,c , J¨orn Anem¨ullera Physik and Cluster of Excellence Hearing4all, Carl von Ossietzky University, D-26111 Oldenburg, Germany b Leibniz Institute for Neurobiology, D-39118 Magdeburg, Germany c Institute of Biology, Otto-von-Guericke University, D-39118 Magdeburg, Germany

cr

ip t

a Medizinische

Abstract

an

us

Background The receptive field (RF) represents the signal preferences of sensory neurons and is the primary analysis method for understanding sensory coding. While it is essential to estimate a neurons RF, finding numerical solutions to increasingly complex RF models can become computationally intensive, in particular for high-dimensional stimuli or if many neurons are involved.

M

New Method Here we propose an optimization scheme based on stochastic approximations that facilitates this task. The basic idea is to derive solutions on a random subset rather than computing the full solution on the available data set. To test this, we applied different optimization schemes based on stochastic gradient descent (SGD) to both the generalized linear model (GLM) and a recently developed classification-based RF estimation approach.

Ac ce

pt

ed

Results and Comparison with Existing Method Using simulated and recorded responses, we demonstrate that RF parameter optimization based on state-of-the-art SGD algorithms produces robust estimates of the spectro-temporal receptive field (STRF). Results on recordings from the auditory midbrain demonstrate that stochastic approximations preserve both predictive power and tuning properties of STRFs. A correlation of 0.93 with the STRF derived from the full solution may be obtained in less than 10 % of the full solution’s estimation time. We also present an on-line algorithm that allows simultaneous monitoring of STRF properties of more than 30 neurons on a single computer. Conclusions The proposed approach may not only prove helpful for large-scale recordings but also provides a more comprehensive characterization of neural tuning in experiments than standard tuning curves. Keywords: sensory coding, receptive field, auditory system, generalized linear model, stochastic gradient descent, on-line learning

∗ Corresponding

author Email addresses: [email protected] (Arne F. Meyer), [email protected] (Jan-Philipp Diepenbrock), [email protected] (Frank W. Ohl), [email protected] (J¨orn Anem¨uller) 1 Present address: Gatsby Computational Neuroscience Unit, University College London, London, WC1N 3AR, United Kingdom

Preprint submitted to J Neurosci Methods

January 22, 2015

Page 2 of 32

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

ip t

9 10

cr

8

us

7

an

6

M

5

ed

4

Understanding what makes a neuron fire is a central question in neuroscience [1, 2, 3]. In sensory systems, this implies creating a model that can predict the neural responses not only to learned but also to novel stimuli. The primary model to study neural encoding of sensory stimuli is the receptive field (RF; for a review see [4, 5, 6]). The RF describes how a sensory neuron integrates stimulus features and is defined as the linear part in a linear-nonlinear (LN) cascade model [7, 8]. The LN model conceptually separates how the neuron processes stimuli (filtering through the RF) from the details of how its response is generated (nonlinearity and subsequent spike generation). The linear RF stage maps the high-dimensional stimulus to a scalar value by correlating the stimulus with the stimulus pattern maximally driving that neuron, namely the RF (Fig. 1(a)). The higher the correlation between stimulus and RF, the more likely a spike is elicited. The relation between correlation and spiking probability is described by the neural nonlinearity, which transforms the filtered stimuli into a spike rate (Fig. 1(b)). The RF has successfully been used to study processing in sensory neurons in different brain areas, e.g., the visual cortex [9, 10], the auditory midbrain [11, 12], and the auditory cortex [13, 14, 15]. However, much of our current understanding about the function of sensory areas has been obtained using parametric stimuli, such as random dots and sinusoidal gratings in vision [16, 3], and random chords and ripple combinations in audition [13, 17, 18]. The use of parametric stimuli has its own advantage as it may significantly simplify estimation of the parameters of the LN model. E.g., the spike-triggered average (STA), a linear estimator of the LN model, provides for unbiased estimation of RF parameters if the stimulus ensemble exhibits a Gaussian distribution [8, 19, 20]. Recent evidence, however, suggests that sensory cortical areas are sensitive to structural features of natural stimuli rather than to simple stimuli [21, 22, 23]. In particular, natural stimuli are encoded more efficiently than artificial stimuli [10, 11, 24] and modelbased predictions based on responses to simple artificial stimuli are not sufficient to predict neural responses to natural stimuli [9, 25]. Natural stimuli, however, exhibit a statistically more complex structure than parametric stimuli [20, 26]. Thus, to characterize the feature selectivity with natural stimuli and other stimuli with non-Gaussian correlations, the presence of higher-order stimulus correlations in such stimulus ensembles must be taken into account [19, 20, 27, 28, 29]. Estimators that allow estimation of the parameters of the LN model under natural stimulation typically come at the expense of elaborated optimization schemes, e.g., maximum information dimensions (MID), which generalizes the STA to non-Gaussian stimulus ensembles and arbitrary neural nonlinearities [10, 20]. However, the need for computationally efficient methods is twofold. First, progress in neural recording techniques motivated the development of models that include a large number of neurons [30]. While these model provide a more accurate description of neural responses, computational complexity scales linearly (no interactions) or quadratic (pair-wise interactions) with the number of neurons. Second, characterization of neural responses during experiments becomes increasingly important, e.g., adjusting experimental parameters based on extracted frequency profiles [15], stimulation of neurons with preferred stimulus patterns [31], and (on-line) closed-loop experiments [32, 33, 34]. Our goal in this study is to employ stochastic approximations to the task of RF estimation to reduce computational complexity. The approximation scheme proposed here, stochastic gradient descend (SGD), has received much attention in the field of machine learning (see, e.g., [35, 36], for a recent overview). However, while SGD has been successfully applied in engineering applications, it remains unclear if SGD allows robust and computationally efficient estimation of RF parameters. In particular, the complex statistical structure of natural stimulus ensembles may limit its use. To test this, we apply SGD-based optimization schemes to both the generalized linear model (GLM, [37, 27, 28]), a flexible framework that allows to include interactions between multiple neurons [30], and a previously developed classification-based RF estimation method (CbRF, [29]. Both methods have successfully been used to characterize responses of auditory midbrain neurons used in this study [29]. We present three key results: first, using simulated responses, we demonstrate that optimization of RF parameters using SGD produces extremely robust RF estimates, even for highly non-Gaussian stimulus ensembles such as human speech and natural images. The same stimuli lead to biased RF estimation when derived using standard linear estimators [10, 20, 38]. Second, we apply the method to recordings from the auditory midbrain in anesthetized Mongolian gerbils. We show that stochastic approximations preserve tuning properties and predictive power of the spectro-temporal receptive field (STRF). Furthermore, the time to obtain an approximate STRF estimate takes only a fraction of the experimental time. Third, motivated by this, we present an extension that allows STRF estimation while recording neural responses to non-Gaussian acoustic stimuli.

pt

2 3

1. Introduction

Ac ce

1

2

Page 3 of 32

52

2. Materials and methods

53

2.1. Model-based receptive field estimation

64 65 66 67

68 69 70 71 72 73 74 75 76 77 78

79 80 81 82 83 84

85 86 87

88 89 90

ip t

cr

62 63

A non-deterministic spike train is produced by an inhomogeneous Poisson process that is modulated by the timevarying spiking probability (Fig. 1(b)). Such a model is also known as linear-nonlinear Poisson (LNP) model [8] and has successfully been applied in different modalities to characterize response properties of single neurons [8, 11, 10, 12, 15], and also populations of neurons [30].

us

61

an

59 60

2.1.1. Estimation of receptive field parameters A number of techniques has been proposed to estimate the parameters of a LN model (for a review see [5, 6, 39]). Classical examples are linear estimators based on the spike-triggered average (STA) that allow estimation of the linear part of the LN model if the stimulus ensemble exhibits a Gaussian distribution [7, 40]. Informationtheoretic estimators, e.g., maximum informative dimensions (MID, [20]), allow RF estimation in a linear-nonlinear model under much more general conditions at the expense of elaborated optimization schemes that may lead to suboptimal estimates [19, 29]. Here, we focus on two classes of estimators, the generalized linear model (GLM, [27, 28, 37]), which assumes a specific functional form of neural nonlinearity and the distribution of response values, and a previously presented classification-based RF (CbRF) estimation method [29]. Both approaches can be formulated as a convex optimization problem with a unique solution and have shown to be adequate models to describe auditory midbrain responses [29, 41].

M

57 58

ed

56

In a typical recording situation, stimulus examples si , i = 1, 2, ..., N, are presented while recording the evoked responses ri . Let us assume the response ri has already been quantized in time bins of width ∆ and that the bins are properly aligned, i.e. the ith stimulus example precedes the ith time bin. In this setting, ri may represent the number of spikes in the ith bin or, if we choose the bin width small enough, an essentially binary response with r = 1 and r = 0 indicating the presence or absence of a spike, respectively. s ∈ RD is a D-dimensional vector representing stimulus features derived from the physical stimulus, e.g., s may contain intensity values of an image presented to a visual neuron or a spectrogram patch preceding the response of an auditory neuron as illustrated in Fig 1(a). The model of response generation used here, a linear-nonlinear (LN) cascade model, assumes that stimulus examples s are filtered through the RF k (Fig. 1(a)). The filtered stimulus is transformed into a spiking probability using a static (memoryless) nonlinear function f ,    p spike|s ∝ f kT s . (1)

pt

55

The generalized linear model (GLM). The GLM used in this study models the response by a Poisson distribution, e.g., the number of spikes in each time bin. To keep the notation simple, we assume that each stimulus vector s contains a trailing 1, s ≡ (s1 , s2 , ..., sD , 1)T , and summarize RF weights k and optional bias term b in the parameter vector Θ ≡ (k1 , k2 , ..., kD , b)T . The filtered stimulus is thus given by zi ≡ ΘT si . The log of the probability of observing ri spikes in the ith time interval, denoted by the log-likelihood of the model parameters L(Θ) ≡ log p({ri } |Θ, f, {si }), is given by X X L(Θ) = ri log f (zi ) − ∆ f (zi ) (2)

Ac ce

54

i

i

where f is a strictly monotonic nonlinear function [28]. We found that an expanding nonlinearity, e.g., the exponential function, provides an adequate description of the spiking nonlinearity in the IC recordings used in this study [29]. Here we used a slight modification of the parametrization proposed in [42]:    if z ≤ 0 ez (3) f (z) =   1 + z + z2 if z > 0. 2 This form of nonlinearity provides for a milder increase for large values of z and tends to be numerically more stable than the exponential, particularly with regard to stochastic updates. With this choice of the nonlinearity, the log-likelihood of the model parameters given r is L(Θ) = L− (Θ) + L+ (Θ)

(4)

3

Page 4 of 32

91

with L− (Θ)

X



ri zi − ∆ezi

∀i:zi ≤0

L+ (Θ)

X



ri log(1 + zi +

∀i:zi >0

96 97

ip t

Classification-based receptive field (CbRF) estimation. The CbRF method casts the problem of RF estimation as a linear binary classification problem [29]; the classifier parameters that optimally separate stimulus examples that elicited a spike from those that did not elicit a spike determine the RF. The underlying model of response generation assumes that spikes are generated from the filtered stimulus by a noisy threshold operation, yi = sgn(zi − η + ),

104 105 106

107 108 109 110

111 112

113 114 115 116 117

an

103

M

101 102

where sgn is the signum function and y = 1 and y = −1 denote the presence or absence of a spike, respectively. Note that we used y ∈ {−1, 1} instead of r to symmetrize the problem and to better distinguish the binary response nature from the Poisson assumption in the GLM. The neural nonlinearity is absorbed by the distribution of the noise, denoted by , around the threshold η (cf. dashed line in Fig. 1(c)). Note that the threshold noise is a conceptual component of the response generation model and is not used in the actual analysis. Finding the solution to the binary encoding problem in Eq. (5) results in a non-convex optimization problem. The CbRF method optimizes a convex upper bound to the empirical error induced by the binary encoding model. In contrast to the GLM, the CbRF’s loss function was not derived from a likelihood. However, it can be shown that the employed loss function  1  (1 − yz)+ 2 `(y, z) = (6) p(y)  directly aims at p spike|s [43, 44]. p(y) is the marginal probability of observing a spike or no spike, respectively, and (v)+ is a function such that v+ is v, if v > 0, and zero, otherwise. In a previous study we found that the CbRF’s loss function tends to be more robust to the actual shape of the neural nonlinearity than the GLM’s nonlinear function, in particular for natural stimuli [29].

ed

100

(5)

pt

99

2.1.2. Full batch gradient solution Both methods can be formulated in the regularization framework

Ac ce

98

cr

94 95

The parameters Θ that maximize the log-likelihood in Eq. (4) correspond to the GLM’s estimate of LNP model parameters.

us

92 93

z2i z2 ) − ∆(1 + zi + i ). 2 2

Θ∗ = arg min Θ

X i

λ ` (ri , zi ) + kΘk k22 2

(7)

where ` is a loss function, λ determines the strength of the L2 norm regularization term, and Θk indicates that regularization acts only on the linear filter k of the LN model [45]. For the GLM, the loss is given be the negative log-likelihood in Eq. (4). Here, we used r to denote the observed response (y in the CbRF method above) and z is the model’s prediction. The problem in Eq. (7) is convex and can be solved using batch gradient descent (BGD) on the objective. The gradient g is given by X g= ∇Θ `(ri , zi ) + λΘk (8) i

118

with the nabla operator ∇Θ =



∂ ∂ ∂Θ1 , ..., ∂ΘD+1



and Θ is updated according to Θt+1 = Θt − γt g

119 120

(9)

with update step size γt , which may be adaptively chosen at each update step t. Usually, the procedure is iterated until the change in the objective is below a specific threshold. 4

Page 5 of 32

121

A powerful modification to Eq. (9) that also uses second-order information is Newton’s method:  −1 Θt+1 = Θt − Ht g,

126 127 128

129 130 131 132 133

ip t

125

cr

123 124

 where Ht −1 is the inverse of the Hessian matrix, i.e., the matrix of second derivatives with respect to the parameters Θt computed at iteration t. This determines the optimal step size to take for quadratic optimization problems, but is expensive to compute in practice for large problems, e.g., the usually high-dimensional RF. We used a trust region Newton conjugate gradient algorithm that uses approximate Newton steps in the beginning and has been shown to outperform other state-of-the art algorithms like L-BFGS in terms of accuracy and estimation time [46, 47]. An optimal solution can be obtained in O(log( 1 )) iterations. In all experiments, we used  = 0.01. Expressions for loss function, gradient, and Hessian matrix are derived in the appendix for GLM and CbRF, and summarized in Table 1. 2.1.3. Stochastic approximations The goal is to find approximate solutions to the convex optimization problem in Eq. (7). We propose an approach based on stochastic gradient descent (SGD): instead of calculating the average gradient using all stimulus examples (cf. Eq (8)) an approximation based on a randomly selected subset is used [48, 35]. The principle is illustrated in Fig. 2(a). Here, we consider updates based on a single stimulus-response pair {st , rt } giving rise to the subgradient

135 136 137 138

142 143 144 145

146 147 148 149 150

151 152 153 154 155 156 157

which is guaranteed to reach asymptotic convergence for t → ∞ for meaningful values of λ [50]. However, as detailed in 2.1.5, λ is found using cross-validation and invalid parameters are automatically excluded due to their low predictive power.

pt

141

(12)

Polynomial decay-averaging SGD (ASGD). A typical problem with a plain SGD algorithm as described above is the noisy and slow convergence to the optimal model parameters in terms of training examples, even if the optimal learning rate is used. However, by employing an averaging procedure the convergence rate can be significantly improved [35, 51]. Here, we use polynomial decay-averaging [51] ! η+1 ¯t η+1 t t+1 ¯ Θ = 1− Θ + Θ (13) t+η t+η

Ac ce

139

1 1 + λt

ed

γt =

140

(11)

with update equation Θt+1 = Θt − γt gt . Setting the learning rate γ in the resulting first-order optimization problem is not trivial: setting the learning rate too high may lead to divergence, while setting the learning rate too small may result in slow convergence during optimization. A common heuristic for estimating a good learning rate at each iteration of gradient descent is based on annealing [35, 49, 50]. Here, we use a 1/t-like learning rate schedule

M

134

an

gt = ∇Θ `(rt , zt ) + λΘk

us

122

(10)

where η is a constant that reduces the weight of earlier iterates compared to recent ones. Θt is the parameter vector obtained using the simple SGD. For η = 0, Eq. (13) yields an incrementally averaged SGD. However, by choosing η > 0 the ”memory” about preceding examples decays in a polynomial manner and we theoretically obtain a O(1/T ) convergence rate. In all experiments, we used η = 2 and refer to this variant as averaged stochastic gradient descent (ASGD). Example. Fig. 2(b) shows evolution of the misclassification error as a function of the number of iterations for BGD, SGD, and ASGD. A single iteration corresponds to processing of a single stimulus-response pair. BGD updates model parameters after a complete cycle over all stimulus-response pairs. SGD and ASGD perform model updates for each stimulus-response pair and tend to be noisy. However, the employed polynomial decay-averaging of the ASGD significantly reduces fluctuations and converges even faster than the BGD solution. Thus, ASGD optimizes the objective to a good precision quite fast and, in case the resulting solution is not accurate enough, a small number of subsequent BGD update steps will quickly reach to optimal solution [52]. 5

Page 6 of 32

165 166 167 168 169 170

171 172 173 174 175 176 177

178 179 180

ip t

164

cr

163

2.1.5. Model selection Solution Eq. (7) depends on the choice of the hyperparameter λ. Thus, we trained models for 25 different values of λ and automatically selected the hyperparameter in a 5-fold cross-validation scheme. The parameter that yields the highest average predictive power has been used to estimate the final RF using all data. For the GLM we used the hyperparameter that maximizes the log-likelihood in Eq. (4). As a result of prior-based weighting of errors the area under the receiver operating characteristics curve (AUC) is a natural metric for parameter selection for the CbRF method [29, 41].

us

161 162

an

160

2.1.4. Implementation details While SGD as described above is straightforward to apply, nevertheless, some implementation details may improve performance significantly [50]. First, the “truncated” negative part of the CbRF’s loss function simplifies the gradient update in Eq. (11) to regularization of the RF parameters if the stimulus example does not violate the margin constraint (Eq. 6). If we represent the RF k as ck k0 the gradient update step in this case reduces to rescaling ck . Second, instead of weighting stimulus examples by their inverse prior as for the full solution we alternate examples of both classes and repeat examples of the minor class. Thus, we avoid amplification of noisy update steps, which may result from multiplications with inverted low spike probabilities. The BGD algorithm has been adapted from the Liblinear library [47]. SGD and ASGD algorithms were custom written. All algorithms were implemented in C++ and interfaced to Python using Cython [53]. Data handling, processing and visualization was done in Python using the packages NumPy [54], scikit-learn [55] and Matplotlib [56], respectively. Code and example data sets will be made publicly available at https://github.com/arnefmeyer/ online_strf.

2.2. Normalized subspace projection We quantified similarity between RF estimates kˆ and true RF (simulations) or full RF estimate (recordings) denoted by ktrue in terms of the normalized subspace projection [20]

M

158 159

β=

184 185 186 187 188 189

190 191 192 193 194 195 196 197 198 199 200 201 202

ed

pt

183

(14) ˆ 2 kktrue k2 kkk where k · k2 denotes the L2-norm. β may also be regarded as correlation, where a value close to 1 indicates high similarity and a value close to 0 indicates that estimate and reference are effectively uncorrelated. Thus, this measure summarizes similarity in a single value and, therefore, is suitable to describe performance in neural subpopulations. 2.3. Electrophysiological experiment 2.3.1. Ethics statement All experiments were conducted in accordance with the international National Institutes of Health Guidelines for Animals in Research and with ethical standards for the care and use of animals in research defined by the German Law for the protection of experimental animals. Experiments were approved by an ethics committee of the state Saxony-Anhalt, Germany.

Ac ce

181 182

kˆ T ktrue

2.3.2. Stimulus generation and pre-processing We used two different stimulus ensembles composed of frequency-modulated (FM) tones. In the first ensemble, FM tones with randomly chosen starting and ending frequencies were arranged in 100 ms blocks. The second ensemble comprises continuously starting FM tones. The total length of the experiment was of 500 s. Fig. 1(a) shows a 1 s segment of a FM tone complex stimulus. Signals were presented to the animal through a calibrated Canton Plus XS.2 speaker and presented free field in a double-walled sound booth. For the analysis, stimuli were transformed into their time-frequency representation using a gammatone filter bank with a temporal resolution of 2 ms [57]. To mimic cochlear processing log-like compression was applied to the envelope of the filter bank output. The spectro-temporal patch size covered 50 ms preceding the response, and spanned the frequency range between 500 Hz and 8000 Hz. Stimulus examples were created by recasting the spectrogram patches preceding the response as vectors. The extent of the spectro-temporal patch is 25 temporal steps x 34 frequency channels resulting in 850-dimensional stimulus vectors. The stimulus ensemble is highly non-Gaussian with a long tail towards negative amplitude values and higher-order correlations across time and frequency. 6

Page 7 of 32

211

3. Results

212

3.1. Receptive field estimation from simulated responses

216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237

238 239 240 241 242 243 244 245 246 247 248 249 250

cr

us

215

an

214

3.1.1. Natural stimulus ensembles To test and compare performance of stochastic approximations and full BGD solution, we simulated responses using two different model cells and compared RF estimates to the respective true RF. The first model cell (Fig. 3(a)) represents a neuron sensitive to narrowband stimulus onsets found at different stages of the auditory system, e.g., in the inferior colliculus (IC) [17, 12] and the primary auditory cortex (A1) [15, 59, 60]. As an example for a nonGaussian stimulus we used human speech from the TIMIT corpus [61]. 100 s of human speech was transformed into its time-frequency representation with subsequent log-like compression of the envelope. Stimulus examples were created by recasting spectro-temporal patches preceding the response in a 40 ms time window as vectors. Spike trains were generated by applying a static nonlinearity to the output of the linear filter, which drives an inhomogeneous Poisson process. For the GLM, we used the nonlinearity used for the analysis (see Materials and Methods) and for the CbRF method a sigmoid function. Fig. 3(b) shows STRF estimates obtained with BGD and SGD solutions. For both GLM and CbRF, SGD-based estimators accurately recovered the true STRF indicated by the high correlation between estimates and true STRF in the lower left corners. For comparison, we also estimated an STRF using Ridge regression, a linear estimator that allows RF estimation for Gaussian stimulus ensembles even in the case of second-order correlations across stimulus dimensions [62]. However, the Ridge estimator does not yield a proper reconstruction of the true STRF due to violations of the Gaussian assumption (Fig. 3(c)). The second model cell with Gabor-like RF mimics response properties of neurons in the primary visual cortex (Fig. 3(d)). The stimulus ensemble was created by randomly sampling 20 x 20 patches from the van Hateren natural image database [63]. As in the previous experiment, spikes trains were generated by applying static nonlinearities to the filtered stimulus with subsequent Poisson spike generation. The results are shown in Fig. 3(e). For both GLM and CbRF, full BGD solution and SGD approximations produced estimates that are highly correlated with the true RF. Ridge regression (cf. Fig. 3(f)) is biased due to the non-Gaussian nature of natural images [20, 26, 64]. Thus, SGDbased optimization schemes allow reliable reconstruction of LN model parameters, even for highly non-Gaussian natural stimulus ensembles.

M

213

ed

209

pt

207 208

Ac ce

205 206

ip t

210

2.3.3. Gerbil recordings Extracellular recordings were made from neurons in the inferior colliculus (IC) in anesthetized Mongolian gerbils (Meriones unguiculatus). For detailed description of the surgical procedure, electrophysiological recordings, and stimulus generation please see [58] and [29]. Briefly, single-unit recordings in the IC were made in 31 ketamine/xylazine anesthetized adult male Mongolian gerbils with tungsten electrodes (3-4 MΩ) using a Plexon Multichannel Acquisition Processor (Plexon Inc). Single spiking units were assigned on-line using the software Rasputin (Plexon Inc) and verified off-line using the sofware Offline Sorter (Plexon Inc). All the units were well separated. Only units that on average produced at least one spike per second have been considered for the analysis.

203 204

3.1.2. Beyond the static receptive field: adaptive tracking of changes in receptive field structure The traditional RF as defined in Eq. (1) assumes that neural response properties are time-invariant. However, neurons in the auditory cortex have been found to be highly adaptive, e.g., adaptation to stimulus statistics [65, 11, 15] or to performance of a specific task [14]. The recurrent nature of stochastic gradient learning facilitates an effective way of characterizing changes in neural responses. Similar approaches based on recursive least mean squares filtering have previously been applied to study time-varying properties of neural responses, e.g., temporal RFs in the visual system [66] and spatial RFs in hippocampal neurons [67]. However, as demonstrated in the previous experiment, both GLM and CbRF method are also applicable to stimulus ensembles for which linear estimators produce biased estimates (cf. Fig. (3)). Here, we will present a simple toy example to demonstrate adaptive filtering capabilities of the CbRF approach. Assume a neuron abruptly changes its response from RF k1 to RF k2 as illustrated in Fig. 4(a). A similar neural behavior has been observed in the early visual pathway [66]. Characterizing the system using a stationary estimator based on the whole stimulus sequence produces the RF estimate shown in Fig. 4(b). While the resulting RF may 7

Page 8 of 32

260

3.2. Application to neural recordings from the auditory midbrain

263 264 265 266 267 268 269 270 271 272 273 274 275 276 277

278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299

Next, we compared SGD approximations and full BGD solution using neural recordings from the IC in anesthetized Mongolian gerbils. Stimuli consisted of FM tone complexes with non-Gaussian statistics (see Materials and Methods). Fig. 5(a) shows STRF estimates for three different IC units obtained using GLM and CbRF method. For the SGD-based approaches different numbers of iterations over the stimulus set are shown (0.1 and 2.5). Even for a seemingly small number of iterations, STRFs derived using SGD resemble the structure of BGD-based estimates. In particular, STRFs derived using ASGD also exhibit details of the full solutions while plain SGD provides a less detailed characterization. Increasing the number of iterations to 2.5 produces STRF estimates virtually indistinguishable from the full BGD solution. We also extracted frequency profiles from two STRFs (units B and C in Fig. 5(a)) and compared them to pure tone tuning curves measured for the same units (Fig. 5(b)). Pure tone tuning curves were created by presenting pure tones of length 0.1 s at the average sound pressure level of the STRF stimuli and counting the number of spikes evoked during the presentation. STRF frequency profiles were extracted at the temporal position of the excitatory peak indicated by the rectangles in Fig. 5(a). For both units STRF frequency profiles reveal substantially finer tuning than pure tone tuning curves. Apparently, the FM tone complex stimuli activate sideband inhibition, in particular above the best frequency, that helps to sharpen frequency tuning of neural responses [68]. Such an effect requires at least a tone pair and can not be produced by single pure tones. Thus, rapid STRF-based characterization of neural response properties may provide complementary information to standard pure tone tuning curves during experiments.

cr

262

us

261

an

257 258

M

255 256

ed

254

3.2.1. How close can we get to the full solution? Here, we performed a quantitative analysis of approximation performance of SGD-based algorithms as compared to the full BGD solution. We measured similarity between the different optimization schemes in terms of correlation between STRF estimates (cf. Eq. 14). Results for 70 IC units are shown in Fig. 6(a). For the GLM, 0.5 SGD (ASGD) iterations over the training set are sufficient to obtain a mean correlation of 0.82 (0.89) and 2.5 iterations yields a mean correlation of 0.84 (0.90). The CbRF method yield slightly higher mean correlation values of 0.87 (0.92) for 0.5 iterations and 0.89 (0.93) for 2.5 iterations, which denotes almost perfect approximation. For increasing numbers of iterations, there is no noticeable increase in correlation. Note that we did not smooth the STRFs prior to the analysis. We also analyzed prediction performance of the different optimization schemes. Predictions were quantified using mutual information (MI) between stimulus and response in a 4-fold cross-validation setting. MI provides a modelindependent measure of the amount of information transmitted by a single spike [20, 69]. Since MI depends on the total amount of information in the response, we normalized the mean MI across all four folds for each unit by the mean MI predicted by the BGD-based STRF estimate. Thus, by construction the BGD approach yields a normalized mean MI of 1 for each unit. A value higher (lower) than 1 indicates that SGD-based estimates yield a better (worse) prediction of the neural response than BGD-derived estimates. Fig. 6(b) summarizes mean and standard deviation of normalized predicted MI as a function of the number of SGD iterations over the data set. For both GLM and CbRF, the ASGD method yields noticeable better predictions than plain SGD. In many cases, ASGD-based STRF estimates even predicted higher MI values than BGD-derived STRFs, and normalized MI averaged across all units is virtually one. Thus, ASGD closely resembles prediction performance of the batch method for both GLM and CbRF, even for a single iteration over all stimulus–response pairs. Having shown that approximations are capable of producing STRF estimates that closely resemble properties of BGD-based STRFs, the time required to obtain these estimates is another relevant criterion, particularly in view of

pt

253

Ac ce

252

ip t

259

be considered meaningful, it does not reflect the response characteristics of the underlying system. Learning RF parameters by recurrent SGD-based updates yields high correlations with the true respective RF (cf. Fig. 4(c)). The correlations asymptotically approach the correlation between (optimal) RFs estimated for each part separately using the batch method. For the batch method, however, this requires detailed knowledge of when the system changes whereas SGD-based filtering automatically adapts to such changes. Note that it is required to use a lower bound on the learning rate in Eq. (12) to allow tracking of such abrupt changes for long stimulus sequences. Otherwise, the learning rate would go to zero, effectively preventing updates of RF parameters. We stopped decrementing the learning rate after 3000 stimulus examples. In real experiments, the minimum learning rate may be found by cross-validation on training data.

251

8

Page 9 of 32

309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327

328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345

346 347 348

ip t

308

3.2.2. Stochastic approximations preserve tuning properties The correlation used in the previous analysis measures the average similarity between STRF estimates. Characterizing response properties using the STRF in experiments may involve exact determination of tuning parameters like best frequency (BF) and spectral bandwidth (BW). These parameters are only coarsely captured by the correlation. Next, we extracted BF and BW from STRFs derived using the BGD method and tested whether SGD-based STRFs reliably reveal the same parameters. Following [68], we calculated tuning parameters by setting negative STRF values to zero and projecting the remaining values along temporal direction onto the frequency axis. Both BF and excitatory BW (eBW) have been estimated from the resulting frequency vector after smoothing using a symmetric 4-point Hann window. The BF was given by the frequency at the maximum projection and the eBW was determined by the full frequency width at half maximum projection. As indicated in Fig. 5, STRFs derived from IC data reveal only marginal inhibitory components. Therefore, inhibitory BW analysis has been omitted. Fig. 7(a) compares BF and eBW extracted from STRFs produced by the GLM for BGD, SGD, and ASGD. While BFs extracted from STRFs produced by a BGD-optimized GLM are well approximated by STRFs derived using both SGD and ASGD (as indicated by the high correlation), plain SGD tends to reveal a significantly broader eBW compared to BGD (p < 3 · 10−3 ; paired Wilcoxon test). There was no significant difference between ASGD and BGD (paired Wilcoxon test; α = 0.05). Results for the CbRF method are qualitatively and quantitatively similar except for a slightly better performance of the ASGD scheme for extraction of eBW parameters. In summary, ASGD-based STRFs preserved both BF and eBW as extracted from the full solution, thus being thus able of capturing details of the RF encoding model.

cr

307

us

305 306

an

304

M

303

ed

302

application during experiments. Therefore, we measured estimation time of SGD, ASGD, and BGD and compared them across all 70 stimulus-response sets (Fig. 6(c)). 100 % correspond to approximately 62 minutes on a single core of an Intel i7 processor. In general, the ASGD approach takes more time than plain SGD due to the additional averaging step (cf. Eq. 13). However, one ASGD iteration over the training set takes only about half the recording time or 8% of the time required for the BGD solution for both GLM and CbRF. On average, updates for the CbRF method can be computed slightly more efficiently (see Materials and Methods) but computation of the AUC during hyperparameter selection is computationally more expensive than computation of the log-likelihood for the GLM. Both methods therefore show comparable performance in terms of training time, allowing extremely rapid approximation of model parameters.

3.2.3. Convergence properties The above analysis compared the different optimization strategies in the event of a large number of collected stimulus-response pairs. An important property of an estimator, however, is how its performance scales with sample size, particularly if data are expensive or time is a crucial factor. To test the performance in such scenarios, we split the data into a training set (70 %) and a validation set (30 %), estimated STRFs using 5 %, 10 %, ..., 100 % of the training data, and calculated predicted MI between stimulus and response on the validation set. Note that MI may become highly biased for small sample sizes [19, 41]. To reduce the bias, we subtracted a first-order approximation of the bias term from the estimated MI value [70]. Moreover, for every unit we normalized MI predicted by partial STRF estimates by the MI predicted by the 100 % BGD-based STRF estimate. Thus, the obtained convergence curves represent relative convergence in terms of prediction performance relative to the BGD solution. Fig. 8(a) and Fig. 8(b) show convergence curves averaged across all IC units for GLM and CbRF, respectively. For SGD and ASGD, we used 2.5 iterations over the data set. In both cases, optimization based on BGD reveals slightly faster convergence than ASGD. The performance of plain SGD is noticeably lower than for both BGD and ASGD. Note that BGD revealed slightly better prediction performance on all data sets (cf. Fig. 6(b)), which may explain the almost constant small gap between BGD and ASGD performance. The full BGD solutions for both GLM and CbRF method show very similar convergence properties with the CbRF method exhibiting slightly better performance for small sample sizes. In the large-data regime, GLM and CbRF have been found to perform almost identical on the tested data set [29].

pt

301

Ac ce

300

3.2.4. Monitoring spectro-temporal receptive field properties during experiments An implication of the results in Fig. 6 is that finding approximate solutions highly correlated with the full solution requires only a small fraction of the experimental time. Next, we describe how stochastic approximations can be used 9

Page 10 of 32

360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375

376 377 378 379 380 381 382 383 384 385

386 387 388 389 390 391 392 393 394 395 396 397 398

ip t

359

cr

357 358

us

355 356

an

354

M

353

ed

351 352

to estimate STRFs while recording neural responses. We refer to this scenario as on-line STRF estimation, whereas the above experiments represent an off-line analysis. Essentially, both approximation methods are directly applicable to on-line scenarios: incoming stimulus–response pairs are used to update the models (one model for each of the 25 regularization hyperparameters). In the above approach, we selected the optimal regularization parameter in a 5-fold cross-validation setting and estimated the final STRF using all data. Thus, the only modification is that we do not have a part of the data available as validation set to find the regularization hyperparameter. One possibility is to record the response to a short sequence and use this sequence as a fixed validation set. However, the neural response may change over time, e.g., by adaptation to stimulus statistics [10, 65] or anesthesia effects [71, 72], and this should be reflected in the validation set. Using simulated responses, we have demonstrated that the recurrent nature of SGD allows to automatically track such changes (see section 3.1.2). Therefore, we used the following approach: data is collected in 2 s intervals and used to update the models, and every 12 seconds updating is intermited and the collected data is put aside as a test set. Thus, the models are evaluated on the accumulated test data. The model with the highest predictive power represents the best instantaneous estimate of the STRF. We limited the total size of the test set to 30 s and updated it with new data using the first in, first out method. We used the ASGD-optimized CbRF method because ASGD is less prone to fluctuations resulting from stochastic gradient update steps (cf. Fig. 2(b)) and showed better performance than plain SGD, in particular for the CbRF method. In each update step, the models are updated using one iteration over the training set. Only units that produced a mean spike rate of at least two spikes per second were considered in the analysis resulting in 51 data sets. To mimic a real recording situation we also modeled data acquisition and intermediate processing steps. Fig. 9(a) shows the temporal STRF evolution for two example units and the corresponding BGD STRFs obtained using all data. In both cases, the on-line algorithm was able to capture the detailed STRF structure after 160 s. The relative convergence in terms of correlation with the full BGD solution averaged across the 51 data sets is shown in Fig. 9(b). For comparison, we also calculated the relative convergence curve for the off-line ASGD method, which can be considered as an upper bound. Note that the on-line algorithm uses approx. 20 % of the data for validation whereas the off-line method uses all data. However, the on-line approach yields a mean correlation of 0.80 after 300 s, which is comparable to 0.9 achieved by the off-line method. Extension to simultaneously recorded neurons. In the above analysis, we focused on fast and robust characterization of single neurons’ responses. Another potential application is modeling responses of simultaneously recorded neurons. We parallelized the algorithm to allow concurrent estimation of multiple independent STRFs on a single computer (two Intel XEON 2620 processors, 6 cores each). Fig. 9(c) shows algorithm run time per update step for different numbers of simultaneously estimated STRFs. For number of units larger than the total number of CPU cores, we used multiple threads on each core per update step. We also estimated the maximum number of units that can be estimated simultaneously (cf. Fig. 9(d)). We found that the proposed algorithm allows estimation of 850-dimensional STRFs for up to 33 units on a single computer. Repeating the same experiment with reduced STRF dimensionality (440 dimensions) allowed simultaneous STRF estimation for up to 55 units (data not shown). A video showing simultaneously estimated STRFs for 16 IC units is provided in the supplementary material.

pt

350

Ac ce

349

3.2.5. Tracking changes in the spectro-temporal receptive field As demonstrated in section 3.1.2 using simulated data, the recurrent nature of SGD provides a means to track changes in RF properties. Next, we demonstrate that the above on-line algorithm also allows tracking of changes in the STRF for IC recordings. In a previous study we found that short-term STRF estimates derived from our IC recordings reveal virtually no temporal variability for FM sweeps [73]. Thus, we recorded responses to both FM sweeps and dynamic moving ripple (DMR) stimuli for some IC units (Fig. 10(a,c)). DMRs exhibit a different stimulus structure than FM sweep stimuli and STRFs derived from DMRs reveal different characteristics, specifically stronger inhibitory components surrounding the neuron’s best frequency (Fig. 10(b,d), top row). The STRF estimates produced by the ASGD-based on-line algorithm successfully recovered stimulus-dependent structures of the full BDG-based STRF estimates for the two example unit (Fig. 10(b,d), bottom row). We also simulated changes in the system by switching the response between randomly chosen IC units after 240 s while keeping the stimulus ensemble constant (FM sweeps, Fig. 10(e)). Fig. 10(e) shows correlation between on-line and BGD STRFs estimates for 30 runs with randomly chosen IC units. Similarly to Fig. 9(b), the on-line algorithm 10

Page 11 of 32

401

yields a mean correlation of about 0.8 with BGD-derived STRFs in the first part of the experiment. After switching the responses, the correlation drops to about zero but recovers to 0.72 at the end of the second part of the experiment. In all experiments, we ceased to decrease the learning rate in the SGD algorithm (Eq. 12) after 100 s.

402

4. Discussion

410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447

ip t

cr

408 409

us

406 407

an

405

The goal of this study was to determine whether stochastic approximations provide a robust and computationally efficient way to estimate parameters in neural encoding models. In many cases, encoding models can be formulated as a convex optimization problem, which can be fitted using batch gradient descent (BGD). Here we thoroughly tested different optimization algorithms based on stochastic gradient descent (SGD) as an alternative way to find numerical solutions to the receptive field (RF), which is defined in a linear-nonlinear cascade model. In particular, we applied SGD to the Generalized linear model (GLM; [27, 28, 30]) and a classification-based RF (CbRF) estimation method [29]. While SGD-based approaches permitted efficient computation of spectro-temporal receptive field (STRF) parameters, they also preserved many of the full solution’s desirable properties, namely robustness to non-Gaussian stimulus ensembles, predictive power, tuning properties, and also exhibited comparable convergence properties. We applied BGD and SGD-based algorithms to both simulated responses to natural stimuli and recordings from the auditory midbrain, the inferior colliculus (IC), in anesthetized Mongolian gerbils. Stimuli used in recordings mimicked properties of natural acoustic signals, namely non-Gaussian distribution of amplitude values and higherorder correlations across time and frequency (see [29] for a detailed description). While linear estimators have been found to result in biased STRF estimates for these data, STRFs derived using SGD, in particular ASGD, reliably produced STRFs closely resembling those produced by BGD-based estimators. Consequently, the proposed stochastic approximations also enable rapid and robust estimation of LN model parameters in such situations. We found that a simple averaging scheme yields a significant gain in performance compared to plain SGD at the expense of only a slight increase in computational complexity. On average, plain SGD required only about 5 % of the BGD solution’s time, and averaging increased training time to about 8 % (corresponding to approximately 50 % of experimental recording time). We used polynomial decay-averaging SGD (ASGD), a running average scheme which converts SGD iterates to a solution with theoretically optimal optimization accuracy [51]. In particular, ASGD also captured details of the full BGD solution while plain SGD provided a rather coarse approximation. This may become important in situations in which we are interested in detailed neural characteristics, e.g., extraction of tuning parameters of IC neurons from STRFs as described in this study. Note that even in cases in which ASGD would not produce parameter estimates sufficiently close to the optimal solution, one may still reduce computation time by using a single SGD or ASGD iteration over the data set to initialize a BGD algorithm [52]. From a technical perspective, the SGD approach is both very simple to implement and highly efficient in terms of computational cost. ASGD-based estimation of a 850-dimensional STRF from approximately 50000 stimulus examples took about 300 s on a single processor core and about 55 s on multiple processor cores (Intel i7-3770 processor with four cores). Motivated by this, we extended the approach to on-line scenarios in which STRF parameters were successfully learned while recording the response. The main limitation was imposed by the time required to find the optimal regularization parameter, which increases with increasing validation set size. Using an upper limit on the total validation set size provided for a good trade-off between algorithm run time and estimation accuracy. In all experiments, we used a rather large number of stimulus features resulting from the high temporal resolution of 2 ms to account for the high temporal precision of IC neurons [17, 12]. In many situations a considerably lower spectral and temporal resolution may be sufficient for RF characterization. Using the STRF in experiments to characterize neural encoding properties, e.g., neural tuning parameters, may become important as experimental paradigms shift towards the use of natural stimuli [9, 10, 74, 75]. The auditory system is adapted to the use of structural regularities in natural stimuli, e.g., co-modulations across frequency [21] and logarithmic scaling of spectro-temporal modulations [11]. Stimuli mimicking properties of natural stimuli may reveal different tuning properties than simple artificial stimuli. We have found sideband inhibition in neurons in the gerbil IC manifesting a noticeable finer tuning compared to pure tones when probed with FM tone complexes. This has also been observed in neurons in the avian auditory midbrain that exhibit extra-classical response field characteristics consisting of sideband excitation or sideband inhibition [68]. Moreover, such a refined tuning presumably occurs for a wide range of complex stimuli. Thus, if the experimental stimulus class exhibits a rather complex structure and

M

404

ed

403

pt

400

Ac ce

399

11

Page 12 of 32

449

we like to adapt stimulus parameters to the neural response, RF-based response characterization may provide a more reliable description. The proposed stochastic approximations allow rapid and robust RF estimation in such situations.

450

4.1. Relation to previous studies

448

475

4.2. Application to simultaneously recorded neurons

457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473

cr

456

us

455

an

454

M

453

ed

452

ip t

474

In this study we used SGD for efficiently computing the parameters of a LN model. Previous studies used least mean squares (LMS) filtering to study adaptive responses in the visual system [66] and in the hippocampus [76]. Although these studies primarily analyzed adaptive processing LMS may also be considered a SGD optimization scheme with least squares loss function (cf. Eq. 11). Thus, just as LMS generalizes linear regression to time-varying processes, the proposed recurrent SGD scheme allows dynamic inference of LN model parameters. Under Gaussian stimulation LMS and the SGD-based approaches presented in this study should produce almost identical results [20]. However, sensory neurons, in particular in cortical ares, have been found to strongly adapt to structural features in natural stimuli [10]. Thus, LMS-based approaches are restricted to regions that respond well to Gaussian stimulus ensembles. Ramirez and Paninski proposed a fast computation scheme for the GLM using an approximate “expected loglikelihood” (EL) [77]. It relies on the canonical form of the GLM, i.e. the exponential nonlinearity for the Poisson GLM used in this study. While the linear term has to be evaluated only once for each parameter vector, the sum over the non-linear term in Eq. (2), which represents the main burden in the computation of the log-likelihood, is replaced by its expectation. The integral occurring in the expectation can in some cases be calculated analytically or an approximation can be computed off-line and used via a lookup table. This allows efficient computation of the log-likelihood for many common GLMs. For GLMs with non-canonical nonlinearities, e.g., the nonlinearity used in this study (cf. Eq. 3), the proposed stochastic approximation scheme represents a computationally efficient alternative. A possible application of the proposed approach emerging from its on-line capability is optimal experimental design [32, 33]. In this setting, stimulus parameters are automatically adapted to neural response properties, e.g., by specifying a RF-based prior on stimulus parameters. Such an approach has previously been applied to visual responses to Gaussian stimulus ensembles [34]. However, the underlying linear coding model prevents application to non-Gaussian stimulus ensembles. E.g., the IC responses to FM tone complexes used in this study result in biased STRF estimates when analyzed with linear estimators [29]. The proposed SGD-based algorithm may allow to include more complex stimulus ensembles, e.g., FM tone complexes and potentially even natural stimuli.

451

488

4.3. Extensions

478 479 480 481 482 483 484 485 486

489 490 491 492 493 494 495

Ac ce

477

pt

487

Recent advances in neural recording techniques allow the simultaneous recording of signals from hundreds of neurons. If all neurons are treated independently, computational complexity scales linearly with the number of neurons. We found that a parallelized version of the proposed on-line algorithm allows simultaneous STRF estimation for up to 33 units on a single computer (cf. Fig. 9(d); up to 55 units for reduced STRF dimensionality). According to Moore’s law [78], the processing power of computers will double approximately every two years. Thus, we expect that the approach will allow simultaneous STRF estimation for more than 100 units in the next years. Furthermore, including interaction between neurons may significantly increase prediction accuracy and lead to a better understanding of cortical processing ([30]; for an overview see [79]). Modeling neurons with pair-wise interactions leads to a quadratic growth of model parameters (if the dimensionality of interaction parameters is the same as the dimensionality of single model parameters). Computationally efficient and robust optimization schemes, as the SGD-based approach proposed in this study, may be particularly useful in such cases, enabling a broader community to use these models without the necessity of supercomputers.

476

Note that stochastic approximations of neural encoding model parameters are not restricted to RF analysis. In machine learning, stochastic approximations have successfully been applied to a wide range of models, including convex problems as the GLM and CbRF method used in this study, e.g., support vector machines and logistic regression [35], and even non-convex problems as deep neural networks [80]. More specifically, the decaying learning rate effectively acts as an annealing parameter, thus reducing the problem of local extrema compared to batch gradient approaches. An example is MID analysis, which uses a combination of gradient ascend and simulated annealing to find RF parameters [20]. Recent studies proposed powerful extensions to the stochastic gradient, e.g., variants that 12

Page 13 of 32

501

Acknowledgements

497 498 499

ip t

500

make use of second-order information [49] or history of gradients [81] to find the optimal step size, parallelization of SGD updates [82], and distributed hybrid SGD/BGD systems [52]. Such approaches may be particularly helpful when modeling neural responses from large scale recordings and provide a huge gain in performance. As we have successfully demonstrated in this study, approximate solutions provide a very efficient way of estimating model parameters and allow to replace computationally extensive optimization schemes with only a very small cost in accuracy.

496

This study was supported by the German Research Foundation (DFG) within the Collaborative Research Centre SFB/TRR31 “The Active Auditory System”.

504

References

[5]

512 513

[6]

514 515 516 517 518

[7] [8] [9]

519 520 521

[10]

522 523 524

[11]

525 526

[12]

527 528 529 530

[13] [14]

531 532 533

[15]

534 535 536

[16]

537 538

[17]

539 540 541

[18]

542 543 544 545

[19] [20]

546 547 548

[21]

549 550 551 552 553

[22]

us

510 511

an

509

M

508

W. Bialek, F. Rieke, R. R. de Ruyter van Steveninck, D. Warland, Reading a neural code., Science 252 (5014) (1991) 1854–1857. F. Rieke, D. Warland, Rob, W. Bialek, Spikes: Exploring the Neural Code, 1st Edition, MIT Press, Cambridge, MA, 1997. P. Dayan, L. F. Abbott, Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems, The MIT Press, 2005. E. P. Simoncelli, J. P. Paninski, J. Pillow, O. Schwartz, Characterization of neural responses with stochastic stimuli, The cognitive neurosciences. M. C. Wu, S. V. David, J. L. Gallant, Complete functional characterization of sensory neurons by system identification., Annu Rev Neurosci 29 (2006) 477–505. doi:http://dx.doi.org/10.1146/annurev.neuro.29.051605.113024. URL http://dx.doi.org/10.1146/annurev.neuro.29.051605.113024 T. O. Sharpee, Computational identification of receptive fields., Annu Rev Neurosci 36 (2013) 103–120. doi:10.1146/annurev-neuro-062012170253. URL http://dx.doi.org/10.1146/annurev-neuro-062012-170253 E. deBoer, P. Kuyper, Triggered correlation, IEEE Transactions on Biomedical Engineering BM15 (3) (1968) 169–179. E. J. Chichilnisky, A simple white noise analysis of neuronal light responses., Network 12 (2) (2001) 199–213. S. V. David, W. E. Vinje, J. L. Gallant, Natural stimulus statistics alter the receptive field structure of v1 neurons., J Neurosci 24 (31) (2004) 6991–7006. doi:10.1523/JNEUROSCI.1422-04.2004. URL http://dx.doi.org/10.1523/JNEUROSCI.1422-04.2004 T. O. Sharpee, H. Sugihara, A. V. Kurgansky, S. P. Rebrik, M. P. Stryker, K. D. Miller, Adaptive filtering enhances information transmission in visual cortex., Nature 439 (7079) (2006) 936–942. doi:10.1038/nature04519. URL http://dx.doi.org/10.1038/nature04519 M. A. Escabi, L. M. Miller, H. L. Read, C. E. Schreiner, Naturalistic auditory contrast improves spectrotemporal coding in the cat inferior colliculus., J Neurosci 23 (37) (2003) 11489–11504. N. A. Lesica, B. Grothe, Dynamic spectrotemporal feature selectivity in the auditory midbrain, J. Neurosci. 28 (21) (2008) 5412–5421. doi:http://dx.doi.org/10.1523/JNEUROSCI.0073-08.2008. URL http://dx.doi.org/10.1523/JNEUROSCI.0073-08.2008 R. C. deCharms, D. T. Blake, M. M. Merzenich, Optimizing sound features for cortical neurons., Science 280 (5368) (1998) 1439–1443. J. Fritz, S. Shamma, M. Elhilali, D. Klein, Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex., Nat Neurosci 6 (11) (2003) 1216–1223. doi:10.1038/nn1141. URL http://dx.doi.org/10.1038/nn1141 N. C. Rabinowitz, B. D. B. Willmore, J. W. H. Schnupp, A. J. King, Contrast gain control in auditory cortex., Neuron 70 (6) (2011) 1178– 1191. doi:10.1016/j.neuron.2011.04.030. URL http://dx.doi.org/10.1016/j.neuron.2011.04.030 D. L. Ringach, G. Sapiro, R. Shapley, A subspace reverse-correlation technique for the study of visual neurons., Vision Res 37 (17) (1997) 2455–2464. M. A. Escabi, C. E. Schreiner, Nonlinear spectrotemporal sound analysis by neurons in the auditory midbrain., J Neurosci 22 (10) (2002) 4114–4131. doi:http://dx.doi.org/20026325. URL http://dx.doi.org/20026325 D. J. Klein, J. Z. Simon, D. A. Depireux, S. A. Shamma, Stimulus-invariant processing and spectrotemporal reverse correlation in primary auditory cortex., J Comput Neuroscidoi:http://dx.doi.org/10.1007/s10827-005-3589-4. URL http://dx.doi.org/10.1007/s10827-005-3589-4 L. Paninski, Convergence properties of three spike-triggered analysis techniques., Network 14 (3) (2003) 437–464. T. Sharpee, N. C. Rust, W. Bialek, Analyzing neural responses to natural signals: maximally informative dimensions., Neural Comput 16 (2) (2004) 223–250. doi:10.1162/089976604322742010. URL http://dx.doi.org/10.1162/089976604322742010 I. Nelken, Y. Rotman, O. B. Yosef, Responses of auditory-cortex neurons to structural features of natural sounds., Nature 397 (6715) (1999) 154–157. doi:10.1038/16456. URL http://dx.doi.org/10.1038/16456 G. Chechik, I. Nelken, Auditory abstraction from spectro-temporal features to coding auditory entities., Proc Natl Acad Sci U S A 109 (46) (2012) 18968–18973. doi:10.1073/pnas.1111242109. URL http://dx.doi.org/10.1073/pnas.1111242109

ed

507

[1] [2] [3] [4]

pt

506

Ac ce

505

cr

503

502

13

Page 14 of 32

566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618

ip t

565

cr

564

us

563

an

561 562

M

559 560

ed

558

pt

556 557

[23] J. Freeman, C. M. Ziemba, D. J. Heeger, E. P. Simoncelli, J. A. Movshon, A functional and perceptual signature of the second visual area in primates., Nat Neurosci 16 (7) (2013) 974–981. doi:10.1038/nn.3402. URL http://dx.doi.org/10.1038/nn.3402 [24] W. E. Vinje, J. L. Gallant, Sparse coding and decorrelation in primary visual cortex during natural vision., Science 287 (5456) (2000) 1273– 1276. [25] F. E. Theunissen, K. Sen, A. J. Doupe, Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds., J Neurosci 20 (6) (2000) 2315–2331. [26] D. L. Ruderman, W. Bialek, Statistics of natural images: Scaling in the woods, Phys. Rev. Lett. 73 (1994) 814–817. doi:10.1103/PhysRevLett.73.814. URL http://link.aps.org/doi/10.1103/PhysRevLett.73.814 [27] L. Paninski, Maximum likelihood estimation of cascade point-process neural encoding models., Network 15 (4) (2004) 243–262. [28] W. Truccolo, U. T. Eden, M. R. Fellows, J. P. Donoghue, E. N. Brown, A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects., J Neurophysiol 93 (2) (2005) 1074–1089. doi:10.1152/jn.00697.2004. URL http://dx.doi.org/10.1152/jn.00697.2004 [29] A. F. Meyer, J.-P. Diepenbrock, M. F. Happel, F. W. Ohl, J. Anem¨uller, Discriminative learning of receptive fields from responses to nongaussian stimulus ensembles, PLOS ONE 9 (4) (2014) e93062. doi:10.1371/journal.pone.0093062. [30] J. W. Pillow, J. Shlens, L. Paninski, A. Sher, A. M. Litke, E. J. Chichilnisky, E. P. Simoncelli, Spatio-temporal correlations and visual signalling in a complete neuronal population., Nature 454 (7207) (2008) 995–999. doi:10.1038/nature07140. URL http://dx.doi.org/10.1038/nature07140 [31] J. Touryan, B. Lau, Y. Dan, Isolation of relevant visual features from random stimuli for cortical complex cells., J Neurosci 22 (24) (2002) 10811–10818. [32] J. Benda, T. Gollisch, C. K. Machens, A. V. M. Herz, From response to stimulus: adaptive sampling in sensory physiology, Curr Opin Neurobiol 17 (4) (2007) 430–436. doi:10.1016/j.conb.2007.07.009. URL http://dx.doi.org/10.1016/j.conb.2007.07.009 [33] J. Lewi, R. Butera, L. Paninski, Sequential optimal design of neurophysiology experiments., Neural Comput 21 (3) (2009) 619–687. doi:10.1162/neco.2008.08-07-594. URL http://dx.doi.org/10.1162/neco.2008.08-07-594 [34] M. Park, J. W. Pillow, Bayesian active learning with localized priors for fast receptive field characterization, in: NIPS, 2012, pp. 2357–2365. [35] L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Y. Lechevallier, G. Saporta (Eds.), Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Springer, Paris, France, 2010, pp. 177–187. URL http://leon.bottou.org/papers/bottou-2010 [36] L. Bottou, O. Bousquet, The tradeoffs of large scale learning, in: S. Sra, S. Nowozin, S. J. Wright (Eds.), Optimization for Machine Learning, MIT Press, 2011, pp. 351–368. URL http://leon.bottou.org/papers/bottou-bousquet-2011 [37] J. A. Nelder, R. W. M. Wedderburn, Generalized linear models, Journal of the Royal Statistical Society, Series A, General 135 (1972) 370–384. [38] G. B. Christianson, M. Sahani, J. F. Linden, The consequences of response nonlinearities for interpretation of spectrotemporal receptive fields., J Neurosci 28 (2) (2008) 446–455. doi:10.1523/JNEUROSCI.1775-07.2007. URL http://dx.doi.org/10.1523/JNEUROSCI.1775-07.2007 [39] O. Schwartz, J. W. Pillow, N. C. Rust, E. P. Simoncelli, Spike-triggered neural characterization., J Vis 6 (4) (2006) 484–507. doi:10.1167/6.4.13. URL http://dx.doi.org/10.1167/6.4.13 [40] F. E. Theunissen, S. V. David, N. C. Singh, A. Hsu, W. E. Vinje, J. L. Gallant, Estimating spatio-temporal receptive fields of auditory and visual neurons from their responses to natural stimuli., Network 12 (3) (2001) 289–316. [41] A. F. Meyer, J.-P. Diepenbrock, F. W. Ohl, J. Anem¨uller, Quantifying neural coding noise in linear threshold models, in: Neural Engineering (NER), 2013 6th International IEEE/EMBS Conference on, 2013, pp. 1127–1130. doi:10.1109/NER.2013.6696136. [42] A. Calabrese, J. W. Schumacher, D. M. Schneider, L. Paninski, S. M. N. Woolley, A generalized linear model for estimating spectrotemporal receptive fields from responses to natural sounds., PLoS One 6 (1) (2011) e16104. doi:10.1371/journal.pone.0016104. URL http://dx.doi.org/10.1371/journal.pone.0016104 [43] T. Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, Annals of Statistics 32 (2003) 56–134. [44] P. L. Bartlett, M. I. Jordan, J. D. McAuliffe, Convexity, classification, and risk bounds, Journal of the American Statistical Association 101 (473) (2006) 138–156, (Was Department of Statistics, U.C. Berkeley Technical Report number 638, 2003). [45] T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning, Springer, 2001. [46] C.-J. Lin, R. C. Weng, S. S. Keerthi, Trust region newton method for logistic regression, J. Mach. Learn. Res. 9 (2008) 627–650. URL http://dl.acm.org/citation.cfm?id=1390681.1390703 [47] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874. [48] L. Bottou, Online algorithms and stochastic approximations, in: D. Saad (Ed.), Online Learning and Neural Networks, Cambridge University Press, Cambridge, UK, 1998, revised, oct 2012. URL http://leon.bottou.org/papers/bottou-98x [49] A. Bordes, L. Bottou, P. Gallinari, Sgd-qn: Careful quasi-newton stochastic gradient descent, Journal of Machine Learning Research 10 (2009) 1737–1754. URL http://leon.bottou.org/papers/bordes-bottou-gallinari-2009 [50] L. Bottou, Stochastic gradient tricks, in: G. Montavon, G. B. Orr, K.-R. M¨uller (Eds.), Neural Networks, Tricks of the Trade, Reloaded,

Ac ce

554 555

14

Page 15 of 32

629

[54]

630 631 632

[55]

633 634 635 636

[56] [57]

637 638 639

[58]

640 641 642

[59]

643 644

[60]

645 646 647

[61]

648 649

[62]

650 651 652

[63]

653 654

[64]

655 656 657 658

[65] [66]

659 660 661

[67]

662 663 664

[68]

665 666 667 668

[69] [70]

669 670 671

[71]

672 673

[72]

674 675 676

[73]

677 678 679

[74]

680 681 682 683

[75]

ip t

[53]

628

cr

626 627

us

[52]

625

an

623 624

M

[51]

ed

621 622

Lecture Notes in Computer Science (LNCS 7700), Springer, 2012, pp. 430–445. URL http://leon.bottou.org/papers/bottou-tricks-2012 O. Shamir, T. Zhang, Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes, in: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 71–79. URL http://jmlr.org/proceedings/papers/v28/shamir13.html A. Agarwal, O. Chapelle, M. Dud´ık, J. Langford, A reliable effective terascale linear learning system, Journal of Machine Learning Research 15 (2014) 1111–1133. URL http://jmlr.org/papers/v15/agarwal14a.html S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. Seljebotn, K. Smith, Cython: The best of both worlds, Computing in Science Engineering 13 (2) (2011) 31 –39. doi:10.1109/MCSE.2010.118. T. E. Oliphant, Python for scientific computing, Computing in Science & Engineering 9 (3) (2007) 10–20. doi:http://dx.doi.org/10.1109/MCSE.2007.58. URL http://scitation.aip.org/content/aip/journal/cise/9/3/10.1109/MCSE.2007.58 F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. J. D. Hunter, Matplotlib: A 2d graphics environment, Computing In Science & Engineering 9 (3) (2007) 90–95. B. Englitz, M. Ahrens, S. Tolnai, R. R¨ubsamen, M. Sahani, J. Jost, Multilinear models of single cell responses in the medial nucleus of the trapezoid body., Network 21 (1-2) (2010) 91–124. doi:10.3109/09548981003801996. URL http://dx.doi.org/10.3109/09548981003801996 M. F. K. Happel, M. Jeschke, F. W. Ohl, Spectral integration in primary auditory cortex attributable to temporally precise convergence of thalamocortical and intracortical input., J Neurosci 30 (33) (2010) 11114–11127. doi:10.1523/JNEUROSCI.0689-10.2010. URL http://dx.doi.org/10.1523/JNEUROSCI.0689-10.2010 D. J. Klein, D. A. Depireux, J. Z. Simon, S. A. Shamma, Robust spectrotemporal reverse correlation for the auditory system: Optimizing stimulus design, Journal of Computational Neuroscience 9 (1) (2000) 85–111. S. V. David, N. Mesgarani, S. A. Shamma, Estimating sparse spectro-temporal receptive fields with natural stimuli., Network 18 (3) (2007) 191–212. doi:10.1080/09548980701609235. URL http://dx.doi.org/10.1080/09548980701609235 J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, V. Zue, Timit acoustic-phonetic continuous speech corpus, Linguistic Data Consortium, Philadelphia (1993). C. K. Machens, M. S. Wehr, A. M. Zador, Linearity of cortical receptive fields measured with natural sounds., J Neurosci 24 (5) (2004) 1089–1100. doi:10.1523/JNEUROSCI.4445-03.2004. URL http://dx.doi.org/10.1523/JNEUROSCI.4445-03.2004 J. H. v. Hateren, A. v. d. Schaaf, Independent component filters of natural images compared with simple cells in primary visual cortex, Proceedings: Biological Sciences 265 (1394) (1998) 359–366. E. P. Simoncelli, B. A. Olshausen, Natural image statistics and neural representation., Annu Rev Neurosci 24 (2001) 1193–1216. doi:10.1146/annurev.neuro.24.1.1193. URL http://dx.doi.org/10.1146/annurev.neuro.24.1.1193 N. Ulanovsky, L. Las, I. Nelken, Processing of low-probability sounds by cortical neurons., Nature Neuroscience 6 (4) (2003) 391–8. G. B. Stanley, Adaptive spatiotemporal receptive field estimation in the visual pathway., Neural Comput 14 (12) (2002) 2925–2946. doi:10.1162/089976602760805340. URL http://dx.doi.org/10.1162/089976602760805340 U. T. Eden, L. M. Frank, R. Barbieri, V. Solo, E. N. Brown, Dynamic analysis of neural encoding by point process adaptive filtering., Neural Comput 16 (5) (2004) 971–998. doi:10.1162/089976604773135069. URL http://dx.doi.org/10.1162/089976604773135069 D. M. Schneider, S. M. N. Woolley, Extra-classical tuning predicts stimulus-dependent receptive fields in auditory neurons., J Neurosci 31 (33) (2011) 11867–11878. doi:10.1523/JNEUROSCI.5790-10.2011. URL http://dx.doi.org/10.1523/JNEUROSCI.5790-10.2011 T. L. Adelman, W. Bialek, R. M. Olberg, The information content of receptive fields., Neuron 40 (4) (2003) 823–833. A. Treves, S. Panzeri, The upward bias in measures of information derived from limited data samples, Neural Comput. 7 (2) (1995) 399–407. doi:10.1162/neco.1995.7.2.399. URL http://dx.doi.org/10.1162/neco.1995.7.2.399 B. H. Gaese, J. Ostwald, Anesthesia changes frequency tuning of neurons in the rat primary auditory cortex., J Neurophysiol 86 (2) (2001) 1062–1066. J. W. Schumacher, D. M. Schneider, S. M. N. Woolley, Anesthetic state modulates excitability but not spectral tuning or neural discrimination in single auditory midbrain neurons., J Neurophysiol 106 (2) (2011) 500–514. doi:10.1152/jn.01072.2010. URL http://dx.doi.org/10.1152/jn.01072.2010 A. F. Meyer, J.-P. Diepenbrock, F. W. Ohl, J. Anemller, Temporal variability of spectro-temporal receptive fields in the anesthetized auditory cortex, Frontiers in Computational Neuroscience 8 (165). doi:10.3389/fncom.2014.00165. URL http://www.frontiersin.org/computational_neuroscience/10.3389/fncom.2014.00165/abstract S. Netser, Y. Zahar, Y. Gutfreund, Stimulus-specific adaptation: can it be a neural correlate of behavioral habituation?, J Neurosci 31 (49) (2011) 17811–17820. doi:10.1523/JNEUROSCI.4790-11.2011. URL http://dx.doi.org/10.1523/JNEUROSCI.4790-11.2011 B. N. Pasley, S. V. David, N. Mesgarani, A. Flinker, S. A. Shamma, N. E. Crone, R. T. Knight, E. F. Chang, Reconstructing speech from human auditory cortex., PLoS Biol 10 (1) (2012) e1001251. doi:10.1371/journal.pbio.1001251.

pt

620

Ac ce

619

15

Page 16 of 32

695 696 697 698 699 700 701 702

ip t

693 694

cr

692

us

691

an

689 690

M

687 688

ed

686

URL http://dx.doi.org/10.1371/journal.pbio.1001251 [76] E. N. Brown, D. P. Nguyen, L. M. Frank, M. A. Wilson, V. Solo, An analysis of neural receptive field plasticity by point process adaptive filtering., Proc Natl Acad Sci U S A 98 (21) (2001) 12261–12266. doi:10.1073/pnas.201409398. URL http://dx.doi.org/10.1073/pnas.201409398 [77] A. D. Ramirez, L. Paninski, Fast inference in generalized linear models via expected log-likelihoods., J Comput Neurosci 36 (2) (2014) 215–234. doi:10.1007/s10827-013-0466-4. URL http://dx.doi.org/10.1007/s10827-013-0466-4 [78] G. E. Moore, Cramming more components onto integrated circuits, Electronics 38 (8). [79] I. H. Stevenson, K. P. Kording, How advances in neural recording affect data analysis., Nat Neurosci 14 (2) (2011) 139–142. doi:10.1038/nn.2731. URL http://dx.doi.org/10.1038/nn.2731 [80] H. Larochelle, Y. Bengio, J. Louradour, P. Lamblin, Exploring strategies for training deep neural networks, J. Mach. Learn. Res. 10 (2009) 1–40. URL http://dl.acm.org/citation.cfm?id=1577069.1577070 [81] J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2011) 2121–2159. URL http://dl.acm.org/citation.cfm?id=1953048.2021068 [82] M. A. Zinkevich, A. Smola, M. Weimer, L. Li, Parallelized stochastic gradient descent, in: Advances in Neural Information Processing Systems 23, 2010, pp. 2595–2603.

pt

685

Ac ce

684

16

Page 17 of 32

Figures

cr

ip t

Figure 1: Model of neural response generation. (a) The linear stage describes integration of stimulus features by a linear filter, the receptive field (RF). In the auditory system, stimulus examples are sampled from the stimulus spectrogram by considering the spectro-temporal density preceding the response. The extracted instantaneous spectro-temporal patch is filtered using the spectro-temporal receptive field (STRF) of the neuron. (b) The linear-nonlinear Poisson (LNP) model assumes that spikes are generated from the output of the linear stage by a static neural nonlinearity with subsequent Poisson spike generation. Parameters of the LNP are typically estimated using a generalized linear model (GLM). (c) An alternative approach, classification-based RF (CbRF) estimation, is based on the assumption of a noisy threshold operation that produces a probabilistic binary spike train. The average nonlinear response can be described by the cumulative distribution function of the noise around the threshold as indicated by the dashed gray line.

an

us

Figure 2: Estimation of model parameters using full batch gradient descent (BGD) solution and approximations based on stochastic gradient descent (SGD). The concept is illustrated for the CbRF method. (a) Sensory stimulus examples and evoked binary response assemble stimulus-response pairs. If a stimulus example elicited a spike it is labeled as spike example. Otherwise it assigned to the non-spike class. Solutions to the binary model are found by seeking for the linear hyperplane that optimally separate spike (triangles) and non-spike stimulus examples (circles) in a space with the same dimensionality as the stimulus examples. For visualization purposes, only two dimensions are shown. The BGD solution uses all examples at every update step, whereas the approximation performs updates based on a randomly chosen example indicated by the single gray triangle. (b) Convergence of BGD solution and SGD approximations as a function of the number of processed samples. BGD (red squares) updates the model after an iteration over all stimulus–response pairs. SGD (gray line) updates the model after each iteration with a single pair and tends to be noisy. These fluctuations can be reduced using the averaged SGD (ASGD) based on single updates (blue line). In the large-data regime, all methods approach the minimum error. However, SGD-based methods are computationally much more efficient than BGD.

pt

ed

M

Figure 3: RF estimation from simulated response to natural stimuli. (a) An auditory model neuron with a STRF that represents a narrow-band onset detector as found throughout the auditory pathway. The stimulus ensemble consists of 100 s human speech taken from the TIMIT corpus. (b) STRF estimates for GLM (upper row) and CbRF (lower row). Despite the highly non-Gaussian nature of speech, BGD solution and stochastic approximations reveal the true STRF as indicated by the high correlation with the true STRF shown in the lower left corner of each STRF estimate. A value close to 1 indicates almost perfect reconstruction of the true STRF. (c) For comparison, we also estimated the STRF using ridge regression, a linear estimator that is biased due to violation of the Gaussian assumption. (d) RF estimation from simulated responses to natural images. The true RF resembles a Gabor pattern often found in the visual system. A saturation nonlinearity has been used for the simulation (4000 spikes). (e) Similarly to the auditory example in (a) the true RF was accurately recovered by BGD and SGD-based optimization schemes for both GLM and CbRF. (f) The linear estimator does not reliably reveal the true RF due to non-Gaussian statistics of natural images. For visualization purposes, RF values have been rescaled to [−1, 1].

Figure 4: Adaptive tracking of changes in RF structure for simulated responses. (a) The RF of the neuron changes from k1 to k2 in the middle of the stimulus sequence (N = 10000 stimulus examples). (b) The RF estimate produced by the stationary LN model for the whole sequence. (c) Correlation between RF estimates and the respective true RF at any time instant. The static RF estimate yields a rather low correlation due to violation of the stationarity assumption (solid red line). Estimating a static RF for each part of the sequence separately achieves a high correlation with the true RFs (dashed red line). However, in real experiments we do not know when the response changes, and these changes usually occur more gradually than this example. The RF estimates based on recurrent SGD (gray line) and ASGD (blue line) updates inherently follow changes in response properties and show asymptotic behavior towards the (optimal) RF estimated for each part.

Ac ce

703

Figure 5: STRFs estimated from inferior colliculus (IC) responses to non-Gaussian FM tone complexes. (a) Shown are STRF estimates obtained using SGD and ASGD for different number of iterations and the full BGD solution. The first two rows demonstrates approximation performance for the same unit (Unit A) for a GLM and the CbRF method. For 0.1 iterations over the data the SGD-based methods overestimate the extent of the excitatory regions compared to BGD. Increasing the number of iterations to 2.5 reduces the extent and reveals details of the full STRF estimate, particularly for the ASGD. STRFs for Unit B and Unit C have been estimated using the CbRF method. (b) Frequency profiles extracted from STRFs (Units B and C) compared to the pure tone tuning curves measured at the same sound pressure level as the mean presentation level of the FM tone complex stimuli. For both units the pure tone tuning curve reveals broader frequency selectivity compared to the frequency profile extracted from the STRF. The sharpening in STRF frequency tuning is likely a result of sideband inhibition, which does not have an effect on single pure tones responses.

17

Page 18 of 32

cr

ip t

Figure 6: Approximation performance on IC recordings. (a) Correlation (denoted by cc) in terms of normalized subspace projection between STRFs estimated using stochastic approximations and full BGD solution for GLM (left) and CbRF (right) for 70 units. Numbers on the abscissa denote the number of iterations over the data set. For both methods, ASGD-based STRFs reveal a noticeable higher correlation with the BGD solution than STRFs derived using plain SGD, converging at approximately 0.9 (GLM) and 0.93 (CbRF) after a single iteration over all stimulus–response pairs. Note that the ordinate starts at a correlation of 0.4. (b) Prediction performance in terms of relative mutual information (MI) between stimulus and response as described in the text. The results confirm the correlation experiment in (a) with the CbRF method showing slightly better performance. (c) Training time for the different SGD algorithms relative to the time required by the full BGD solution. A single iteration over the whole data set using SGD and ASGD requires approximately 5 % and 8 % of the time, respectively. The mean time required by the BGD solution corresponds to approx. 63 minutes on a single processor. The dotted and dashed lines indicate neural recording time (8.33 minutes) and training time of Ridge regression (approx. 20 minutes). For visualization purposes, bars in each plot were slightly shifted.

an

us

Figure 7: Tuning properties extracted from STRFs for 70 IC units. (a) Best frequencies (BF, top row) and excitatory bandwidths (eBW, bottom row) of STRFs estimated using stochastic approximations to the GLM (2.5 iterations over the data set) compared to STRFs derived using the full solution. (b) BF and eBW extracted from CbRF-based STRF estimates. For both GLM and CbRF, ASGD reliably revealed same BF and eBW compared to the BGD solution as indicated by the p-values close to one in the upper left corner of each plot. eBW parameters extracted from SGD-based STRF estimates exhibited a systematic shift towards higher values compared to BGD. P-values are from paired Wilcoxon test on the equality of tuning parameters extracted from SGD-based and BGD-based STRF estimates.

ed

M

Figure 8: Convergence curves for the different methods. For each unit, the data have been split into training set (70 %) and validation set (30 %). Predictions have been done on the validation set using STRF estimates obtained for 5 %, 10 %, ..., 100 % of the training data. For each unit, we normalized predicted MI between stimulus and response by the MI predicted by the 100 % BGD-based estimate. Shown are mean and standard deviation averaged across all IC units for (a) GLM and (b) CbRF. The standard deviation at large sample sizes indicates that for some units SGDbased solutions, in particular ASGD, account for a higher amount of MI than STRFs derived using BGD. For visualization purposes, bars have been slightly shifted with ASGD at correct position.

Ac ce

pt

Figure 9: Monitoring RF properties while recording responses to acoustic stimuli. (a) Each row displays the temporal evolution of the STRF estimate produced by the ASGD-based on-line algorithm for an example unit. The full BGD STRFs estimated using the whole 500 s stimulus sequence are shown in the right column. All STRFs have been estimated using the CbRF method. (b) Relative convergence of the on-line algorithm towards the full BGD solution averaged across N = 51 units that produced at least two spikes per second (black line). Shaded gray area indicates one standard deviation. For comparison, relative convergence of the off-line ASGD method is shown (blue line). Note that the on-line algorithm used approx. 20 % of the data for validation, whereas the off-line method uses all data. (c) Processing time of the on-line algorithm on a single computer with 12 processor cores. Colors encode the number of simultaneously estimated units. Solid line and shaded area indicate mean and standard deviation, respectively, across all units. The update interval was 2 s. 15 % of the incoming data was used as validation set with total size limited to 30 s. Model validation amounts to about 75 % of computation time per update step. (d) Estimated maximum number of simultaneously feasible units. Expected run time was estimated for each number of units from the time interval in (c) in which the maximum validation set size was reached. Interpolation using a quadratic fit (solid gray line) indicates that the proposed approach allows simultaneous estimation of STRFs for up to 33 units on a single computer.

Figure 10: Tracking of changes in the STRF for IC recordings. (a) In the first experiment, the stimulus ensemble changed after 240 s from dynamic moving ripple (DMR) to frequency-modulated (FM) sweep complexes. (b) STRFs derived using the BGD-based CbRF method and the whole data set for both stimulus classes (top row). STRFs estimated using the on-line algorithm 10 s before the end of the presentation of each stimulus ensemble (bottom row). (c) The same experiment for another IC unit but with changing the stimulus ensemble from FM sweeps to DMRs. (d) The resuling BDG and ASGD-based STRF estimates. In both examples, the on-line algorithm was able to capture the stimulus-specific structure of the BGD solution, specifically the increase in inhibitory STRF components observed for DMR stimuli. (e) The same experiment as in (a) but with fixed stimulus ensemble (FM sweeps) and in each part the response was randomly chosen from the N = 51 units shown in Fig. 9. Mean and standard deviation of correlations between on-line and full BGD STRFs were computed across 30 runs. Even though the correlation dropped to approximately zero after switching the unit, the on-line algorithm recovered to a mean correlation of greater than 0.7 in the second part. In all experiments, we used a lower bound on the learning rate in the ASGD algorithm as described in the text.

18

Page 19 of 32

Tables

pt

ed

M

an

us

cr

ip t

Table 1: Loss, gradient, and Hessian matrix for GLM and CbRF method. z is the stimulus s filtered with the linear RF coefficients Θ, r the observed number of spikes in the time bin of width ∆, and f the neural nonlinearity as given in the text. For the CbRF method, y = ±1 indicates the presence of absence of a spike, and p(y) denotes the marginal probability of observing y. Both methods use the same L2 regularizer with regularization parameter α. I denotes the identity matrix. Total loss, gradient and Hessian matrix are computed by summation over all stimulus–response pairs, and adding the L2 regularization term. Note that the GLM’s loss term corresponds to the negative log-likelihood. A detailed derivation is given in the appendix.

Ac ce

704

19

Page 20 of 32

707

708 709 710

711 712 713

Appendix A.1. Poisson GLM The maximum a posteriori estimate for a L2 -penalized Poisson GLM for a complet set of observations and a general form of the nonlinearity f is given by X α (rt − ∆) f (zt ) − kΘk2k (A.1) ΘMAP = arg max 2 Θ t with spike bin width ∆, regularization parameter α, and zt ≡ kT st . k · kk denotes the L2 norm of the RF coefficients k (thus excluding the bias term). Due to linearity of the derivative we can derive expressions for log-likelihood and regularization term separately. The first and second derivatives for a single observation of the Poisson log-likelihood in Eq. (2), L = r log f − ∆ f , are given by r 0 f − ∆f0 (A.2) L0 = f # " 00 f f − f 0 f 0T − ∆ f 00 (A.3) L00 = y f2

716 717 718 719

where we omitted the arguments of L and f . f 0 and f 00 denote the first and second derivative, respectively, with respect to the parameter vector Θ. Plugging in the neural nonlinearity in Eq. (3) yields the expressions summarized in Table 1, together with expressions for the regularization term. Note that the outer vector product in the Hessian provides an estimate of the covariance matrix. Thus the Hessian of the MAP estimate can be written in matrix form as

M

715

an

us

714

In this section, we will derive expressions for loss function, gradients, and Hessian matrix for CbRF and GLM. The notation follows in large parts that of Section 2.

ip t

706

Appendix A. Optimization of RF parameters

cr

705

H = −∆ ST DS − αI

722

where S is a matrix that contains all stimulus examples as row vectors and D is a diagonal matrix with f (zt ), t = 1, 2, 3, ..., on its diagonal. The trust region conjugate gradient Newton algorithm used to derive the full solution requires frequent computation of the product between the Hessian matrix and a vector v. This can be written as

ed

720 721

725 726

727

728

and thus avoids to store the Hessian matrix. The optimal parameters are found by maximizing the log-likelihood in Eq. (A.1), which is equivalent to minimizing the negative log-likelihood as given in Table 1.

Ac ce

724

(A.5)

pt

Hv = ST (D (Sv)) + αv

723

(A.4)

Appendix A.2. Classification-based RF estimation method The objective function minimized by the CbRF method is given by Eq. (7) with the loss in Eq. (6), X   α L= ωt (1 − yt zt )+ 2 + kΘk2k . 2 t The gradient is given by

X ∂L = −2 ωt yt st (1 − yt zt )+ + αΘ ∂k t

with Hessian matrix H=2

X

ωt0 st0 sTt0 + αI.

(A.6)

(A.7)

(A.8)

t0 729 730

where t0 indicates that the sum is taken over all examples for which 1 − yt zt > 0. Analog to the GLM, the product of the Hessian matrix and the vector v can be written in matrix notation as H = 2 STt0 Wt0 St0 v + αv.

731

(A.9)

The matrix Wt0 contains the inverse class probabilities on its diagonal. 20

Page 21 of 32

Loss ∆ez − rz

GLM (z > 0)

∆f (z) − r log f (z)

CbRF (1 − yz > 0) CbRF (1 − yz ≤ 0) L2 Regularizer

(1 − yz)2 /p(y) 0 α 2 2 kkk2

Ac

Method Table GLM (z ≤ 0)

Gradient (∆ez −hr) s (1 + z)

r f (z) 2

i −∆ s

2(1 − yz) s/p(y) 0 αk

Hessian z T −∆e h ss r f (z)

1−

(1+z)2 f (z)



i − ∆ ssT

−2yssT /p(y) Page 22 of 32 0 αI

1 0 .5 1 .2 low

high

1 .4

1 .6

Time (s)

1 .8

2

X

Time

Spikes

Time

Input

(c)

Noisy threshold

+

Time min

Poisson process

max

Noise

Output

2

STRF Frequency

4

Example Frequency

Stimulus

8

Spike generation

(b) Output

ce

Linear integration of stimulus features

Ac

Frequency (kHz)

Figure 1 (a)

Spikes

Page 23 of 32 Input

Time

p

Figure 2 (a) Stimuli

Full solution

(b)

ce

Update model

ples

xam

nt adie ) h gr D Batc ent (BG c s e d

usin

g ra

ndo

e

Tim

me Stoc xam h ple des astic g cen r a d ient t (SG D)

Approximation stimulus dim. i+1

us

Ac

Spikes

ll e ing a

Page 24 of 32 stimulus dim. i

0

1

Ac

-1

ce

pt

ed

Figure

Page 25 of 32

e pt

Figure

...

...

...

Time

-1

0

Ac

ce

...

Page 26 of 32

1

(b)

d

(a) Figure

Unit B (CbRF)

ep Ac c

Unit A (CbRF)

te

Unit A (GLM)

Unit C (CbRF)

Page 27 of 32 -1

0

1

cc with BGD STRF

1.0

0.8

0.8

0.6

0.6

0.4

0.01

0.05

0.1

0.5

# iterations

0.8

Relative training time (%)

40 30

0.01

2.5

5

0.1

0.5

1

# iterations

2.5

5

0.01

0.05

0.1

0.5

0.01

0.05

0.1

0.5

1

2.5

5

0.01

0.05

0.1

0.5

1

2.5

5

0.01

0.05

0.1

Page 28 of 32 0.5 1 2.5 5

# iterations

1.0 0.9

0.7

# iterations

40 30 20

Experiment length (500 s)

10

0.4

0.8

Ridge regression (1206 s)

20

0

0.05

ce

0.9

Ac

Relative MI

1.0

0.7

1

pt

(b)

(c)

1.2

ASGD SGD

1.0

CbRF

M a

GLM

1.2

ed

(a)

Figure

10 1

# iterations

2.5

5

0

# iterations

GLM ASGD

CbRF SGD

ASGD

Ac c

ep

te

SGD

d

Figure

Page 29 of 32

0.9 0.8

Ac

Relative MI

1.0

0.7 0.6 0.5

5

10

25

50

Training size (%)

BGD ASGD SGD 75

100

(b)

CbRF 1.0

Relative MI

ce

GLM

p

(a)

Figure

0.9 0.8 0.7 0.6 0.5

5

10

25

Page 30 50 75 of 32 100

Training size (%)

Figure

Time in experiment (s) 40

80

160

d

20

te

Unit 1

ep

Unit 2

0

1

Ac c

-1

Simultaneous on-line estimation on a single computer

Estimated max. number of units

update step size

max. validation set size reached (30 s)

feasible # of units

Page 31 of 32

...

BGD (full DMR)

Time (s)

...

480

BGD (full FM)

...

0

(d)

1

(e)

DMR

BGD (full DMR)

... 240

...

... Time (s)

0

480

BGD (full FM)

(f)

STRF2

STRF1 ...

... 240

... Time (s)

On-line (t=230 s)

On-line (t=470 s)

N = 30

-1

On-line (t=230 s)

480

Transition between random STRFs

0

Ac

(b)

240

...

ce

0

...

FM tones

e

...

(c)

FM tones

DMR

pt

(a) Figure

On-line (t=470 s)

Page 32 of 32