An FPGA Multiprocessor Architecture for Bayesian Online Change Point Detection Using Hybrid Stochastic/Deterministic Computation
Journal Pre-proof
An FPGA Multiprocessor Architecture for Bayesian Online Change Point Detection Using Hybrid Stochastic/Deterministic Computation Tomas Figliolia, Andreas G. Andreou PII: DOI: Reference:
S0141-9331(19)30472-7 https://doi.org/10.1016/j.micpro.2019.102968 MICPRO 102968
To appear in:
Microprocessors and Microsystems
Received date: Accepted date:
24 September 2019 23 December 2019
Please cite this article as: Tomas Figliolia, Andreas G. Andreou, An FPGA Multiprocessor Architecture for Bayesian Online Change Point Detection Using Hybrid Stochastic/Deterministic Computation, Microprocessors and Microsystems (2019), doi: https://doi.org/10.1016/j.micpro.2019.102968
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
An FPGA Multiprocessor Architecture for Bayesian Online Change Point Detection Using Hybrid Stochastic/Deterministic Computation Tomas Figlioliaa , Andreas G. Andreoub,∗ a was b Department
with Johns Hopkins University, now at Xilinx Inc. of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, Maryland 21218 USA
Abstract In this paper an event-based hybrid stochastic/deterministic architecture for the Adams/McKay Bayesian Online Change Point Detection algorithm (BOCPD) [1] is reported. In the stochastic computational structures, probabilities are represented as stochastic events and computation is carried out natively with probabilities and not probability density functions. A fully programmable multicore BOCPD processor is synthesized in VHDL. The BOCPD algorithm with on-line learning, does foregroundbackground detection. Running on a single Kintex 7 FPGA (Opal Kelly XEM7350-K410T) the architecture is capable of real-time processing a 160 x 120 pixels image, at 10 frames per second. Keywords: changepoint analysis, changepoint detection, image segmentation, Bayesian inference, on-line algorithm, stochastic processing, precision on demand, ASIC, VHDL, probabilistic event representation.
✩ This work was supported by DARPA UPSIDE project HR0011-13-C-0051 through BAE Systems; we are grateful to Dr. Louise Sengupta who managed the project the first two years while at BAE systems, for her support and encouragement. ∗ Corresponding author Email addresses:
[email protected] (Tomas Figliolia),
[email protected] (Andreas G. Andreou)
Preprint submitted to Microprocessors and Microsystems (MICPRO)
January 16, 2020
1. Introduction Change point analysis (CPA), also known as change point detection (CPD), is the characterization of changes in the parameters of a stochastic process, through the form of sequential data. Often CPA is employed in the segmentation 5
of signals, i.e., to facilitate the process of tracking, identification or recognition. The “Bayesian” [2, 3] version of a change point analysis was originally described in [4] with online versions of the algorithm only recently formulated [5, 1]. BOCPD is a Bayesian Online Change Point Detection algorithm introduced by Adams and McKay [1], and further advanced in [6, 7, 8] that allows for online in-
10
ference with causal predictive filtering processing necessary in real-time systems that interact with dynamic physical environments. One of the key challenges and critiques of Bayesian approaches is the high computational requirements that often necessitate high precision floating point computational units. In this paper a reconfigurable computing architecture for the Adams/McKay
15
BOCPD algorithm is presented. The architecture is designed for video processing, specifically computing whether a pixel in a video sequence belongs to the background or to an object (foreground). Each pixel is processed with only information from its intensity and it’s time history. The computing architecture, employs unary fixed point representation for numbers in time, where values
20
are encoded as Bernoulli random processes, where the Bernoulli probability p encodes a scaled version of the value desired to be represented. Each processor incorporates on-line machine learning that enable a continuous update of processing parameters in the statistical model. In Section 2, the mathematical framework is presented, followed by an introduction to stochastic computation,
25
and data representation that is well suited for problems that involved probability computations (Section 3). A detailed description of the architecture and computational structures is found in Section 4 with results and discussion in Section 5.
2
2. The Bayesian On-line Change Point Detection (BOCPD) 30
Consider a sequence of independent random variables (RVs) X1 , X2 , ..., Xt . The parameters of the distribution of these RVs can change over time. If that change of parameters could be characterized, then a point of change could be identified. To give an example of such signal, let’s consider a simple stream of zeros and ones from a serial line, where the received voltage values at the
35
endpoint can fluctuate due to channel noise. Considering the noise Gaussian with mean equal to zero, the sampled signal will be distributed ∼ N (µ, σ 2 ), where σ 2 is the fixed variance of the channel noise, and µ is the actual value to transmit, i.e. 0V or 5V . In this case µ is the only parameter that changes over time, but a model in which σ 2 changes, can be considered as well. Samples are
40
distributed normally with parameters µ and σ 2 , and those parameters can also be considered drawn from another distribution P (µ, σ 2 ). For the example mentioned before, one can assume that σ 2 is fixed, but µ is drawn from a Bernoulli distribution, with a p probability of sending 5V , and a (1 − p) probability of sending 0V .
45
The run-length concept is now introduced, which is the number of consecutive samples that have been drawn from the same distribution (the parameters’ values didn’t change for all of the samples in that run). At time t, rt is the number of past samples corresponding to that run. If rt = k, then the samples that are considered to be part of that run are xt−k , xt−(k+1) , ..., xt−1 . A distinction
50
has to be made between the RV X and a sample from that random variable x. (r=k)
The xs from a run rt = k will be notated as xt
. In Figure 6 a graph that
captures the ideas mentioned is shown. The nodes with P are the ones that contain the parameters of the distribution from which to sample to obtain the xs. The change of the parameter values Pt−k will trigger a reset in the count 55
rt−k , setting it to 0. On the other hand, if the parameter Pt−k = Pt−(k+1) then rt−k = rt−(k+1) + 1. The objective of this algorithm is to predict, based on history, what is the probability density function P (rt |X1:t = x1:t ). Note that for this algorithm, at 3
r t-3
... ... ... ... ... ... ...
r t-2
r t-1
rt ... ... ... ... ...
Pt-3
Pt-2
Pt-1
Pt
Xt-3
Xt-2
Xt-1
Xt
...
Figure 1: Graphical generative model for the BOCPD algorithm. Parameters P can change or remain the same. The current parameter value Pt can depend on the whole history of parameter values. Run length rt will be 0 or rt−1 + 1, depending on the values of Pt . Finally xt is just a sample from a distribution with parameters equal to Pt .
every time step, there is a distribution P (rt |X1:t = x1:t ), where rt can take values 60
from 0 to t. At the bottom of figures 2, 3 and 4 from Ryan Prescott Adams and David J.C. MacKay’s paper “Bayesian Online Changepoint Detection” [1] one can see in greyscale at each point in time the distribution P (rt |X1:t = x1:t ). It is through the usage of this distribution that, with a certain level of confidence, a point of change can be called. For finding changes in the underlying distribution from which samples are obtained over time, the change in the run-length distribution is analyzed with every new sample received. In the original paper from Ryan Prescott Adams and David J.C. MacKay, the expression for the run-length distribution is showed, and it is: P (rt |X1:t ) ∝ P (rt , X1:t ) X (rt−1 ) = P (rt |rt−1 )P (Xt |rt−1 , Xt−1 )P (rt−1 , X1:t−1 ) rt−1
∝
X
rt−1
(r
)
t−1 P (rt |rt−1 )P (Xt |rt−1 , Xt−1 )P (rt−1 |X1:t−1 )
P (rt |rt−1 ) =
H(rt−1 + 1)
1 − H(rt−1 + 1) 0
(1)
if rt = 0 if rt = rt−1 + 1
(2)
if otherwise
In Equation 1, at time t, one can see P (rt |X1:t ) depends on the same dis4
tribution for the previous sample at time t − 1. This dependency allows this algorithm to be run online. In the work presented here H(rt−1 + 1) = 1/λ. For the second term in Equation 1, for every rt−1 = k, one can demonstrate: (r
t−1 P (Xt |rt−1 = k, Xt−1
=k)
(r
=k)
t−1 = xt−1 Z = xt−k:t−1 ) =
)
~ t−k:t−1 = xt−k:t−1 ) = P (Xt |Xt−k:t−1 P (Xt , θ|X ~ ∀θ Z ~ Xt−k:t−1 = xt−k:t−1 )P (θ|X ~ t−k:t−1 = xt−k:t−1 ) = P (Xt |θ, ~ Z∀θ ~ (θ| ~ φ~k ) = P (Xt |φ~k ) = P (Xt |θ)P ~ ∀θ
(r
t−1 The dependence on the past samples xt−1
=k)
(3)
= xt−k:t−1 is assumed through
~ k (Xt−k:t−1 = xt−k:t−1 ), the estimator of parameters θ. ~ It can be seen that for φ every k chosen, one has a set of parameters φ~k (Xt−k:t−1 = xt−k:t−1 ). The θ~ are ~ φ~k ). It can be showed that: the random variables of the distribution P (θ| ~ t , φ~k ) ∝ P (Xt |θ)P ~ (θ| ~ φ~k ) P (θ|X
(4)
If the right term in Equation 4 can be formulated, then an expression for Equation 3 can be found, and then the expression for the distribution in Equa~ in Equation 1 is finally unveiled. An assumption is made here, that P (Xt |θ) ~ φ~k ) is a Gaussian tion 4 is normal ∼ N (σ 2 , µ), with θ~ = µ (σ 2 fixed), and P (θ|
2 2 prior ∼ N (µok , σok ), where φ~k = (µok (xt−k:t−1 ), σok (xt−k:t−1 )). The Gaussian
~ is the conjugate distribution of the prior and distribution chosen for P (Xt |θ),
~ t , φ~k ) is Gaussian. These posterior in Equation 4, and hence the posterior P (θ|X 2 are the expressions of the updated parameters φ~k = (σok , µok ) due to a new re-
ceived sample: µok 0 = 2 σok 0=
65
2 σo(k−1) xt + µo(k−1) σ 2
2 σo(k−1) + σ2 −1 1 1 + 2 σ2 σo(k−1)
(5)
When obtaining P (Xt |φ~k ) (the last piece missing in Equation 1), through
integration of the θ~ variables in Equation 3, this distribution turns out to be a 5
2 Gaussian with parameters µk = µok and σk2 = σ 2 + σok . Finally all the required
pieces for turning the CPD into its online version BOCPD has been obtained, with updates performed with each new received sample. 70
2.1. BOCPD Algorithm Description Having developed the mathematical framework, one can now outline the implemented algorithm step by step. 1. Initialize. r(t=0) can only take one value, 0, and then P (rt=0 = 0) = 1. If for some reason one knew anything about the process before time t = 0,
75
one can start with another distribution P (rt=0 ) = S(r). The case used here is the one in which one knows nothing about the time before t = 0. At every time step, r will have the possibility of being different values, and since the whole last subsection was developed for a particular value of r = k, then for every value of r one will have a different set of values
80
~ For the case of a normal prior, the parameters are seen in Equation 5. θ. Parameters θ~ are set to the initial values θ~o . 2. Observe New Datum Xt . 3. Evaluate the predictive probability. (r)
(r)
P (Xt |rt−1 , Xt−1 ) = πt
(6)
4. Calculate Growth Probabilities. (r)
P (rt = rt−1 + 1, X1:t ) = P (rt−1 , X1:t−1 )πt (1 − H(rt−1 ))
(7)
5. Calculate Changepoint Probabilities. In this step, if one finds the probability to be high enough, one could be in the presence of a changepoint. P (rt = 0, X1:t ) =
X
(r)
P (rt−1 , X1:t−1 )πt H(rt−1 )
rt−1
6
(8)
6. Calculate evidence. P (X1:t ) =
X
P (rt , X1:t )
(9)
rt
7. Determine Run Length Distribution. P (rt |X1:t ) = P (rt , X1:t )/P (X1:t )
(10)
8. Update the paremeters. For understanding this part an example is given. At time t = 1, r can be either 0 or 1, so two sets of parameters are available, and the second set of parameters, for which r = 1, will be updated using the parameters corresponding to the previous time step for r = 0. This happens for all the values of r except r = 0 for which the parameters are started from step 1. ~ r+1 = f (Xt , (θ) ~r (θ) t (t−1) )
(11)
9. Return to step 2.
3. Probabilistic Computation 85
Effective and efficient processing of data using probabilistic algorithms necessitate architectures that operates natively with probability data structures. Probabilistic computation was first introduced by mathematician John von Neumann in the 1953s, and then later conceptualized under the name of “stochastic computation” by Brian Gaines [9] and Wolfgang Poppelbaum [10] in the late
90
1960s. Stochastic computation employs unary representation in contrast to the binary representation employed in modern computer systems. Primarily famed for mapping complex computations into simple bit-wise logic operations [11], stochastic computation has failed to gain much popularity during its earlier years because of relatively lower accuracy, and longer computation time com-
95
pared to conventional methodologies. Capitalizing on its strengths, stochastic computation has seen successfully been implemented for neural computation
7
Address Encoder
Digital AER Bus
Address Decoder
0
0
1
1
2
1 0 3 1
2 2 0
2 3
3
‘‘Sender’’
0 1 2
Sender address IntegrateSynapse index Receiver address and-fire array Polarity Probability
‘‘Receiver’’ 0.5 -1 0.5 1
0 1 2
0 0 1 2 1 0 1 2 2 0 1 2
0 0 2 2 -
1 0.5 - - 0 1 1 0.5 - 1 1 - - -
Inhibited
0 Send?
Req
1 Excited
2
Weight Polarity
Receiver address
Figure 2: (Top) Time-multiplexed Address Event Representation (AER) bus, (adapted from [26]). (Bottom) Mapping of a two-layer network into an Probabilistic Address Event (PAE) look-up table with transmission probabilities. In this example, sender 1 sends an PAE. An inhibitory PAE is transmitted to receiver 0 with 100% probability, and then an excitatory PAE is transmitted to receiver 2 with 50% probability. The synaptic connection table is stored in a random-access memory (RAM). The first two columns comprise the memory address, and the remaining columns comprise the memory data.
[12, 13], image processing [14, 15, 16], parity check (LPDC) decoding [17]. For an excellent overview of the field please refer to the review paper by Alaghi and Hayes [18]. More recently, probabilistic computation and address event-based 100
representation (Figure 2) was introduced in neuromorphic architectures [19], [20],[21], in Deep Belief Networks [22] and in general purpose imager processing pipeline for motion imagery [23],[24]. Stochastic computation has also been employed to implement neural dynamics based sampling in recurrent neural networks[25].
105
Computation in unary stochastic representation of data, maps to simpler logic functions. For example, with uncorrelated probabilistic unary streams,
8
multiplication can be implemented with an AND gate in a unipolar coding format. A few stochastic computational elements are shown in Figure 3, along with an example of the AND gate operation. OR,gate
Multiplexer,gate
PA
PA P,+,P,-,P,P A B A B
PB
P,P,+,P,(1,-,P,) A C B C
PB PC
AND,gate
JK,flip,flop
PA P,P A B
PB
PA
J
PB
K
P,/(P,-,P,) A A B
0,1,1,0,1,1,0,0,1,0,0,1,0,1,1,0 0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0
8/16 0,1,0,1,0,1,0,1,0,0,0,1,1,1,0,1
4/16
8/16 Figure 3: Stochastic computational elements. Example of four stochastic computational elements, the OR gate, AND gate, two-input multiplexer gate and JK flip flop. At the inputs and output of these gates the probability conversion is presented. An example showing the utilization of the AND gate is shown. Two streams encoding a probability of 0.5 are multiplied to produce a stream encoding 0.25.
These streams can be decoded back to their initial encoding by integrating the streams, and normalizing to its range. For a digital design with binary inputs, this can be done with a counter. The sum of these Bernoulli trials in the stream, normalized to the probability space of [0,1], give rise to a Binomialdistributed random variable, y= 110
N 1 X I(xi = 1) N i
(12)
where I(∗) is the indicator function, N is the sample size, and xi is a Bernoullidistributed sample in the stream. Analyzing this distribution, reveals an inverse-proportional relationship between the variance σ 2 , and the sample size N . Mathematically, this is seen as, 9
σ2 =
p(1 − p) N
(13)
where p is the probability of getting a 1 in each sample, that is the mean of the Bernoulli trial. With variance σ 2 directly related to the precision of the computation, there 115
exist a trade-off between computation precision and latency. Computing over longer stochastic streams nets lower variance in the outputs, and thus more precise results. In the computing architecture presented in this paper, state values are stored in registers in binary base due to the compactness of representation. The distinction of this architecture with respect to others is that when
120
performing any kind of computation, these state values are translated into a probabilistic representation through the usage of uniform random number generators (URNGs). Computations in this new domain allow robustness against EM noise in bit flipping, reduction in silicon area in comparison to conventional computational units such as adders, multipliers and dividers, and to control
125
power consumption through accuracy management. The domain translation in state variables is done utilizing a digital comparator, the state variable’s value to encode in binary base, and a URNG (Figure 4). The key to processing stochastically is to perform an accurate domain transformation, meaning that reliable independent URNGs are required. Even if LFSRs are deterministic sources,
130
their pseudo-random characteristics allowed them to be used as URNGs in this design. But one has to keep in mind that one of the disadvantages of LFSRs is that, if a large number of random sources is required in a given design, avoiding correlation artifacts becomes an issue. Not only that, if LFSRs are k ∈ N bits
wide, only 2k − 1 random sources will have low correlation, making it impossi135
ble for architectures requiring more than 2k − 1 random sources with k bits of accuracy to work. It is for this reason that replacement of LFSRs with arrays of true URNGs like the ones presented in [27] would ensure proper scaling of stochastic architectures. The second constraint that probabilistic representation exhibits is that all
10
140
of the encoded values need to represent a probability which is bounded to [0, 1]. This requires some ingenious thinking to shape formulations of problems so that they can be solved stochastically. The same numeric value can be represented using different probabilities, depending on the maximum number of bits used to represent binary numbers. For instance 64 can be Bernoulli probability p =
145
64/256 for n = 8, or p = 64/512 for n = 9 and so on and so forth.
13 5 4 7 9 8 1 15 3 6 12 11 0 2 10 14 12 2 14 3 10 7 6 11
Uniform/Random/Number
Encoded/Number
Comparator
8 13 4 5 0 15 1 9 2 8 11 14 10 15 2 14 9 3 5 0 7 1 6 4
...
...
... Number/to/Encode
13/16
time
2
13
...
... 2
2/16
2/16
9/16
time
8
... time
Figure 4: Example of numbers encoded stochastically. This example shows the change of the encoded number every 16 time slots. A 4-bit uniform random number generator is used in the encoding. At the output of the comparator, for the encoded values 13, 2 and 8, the resulting number of ones on each number period is 13, 2 and 9. Number 8 does not translate into eight ‘1’s because the numbers used for the encoding are random. As one can see, the mean at the output resembles the encoded value, and as more time is used to decode it, less variance is achieved in the estimation. It is evident in this example that encoded values using the same random number source display high levels of correlation, and that is the reason why the majority of stochastic architectures would require independent random number sources for every encoded value.
4. Stochastic Architecture for BOCPD Image Segmentation The change point detection algorithm is employed in a brain-inspired system architecture [23] for real-time big velocity BIGDATA processing that originates in large format tiled imaging arrays used in wide area motion imagery ubiqui150
tous surveillance. High performance and high throughput is achieved through 11
approximate computing and fixed point arithmetic in a variable precision (6 bits to 18 bits) architecture. The architecture implements a variety of processing algorithms classes ranging from convolutional networks (ConvNets) to linear and non-linear morphological processing, probabilistic inference using exact and ap155
proximate Bayesian methods and ConvNet based classification. The processing pipeline is implemented entirely using event based neuromorphic and stochastic computational primitives. The BOCPD architecture presented here is designed for video processing, specifically computing whether a pixel in a video sequence belongs to the back-
160
ground or to an object (image segmentation). Each pixel is processed with only information from its intensity and it’s time history (Figure 5). Pixelbstatebtobbebcomputed nbackgroudborbforeground{ Inxvyvt=1{ Inxvyvt=N{
Inxvy{={Inxvyvt{}
t=1v...vN
Observedbpixelbvaluebatbframebt
pnrbbvbxbbb{ t 1:t
t=1v. ..v t=1 Fram N es
t=N
Frames
t:
1
2
3
4
5
6
7
8
…
Inferred detection
c:
1
0
0
0
1
0
0
1
…
Run length
r:
0
1
2
3
0
1
2
0
…
I1
I2
I3
I4
I5
I6
I7
I8
Observations per pixel
Inx,y,t{
…
Figure 5: Change point detection applied to image segmentation at the pixel level.
4.1. Core architecture In contrast with the original CPD formulation, the proposed CPD implementation will have the run-length distribution trimmed, meaning that run-lengths 165
higher than N win − 1 will not be considered for the run-length distribution P (rt |X1:t ). If no information of the time before the algorithm starts is provided, then the default distribution for P (rt |X1:t ) will have all of its weight in 12
P (r0 = 0|X1:t ) = 1. At time t = (Nwin-1), distribution P (rt |X1:t ) will have been populated with values different than zero for run-lengths up to N win − 1, 170
~ k k ∈ {0, 1, 2, ..., N win−1} corresponding to run-lengths r = k and parameters φ
will have already been generated. At this point, distribution P (rt |X1:t ) can then be already evaluated for Nwin different values (0 to Nwin-1 ). Since up to Nwin values are stored for the run-length distribution P (rt |X1:t ), the moment the 2 following Xt=Nwin sample arrives, the parameters φ~k = (σok , µok ) for k = Nwin 175
will not be generated, and the probability value assigned for rNwin = Nwin, will be added to the probability calculated for rNwin = Nwin-1 . This way rt is always limited to a constant number of Nwin values over time. Furthermore, ~ (Nwin-1 ) ) + P (rt = Nwin|φ ~ (Nwin) ) will have to be asthe value P (rt = Nwin-1 |φ ~ (Nwin-1 ) ). By doing this, P (rt = Nwin-1 |φ ~ (Nwin-1 ) ) signed to P (rt = Nwin-1 |φ
180
can be intepreted as the probability of Change-Point not only for rt = N win − 1 but for all rt ≥ N win − 1. A diagram of the designed architecture for the CPD algorithm is presented in Figure 6. At time t, registers Ro(0) to Ro(N win − 1) found in Figure 6 (1) will hold the probability distribution values for P (rt−1 |X1:t−1 ). These reg-
185
isters simply contain count values (integer values), and they do not necessarily represent the normalized version of P (rt−1 |X1:t−1 ). Register W o will contain
~ registers, so this can be thought of as the the sum of all the values of the Ro normalizing constant. For this proposed architecture, two frequencies are defined, the frequency 190
at which samples are received (Fsamp ) and the frequency corresponding to the processing of every new sample (Fproc ). For instance, lets assume a new sample is received every milisecond, 1ms is then available to perform processing on that sample. Depending on the frequency used, 1ms can represent different number of clock cycles that can be employed in the processing of that sample. If the
195
frequency is 100KHz, then 100 clock cycles are employed, if the frequency is 1M Hz, then 1000 clock cycles are employed. In this architecture the number of clock cycles employed is programmable, and range from 28 to 216 . If both frequencies Fsamp and Fproc are set, but additional clocks cycles are required 13
Clocktenable andtreset H
H
Ph
Z
hXHX
mHp Ph
Z
Ph J
JKFF Z
Ph
Z
K
Z
JKFF ZPh K
J
Z
J
Z
Ph Ph Ph
Z
Ph
JKFF ZPh K
Ph
Z
P M P M
Ph Ph Ph
Z
M P M
RimEp
C
Romhp
Rimhp
C
RomHp
RimHp
C
RomNwinPHp
RimNwinPHp
C
RomNwinPhp
RimNwinPhp
AAA 4X
C
5X
C
muomEp
NDmEp
Bernstein P4
Current SampletXmtp
mnMHpX
C
mnM5pX
C
mnM4pX
C
Ph
Z
AAA
AAA
NUmhp muomhp
nPh
P4 Ph
Z NUmHp
C
muomHp
AAA
m4p mHnMhpX
NDmhp
Bernstein Z
C
AAA
NDmHp
Bernstein P4
Z
Ph
Z
AAA
AAA
NUmNwinPHp NDmNwinPHp
H
MUUPDATES
AAA AAA AAA
Defaulttmean fortthetpriort
n
m5p
mnM5pX
CoefficientstW
P4
Z NUmEp
Z
AAA
Bernstein Coefficients
C
Nwin
P
H 5X
Wi
AAA
Z
RomEp
AAA
Ph J Z K
Z
Wo
C
AAA
Z
J
C
AAA
AAA
AAA
P
Ph
JKFF ZPh K JKFF
M
mhp
muomNwinPHp
Bernstein P4
Z
Ph
Z
NUmNwinPhp NDmNwinPhp muomNwinPhp
Bernstein P4
Z
mHnMHpXtAAAtm5nMNwinpX
Ph
Z
C
m5nMNwinMhpX H
Figure 6: General stochastic architecture for the CPD algorithm.
14
to process a sample, an alternative would be to diversify in space, meaning 200
that instead of having stochastic streams of N clock cycles, double the number of stochastic streams can be employed (with the additional need of URNGs) cutting in half the number of clock cycles assigned to each of those streams. One can clearly see the resemblance here with nature and neural networks. If all computations had to be performed on a single stochastic line, then very high
205
Fproc frequencies would be needed, but instead what the brain does is diversify in space making Fproc a very low frequency, between 100Hz and 2000Hz. There is a massive connectivity of dendrites in a single neuron, which is in the order of thousands. This architecture here does not diversify in space, with the objective of keeping the silicon area as small as possible, but it would not be difficult to
210
introduce this change. Hence, the more clock cycles employed, the more accurate the computation becomes. ~ registers need When a new sample value arrives, all of the values in the Ro to be encoded stochastically, so several comparators and random number generators are necessary to generate the stochastic streams (see Figure 6 (2)). In
215
~ values, the order to perform a normalization on the stochastically encoded Ro statistical characteristics at the output of a JK flipflop will help do that (see Figure 4). By using stochastic adders/subtractors, binary comparators and JK flip flops, the sum of all the Bernoulli probabilities at the output of the JK flip flops will add to one, making the normalization of P (rt−1 |X1:t−1 ) possible with-
220
out involving complicated division algorithms. The stochastic adder/subtractor mentioned is the one represented by a circle in Figure 4 (2). This unit will contain a counter that will increase its count by the difference of its inputs, where that difference can be -1, 0 or 1. After adding that step to the local counter, if the local counter holds a value greater than 0, then a ‘1’ is forwarded to the
225
output, and the counter is decreased by a count. The length of these stochastic streams can be changed depending on the accuracy required from the algorithm. If a higher accuracy is required, more ~ computational time has to be provided. On the right side of registers Ro, registers Ri(0) to Ri(Nwin-1 ) and W i will contain the updated probability 15
230
distribution P (rt |X1:t ) by the end of the chosen computational time. Every time ~ and W o are loaded with P (rt−1 |X1:t−1 ) a new sample xt arrives, registers Ro ~ and W i. After this transference, Ri ~ and W i registers are set held by registers Ri
~ and W i will integrate the ones from their stochastic inputs, to zero. Registers Ri acting as stochastic decoders. After the computation time chosen, the resulting 235
~ will represent the non-necessarily normalized version count values in registers Ri of the P (rt |X1:t ) distribution. The resulting distribution will not be necessarily normalized because of the nature of this type of stochastic computation. The decoding process has a mean and a variance, and it is because of that variance that the distribution might not add to one. The total number of uncorrelated uniform random numbers used in this
240
design is 3n + N win + 1. Many of these numbers are used more than one time because the generated stochastic streams are not crossing their paths. If they ~ and W i registers, which do cross their paths it is only at the input of the Ri is not a problem because there are no more stochastic computations performed for which correlation could be a problem. Stochastic XDtAdsample
muiD0A
RGEN
muoD1A
muiD1A
RGEN
muoD2A
muiD2A
MU_UPDATES
C
C
RGEN 1
C
...
...
...
...
...
...
... Nwin-1T
muoD0A
n-1
C
RGEN
muoDNwin-2A
muiDNwin-2A
RGEN
muoDNwin-1A
muiDNwin-1A
n
C C C
C
C
NwinT Nwin31T Nwin32T
...
3T
Defaultdmean fordthedpriord
...
2T
Uniformdrandom numbers
...
1T
CoefficientsdW
245
Nwin3n-1T
Stochasticdmuo
Figure 7: Stochastic architecture for the update of the µok parameters.
With a similar structure, the parameters from Equation 5 are updated. In 2 fact, notice that the expression for σok 0 ∀k does not depend on the value of the
current Xt sample, and as a result, they are constants. That makes it possible not having memory states for these parameters. In Figure 7 the Coefficients 16
250
2 2 W will provide the σo(k−1) /(σo(k−1) + σ 2 ) constant values, and after being
stochastically encoded, will be used in the update of the mean parameters µok for k ∈ [1 : N win − 1]. The default mean for the prior does not change, and then it is added as one of the inputs in Figure 7. The value for this default mean will generally be set to 0.5 for the most uninformed guess. One can estimate the 255
mean of the analyzed random process over a long period of time, and add that estimation as the default mean. Input Xt will be encoded stochastically and will be used with two input multiplexers to perform the necessary weighted sum in the update of the mean parameters (see Equation 5). The way in which registers ~ and Ri ~ registers. It can m ~ uo and m ~ ui are reset and loaded is similar to the Ro
260
be also observed in Figure 7 how the arrows from the multiplexers to the m ~ ui registers go down one level, meaning that µok 0 depends on µo(k−1) , as expected. Each of the m ~ uo register values is encoded in n-1 different independent streams. One of those streams is used for the updates in the µok parameters, and at the same time all of those streams are sent out to the blocks that will compute the
265
(r
)
t−1 Gaussian distributions P (Xt |rt−1 , Xt−1 ) for the different values of rt−1 .
Going back to Figure 6, three different types of stochastic streams can be seen, the ones for the encoded Xt value (4), the ones corresponding to the mean parameters µok coming out of the MU UPDATES block, and the Bernstein coefficients streams (3) that will be used to generate the required Gaussian ~ k ). The Bernstein polynomials [28] are polynomials that can function P (Xt |φ be used to approximate any function f (x) = y with x ∈ [0; 1] and y ∈ [0; 1], and they rely on the weigthed sum of Bernoulli probabilities. Considering n independent stochastic streams representing the number p, if at every time step all of the ones from the different stochastic streams are added, the probability of having q as the output addition is a Binomial distribution: q q P (q) = p (1 − p)n−q n
(14)
By computing a weigthed sum of these probabilities one can approximate
17
the desired function:
i=1
wi
q q p (1 − p)n−q = fˆ(p) n
(15)
b_iPn)
n X
b_iP0) b_iP1) b_iP2) b_iP3)
f (p) ≈
...
Bernstein -1
-1
Z
Z
log2Pn)+1
out_o
NDP1)
x_iP0) mu_iP0)
ABS
x_iP1) mu_iP1)
ABS
x_iP2) mu_iP2)
ABS
x_iPn-1) mu_iPn-1)
ABS
-1
BURSTRDOWN -1 Z
-1
BURSTRDOWN -1 Z
-1
BURSTRDOWN -1 Z
Z
BURSTRUP -1 Z
NDP2) Z
NUP2) BURSTRUP -1 Z
NDP3) Z
NUP3)
...
NDPn)
2
NUPn)
BURSTRDOWN -1 Z
-1
Z
+
BURSTRUP -1 Z
...
...
clockRenableR andRreset
NUP1)
-1
BURSTRUP -1 Z -1
Z
Z
-1
Z
-1
Z
Figure 8: Stochastic architecture for the Bernstein polynomials function approximation.
In order to make this approximation stochastically feasible, one needs the weights wi to be values ∈ [0; 1]. The architecture for this Bernstein polynomial function approximation can be found in Figure 8. By chosing among the different Bernstein coefficient streams with a multiplexer, the desired weighted 270
~ k ) is Gaussian, and its parameters effect is obtained. The distribution P (Xt |φ
2 are (µk = µok , σk2 = σ 2 + σok ). The block ABS performs the stochastic absolute
value of the difference between the inputs, so that only half of the Gaussian bell needs to be approximated by the Bernstein polynomials. The problem that now arises is that N win different Gaussians with N win possibly different vari275
ances need to be approximated. This would mean that N win different sets of Bernstein coefficients are needed, incrementing significantly the number of co18
efficients supplied to the architecture that need to be stochastically encoded. As a solution to this problem, the concept of a bursting neuron was applied. If a Gaussian distribution with standard deviation 0.2 was approximated with 280
certain coefficients, how can these same coefficients be used to approximate a Gaussian with standard deviation 0.2j/i with i, j ∈ N? Some neurons when excited, they generate a train of impulses at their output instead of a single spike. This same idea will be applied to the BURST UP blocks in Figure 8. For every spike at the input, N U (∗) spikes will be generated at the output. If the
285
original Gaussian has a standard deviation of 0.2, and i spikes are generated at the output of these blocks, then the approximated half Gaussian will decrease its standard deviation to 0.2/i. On the other hand for blocks BURST DOWN, if only every j spikes, one spike is generated at the output, then the effect is the opposite. By concatenating these two blocks, a more accurate approxima-
290
tion for the standard deviations in Equation 5 can be obtained, as two different degrees or freedom can be used in such approximation. In the scaling of the streams going to the Bernstein approximation blocks there is a caveat. Even if it seems that the use of two consecutive blocks that multiply and divide by constants stochastically can help to achieve a better
295
accuracy for the required standard deviations in Equation 5, there is a problem involved. For the case of the Burstdown block, a spike is sent to the output every time j spikes were received at the input, and for Burstup block, i spikes are sent at the output every time a spike is received at the input. When processing a new sample Xt , during the processing time N , K ones may be received at the
300
input of the Bernstein blocks, meaning that the signal value is approximately K/N . When the division is performed on this value, a remainder rem could be left in the calculation. This value is < j/N , and when scaling by i, a maximum accuracy error of (j − 1)i/N is suffered. This error wouldn’t exist if Burstdown block was disabled by setting j = 1, but a lower accuracy would be obtained
305
for the standard deviations in Equation 5. Additionally, when encoding small probability values, the relative error could become high. Overall, simulations performed showed that even in this case, the behavior of the system improved 19
compared to the disabling of the Burst down blocks. The combinatorial circuit placed after the Bernstein blocks can be easily 310
explained considering that the AND gate performs the multiplication operation, and the NOT gate performs 1 − p, where p is the value encoded at the input. To understand the reason of all of the connections that go to the input of the ~ and W i one can go back to equations 7 and 8. counters Ri 4.2. Preprocessing unit
315
As mentioned before, the objective of the ChangePoint Detection algorithm is to find changes in the parameters of the underlying distribution from which samples xt come from, by interpreting the run-length probability distribution P (rt |X1:t = x1:t ). Due to the choice of the Gaussian distribution for the right expression in Equation 4, it was concluded that Equation 1 is Gaussian. Now,
320
if a coming sample xt is very different from the range of values received in the past, this could mean that there is an outlier or that actually a change in the parameters has taken place. In either case, when sample xt is very far from the mean values µok , and falls where the Gaussian distribution has a very low probability value, the precision for this stochastic implementation gets
325
impaired. This problem can be seen in Figure 9. If one was dealing with floating ~ k ) was point computations, this would not be a problem, because even if P (Xt |φ a very small value ∀k ∈ [0 : N win − 1], when normalizing the distribution P (rt |X1:t ), one would still obtain meaningful probability values. For the case of the stochastic implementation, if for instance, one was using 1024 time slots to
330
~ k ) were all lower than compute P (rt |X1:t ) and the values obtained for P (Xt |φ
~ will most likely contain all zeros at the end of 1/1024, then the registers Ri the processing time, and no meaningful distribution for P (rt |X1:t ) would be obtained. A possible way of addressing this problem is by normalizing the samples
335
received over time. By tracking a slow varying mean for these samples, and amplifying the difference between this mean and the current sample value xt . This can be thought as a high pass filter with a cut-off frequency very close to 0Hz 20
Figure 9: Floating point vs stochastic precision problem.
and a gain greater than one. The stochastic stream sent to the Bernstein block approximator will never encode Bernoulli probabilities greater than 1, because 340
of the nature of stochastic processing. Consequently, when approximating half of the the Gaussian bell with the Bernstein polynomials [28], and performing the filtering and amplifying the values for xt , if the approximation of the Gaussian bell has a minimum value in the [0 : 1] domain that is greater than 1/Ncomp , where Ncomp is the number of clock cycles chosen for processing each sample,
345
then the values in the registers Ri at the end of the computational process time will not contain all zeros. By not containing all zeros, the architecture can recover easily from this situation. The stochastic pre-processing unit is shown in Figure 10. It is presented in Figure 10 the architecture of the preprocessing block that
350
adjusts the signal that will be fed to the CPD block. This adjustment is also done in the probabilistic domain. The first stage of the preprocessing unit is a one-pole lowpass filter. The iterative equation for this filter is well known, and it is y[n] = δ.x[n] + (1 − δ).y[n − 1]. The parameter δ determines the cutoff frequency of the filter, smaller values of it will mean lower cutoff frequency.
355
When computing the equation shown, the value of y[n] will have to be stored in binary base, meaning that a stochastic stream has been decoded for time n − 1. The decoding of a stochastic stream brings up problems that if left in the stochastic domain one would not encounter. This is that decoding is
21
360
not a loss-less transformation, due to the variance of the estimator used. This p variance is inversely proportional to Ncomp . At some point the value of the filtered input signal will be subtracted from the input signal, and if the error in the filtered signal is of the order of the input signal’s standard deviation, then a problem arises. Two complementary solutions were found for this problem. First of all it was decided to change the filter to output a new sample every M 1
365
input samples. This means that the counter block in Figure 10 will count for M 1.Nproc clock cycles. After this time, the value in this counter will be sent to the block current y[n-1] and counter register will be reset. In red dotted line one can observe the three terms in the IIR one pole filter mentioned. This will √ decrease the decoding error by M 1, and it is a viable solution because the
370
input signal’s DC component is assumed to change very slowly. Secondly it was decided to incorporate additional parallel streams for the values y[n − 1].(1 − δ) and x[n].δ. If M 2 independent stochastic number streams were used to generate these M 2 parallel streams, it is the same as having Ncomp .M 2 time slots instead
375
of Ncomp . Considering now the effect of M 1 and M 2, one would effectively √ have reduced the error in the decoded value of the filtered signal by M 1.M 2. The difference of the filtered signal and the original signal is computed in the stochastic domain. Depending on which of the two signals encodes a higher number, one of the output streams for the subtractions will be zero and the other one will have the difference. The subtraction signals are then amplified
380
with the previously mentioned Bursting blocks by mult i. One wants to amplify the subtracting signals because of the desire to distinguish small changes in the mean of the input signal. Finally the two amplified streams are multiplied by a number lower than 1 (input divi ), so that the scaling factor can be any rational number, and they are combined by using a counter that starts at a
385
count representing the number 0.5 that goes up or down depending on the input stochastic values. This way one will always obtain a signal that is centered at 0.5 at the preprocessing block output.
22
Figure 10: Architecture for the stochastic pre-processing unit.
5. Results and Discussion The architecture was implemented in a Xilinx Kintex-7 FPGA communicat390
ing with a host system using USB3 interface and Matlab, featuring 14 BOCPD cores with their own respective pre-signal adjustment allowing the processing of 120x160 pixel images @ 10 fps. All of the parameters were fixed, with the most important ones being the number of Bernstein coefficients n = 5, the processing time 8192 clock cycles, and the time window N win = 8. The design was writ-
395
ten completely parametrically in VHDL, allowing to change these parameters and resynthesize. The proposed stochastic BOCPD was implemented as part of a image processing pipeline for implementing background-forground segmentation. Points of change were identify analyzing the probability distribution obtained for each new sample xt in Equation 1. This system architecture is
400
capable of processing in real-time 160 x 120 raw pixel data running on a reconfigurable computing platform (5 Xilinx Kintex-7 FPGAs 5). The image data is generated by standard cameras and hence the value of the pixels is encoded using pulse density modulation (PDM) so that it is compatible with the computational infrastructure. A screen shot of results from the image processing 23
405
pipeline up to the segmentation and feature extraction stage is shown in Figure 11. In Figure 12 show results obtained in the simulation of the architecture processing a single pixel (scalar) using Isim are shown. An ASIC implementation of the system would represent an architecture for hardware inference for the second wave of AI that goes beyond deep neural networks [29]. Furthermore
410
by judiciously employing hybrid stochatic/derministic data representation the many pitfalls associated with stochastic computation and pointed out in[30] are avoided.
Acknowledgment This work was supported by DARPA UPSIDE project HR0011-13-C-0051 415
through BAE Systems; we are grateful to Dr. Louise Sengupta who managed the project the first two years while at BAE systems, for her support and encouragement. This work was also partially supported by NSF grant INSPIRE SMA 1248056 through the Telluride Workshop on Neuromorphic Cognition Engineering and by an ONR MURI N000141010278. 24
Figure 11: Screen capture of the laptop displaying the results from image processing pipeline. The results of the change point analysis and segmentation algorithm is shown in the two bottom frames of the image processing pipeline, using the resources of one Xilinx Kintex-7 FGPA.
420
References [1] R. P. Adams, D. J. MacKay, Bayesian Online Changepoint Detection, arXiv.orgarXiv:0710.3742v1. [2] D. Lunn, C. Jackson, N. Best, A. Thomas, D. Spiegelhalter, The BUGS Book: A Practical Introduction to Bayesian Analysis, 1st Edition, Texts in
425
Statistical Science, Chapman & Hall/CRC, 2012. [3] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis, 2nd Edition, Texts in Statistical Science, Chapman & Hall/CRC, 2006. [4] A. Smith, A Bayesian approach to inference about a change-point in a sequence of random variables, Biometrika 62 (2) (1975) 407–416.
430
[5] J. J. O Ruanaidh, W. J. Fitzgerald, K. J. Pope, Recursive Bayesian location of a discontinuity in time series, in: Proceedings of the 1994 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 1994, pp. 513–516. 25
Figure 12: Results from the CPD FPGA architecture. The bottom panel shows a noisy input signal with step changes in the mean value. Signal send-event shows the output of the architecture indicating an event whenever the mean value of the signal changes (red arrows pointing upwards).
[6] R. Turner, Y. Saat¸ci, C. E. Rasmussen, Adaptive sequential Bayesian 435
change point detection, Tech. rep. (2009). [7] J. Mellor, J. Shapiro, Thompson Sampling in Switching Environments with Bayesian Online Change Detection, in: Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2013, pp. 442–450.
440
[8] Y. Saat¸ci, R. D. Turner, C. E. Rasmussen, Gaussian process change point models, in: Proceedings of the 27th Annual International Conference on Machine Learning (ICML-10), 2010, pp. 927–934. [9] B. R. Gaines, Stochastic Computing Systems, in: J. F. Tou (Ed.), Advances in Information Systems Science, Plenum, New York, 1969, pp. 38–172.
445
[10] W. Poppelbaum, C. Afuso, J. Esch, Stochastic computing elements and sys-
26
tems, in: Proceedings of the 1967 Fall Joint Computer Conference (FJCC), ACM, 1967, pp. 635–644. [11] N. R. Shanbhag, R. A. Abdallah, R. Kumar, D. L. Jones, Stochastic computation, in: Proceedings of the 47th ACM/EDAC/IEEE Design Automation 450
Conference (DAC), 2010, pp. 859–864. [12] B. D. Brown, H. C. Card, Stochastic neural computation I: Computational elements, IEEE Transactions on Computers 50 (9) (2001) 891–905. [13] B. D. Brown, H. C. Card, Stochastic neural computation II: Soft competitive learning, IEEE Transactions on Computers 50 (9) (2001) 906–920.
455
[14] T. Hammadou, M. Nilson, A. Bermak, P. Ogunbona, A 96 64 intelligent digital pixel array with extended binary stochastic arithmetic, in: 2003 IEEE International Symposium on Circuits and Systems (ISCAS), 2003, pp. 772–775. [15] W. Qian, X. Li, M. Riedel, K. Bazargan, D. Lilja, An Architecture for
460
Fault-Tolerant Computation with Stochastic Logic, IEEE Transactions on Computers 60 (1) (2011) 93–105. [16] A. Alaghi, C. Li, J. P. Hayes, Stochastic circuits for real-time imageprocessing applications, in: Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13), ACM, 2013, p. 136.
465
[17] S. Sharifi Tehrani, S. Mannor, W. J. Gross, Fully Parallel Stochastic LDPC Decoders, IEEE Transactions on Signal Processing 56 (11) (2008) 5692– 5703. [18] A. Alaghi, J. P. Hayes, Survey of Stochastic Computing, ACM Transactions on Embedded Computing Systems (TECS) (2012) 1–25.
470
[19] D. H. Goldberg, G. Cauwenberghs, A. G. Andreou, Probabilistic synaptic weighting in a reconfigurable network of VLSI integrate-and-fire neurons, Neural Networks 14 (2001) 781–793. 27
[20] R. J. Vogelstein, U. Mallik, E. Culurciello, G. Cauwenberghs, R. EtienneCummings, A multichip neuromorphic system for spike-based visual infor475
mation processing, Neural Computation 19 (9) (2007) 2281–2300. [21] A. S. Cassidy, J. Georgiou, A. G. Andreou, Design of silicon brains in the nano-CMOS era: spiking neurons, learning synapses and neural architecture optimization, Neural Networks 45 (2013) 4–26. [22] K. A. Sanni, G. Garreau, J. L. Molin, A. G. Andreou, FPGA implementa-
480
tion of a Deep Belief Network architecture for character recognition using stochastic computation, in: 49th Annual Conference on Information Sciences and Systems (CISS), 2015, pp. 1–5. [23] A. G. Andreou, T. Figliolia, K. Sanni, T. S. Murray, G. Tognetti, D. R. Mendat, J. L. Molin, M. Villemur, P. O. Pouliquen, P. M. Juli´ an,
485
R. Etienne-Cummings, I. Doxas, Bio-inspired System Architecture for Energy Efficient, BIGDATA Computing With Application to Wide Area Motion Imagery , in: 2016 IEEE 7th Latin American Symposium on Circuits and Systems (LASCAS), 2016, pp. 1–6. [24] J. L. Molin, K. Sanni, I. Doxas, A. G. Andreou, R. Etienne-Cummings,
490
T. Figliolia, FPGA emulation of a spike-based, stochastic system for realtime image dewarping, in: 55th Midwest Symposium on Circuits and Systems (MWSCAS), IEEE, 2015, pp. 1–4. [25] L. Buesing, J. Bill, B. Nessler, W. Maass, Neural Dynamics as Sampling: A Model for Stochastic Computation in Recurrent Networks of Spiking
495
Neurons, PLOS Computational Biology 7 (11) (2011) e1002211. [26] K. A. Boahen, Point-to-point connectivity between neuromorphic chips using address events, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47 (5) (2000) 416–434. [27] T. Figliolia, P. M. Juli´ an, G. Tognetti, A. G. Andreou, A True Random
500
Number Generator using RTN noise and a sigma delta converter, in: 2016 28
IEEE International Symposium on Circuits and Systems (ISCAS), 2016, pp. 17–20. [28] S. Bernstein, Demonstration du theoreme de Weierstrass basee sur le calcul des probabilities, Communications of the Kharkov Mathematical Society 13 505
(1913) 1–2. [29] K. Sanni, A. G. Andreou, A Historical Perspective on Hardware AI Inference, Charge-Based Computational Circuits and an 8bit Charge-Based Multiply-Add Core in 16nm FinFET CMOS, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (3) (2019) 532–543.
510
[30] R. Manohar, Comparing Stochastic and Deterministic Computing, IEEE Computer Architecture Letters 14 (2) (2015) 119–122.
29
Biography
Tomas Figliolia received his diploma in Electronic Engineering from the Universidad de Buenos Aires, Buenos Aires, Argentina in 2009, his M.S. in Electrical and Computer Engineering from Johns Hopkins University in 2011 and his Ph.D. in 2018. His Ph.D. dissertation research was focused on a hardware architecture for embedded processing of high velocity streaming image data and probabilistic inference. He is now with Xilinx Incorporated.
Andreas G. Andreou is a professor of electrical and computer engineering, computer science and the Whitaker Biomedical Engineering Institute, at Johns Hopkins University. Andreou is the co-founder of the Johns Hopkins University Center for Language and Speech Processing. Research in the Andreou lab is aimed at brain inspired microsystems for sensory information and human language processing. Notable micro-systems achievements over the last 30
25 years, include a contrast sensitive silicon retina, the first CMOS polarization sensitive imager, silicon rods in standard foundry CMOS for single photon detection, and a large scale mixed analog-digital associative processor for character recognition. Significant algorithmic research contributions for speech recognition include the vocal tract normalization technique and heteroscedastic linear discriminant analysis, a derivation and generalization of Fisher discriminants in the maximum likelihood framework. In 1996 Andreou was elected as an IEEE Fellow, for his contribution in energy efficient sensory Microsystems.
31