Accepted Manuscript
A Scalable and Adaptable Hardware NoC-Based Self Organizing Map Mehdi Abadi, Slavisa Jovanovic, Khaled Ben Khalifa, Serge Weber, Mohammed Hedi ´ Bedoui PII: DOI: Reference:
S0141-9331(17)30191-6 10.1016/j.micpro.2017.12.007 MICPRO 2646
To appear in:
Microprocessors and Microsystems
Received date: Revised date: Accepted date:
31 March 2017 26 October 2017 14 December 2017
Please cite this article as: Mehdi Abadi, Slavisa Jovanovic, Khaled Ben Khalifa, Serge Weber, Mohammed Hedi ´ Bedoui, A Scalable and Adaptable Hardware NoC-Based Self Organizing Map, Microprocessors and Microsystems (2017), doi: 10.1016/j.micpro.2017.12.007
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A Scalable and Adaptable Hardware NoC-Based Self Organizing Map
a UMR b Laboratoire
CR IP T
Mehdi Abadia,b,1 , Slavisa Jovanovica,∗, Khaled Ben Khalifab , Serge Webera , Mohammed H´edi Bedouib 7198, Institut Jean Lamour, Universit´ e de Lorraine, Nancy, France de Technologie et Imagerie M´ edicale, Universit´ e de Monastir, Monastir, Tunisia
AN US
Abstract
Due to their ability to reduce the size of high-dimensional input data, Selforganizing maps (SOMs) can be employed as data quantizers. The widely used software implementations of SOM enjoy flexibility and adaptability, usually to the detriment of performances, which limits their use in real time applications. On the contrary, the hardware counterparts of SOMs exploit the inherent paral-
M
lelism of hardware to boost the overall performances, but generally lack adaptability without considerable design efforts. To benefit from both, the flexibility
ED
of software and performances of hardware SOM implementations, unconventional design approaches of SOMs should be used. In this work, a scalable and adaptable hardware implementation of a SOM network is presented. The
PT
proposed architecture allows to dynamically extend the SOM operation from a smaller to a larger map only by (re-)configuring the parameters of each neu-
CE
ron. The gained scalability is obtained by decoupling the computation layer composed of neurons, from the communication one, used to provide data exchange mechanisms between neurons. The proposed SOM architecture is also
AC
validated through simulation on variable-sized SOM networks applied to image ✩ This
work was supported by the PHC-UTIQUE 17G1423 Research program. author Email addresses:
[email protected] (Mehdi Abadi),
[email protected] (Slavisa Jovanovic),
[email protected] (Serge Weber),
[email protected] (Mohammed H´ edi Bedoui) 1 Ecole Nationale d’Ing´ enieur de Sousse, Universit´ e de Sousse, Sousse, Tunisia ∗ Corresponding
Preprint submitted to Journal of Microprocessor and Microsystems
December 15, 2017
ACCEPTED MANUSCRIPT
compression. Keywords: Self-Organizing Map, Network-on-chip, FPGA, Image compression
CR IP T
1. Introduction
A Self-Organizing Map (SOM) is an unsupervised learning neural network
finding its use in many applications. The high-dimensional data reduction and classification are commonly done with SOMs, facilitating that way their in5
terpretation and processing. Several SOM implementations have already been
AN US
proposed in the literature [1–6]. The software (SW) implementations are the most common and bring more flexibility, whereas the hardware (HW) implementations exploit inherent parallelism of SOMs and may be preferred in real-time applications characterized with tight temporal constraints.
The state-of-the-art hardware SOM implementations are application specific
10
and have the parameters such as input and output layer size, timing constraints
M
and memory requirements fixed in the design phase. Thus, the obtained hardware SOM implementation fits perfectly the needs of the specific application but is hardly adaptable to other applications without considerable design efforts. The main reason of this lack of flexibility of hardware SOMs lies in the
ED
15
way that their processing units, often called neurons, are connected and ex-
PT
change data. The point-to-point links between neurons bring fast connections allowing them to exchange and compare the computed data often within a cycle. However, the complexity of these fully connected SOMs, which grows quadratically with the number of neurons, makes this type of connections impractical
CE
20
for large SOM networks. Scalability of hardware SOMs can be achieved by
AC
decoupling the computation layer, composed only of neurons, from the communication one, providing the data exchange mechanisms. In addition, more flexibility can also be gained by modifying the computation layer composed of
25
neurons, by making them customizable and configurable at runtime, during the normal SOM operation. Many real-time processing applications can fully gain benefits from these scalable and flexible hardware SOMs. Indeed, the fast con-
2
ACCEPTED MANUSCRIPT
text switching from one task to another during the normal operation of a system can bring more flexibility to the latter and may enlarge its basic functionalities 30
and fields of applications. In this paper, a scalable and adaptable hardware
CR IP T
SOM implementation is proposed. This paper is organized as follows: Section 2 presents the theoretical background of this work and the state of the art works in the domain of hardware
SOM implementations. The proposed hardware architecture is detailed in Sec35
tion 3. Section 4 presents some obtained results whereas some conclusions and
2. Background 2.1. Self-Organizing Map (SOM)
A Self-Organizing Map can be presented with a two dimensional distribution − of L × K neurons. Each neuron has a weight vector → m of dimension D, where → − D is the size of an input vector X :
M
40
AN US
perspectives are drawn in Section 5.
→ − X = {ξ1 , ξ2 , ..., ξD }
(1)
ED
The SOM operation requires two phases: learning and recall. During the learning phase, the map generates its outputs by changing the weights of its neurons as a function of the input vectors used for training. After the learning phase, the trained map can be used for decision purposes in the recall phase,
PT
45
where each input vector is assigned to a neuron or a group of neurons in the
CE
map, often called a winner neuron or best matching unit (BMU). Each neuron − calculates the distance between its weights → m (0 ≤ l ≤ L − 1, 0 ≤ k ≤ K − 1) l,k
→ − and the input vector X . In general, the calculated distance is the Euclidean distance (L2) as presented by:
AC 50
DL2
v u D
→
X
− →
u − = X − m = t (ξk − µk )2
(2)
k=1
− Therefore, the winner neuron, which has the weight vector → m c closest to the → − input vector X , is identified. This phase is called competition and is expressed 3
ACCEPTED MANUSCRIPT
as follows:
→
− − m l,k c = argmin X − →
(3)
l,k
The recall and learning phases carry out both the calculation of the Euclidean distance, and the competition phase to find the winner neuron. More-
CR IP T
55
over, during the learning phase, the weights of the winning neuron and of its
closest neighbours are updated. This phase is called adaptation and is expressed as follows:
h→ i − → − − − m(t) = → m(t − 1) + hc,l,k (t) X (t) − → m(t − 1)
(4)
60
AN US
where hc,l,k (t) is the neighbourhood function used to define the degree of learning of a neuron which is higher in the vicinity of the winner neuron; more-
over, it depends on the position of the neuron with respect to the winner’s one and the epoch number representing the number of learning iterations. The neighbourhood function is defined by the following equation:
(5)
− with α(t) learning rate; σ(t) neighbourhood rate; → r c position of the winning → − neuron; r position of the neuron with index (l, k). l,k
ED
65
− − k→ rc−→ r l,k k ) 2σ 2 (t)
M
hc,l,k (t) = α(t) × exp(−
2.2. Literature survey
PT
Due to their inherent parallelism and feature extraction property, the SOMs have naturally found their place as vector quantizers. The use of hardware SOMs as vector quantizers have already been reported in many works [1–6]. Hikawa et al. in [2] proposed a massively parallel hardware SOM implemen-
CE
70
tation adapted for vector quantization applications using a novel neighbouring
AC
function. The proposed ”hardware friendly” neighbouring function exploits inherent hardware parallelism allowing to improve the overall vector quantization performances. The presented architecture is massively parallel and high-
75
performance where both the learning and recall phases are done within a clock cycle. The performances of the proposed architecture were tested in a colour quantization experiment, where 128×128 images were used as inputs and the
4
ACCEPTED MANUSCRIPT
number of neurons varied from 8×8 to 32×32. In the proposed architecture, the winner search operation is done within a clock cycle, whose value is closely 80
related to the size of SOM essentially due to the high number of point-to-point
M
AN US
literature, but is inflexible and lacks scalability.
CR IP T
links. The proposed architecture achieves the best reported performances in the
ED
Figure 1: Illustration of the scalability of common hardware SOM architectures
Ramirez-Agundis et al. in [6] also proposed a hardware massively parallel
85
PT
SOM implementation for vector quantization. 16-element input vectors were used for all experiments on different SOM networks comprising from 16 to 256 neurons. The overall time needed for learning and recall phases for all map
CE
sizes is in the ranges of 41 to 45 and 22 to 26 working clock cycles respectively. The relatively constant learning and recall time disregarding the map size is
AC
explained by the massively parallel hardware implementation. The proposed
90
architecture meets real-time video coding timing constraints for greyscale and colour images up to 640×480, but is poorly flexible and not scalable. In [5], Kurdthongmee proposed an approach to accelerate the learning phase
of a hardware SOM quantizer by evaluating the mean square error of the quantization process and comparing it with a threshold fixed in advance. A 16×16
5
ACCEPTED MANUSCRIPT
95
map was used for all experiments, carried out on images having the sizes varying from 32×32 to 512×512 pixels. The proposed approach was validated on a Xilinx Virtex-2 FPGA providing real-time performances for image sizes up to
CR IP T
640×480 pixels. The same author proposed in [4] a hardware SOM quantizer using a fast best matching unit (BMU). The BMU has a unique goal to find, for a 100
given input vector, as fast as possible the identity of the winner neuron, the one
whose weights are the closest to the input vector. The proposed approach allows
to find the winner neuron within 4 clock cycles. This outstanding result is to the detriment of the necessary hardware resources for the algorithm implementa-
105
AN US
tion, which necessitates multi-port memory blocks. The proposed approach was validated using a Xilinx Virtex-4 FPGA in the case of a 16×16 map processing
up to 512×512 images at the maximal working frequency of 19.6 MHz. Both reported architectures in [5] and [4] are high-performance, meet real-time video processing constraints, but are also inflexible and lack scalability. Kurdthongmee recently proposed in [3] a similar approach to the one proposed in [4], focusing on the winner search operation. This operation is per-
M
110
formed using 2K 1-bit memory blocks, where K is the value of the maximal
ED
distance that can be encountered in a SOM network. The 1-bit word is used to indicate the state of the address: 1 means that the corresponding address, and the distance were already found in a SOM, 0 otherwise. The main advantage of the proposed winner neuron search scheme is the total time which is
PT
115
within a clock cycle. The proposed approach was tested on a 16×16 SOM using
CE
images with a 512×512 resolution, achieving the real-time performances on a Xilinx Virtex-4 FPGA. As other reported architectures, the scalability cannot be achieved without considerable design efforts.
AC
120
Recently, Abadi et al. in [7] proposed a flexible and scalable hardware SOM
architecture, without particular application in mind. The flexibility and scalability are provided to a hardware SOM by the means of a Network on a Chip (NoC) used for communication purposes. Therefore, a small hardware SOM can be easily extended to a larger one without decreasing the overall operat-
125
ing frequency, as it is often the case in the conventional hardware SOM ar6
ACCEPTED MANUSCRIPT
chitectures essentially due to the important number of neurons connected in a point-to-point manner. Moreover, this scalability is to the detriment of the winner search speed, which is greatly influenced by the communication latency
130
CR IP T
of the used communication approach. The proposed architecture was tested on a Xilinx Virtex-6 FPGA showing an estimated operating frequency of 200 MHz.
However, the scalability and the flexibility of the presented approach have not been demonstrated neither formally nor through an experiment. IN
R(1,2)
R(1,K)
IN
WEST
EAST CROSSBAR
OUT
PE
PE
E_d_in E_req_in E_ack_in
IN
AN US
R(1,1)
OUT
NORTH
to logic
OUT
E_d_out E_req_out E_ack_out
to logic
PE
LOCAL
R(2,1)
R(2,2)
R(2,K)
IN
PE
PE
PE
ROUTING/ OUTPUT / CONTROL LOGIC
SOUTH
OUT
Header « 01 »
Body « 00 »
Type
PE
R(L,K)
PE
2 bits Type
data
8 bits dest Address
Tail « 11 »
8bits data
Source address
N bits
ED
PE
R(L,2)
M
R(L,1)
body
Figure 2: (a) Structure of a 2D Mesh NoC. (b) NoC router architecture. (c) Packet structure
PT
From the proposed literature survey, it can be concluded that the stateof-the-art HW SOM implementations propose to accelerate either the winner search operation or the overall SOM computation by using massively the in-
CE
135
herent parallelism of hardware. The winner search is often the most critical operation of the SOM and the considerable attention that has been drawn
AC
to this aspect is fully justified. However, as illustrated in Figure 1, the reported SOM architectures in the literature cannot be used as a basis to build
140
up larger SOM networks without considerable design efforts. They are mainly application-specific, high-performance and poorly flexible. Moreover, the input vector data sending is often overlooked in the state-of-the-art architectures: it is often assumed that input vectors are available at the same time at the inputs of 7
ACCEPTED MANUSCRIPT
all neurons for distance calculation without detailing the way of their delivery. 145
Hence, point-to-point links can be and are often used to synchronously deliver these input vectors to neurons but do not present a viable solution for scalable
CR IP T
SOM architectures. Consequently, to the best of our knowledge, the scalability and adaptability of hardware SOMs, except in a general manner in the work [7], have never been addressed in SOM architectures.
The main contributions of this work are the following:
150
• the classical SOM operation is formally described at the algorithmic level and decomposed to form a scalable and easily configurable SOM algorithm
AN US
depending only on the map dimensions,
• an architecture based on the use of a Network-on-a-chip approach is proposed to implement the scalable SOM operation, formally described in the
155
first phase and,
M
• the proposed scalable SOM architecture is validated through simulation on an image compression application.
ED
2.3. Network on Chip (NoC)
A Network-on-a-chip (NoC) is presented as an alternative communication
160
approach to the commonly used shared bus, allowing several integrated pro-
PT
cessing units on a single chip to communicate [8]. Systems with hundreds of processors are not uncommon, and traditional interconnects such as shared bus
CE
struggle to meet the required performance. Interconnects designed for dozens of 165
components cannot easily scale to support hundreds or even more components required by systems today. With NoC interconnect, which does not lack scal-
AC
ability as the traditional shared bus, small networks can easily be grouped in larger ones via pipeline stages, bridges or other as required. Therefore, the NoC interconnect could easily support thousands of processing nodes, and could even
170
provide a transport network spanning multiple chips. Additionally, the NoC interconnect is characterized with an explicit parallelism, a high bandwidth and
8
ACCEPTED MANUSCRIPT
a high degree of modularity, which makes it very suitable for distributed architectures [9]. The structure of a 2D mesh NoC is shown in Figure 2(a). It is composed of processing elements (PEs) and routers. Each router is associated to one or
CR IP T
175
more PEs via a network interface (NI), whose primary function is to pack (before
sending) and unpack (after receiving) data exchanged between PEs. Figure 2(c)
illustrates the structure of packets circulating in the network using the wormhole switching technique. Each packet is composed of flits (the smallest indivisible 180
entities): the header flit, which opens the communication and ”shows” the route
AN US
to other flits; the body flit, which contains the data to be transmitted between PEs; and the tail flit, which closes the borrowed communication links and frees the router for other packets.
The main module of a NoC is the router (see Figure 2(b)), whose main 185
function is to forward ingoing packets to neighbouring routers until they reach the final destination. The router is composed of a crossbar which establishes
M
multiple wire connections between its input and output links according to the predefined routing algorithm and scheduling policy. The crossbar and the ar-
190
ED
bitration of packets in the router are handled with a control logic block. For instance, if multiple ingoing packets from different input links simultaneously try to reserve the same output link, the control block has to arbitrate this sit-
PT
uation and prioritize the forwarding of packets from different input links. The waiting ingoing packets must be stored in buffers until their forwarding is sched-
CE
uled. That is why each router has input and/or output buffers, whose role is 195
to temporarily accept flits before their transmission to either the local PE or to
AC
one of the neighbouring routers. 2.4. Systolic SOM architecture A way of decreasing the hardware complexity of SOMs, especially their point-
to-point links is to use the systolic approach of data exchange. The synoptic
200
scheme of a systolic SOM architecture is shown in Figure 3 [10, 11]. In this architecture, data between neurons are exchanged through levels, from one level 9
ACCEPTED MANUSCRIPT
Competition propagation y s
AN US
CR IP T
x
y
s
x
Adaptation propagation
Figure 3: Systolic architecture of SOM showing an example of data propagation between
M
systolic levels
to the next adjacent one as shown in Figure 3. In a L × K SOM architecture,
ED
there are Nsl systolic levels where Nsl = L + K − 1. The learning phase of the systolic SOM architecture requires two phases as the conventional SOM: the 205
competition phase during which the search of the winner neuron is performed;
PT
and the adaptation phase where the weights of the neurons are updated according to the used neighbouring function and the relative position to the winner
CE
neuron.
During the competition phase, the minimum distance propagates from one
210
level to another, starting from the top-left neuron located at the level 1. A neu-
AC
ron belonging to the level i first receives the comparison results of the adjacent neighbours belonging to the level i − 1, then performs a comparison between the received data and its local distance and finally sends the minimal local distance to its adjacent neighbours belonging to the level i + 1. Figure 3 shows an exam-
215
ple of data exchange between the levels of the presented systolic architecture.
10
ACCEPTED MANUSCRIPT
It can be shown that the global minimal distance is obtained after Nsl − 1 level propagation cycles. At the end of the competition phase, the winner neuron is identified at the bottom-right neuron located at the level L + K − 1, from which 220
CR IP T
the adaptation phase starts by broadcasting to all neurons the identity of the winner. It should also be noticed that all neurons belonging to the level i can
propagate their computed distances simultaneously. Therefore, the time needed
to finish the competition phase depends on the SOM topology. The bigger is
the number of levels Nsl , the greater is the time needed for the competition
phase. It should also be noted that all reported systolic SOM architectures lack
scalability, essentially due to the impossibility to (re-)configure at run-time the
AN US
225
neuron-to-neuron systolic connections without design efforts.
3. The proposed hardware SOM architecture
Before presenting the hardware implementation details of the proposed SOM
230
tecture is described first.
M
architecture, the SOM operation which forms the basis for this hardware archi-
ED
3.1. Proposed SOM operation
At the heart of the SOM operation, executed on a map of L × K neurons, is the calculation of the Euclidean distance according to Equation 2. Obviously,
235
PT
each calculated distance in the SOM is positive DL2 ≥ 0. Therefore, for any two 2 2 neurons, n1 and n2 , if DL2,n1 < DL2,n2 then DL2,n . For this reason, < DL2,n 1 2
CE
2 both DL2 and DL2 lead to the same result in the process of identifying the 2 winning neuron. Moreover, the measure DL2 is often favored over the measure
DL2 , because it allows to omit the rooting operation and thus decreases the
AC
2 computational complexity of the SOM algorithm [10, 12]. The measure DL2 is
240
calculated throughout this work, and for a neuron at the position (i, j) is given by:
D
→
2 X
−
2 DL2,(i,j) = X − − m−→ (ξk − µ(i,j)k )2 i,j = k=1
where D is the size of the input vector X. 11
(6)
ACCEPTED MANUSCRIPT
The direct hardware implementation of the neighbourhood function presented by Equation 5 necessitates the use of several arithmetic operators. More245
over, this direct implementation also implies the need to use an additional mul-
CR IP T
tiplier in the adaptation phase to update the weights of neurons according to Equation 4. In order to limit the hardware requirements for its implementation, the neighbourhood equation is often simplified in hardware implementations [2, 13–18]. A widely used solution is to replace the Gaussian function from 250
Equation 4 with a restricted set of values corresponding to negative powers of
two. Thus, the arithmetic operators that are required for the direct implemen-
AN US
tation of Equations 5 and 4 in the adaptation phase are replaced with a simple shift function [2, 3, 18]. This simplification was investigated by simulation in
[16] and [14], where it was shown that the comparable results to the original Ko255
honen’s SOM algorithm were obtained. Moreover, Porrman et al. demonstrated in [14] that some applications may need some additional learning steps to obtain the results comparable with the original Kohonen’s algorithm. Therefore, for a
M
neuron at the position (i, j), its neighbourhood function can be written as [18]:
260
ED
hi,j =
1
(7)
2Si,j
where Si,j is the number of shifts determined according to the relative neuron’s position (i, j) compared to the winner’s one R(i,j) and the learning phase’s
PT
neighbourhood rate β evolving with the number of learning iterations (epoch
CE
number), as:
Si,j = Ri,j + β
(8)
With this simplification, the update of weights of a neuron at the position (i, j) given by Equation 4 becomes:
AC
265
− m−→ i,j [n + 1] =
− m−→ i,j [n] +
→ − → ( X [n]−− m− i,j [n])
− m−→ i,j [n],
S 2 i,j
,
Ri,j < Rv
(9)
otherwise
where Ri,j is the distance between the neuron (i, j) and the winner. If a neuron belongs to the neighbourhood radius Rv , its weights will be updated during this phase, otherwise it will not. The Rv radius is initialized with L + K − 1, where 12
ACCEPTED MANUSCRIPT
L and K are the map’s dimensions, and decreases with the progression of the 270
learning phase. The winner neuron is searched in a systolic manner, where a neuron at the
in a limited set of neighbouring neurons as follows:
2 argmin DL2,(l,k) , i = 0, j = 0, (l,k)∈W0 2 argmin DL2,(l,k) , i = 0, 0 < j < K, (l,k)∈W1
2 argmin DL2,(l,k) , 0 < i < L, j = 0, (l,k)∈W2 2 , 0 < i < L, 0 < j < K. argmin DL2,(l,k) (l,k)∈W3
where sets Wi (0 ≤ i ≤ 3) are:
= {(i, j)}, i = 0, j = 0
W1
= {(i, j), (i, j − 1)}, i = 0, 0 < j < K
W2
= {(i − 1, j), (i, j)}, 0 < i < L, j = 0
W3
= {(i − 1, j), (i, j), (i, j − 1)}, 0 < i < L, 0 < j < K.
M
W0
(11)
ED
275
(10)
AN US
ci,j =
CR IP T
position (i, j), 0 ≤ i < L, 0 ≤ j < K carries out a local winner search operation
From Equations 10 and 11, it can be noticed that the neuron at the position (0, 0) does not carry out a local winner search, because it does not have neighbouring
PT
neurons on its left and top side. The result of its local winner search operation are its own coordinates. Moreover, the neurons in the first row (j = 0) and the
CE
first column (i = 0), search a local winner among two neurons, whereas all other
280
neurons (0 < i < L, 0 < j < K) search a local winner in a set of 3 neurons. The
identity of the winner is known at the position (L − 1, K − 1). Therefore, the
AC
global winner search can be written as: 2 c = argmin DL2,(l,k) = cL−1,K−1 .
(12)
(l,k)∈W
where W = {(i, j)|0 ≤ i < L, 0 ≤ j < K} is the set of all available neurons in the L × K SOM. 13
ACCEPTED MANUSCRIPT
fsrc→dest (data) src
0 ≤ i ≤ L − 2,0 ≤ j ≤ K − 2
(i, j)
dest
data
(i + 1, j)
2 DL2,c i,j
(i, j + 1)
ci,j
CR IP T
neuron at (i, j)
0 ≤ i ≤ L − 2, j = K − 1
(i, j)
(i + 1, j)
i = K − 1, 0 ≤ j ≤ L − 2
(i, j)
(i, j + 1)
2 DL2,c i,j
ci,j
2 DL2,c i,j
ci,j
AN US
Table 1: Data exchange between neurons in the competition phase
The result of each local winner search operation at the neuron (i, j) are
285
the coordinates of the local winner ci,j which have to be propagated with the 2 corresponding squared Euclidean distance DL2,c also in a systolic manner. If i,j
we define a data transport function f(l,k)→(m,n) (d) between two neurons at the
290
M
positions (l, k) and (m, n) (0 ≤ l, m ≤ L − 1, 0 ≤ k, n ≤ K − 1) where the neuron (l, k) is source (src), the neuron (m, n) is destination (dest), and d is data to propagate, the data exchange between all neurons during the competition phase
ED
can be described as presented in Table 1. It should be noted that each neuron 2 sends to its neighbours the squared Euclidean distance DL2,c corresponding i,j
PT
to the neuron elected during the local winner search operation and its own
AC
CE
coordinates (ci,j , 0 ≤ i ≤ L − 1, 0 ≤ j ≤ K − 1). neuron at (i, j) 0≤j ≤K −2 0 ≤ i ≤ L − 2, 0≤j ≤K −1
On the other hand, during
fsrc→dest (data) src
dest
data
(L − 1, K − 1)
(L-1,j)
c
(L − 1, j)
(i, j)
c
Table 2: Data exchange between neurons in the adaptation phase
295
the adaptation phase, the identified winner position c = cL−1,K−1 is sent to all
14
ACCEPTED MANUSCRIPT
neurons. The neuron at the position (L − 1, K − 1) initiates this phase, because the global winner search ends at this position. First, it sends the global winner identity to all neurons located at the column L − 1. Thereafter, these neurons (located at the column L − 1) send the global winner identity c to all other
CR IP T
300
neurons by row.
If for a neuron at the position (i, j) we also define a data transport configuration function g(i, j), whose main role is to configure the destination neurons
of the data transport function f according to Tables 1 and 2, the call of this 305
function by the neuron (i, j) will specifically configure it to sent data in the sys-
AN US
tolic manner described earlier. Moreover, to include the added neurons in the SOM operation, this function should be called every time the dimensions of the SOM change. Consequently, the proposed SOM operation initially started on a
L1 ×K1 map can be extended to a L2 ×K2 map, L1 < L2 , K1 < K2 , only by up310
dating the destination neurons of the data transport function and this by calling
M
the function g(i, j) for each neuron (i, j) of the new map 0 ≤ i < L2 , 0 ≤ j < K2 . 3.2. SOM Operation
ED
Algorithm 1: SOM Operation if new configuration then 315
L∗ ← L; K ∗ ← K; D∗ ← D;
for i ← 0; i < L∗ ; i ← i + 1 do
PT
for j ← 0; j < K ∗ ; j ← j + 1 do g(i, j);
end for;
end for;
CE
320
else
AC
if competition phase then
325
2 DL2,(0,0) ; – Equation (6) 2 f(0,0)→(1,0) (DL2,(0,0) ); – transfer function (0,0)→(1,0) 2 f(0,0)→(0,1) (DL2,(0,0) ); – transfer function (0,0)→(0,1)
for i ← 1; i < L∗ ; i ← i + 1 do
for j ← 1; j < K ∗ ; j ← j + 1 do – done in parallel in all neurons 2 DL2,(i,j) ;– Equation (6)
15
ACCEPTED MANUSCRIPT
c(i,j) ;– Equation (10)
330
if i = L∗ − 1 and j = K ∗ − 1 then no operation else if j = K ∗ − 1 then
CR IP T
2 f(i,j)→(i+1,j) (DL2,c ); i,j
f(i,j)→(i+1,j) (ci,j );
335
else if i = L∗ − 1 then
2 f(i,j)→(i+1,j) (DL2,c ); i,j
f(i,j)→(i+1,j) (ci,j ); else 2 f(i,j)→(i+1,j) (DL2,c ); i,j
340
2 f(i,j)→(i,j+1) (DL2,c ); i,j
f(i,j)→(i,j+1) (ci,j ); end if end for;
345
end for;
AN US
f(i,j)→(i+1,j) (ci,j );
c ← c(L∗ −1,K ∗ −1) ;– The winner identified if adaptation phase then 350
M
end if ;
for j ← K ∗ − 2; j ≥ 0; j ← j − 1 do end for;
ED
f(L∗ −1,K ∗ −1)→(L∗ −1,j) (c);
for j ← K ∗ − 1; j ≥ 0; j ← j − 1 do
for i ← L∗ − 2; i ≥ 0; i ← i − 1 do f(L∗ −1,j)→(i,j) (c);
PT
355
end for;
end for;
CE
for i ← 0; i < L∗ ; i ← i + 1 do
for j ← 0; j < K ∗ ; j ← j + 1 do – done in parallel in all neurons
360
AC
S(i,j) ;– Equation (8)
365
h(i,j) ;– Equation (7)
m(i,j) ;– Equation (9) end for; end for; end if ; end if ;
16
ACCEPTED MANUSCRIPT
The proposed SOM operation is summarized with Algorithm 1 for a L∗ × K ∗ SOM. First, the dimensions of the map (L and K) and the input vector (D) 370
are configured, and thereafter the function g(i, j) is called for each neuron (i, j).
CR IP T
This operation can be considered as the initialization phase at the first run of the SOM operation, or as a new configuration for all later changes of the SOM size during the operation. Once the initialization or the configuration is done,
the SOM operation is carried out in the systolic manner as described earlier. If 375
the SOM is in the learning phase, both the competition and adaptation phases
are executed, whereas in the recall phase only the competition is considered.
AN US
The competition phase starts with the calculation of the squared Euclidean
distance by the neuron (0, 0), which is thereafter propagated to the neurons (1, 0) and (0, 1) respectively. Then, each other neuron (i, j) in its turn calculates 380
the squared Euclidean distance, compares it with the ones received from its neighbours (i − 1, j) and (i, j − 1), if they exist, sends the final result of the local winner search operation to its neighbours (i + 1, j) and (i, j + 1), if they
M
exist too. The competition phase ends at the neuron (L∗ − 1, K ∗ − 1), where the identity of the global winner, for a given input vector, is finally known. Afterwards, the adaptation phase can start. Before updating the weights of all
ED
385
neurons, the identity of the global winner must be known by all neurons in the map. The neuron (L∗ − 1, K ∗ − 1) first sends this data to all neurons located at
PT
the column L − 1. Thereafter, the neurons located at the column L − 1 forward the global winner identity (c) to all other neurons by row. Upon the reception of these data, the update of the neurons’ weights starts, first by calculating the
CE
390
shift parameter Si,j , then the neighbourhood function hi,j and finally the new
AC
weights mi,j . 3.3. Proposed SOM architecture The starting point of the proposed SOM architecture is the 2D mesh NoC topol-
395
ogy presented in Figure 2(a). Two layers can be distinguished: the communication layer consisting of NoC routers; and the processing layer composed of processing elements both arranged in a 2D manner. The NoC routers using 17
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 4: SOM-NoC neuron: each neuron is connected to other neurons through the network interface and the NoC
the wormhole switching technique (Section 2.3), are used for data transfer. On
400
M
the other hand, the processing elements called SOM neurons throughout this work, calculate the SOM operation as described in the previous section. The
ED
block diagram of a neuron is shown in Figure 4. It is composed of 5 modules: a Vector Element Processor (VEP), a Local Winner Search (LWS), an Update Signal Generator (USG), a Network Interface (NI) and a Local Configuration
405
PT
Module (LCM).
The VEP module is the unit calculating the squared Euclidean distance in the
CE
competition phase according to Equation 6, and updating the neuron’s weights in the adaptation phase according to Equations 7 and 9. Its block diagram is presented in Figure 5. It has two memory blocks, one for storing the neuron’s
AC
weights and the other one for delta values (the differences between the weights
410
µi and the input vector elements ξi ), necessary for the adaptation phase. The squared Euclidean distance is calculated in a sequential manner for each received input vector in the competition phase, as well as the update of neuron’s weights in the adaptation phase. It should be noted that the sequential manner of
18
CR IP T
ACCEPTED MANUSCRIPT
AN US
+/-
M
Figure 5: Vector Element Processor
calculation in the VEP is not mandatory for the proposed SOM operation.
ED
If the high performances are targeted, these calculations should be done in a
AC
CE
PT
415
Figure 6: Local Winner Search
19
Figure 7: Update Signal Generator
CR IP T
ACCEPTED MANUSCRIPT
massively parallel manner, as already reported in the literature [2].
The block diagram of the LWS module is shown in Figure 6. This module
AN US
carries out the local winner search according to Equation 10. With this module, up to three squared Euclidean distances could be compared. The neuron (0,0) 420
does not carry out the local winner search operation like other neurons. In this case, the LWS sets the local squared Euclidean distance and coordinates 2 at the outputs DL2,c and ci,j respectively. In the case of the neurons located i,j 2 at the first column or first row, the local distance DL2,(i,j) is compared only to
425
M
2 2 DL2,(i,j−1) or DL2,(i−1,j) respectively. The different modes of operating of the
LWS module are set up during the configuration phase performed by the LCM,
ED
described later in details.
The USG module performs the generation of the shift number S according to Equation 8. Its block diagram is presented in Figure 7. On receipt of the global
430
PT
winner neuron coordinates c, the USG calculates the relative position Ri,j of its parent neuron (i, j) to the global winner’s one. This value is then combined
CE
with the overall learning progress phase β to give the number of shifts for the neighbourhood function hi,j . The LCM unit is the module implementing the data transport configuration
AC
function gi,j introduced in the previous section. Therefore, for each new con-
435
figuration, the LCM configures, according to the map size, its parent neuron’s neighbours (i + 1, j) and (i, j + 1), to which it will send the result of the local
winner search DL2,c2i,j in the presented systolic manner during the competition phase. In addition, for the data transfers during the adaptation phase, the LCM also configures the role of broadcasting for its parent neuron as a function of the 20
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 8: Global Configuration Module
map’s size (see Table 2). For the neurons located at the column L − 1, during
M
440
the adaptation phase, they also have to forward the global winner identity c. Moreover, the LCM module also generates control signals for the LWS module
ED
allowing it to choose the right LWS operation according to the neuron’s position (see Figure 6). It also configures the size of the input vector D for the VEP 445
which is mandatory for the sequential distance calculation.
PT
The NI unit manages data received/sent by its parent neuron from/to other neurons. Along with the router to which it is physically connected, and indirectly
CE
other routers of the NoC, it ensures the implementation of the data transport function fsrc→dst (data) introduced in the previous section. The received data
450
2 2 are either data related to the SOM operation (DL2,c , ci,j , DL2,c , c), either i,j
AC
input vector elements (ξi ) or the neuron configuration data (L, K and D). The NI also ensures that all received data are correctly identified and dispatched to the corresponding modules, as presented in Figure 4. On the other hand, the locally identified winner and its squared Euclidean distance are prepared by
455
the NI in the form understandable by the NoC before their sending (see Figure
21
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 9: Illustration of the scalability of the proposed hardware SOM-NoC architecture
M
2(c)). These prepared data include destination addresses of the neighbouring neurons which are configured and supplied by the LCM unit. In this way, the
ED
systolic SOM operation is ensured.
The proposed hardware SOM architecture has at the system level a Global 460
Configuration Module (GCM), as it is presented in Figure 8. This module
PT
sends the configuration data to all neurons as well as the input vector elements to be processed. As configuration data, the dimensions of the map L and K, and the input vector dimension D are sent to all neurons. Upon reception of
CE
these configuration data, each neuron via its LCM unit has to (re-)configure its
465
own parameters, specially the addresses of the neurons to which the LWS data
AC
will be sent during the competition phase, and the addresses of the neurons to broadcast the global winner identity in the adaptation phase. The time needed to transport data between neurons via the NoC is not zero and must be taken into account during the configuration phase. All neurons will not receive
470
the new configuration data at the same time. To shorten the time needed to (re-)configure the entire SOM network, a solution is to send in parallel same 22
ACCEPTED MANUSCRIPT
configuration data by columns (dashed lines from the GCM to routers), thus supplying at the same time all neurons belonging to the same row. The major originality of the proposed hardware SOM architecture is its scalability. It is illustrated in Figure 9 on an example of extending an initial 2 × 2
CR IP T
475
to a 4 × 2 SOM-NoC architecture. This operation goes through three phases: a
physical linking, a configuration of routers and a configuration of neurons. We assume that the new pairs of router/neuron to add to the initial architecture
must be of the same type as the initial ones. For this reason, the physical link480
ing between the added pairs of router/neuron is straightforward. In the second
AN US
phase, the added routers must follow the initial structure in terms of coordinates. As it has been illustrated in Figure 9, the coordinates of added routers (in red) are ordered in accordance with the row and column coordinates of the
initial routers. This configuration can be done either in the design phase (in 485
the case of an inter-chip communication) or at runtime where right row and column coordinates are sent to the routers, which are presumably configurable.
M
It should be mentioned that this phase is crucial for the correct operation of the network, because the message routing between routers is based on these
490
ED
coordinates. Finally, in the last phase, all neurons of the new larger network must be updated with these changes. These new configuration data are sent to
PT
all neurons by the GCM unit, as explained previously.
4. Results and discussion
CE
4.1. Performance evaluation The proposed architecture was described in VHDL and synthesized on a Xilinx VC707 Virtex-7 FPGA board by using the Xilinx ISE Design Suite 14.7. It was
AC
495
also compared to the state-of-the-art SOM architectures presented in [19] and [2], also implemented on the same technology. The architecture presented in [2] is a massively parallel SOM having the best performances that have been reported in the literature. On the other hand, the architecture presented in [19]
500
is a hardware sequential SOM architecture, which is highly flexible and config-
23
ACCEPTED MANUSCRIPT
urable. These comparison results for 16-element input vectors are presented in Figures 10, 11 and 12 in terms of maximal operating frequency and number of cycles needed for both learning and recall phases respectively for different map
505
CR IP T
sizes. Figures 10 shows that the proposed architecture has the highest operating fre-
quency for the map sizes up to 8 × 8. This result does not imply that the
proposed architecture gives the best results in terms of performances among three tested architectures. It only confirms that the proposed architecture is scalable in terms of working frequency, which is essentially due to the use of
NoC for communication. No matter the map size, the operating frequency re-
AN US
510
mains stable. The maximal operating frequency of the sequential architecture is also stable because its hardware structure changes slightly between two different map sizes. On the other hand, the massively parallel hardware architecture is the most influenced by the map sizes. Its maximum working frequency decreases 515
with map size increase.
M
Figures 11 and 12 show the comparison results in terms of the number of cycles needed for the learning and recall phases respectively for different map sizes (up
ED
to 128 × 128). These results are the true measure of performances of the three hardware SOM architectures. The massively parallel architecture is unbeatable 520
in terms of performances. No matter the map size, the learning or recall phases
PT
are both finished within one clock cycle. On the other hand, the sequential architecture has the poorest results, which is expected given that all operations
CE
are done sequentially. The proposed SOM-NoC architecture is much faster than the sequential one, but much slower than the parallel one. It should also be
525
stated that the main objective of our proposed architecture is to show how the
AC
SOM operation can be made scalable, not to propose the best performing SOM architecture. Figures 13 and 14 show the time distribution of the proposed SOM architecture for both learning and recall phases as a function of map size. It can be seen that,
530
for a given input vector size (here 16-element vectors), the calculation of the squared Euclidean distance is constant along with the map size. Moreover, these 24
ACCEPTED MANUSCRIPT
calculations are done in parallel in all neurons and the presented times for all map sizes is the most optimistic case where we assume that all neurons start and finish to compute the Euclidean distances at the same time. The last assumption implies also that the input vectors are delivered synchronously to all neurons,
CR IP T
535
which is rarely the case. On the other hand, the winner search operation, which is distributed all over the network in the systolic manner described earlier, is
the most time consuming operation. Table 3 shows the time distribution of all operations of the proposed architecture in details as a function of the map 540
size and the input vector dimension. From Table 3, it can be seen that the
AN US
input vector dimension only influences the squared Euclidean calculation and
the weights update, TDL2 and TU respectively, not the global winner search part. The global winner search takes TGW S cycles to be carried out, and is proportional to the number of systolic stages which is equal to L + K − 2 for a 545
L × K network. The factor 9 is the time in clock cycles which is needed to a
AC
CE
PT
ED
M
message to cross one systolic level. The global winner search operation is time
Figure 10: Maximal operating frequency as a function of SOM size
25
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 11: Duration of the learning phase as a function of SOM size
Figure 12: Duration of the recall phase as a function of SOM size
26
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 13: Time distribution of the learning phase for the proposed SOM architecture
Figure 14: Time distribution of the recall phase for the proposed SOM architecture
27
ACCEPTED MANUSCRIPT
Table 3: Time distribution of all operations as a function of the map size L × K and the input vector dimension D Time
Time [Clk]
2 DL2,(i,j)
TDL2
D+4
TGW S
9(L + K − 2)
TBC
4(L + K − 1)
Global winner search Winner ID broadcasting
CR IP T
Step
TU
D+1
Competition
TC = TDL2 + TGW S
D + 4 + 9(L + K − 2)
Adaptation
TA = TBC + TU
D + 1 + 4(L + K − 1)
Learning
TL = TC + TA
AN US
Update phase
2D − 4 + 13(L + K − 1)
M
consuming for several reasons. First, for a neuron (i, j) the calculated squared Euclidean distance is sent to its neighbours (i+1, j) and (i, j+1). The calculated distance, before arriving at the destination, leaves the source neuron, crosses two NoC routers (the corresponding router and the neighbour’s one), leaves the
ED
550
router and finally accesses to the destination neuron. Each of these steps takes some time and lengthens the total amount of time TGW S needed to finish the
PT
global winner search. Second, it should also be stated that the employed NoC routers are the commonly used NoC routers based on the wormhole switching technique, and are without any modification or adaptation to the presented
CE
555
systolic SOM operation. In addition, the winner ID broadcasting phase is less time consuming than the global winner search part TBC < TGW S . The main
AC
reason for this is that a part of the winner ID broadcasting is done in parallel by all neurons belonging to the column (L − 1, j) where 0 ≤ j < K,
560
In the proposed approach, all neurons are supplied with input vectors through columns, as described earlier. Thus, all neurons belonging to the same row receive synchronously input vectors, and start the distance calculations at the
28
ACCEPTED MANUSCRIPT
Table 4: Total number of messages and flits sent per iteration as a function of the map size L × K and the input vector dimension D for different SOM operations
Input vector supply Global winner
Number of messages
Number of flits
Nm,i = L × K
Nf,i = (D + 1) × Nm,i
Nm,c = (L − 1) × (K − 1) × 2 +L + K − 2
search
broadcasting
Nm,d = (K − 1) + (L − 1) × K
Nf,c = 2 × Nm,c
Nf,d = 2 × Nm,d
AN US
Winner ID
CR IP T
Step
same time. The arrival time of the input vectors to the neurons belonging to the same row depends also on the traffic load of the network which may perturb 565
this synchronous start of computation. In a L × K SOM-NoC architecture, for a
M
D-element input vectors, where each input vector’s element is sent as a flit, the total number of messages and flits that are sent to supply all neurons with input vectors is equal to L × K and (L × K) × (D + 1) (plus one is for the header flit) 570
ED
respectively. In addition, during the competition phase where data exchange is done in the systolic manner (see Section 2.4), each neuron (i, j) (except the ones at the (i, K − 1) and (L − 1, j) positions) sends 2 messages of 2 flits (header
PT
flit plus one flit containing local winner neuron ID with its distance) to its closest neighbours. Therefore, the total number of messages and flits during
CE
the competition phase is equal to Nm,c = (L − 1) × (K − 1) × 2 + L + K − 2
575
and Nf lits,c = 2 × Nm,c respectively. Finally, in the diffusion phase, the neuron (L − 1, K − 1) which is the first informed about the identity of the global winner
AC
neuron, diffuses it to all neurons belonging to the column L − 1. The L − 1 column neurons, in their turn, diffuse the global winner identity to all neurons by row. The total number of sent messages and flits in this phase amounts to
580
Nm,d = (K − 1) + (L − 1) × K and Nf,d = 2 × Nm,d respectively. These results are summarized in Table 4.
29
ACCEPTED MANUSCRIPT
From the presented results, it can be concluded that the traffic patterns used in the presented SOM-NoC architecture per iteration are straightforward, application independent and not prone to congestion: input vectors are sent to all neurons by columns by the GCM module; each neuron sends a local winner ID
CR IP T
585
and the corresponding distance to its closest neighbours; the neuron (L−1, K−1) initiates the global winner diffusion phase by sending the global winner ID to
the neurons (L − 1, j) (where 0 ≤ j < K − 1), which send it (the global winner ID) to all neurons belonging to the same row. All these traffic patterns are done 590
sequentially and not at the same time, thus avoiding the congestion situations
AN US
which may occur in a NoC with high traffic loads. Therefore, adaptive routing schemes such as ones presented in [20–22] or reconfigurable NoC approaches [23, 24], which are often used to offload congested NoC routers and to balance
the overall traffic all over the network by using adaptive routing policies or by 595
changing NoC architecture respectively, would not help much to improve the overall performances of the SOM-NoC architecture, due to these deterministic
M
traffic patterns. However, network coding communication protocols allowing to send the same data to many processing nodes, so called multicast communica-
600
ED
tion as the approaches presented in [25, 26], may be a way of improvement of the existing SOM-NoC approach, which is using multiple unicast approach. In fact, the traffic patterns used in the proposed architecture are suited for the
PT
multicast communication in the input vector supply and diffusion phases: the same input vector must be delivered to all neurons from the GCM module, as
CE
well as the global winner identity must be known by all neurons starting from 605
the neuron (L − 1, K − 1). Another way of improvement of the presented SOMNoC architecture, which is directly derived from the presented results, is at the
AC
router micro-architectural level. Indeed, the time distribution results presented in Table 3 point out explicitly the latency needed to exchange data between two neighbouring neurons. A low-latency router should be preferred or even
610
some hybrid neuron-router approaches may come as a solution for performance improvements. From the presented discussion, it can be concluded that several possible improvements of the proposed architecture are possible and should be 30
ACCEPTED MANUSCRIPT
done in the future on this basis, keeping in mind the fact that the main objective of the proposed study is to show how the SOM operation can be made scalable, and not to propose the best performing SOM architecture. 4.2. Validation on image compression application
CR IP T
615
The scalability and adaptability of the proposed SOM-NoC architecture is also
tested and validated in an image compression application. The use of hardware SOMs in image compression applications have already been reported in 620
the literature [1–6]. The image compression using SOMs goes through two
AN US
phases: the colour quantization which results in a colour pallet comprising the
representative colours of the images to compress; and the phase of generating the compressed binary data by using the obtained colour pallet. In the colour quantization phase, a L × K SOM network is used as a colour quantizer, whose 625
neurons have the weights of the same size as the pixels used to train the network (3 elements, corresponding to a RGB pixel). At the end of the training phase of
M
the SOM quantizer, the randomly initialized SOM weights will converge to the most representative colours of the images used to train the network. Thus, the
630
ED
total number of colours of an image is reduced to the size of the SOM network used for quantization purposes (here L × K). Moreover, the colour pallet is obtained by taking the weights of the SOM neurons at the end of the training
PT
phase and is often called codebook. In the second phase, the obtained colour pallet is used to compress the image: instead of using the true colour code to code a pixel, the position of the neuron having the weights (colour) closest to the colour of the observed pixel is used, thus reducing the pixel size. A P × Q
CE 635
original image can also be divided into blocks of M × M pixels. Hence, the total
AC
number of obtained blocks is: Nblk =
P ×Q M ×M
(13)
with P ×Q representing the image resolution. The binary size of the compressed image Sc , the compression ratio CR and the space savings SS are obtained
31
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 15: The timeline of used configuration for image compression
32
ACCEPTED MANUSCRIPT
640
respectively with:
CR =
SS = 1 −
R×P ×Q Sc
Sc 1 =1− R×P ×Q CR
where R is the number of bits used to code a pixel.
(14)
CR IP T
Sc = Nblk × {dlog2 (L)e + dlog2 (K)e}
(15)
(16)
645
AN US
In the proposed validation, three different configurations were simulated: the configuration C1 representing a 7 × 7 SOM network using input vectors of 1 × 1 pixels (3 elements); C2 - a 10 × 10 SOM network using input vectors of 1 × 1 pixels (3 elements); and C3 - a 10 × 10 SOM network using input vectors of 2 × 2 pixels (12 elements). The timeline of the used configurations with different network and input vector sizes is presented in Figure 15. At the startup time T1 , the image compression system receives the C1 configuration data. These data
M
650
configure the system for image compression with 49 colours and input vectors
ED
of 3 elements corresponding to one RGB pixel. The time needed to configure the whole system is Tconf ig and is equal to the time needed for the Global Configuration Module to send all relevant data to all neurons. In the proposed architecture, Tconf ig is 35 clock cycles. At T1 + Tconf ig , the system is ready
PT
655
to compress all input images in accordance with the selected parameters. The image compression system remains in the C1 configuration, during the time
CE
needed to extract the most relevant colours from input images (the learning phase), to reduce the overall size of input images based on this colour extraction (compression phase) and to reconstruct the compressed images. This time
AC
660
is image size dependent and was determined through simulation. Each input image is accessed twice: the first time during the learning phase where pixels are chosen randomly, and the second time during the compression phase where all pixels are scanned one by one. At T2 , the image compression system receives
665
the C2 configuration data where the size of the network (and thus the number of 33
ACCEPTED MANUSCRIPT
colours) is increased while keeping the same dimension of the input vectors. It should be mentioned that, at T2 , new pairs of router/neurons are added to the initial SOM to form the final 10 × 10 map. At T2 + Tconf ig , the system is again 670
CR IP T
operational and ready to process input images. The same scenario is repeated with the C3 configuration data, where at T3 + Tconf ig the network size (and the
number of colours) is kept unchanged while the size of input vectors is increased to 12, corresponding to the blocks of 4 RGB pixels (2 × 2). The three presented
configurations were applied on several images (Lenna, Airplane, Pepper and Parrot) of different resolutions (from 128×128 to 640×480) and different types (greyscale and RGB) in the order presented by the timeline in Figure 15. The
AN US
675
obtained results are shown in a 3×3 matrix form in Figures 16 and 17. Each line of the matrix presented for each image in Figures 16 and 17 represents one among three tested configurations. Moreover, for each configuration only for information purposes, three additional metrics were presented: the compression 680
ratio (see Equation 15), the Mean Square Error (MSE) and the Peak Signal
M
to Noise Ratio (PSNR). For each tested image (shown in the top left corner of each quadrant), in the first column of the matrix, the obtained colour pallets
ED
or codebooks for different configurations are presented. In addition, the second and third column present the reconstructed images and the dissimilarities be685
tween the original and reconstructed image respectively. The codebooks (colour
PT
pallets) and the reconstructed images are generated by the image compression system whereas the dissimilarities and the calculation of MSE and PSNR are
CE
done offline with Matlab.
5. Conclusion In this work, a scalable and adaptable hardware implementation of a SOM net-
AC 690
work is presented. The scalability of the SOM operation is obtained by using the Network-on-a-chip communication approach and by distributing the global winner search operation in a systolic manner all over the network. Indeed, the global winner search operation is dispatched to the local searching units be-
34
ACCEPTED MANUSCRIPT
695
longing to the neurons, and thus making the neuron connections more relaxed and easier to configure. Consequently, the proposed architecture allows to extend dynamically the SOM operation from a smaller to a larger map only by
CR IP T
(re-)configuring the parameters of each neuron. On the other hand, the gained scalability is not without a cost, and can be to the detriment of overall perfor700
mances. Indeed, the performances bottleneck of the proposed architecture lies in the data exchange through the NoC, which can be very time consuming es-
pecially for large SOM networks. A solution to this problem may be, instead of using common NoC approaches, to design SOM specific NoCs, which will take
705
these time consuming tasks.
References
AN US
into account all specificities of the SOM operation and thus allow to reduce
[1] H. Hikawa, K. Doumoto, S. Miyoshi, Y. Maeda, Image Compression with
M
Hardware Self-Organizing Map, in: Neural Networks (IJCNN), The 2010 International Joint Conference on, 2010, pp. 1 – 8. [2] H. Hikawa, Y. Maeda, Improved Learning Performance of Hardware Self-
ED
710
Organizing Map Using a Novel Neighborhood Function, IEEE Transactions on Neural Networks and Learning Systems Volume: 26 (N: 15524695)
PT
(2015) 2861 – 2873.
[3] W. Kurdthongmee, A low latency minimum distance searching unit of the SOM based hardware quantizer, Journal - Microprocessors and Microsys-
CE
715
tems Volume: 39 (2015) 135 – 143.
AC
[4] W. Kurdthongmee, A hardware centric algorithm for the best matching
720
unit searching stage of the SOM-based quantizer and its FPGA implementation, Journal of Real-Time Image Processing (2013) pp 1 – 10.
[5] W. Kurdthongmee, Utilization of a fast MSE calculation approach to improve the image quality and accelerate the operation of a hardware K-SOM
35
ACCEPTED MANUSCRIPT
quantizer, Journal - Microprocessors and Microsystems Volume: 34 (2010) 174 – 181. [6] A. Ramirez-Agundis, R. Gadea-Girones, R. Colom-Palero, A hardware de-
CR IP T
sign of a massive-parallel, modular NN-based vector quantizer for real-
725
time video coding, Journal - Microprocessors and Microsystems Volume: 32 (2008) 33 – 44.
[7] M. Abadi, S. Jovanovic, K. Ben Khalifa, S. Weber, M. H. Bedoui, A Scal-
able Flexible SOM NoC-Based Hardware Architecture, in: Advances in
Self-Organizing Maps and Learning Vector Quantization: Proceedings of
AN US
730
the 11th International Workshop WSOM 2016, Houston, Texas, USA, January 6-8, 2016, 2016, pp. 165–175.
[8] D. Wiklund, D. Liu, SoCBUS: switched network on chip for hard real time embedded systems, in: Parallel and Distributed Processing Symposium, 2003. Proceedings. International, IEEE, 2003.
M
735
[9] P. P. Pande, C. Grecu, A. Ivanov, R. Saleh, Design of a switch for network
ED
on chip applications, in: Circuits and Systems, ISCAS ’03. Proceedings of the 2003 International Symposium on, Vol. Volume: 5, IEEE, 2003, pp. 217 – 220.
[10] T. Kohonen, ”Self-Organizing Maps, Third Edition”, third edition Edition,
PT
740
Vol. Volume: 29, Springer, 2001.
CE
[11] I. Manolakos, E. Logaras, High Throughput Systolic SOM IP Core for FPGAs, in: Acoustics, Speech and Signal Processing, ICASSP 2007, 2007,
AC
pp. 61 – 64.
745
[12] T. Talaska, M. Kolasa, R. Dlugosz, W. Pedrycz, Analog programmable distance calculation circuit for winner takes all neural network realized in the cmos technology, IEEE Transactions on Neural Networks and Learning Systems 27 (3) (2016) 661–673. doi:10.1109/TNNLS.2015.2434847.
36
ACCEPTED MANUSCRIPT
[13] M. Porrmann, M. Franzmeier, H. Kalte, U. Witkowski, U. Rckert, A Reconfigurable SOM Hardware Accelerator, 10th European Symposium on
750
Artificial Neural Networks, 2002, pp. 337–342.
CR IP T
[14] S. Rping, M. Porrmann, U. Rckert, Som accelerator system, Neurocomputing 21 (1) (1998) 31 – 50.
[15] M. Porrmann, U. Witkowski, U. Ruckert, A massively parallel architecture for self-organizing feature maps, IEEE Transactions on Neural Networks
755
14 (5) (2003) 1110–1121.
AN US
[16] N. Lightowler, C. Spracklen, A. Allen, A modular approach to implemen-
tation of the self-organising map, in: Proceedings of WSOM’97, 1997, pp. 130–135. 760
[17] H. Tamukoh, T. Aso, K. Horio, T. Yamakawa, Self-organizing map hardware accelerator system and its application to realtime image enlargement,
M
in: Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 4, IEEE, 2004, pp. 2683–2687.
ED
[18] J. Pena, M. Vanegas, A. Valencia, Digital hardware architectures of kohonen’s self organizing feature maps with exponential neighboring function,
765
in: Reconfigurable Computing and FPGA’s, 2006. ReConFig 2006. IEEE
PT
International Conference on, IEEE, 2006, pp. 1–8. [19] H. Hikawa, K. Kaida, Novel FPGA Implementation of Hand Sign Recog-
CE
nition System With SOM-Hebb Classifier, IEEE Transactions on Circuits and Systems for Video Technology 25 (1) (2015) 153–166. doi:
770
AC
10.1109/TCSVT.2014.2335831.
[20] Z. Qian, P. Bogdan, G. Wei, C.-Y. Tsui, R. Marculescu, A traffic-aware
775
adaptive routing algorithm on a highly reconfigurable network-on-chip architecture, in: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’12, ACM, New York, NY, USA, 2012, pp. 161–170. 37
ACCEPTED MANUSCRIPT
[21] S. J. Hollis, C. Jackson, P. Bogdan, R. Marculescu, Exploiting emergence in on-chip interconnects, IEEE Transactions on Computers 63 (3) (2014) 570–582. [22] C. Jackson, S. J. Hollis, A deadlock-free routing algorithm for dynami-
CR IP T
780
cally reconfigurable networks-on-chip, Microprocessors and Microsystems 35 (2) (2011) 139 – 151, special issue on Network-on-Chip Architectures and Design Methodologies.
[23] D. Cozzi, C. Far`e, A. Meroni, V. Rana, M. D. Santambrogio, D. Sciuto,
Reconfigurable noc design flow for multiple applications run-time mapping
AN US
785
on fpga devices, in: Proceedings of the 19th ACM Great Lakes Symposium on VLSI, GLSVLSI ’09, ACM, New York, NY, USA, 2009, pp. 421–424. [24] I. Beretta, V. Rana, D. Atienza, M. D. Santambrogio, D. Sciuto, Runtime mapping for dynamically-added applications in reconfigurable embed-
2009, pp. 157–160.
M
ded systems, in: 2009 International Conference on Microelectronics - ICM,
790
ED
[25] Y. Xue, P. Bogdan, User cooperation network coding approach for noc performance improvement, in: Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS ’15, ACM, New York, NY, USA, 2015, pp. 17:1–17:8.
PT
795
[26] S. Yan, B. Lin, Custom networks-on-chip architectures with multicast rout-
CE
ing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems
AC
17 (3) (2009) 342–355.
38
ACCEPTED MANUSCRIPT
SS % CR=66d
PSNR=36.78
Lenna
PSNR=40.22
MSE = 6.17
CR=75c SS %
PSNR=37.85
MSE = 10.65
M
ED
PT
CE AC
% CR=91c SS
PSNR=35.23
MSE = 19.5
SOM 10x10 2x2 block size
39
SOM 7x7 1x1 block size
SS % CR=66c
SOM 10x10 1x1 block size
MSE = 32.66 Dissimilarity
SOM 10x10 2x2 block size
PSNR=33 Reconstructed Image
SOM 7x7 1x1 block size
MSE = 13.83
% CR=91d SS SOM Pallet
Airplane
MSE = 13.63
PSNR=36.71
SS % CR=75d
Original Image
Dissimilarity
CR IP T
Reconstructed Image
SOM 10x10 1x1 block size
SOM Pallet
AN US
Original Image
Figure 16: Obtained quantization results for different configurations on 128×128 images: Lenna and Airplane
ACCEPTED MANUSCRIPT
Pepper
CR=66k SS %
PSNR=34.18
SS % CR=66d
PSNR=32.47
MSE = 36.8
% CR=75d SS
PSNR=31.36
MSE = 47.48
M
ED
PT
CE AC
% CR=91d SS
PSNR=28.76
MSE = 86.43
SOM 10x10 2x2 block size
40
SOM 7x7 1x1 block size
Dissimilarity
SOM Pallet
SOM 10x10 1x1 block size
MSE = 26.95
SOM 10x10 2x2 block size
PSNR=33.82 Reconstructed Image
SOM 7x7 1x1 block size
MSE = 26.02
CR=91k SS %
Parrot
MSE = 24.82
PSNR=33.97
CR=75k SS %
Original Image
Dissimilarity
CR IP T
Reconstructed Image
AN US
SOM Pallet
SOM 10x10 1x1 block size
Original Image
Figure 17: Obtained quantization results for different configurations on 128×128 images: Pepper and Parrot
ACCEPTED MANUSCRIPT
Authors' biography Mehdi Abadi received the the Engineering degree in real-time computing and the master’s
CR IP T
degree in embedded systems from the University of Sousse, Tunisia, in 2011 and 2013, respectively. He is currently pursuing the Ph.D. degree as an exchange scholar at the National Engineering School of Sousse, University of Sousse, Tunisia and University of Lorraine, Nancy, France. He is a member of the Jean Lamour Institute (UMR 7198), University of Lorraine, Nancy, France. And he is a member of the Technologie et Imagerie Médicale Laboratory with the Faculty of Medicine, University of Monastir, Tunisia. His main research interests include neural network architectures, reconfigurable and adaptable embedded systems and real-time signal processing.
ED
M
AN US
Slavisa Jovanovic received the B.S. in electrical engineering from the University of Belgrade, Serbia, in 2004, M.S. and Ph.D degrees in electrical engineering from the University of Lorraine, France, in 2006 and 2009, respectively. From 2009 to 2012, he was with the Diagnosis and Interventional Adaptive Imaging laboratory (IADI), Nancy, France, as a research engineer working on MRIcompatible sensing embedded systems. Then, he joined the Faculty of Sciences and Technologies and the Jean Lamour Institute (UMR 7198), University of Lorraine, Nancy, where he is currently an assistant professor. His main research interests include reconfigurable Network-on-Chips, energy harvesting circuits, neuromorphic architectures and algorithm-architecture matching for real-time signal processing. He is the author and co-author of more than 50 papers in conference proceedings and international peer-reviewed journals, and he holds one patent.
AC
CE
PT
Khaled Ben Khalifa received his MSc in Physic Microelectronic and his DEA in Materiaux et Dispositive pour l'electronique and a PhD diploma in Physics-Electronics from the University of Monastir, Tunisia, in 1999, 2001 and 2006, respectively. Currently, he is an assistant professor (electrical engineering) in High Institute of Applied Sciences and Technology, University of Sousse, Tunisia and senior researcher at the Laboratory of Technology and Imaging (LR12ES06) at the Faculty of Medicine, University of Monastir, Tunisia. His research interests are related to real-time embedded systems, FPGA-based systems, system-on-chip, neural networks and heterogeneous multiprocessor architectures. Serge Weber was born in 1961. He received the M.S. degree in electrical electronic and control engineering in 1983 and the Ph.D. Degree in electronics in 1986, both from the Henri Poincaré University of Nancy, France. In 1988 he joined the Electronics Laboratory of Nancy (LIEN) as an Associate Professor. Since September 1997 he is Professor and Manager of the Electronic Architecture group at LIEN (University Henri Poincaré). His research interests focus on reconfigurable and parallel architectures for image and signal processing or for intelligent sensors. From 2006 to 2013 he was director of the Electronics Laboratory of Nancy (LIEN). Since 2013, he joined the Jean Lamour Institute. He has coauthored more than 100 papers in reviewed international Journals and Conferences and two Patents.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
Mohammed Hédi Bedoui received the Ph.D. degree in biomedical engineering from Lille University, Villeneuve-d’Ascq, France, in 1992. He is currently a Professor of Biophysics with the Faculty of Medicine, Monastir University, Monastir, Tunisia. He is the Director of the Technologie et Imagerie Médicale Laboratory with the Faculty of Medicine, Monastir University. He has many published papers in international journals. His current research interests include biophysics, medical imaging processing, 800 embedded system, and codesign HW/SW.
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT