A scalable and adaptable hardware NoC-based self organizing map

A scalable and adaptable hardware NoC-based self organizing map

Accepted Manuscript A Scalable and Adaptable Hardware NoC-Based Self Organizing Map Mehdi Abadi, Slavisa Jovanovic, Khaled Ben Khalifa, Serge Weber, ...

3MB Sizes 0 Downloads 28 Views

Accepted Manuscript

A Scalable and Adaptable Hardware NoC-Based Self Organizing Map Mehdi Abadi, Slavisa Jovanovic, Khaled Ben Khalifa, Serge Weber, Mohammed Hedi ´ Bedoui PII: DOI: Reference:

S0141-9331(17)30191-6 10.1016/j.micpro.2017.12.007 MICPRO 2646

To appear in:

Microprocessors and Microsystems

Received date: Revised date: Accepted date:

31 March 2017 26 October 2017 14 December 2017

Please cite this article as: Mehdi Abadi, Slavisa Jovanovic, Khaled Ben Khalifa, Serge Weber, Mohammed Hedi ´ Bedoui, A Scalable and Adaptable Hardware NoC-Based Self Organizing Map, Microprocessors and Microsystems (2017), doi: 10.1016/j.micpro.2017.12.007

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A Scalable and Adaptable Hardware NoC-Based Self Organizing Map

a UMR b Laboratoire

CR IP T

Mehdi Abadia,b,1 , Slavisa Jovanovica,∗, Khaled Ben Khalifab , Serge Webera , Mohammed H´edi Bedouib 7198, Institut Jean Lamour, Universit´ e de Lorraine, Nancy, France de Technologie et Imagerie M´ edicale, Universit´ e de Monastir, Monastir, Tunisia

AN US

Abstract

Due to their ability to reduce the size of high-dimensional input data, Selforganizing maps (SOMs) can be employed as data quantizers. The widely used software implementations of SOM enjoy flexibility and adaptability, usually to the detriment of performances, which limits their use in real time applications. On the contrary, the hardware counterparts of SOMs exploit the inherent paral-

M

lelism of hardware to boost the overall performances, but generally lack adaptability without considerable design efforts. To benefit from both, the flexibility

ED

of software and performances of hardware SOM implementations, unconventional design approaches of SOMs should be used. In this work, a scalable and adaptable hardware implementation of a SOM network is presented. The

PT

proposed architecture allows to dynamically extend the SOM operation from a smaller to a larger map only by (re-)configuring the parameters of each neu-

CE

ron. The gained scalability is obtained by decoupling the computation layer composed of neurons, from the communication one, used to provide data exchange mechanisms between neurons. The proposed SOM architecture is also

AC

validated through simulation on variable-sized SOM networks applied to image ✩ This

work was supported by the PHC-UTIQUE 17G1423 Research program. author Email addresses: [email protected] (Mehdi Abadi), [email protected] (Slavisa Jovanovic), [email protected] (Serge Weber), [email protected] (Mohammed H´ edi Bedoui) 1 Ecole Nationale d’Ing´ enieur de Sousse, Universit´ e de Sousse, Sousse, Tunisia ∗ Corresponding

Preprint submitted to Journal of Microprocessor and Microsystems

December 15, 2017

ACCEPTED MANUSCRIPT

compression. Keywords: Self-Organizing Map, Network-on-chip, FPGA, Image compression

CR IP T

1. Introduction

A Self-Organizing Map (SOM) is an unsupervised learning neural network

finding its use in many applications. The high-dimensional data reduction and classification are commonly done with SOMs, facilitating that way their in5

terpretation and processing. Several SOM implementations have already been

AN US

proposed in the literature [1–6]. The software (SW) implementations are the most common and bring more flexibility, whereas the hardware (HW) implementations exploit inherent parallelism of SOMs and may be preferred in real-time applications characterized with tight temporal constraints.

The state-of-the-art hardware SOM implementations are application specific

10

and have the parameters such as input and output layer size, timing constraints

M

and memory requirements fixed in the design phase. Thus, the obtained hardware SOM implementation fits perfectly the needs of the specific application but is hardly adaptable to other applications without considerable design efforts. The main reason of this lack of flexibility of hardware SOMs lies in the

ED

15

way that their processing units, often called neurons, are connected and ex-

PT

change data. The point-to-point links between neurons bring fast connections allowing them to exchange and compare the computed data often within a cycle. However, the complexity of these fully connected SOMs, which grows quadratically with the number of neurons, makes this type of connections impractical

CE

20

for large SOM networks. Scalability of hardware SOMs can be achieved by

AC

decoupling the computation layer, composed only of neurons, from the communication one, providing the data exchange mechanisms. In addition, more flexibility can also be gained by modifying the computation layer composed of

25

neurons, by making them customizable and configurable at runtime, during the normal SOM operation. Many real-time processing applications can fully gain benefits from these scalable and flexible hardware SOMs. Indeed, the fast con-

2

ACCEPTED MANUSCRIPT

text switching from one task to another during the normal operation of a system can bring more flexibility to the latter and may enlarge its basic functionalities 30

and fields of applications. In this paper, a scalable and adaptable hardware

CR IP T

SOM implementation is proposed. This paper is organized as follows: Section 2 presents the theoretical background of this work and the state of the art works in the domain of hardware

SOM implementations. The proposed hardware architecture is detailed in Sec35

tion 3. Section 4 presents some obtained results whereas some conclusions and

2. Background 2.1. Self-Organizing Map (SOM)

A Self-Organizing Map can be presented with a two dimensional distribution − of L × K neurons. Each neuron has a weight vector → m of dimension D, where → − D is the size of an input vector X :

M

40

AN US

perspectives are drawn in Section 5.

→ − X = {ξ1 , ξ2 , ..., ξD } 
(1)

ED

The SOM operation requires two phases: learning and recall. During the learning phase, the map generates its outputs by changing the weights of its neurons as a function of the input vectors used for training. After the learning phase, the trained map can be used for decision purposes in the recall phase,

PT

45

where each input vector is assigned to a neuron or a group of neurons in the

CE

map, often called a winner neuron or best matching unit (BMU). Each neuron − calculates the distance between its weights → m (0 ≤ l ≤ L − 1, 0 ≤ k ≤ K − 1) l,k

→ − and the input vector X . In general, the calculated distance is the Euclidean distance (L2) as presented by:

AC 50

DL2

v u D



X

− →

u − = X − m = t (ξk − µk )2

(2)

k=1

− Therefore, the winner neuron, which has the weight vector → m c closest to the → − input vector X , is identified. This phase is called competition and is expressed 3

ACCEPTED MANUSCRIPT

as follows:



− − m l,k c = argmin X − →

(3)

l,k

The recall and learning phases carry out both the calculation of the Euclidean distance, and the competition phase to find the winner neuron. More-

CR IP T

55

over, during the learning phase, the weights of the winning neuron and of its

closest neighbours are updated. This phase is called adaptation and is expressed as follows:

h→ i − → − − − m(t) = → m(t − 1) + hc,l,k (t) X (t) − → m(t − 1)

(4)

60

AN US

where hc,l,k (t) is the neighbourhood function used to define the degree of learning of a neuron which is higher in the vicinity of the winner neuron; more-

over, it depends on the position of the neuron with respect to the winner’s one and the epoch number representing the number of learning iterations. The neighbourhood function is defined by the following equation:

(5)

− with α(t) learning rate; σ(t) neighbourhood rate; → r c position of the winning → − neuron; r position of the neuron with index (l, k). l,k

ED

65

− − k→ rc−→ r l,k k ) 2σ 2 (t)

M

hc,l,k (t) = α(t) × exp(−

2.2. Literature survey

PT

Due to their inherent parallelism and feature extraction property, the SOMs have naturally found their place as vector quantizers. The use of hardware SOMs as vector quantizers have already been reported in many works [1–6]. Hikawa et al. in [2] proposed a massively parallel hardware SOM implemen-

CE

70

tation adapted for vector quantization applications using a novel neighbouring

AC

function. The proposed ”hardware friendly” neighbouring function exploits inherent hardware parallelism allowing to improve the overall vector quantization performances. The presented architecture is massively parallel and high-

75

performance where both the learning and recall phases are done within a clock cycle. The performances of the proposed architecture were tested in a colour quantization experiment, where 128×128 images were used as inputs and the

4

ACCEPTED MANUSCRIPT

number of neurons varied from 8×8 to 32×32. In the proposed architecture, the winner search operation is done within a clock cycle, whose value is closely 80

related to the size of SOM essentially due to the high number of point-to-point

M

AN US

literature, but is inflexible and lacks scalability.

CR IP T

links. The proposed architecture achieves the best reported performances in the

ED

Figure 1: Illustration of the scalability of common hardware SOM architectures

Ramirez-Agundis et al. in [6] also proposed a hardware massively parallel

85

PT

SOM implementation for vector quantization. 16-element input vectors were used for all experiments on different SOM networks comprising from 16 to 256 neurons. The overall time needed for learning and recall phases for all map

CE

sizes is in the ranges of 41 to 45 and 22 to 26 working clock cycles respectively. The relatively constant learning and recall time disregarding the map size is

AC

explained by the massively parallel hardware implementation. The proposed

90

architecture meets real-time video coding timing constraints for greyscale and colour images up to 640×480, but is poorly flexible and not scalable. In [5], Kurdthongmee proposed an approach to accelerate the learning phase

of a hardware SOM quantizer by evaluating the mean square error of the quantization process and comparing it with a threshold fixed in advance. A 16×16

5

ACCEPTED MANUSCRIPT

95

map was used for all experiments, carried out on images having the sizes varying from 32×32 to 512×512 pixels. The proposed approach was validated on a Xilinx Virtex-2 FPGA providing real-time performances for image sizes up to

CR IP T

640×480 pixels. The same author proposed in [4] a hardware SOM quantizer using a fast best matching unit (BMU). The BMU has a unique goal to find, for a 100

given input vector, as fast as possible the identity of the winner neuron, the one

whose weights are the closest to the input vector. The proposed approach allows

to find the winner neuron within 4 clock cycles. This outstanding result is to the detriment of the necessary hardware resources for the algorithm implementa-

105

AN US

tion, which necessitates multi-port memory blocks. The proposed approach was validated using a Xilinx Virtex-4 FPGA in the case of a 16×16 map processing

up to 512×512 images at the maximal working frequency of 19.6 MHz. Both reported architectures in [5] and [4] are high-performance, meet real-time video processing constraints, but are also inflexible and lack scalability. Kurdthongmee recently proposed in [3] a similar approach to the one proposed in [4], focusing on the winner search operation. This operation is per-

M

110

formed using 2K 1-bit memory blocks, where K is the value of the maximal

ED

distance that can be encountered in a SOM network. The 1-bit word is used to indicate the state of the address: 1 means that the corresponding address, and the distance were already found in a SOM, 0 otherwise. The main advantage of the proposed winner neuron search scheme is the total time which is

PT

115

within a clock cycle. The proposed approach was tested on a 16×16 SOM using

CE

images with a 512×512 resolution, achieving the real-time performances on a Xilinx Virtex-4 FPGA. As other reported architectures, the scalability cannot be achieved without considerable design efforts.

AC

120

Recently, Abadi et al. in [7] proposed a flexible and scalable hardware SOM

architecture, without particular application in mind. The flexibility and scalability are provided to a hardware SOM by the means of a Network on a Chip (NoC) used for communication purposes. Therefore, a small hardware SOM can be easily extended to a larger one without decreasing the overall operat-

125

ing frequency, as it is often the case in the conventional hardware SOM ar6

ACCEPTED MANUSCRIPT

chitectures essentially due to the important number of neurons connected in a point-to-point manner. Moreover, this scalability is to the detriment of the winner search speed, which is greatly influenced by the communication latency

130

CR IP T

of the used communication approach. The proposed architecture was tested on a Xilinx Virtex-6 FPGA showing an estimated operating frequency of 200 MHz.

However, the scalability and the flexibility of the presented approach have not been demonstrated neither formally nor through an experiment. IN

R(1,2)

R(1,K)

IN

WEST

EAST CROSSBAR

OUT

PE

PE

E_d_in E_req_in E_ack_in

IN

AN US

R(1,1)

OUT

NORTH

to logic

OUT

E_d_out E_req_out E_ack_out

to logic

PE

LOCAL

R(2,1)

R(2,2)

R(2,K)

IN

PE

PE

PE

ROUTING/ OUTPUT / CONTROL LOGIC

SOUTH

OUT

Header « 01 »

Body « 00 »

Type

PE

R(L,K)

PE

2 bits Type

data

8 bits dest Address

Tail « 11 »

8bits data

Source address

N bits

ED

PE

R(L,2)

M

R(L,1)

body

Figure 2: (a) Structure of a 2D Mesh NoC. (b) NoC router architecture. (c) Packet structure

PT

From the proposed literature survey, it can be concluded that the stateof-the-art HW SOM implementations propose to accelerate either the winner search operation or the overall SOM computation by using massively the in-

CE

135

herent parallelism of hardware. The winner search is often the most critical operation of the SOM and the considerable attention that has been drawn

AC

to this aspect is fully justified. However, as illustrated in Figure 1, the reported SOM architectures in the literature cannot be used as a basis to build

140

up larger SOM networks without considerable design efforts. They are mainly application-specific, high-performance and poorly flexible. Moreover, the input vector data sending is often overlooked in the state-of-the-art architectures: it is often assumed that input vectors are available at the same time at the inputs of 7

ACCEPTED MANUSCRIPT

all neurons for distance calculation without detailing the way of their delivery. 145

Hence, point-to-point links can be and are often used to synchronously deliver these input vectors to neurons but do not present a viable solution for scalable

CR IP T

SOM architectures. Consequently, to the best of our knowledge, the scalability and adaptability of hardware SOMs, except in a general manner in the work [7], have never been addressed in SOM architectures.

The main contributions of this work are the following:

150

• the classical SOM operation is formally described at the algorithmic level and decomposed to form a scalable and easily configurable SOM algorithm

AN US

depending only on the map dimensions,

• an architecture based on the use of a Network-on-a-chip approach is proposed to implement the scalable SOM operation, formally described in the

155

first phase and,

M

• the proposed scalable SOM architecture is validated through simulation on an image compression application.

ED

2.3. Network on Chip (NoC)

A Network-on-a-chip (NoC) is presented as an alternative communication

160

approach to the commonly used shared bus, allowing several integrated pro-

PT

cessing units on a single chip to communicate [8]. Systems with hundreds of processors are not uncommon, and traditional interconnects such as shared bus

CE

struggle to meet the required performance. Interconnects designed for dozens of 165

components cannot easily scale to support hundreds or even more components required by systems today. With NoC interconnect, which does not lack scal-

AC

ability as the traditional shared bus, small networks can easily be grouped in larger ones via pipeline stages, bridges or other as required. Therefore, the NoC interconnect could easily support thousands of processing nodes, and could even

170

provide a transport network spanning multiple chips. Additionally, the NoC interconnect is characterized with an explicit parallelism, a high bandwidth and

8

ACCEPTED MANUSCRIPT

a high degree of modularity, which makes it very suitable for distributed architectures [9]. The structure of a 2D mesh NoC is shown in Figure 2(a). It is composed of processing elements (PEs) and routers. Each router is associated to one or

CR IP T

175

more PEs via a network interface (NI), whose primary function is to pack (before

sending) and unpack (after receiving) data exchanged between PEs. Figure 2(c)

illustrates the structure of packets circulating in the network using the wormhole switching technique. Each packet is composed of flits (the smallest indivisible 180

entities): the header flit, which opens the communication and ”shows” the route

AN US

to other flits; the body flit, which contains the data to be transmitted between PEs; and the tail flit, which closes the borrowed communication links and frees the router for other packets.

The main module of a NoC is the router (see Figure 2(b)), whose main 185

function is to forward ingoing packets to neighbouring routers until they reach the final destination. The router is composed of a crossbar which establishes

M

multiple wire connections between its input and output links according to the predefined routing algorithm and scheduling policy. The crossbar and the ar-

190

ED

bitration of packets in the router are handled with a control logic block. For instance, if multiple ingoing packets from different input links simultaneously try to reserve the same output link, the control block has to arbitrate this sit-

PT

uation and prioritize the forwarding of packets from different input links. The waiting ingoing packets must be stored in buffers until their forwarding is sched-

CE

uled. That is why each router has input and/or output buffers, whose role is 195

to temporarily accept flits before their transmission to either the local PE or to

AC

one of the neighbouring routers. 2.4. Systolic SOM architecture A way of decreasing the hardware complexity of SOMs, especially their point-

to-point links is to use the systolic approach of data exchange. The synoptic

200

scheme of a systolic SOM architecture is shown in Figure 3 [10, 11]. In this architecture, data between neurons are exchanged through levels, from one level 9

ACCEPTED MANUSCRIPT

Competition propagation y s

AN US

CR IP T

x

y

s

x

Adaptation propagation

Figure 3: Systolic architecture of SOM showing an example of data propagation between

M

systolic levels

to the next adjacent one as shown in Figure 3. In a L × K SOM architecture,

ED

there are Nsl systolic levels where Nsl = L + K − 1. The learning phase of the systolic SOM architecture requires two phases as the conventional SOM: the 205

competition phase during which the search of the winner neuron is performed;

PT

and the adaptation phase where the weights of the neurons are updated according to the used neighbouring function and the relative position to the winner

CE

neuron.

During the competition phase, the minimum distance propagates from one

210

level to another, starting from the top-left neuron located at the level 1. A neu-

AC

ron belonging to the level i first receives the comparison results of the adjacent neighbours belonging to the level i − 1, then performs a comparison between the received data and its local distance and finally sends the minimal local distance to its adjacent neighbours belonging to the level i + 1. Figure 3 shows an exam-

215

ple of data exchange between the levels of the presented systolic architecture.

10

ACCEPTED MANUSCRIPT

It can be shown that the global minimal distance is obtained after Nsl − 1 level propagation cycles. At the end of the competition phase, the winner neuron is identified at the bottom-right neuron located at the level L + K − 1, from which 220

CR IP T

the adaptation phase starts by broadcasting to all neurons the identity of the winner. It should also be noticed that all neurons belonging to the level i can

propagate their computed distances simultaneously. Therefore, the time needed

to finish the competition phase depends on the SOM topology. The bigger is

the number of levels Nsl , the greater is the time needed for the competition

phase. It should also be noted that all reported systolic SOM architectures lack

scalability, essentially due to the impossibility to (re-)configure at run-time the

AN US

225

neuron-to-neuron systolic connections without design efforts.

3. The proposed hardware SOM architecture

Before presenting the hardware implementation details of the proposed SOM

230

tecture is described first.

M

architecture, the SOM operation which forms the basis for this hardware archi-

ED

3.1. Proposed SOM operation

At the heart of the SOM operation, executed on a map of L × K neurons, is the calculation of the Euclidean distance according to Equation 2. Obviously,

235

PT

each calculated distance in the SOM is positive DL2 ≥ 0. Therefore, for any two 2 2 neurons, n1 and n2 , if DL2,n1 < DL2,n2 then DL2,n . For this reason, < DL2,n 1 2

CE

2 both DL2 and DL2 lead to the same result in the process of identifying the 2 winning neuron. Moreover, the measure DL2 is often favored over the measure

DL2 , because it allows to omit the rooting operation and thus decreases the

AC

2 computational complexity of the SOM algorithm [10, 12]. The measure DL2 is

240

calculated throughout this work, and for a neuron at the position (i, j) is given by:

D



2 X



2 DL2,(i,j) = X − − m−→ (ξk − µ(i,j)k )2 i,j = k=1

where D is the size of the input vector X. 11

(6)

ACCEPTED MANUSCRIPT

The direct hardware implementation of the neighbourhood function presented by Equation 5 necessitates the use of several arithmetic operators. More245

over, this direct implementation also implies the need to use an additional mul-

CR IP T

tiplier in the adaptation phase to update the weights of neurons according to Equation 4. In order to limit the hardware requirements for its implementation, the neighbourhood equation is often simplified in hardware implementations [2, 13–18]. A widely used solution is to replace the Gaussian function from 250

Equation 4 with a restricted set of values corresponding to negative powers of

two. Thus, the arithmetic operators that are required for the direct implemen-

AN US

tation of Equations 5 and 4 in the adaptation phase are replaced with a simple shift function [2, 3, 18]. This simplification was investigated by simulation in

[16] and [14], where it was shown that the comparable results to the original Ko255

honen’s SOM algorithm were obtained. Moreover, Porrman et al. demonstrated in [14] that some applications may need some additional learning steps to obtain the results comparable with the original Kohonen’s algorithm. Therefore, for a

M

neuron at the position (i, j), its neighbourhood function can be written as [18]:

260

ED

hi,j =

1

(7)

2Si,j

where Si,j is the number of shifts determined according to the relative neuron’s position (i, j) compared to the winner’s one R(i,j) and the learning phase’s

PT

neighbourhood rate β evolving with the number of learning iterations (epoch

CE

number), as:

Si,j = Ri,j + β

(8)

With this simplification, the update of weights of a neuron at the position (i, j) given by Equation 4 becomes:

AC

265

− m−→ i,j [n + 1] =

  − m−→ i,j [n] +

→ − → ( X [n]−− m− i,j [n])

 − m−→ i,j [n],

S 2 i,j

,

Ri,j < Rv

(9)

otherwise

where Ri,j is the distance between the neuron (i, j) and the winner. If a neuron belongs to the neighbourhood radius Rv , its weights will be updated during this phase, otherwise it will not. The Rv radius is initialized with L + K − 1, where 12

ACCEPTED MANUSCRIPT

L and K are the map’s dimensions, and decreases with the progression of the 270

learning phase. The winner neuron is searched in a systolic manner, where a neuron at the

in a limited set of neighbouring neurons as follows:

  2  argmin DL2,(l,k) , i = 0, j = 0,    (l,k)∈W0     2   argmin DL2,(l,k) , i = 0, 0 < j < K, (l,k)∈W1

 2   argmin DL2,(l,k) , 0 < i < L, j = 0,    (l,k)∈W2     2  , 0 < i < L, 0 < j < K. argmin DL2,(l,k) (l,k)∈W3

where sets Wi (0 ≤ i ≤ 3) are:

= {(i, j)}, i = 0, j = 0

W1

= {(i, j), (i, j − 1)}, i = 0, 0 < j < K

W2

= {(i − 1, j), (i, j)}, 0 < i < L, j = 0

W3

= {(i − 1, j), (i, j), (i, j − 1)}, 0 < i < L, 0 < j < K.

M

W0

(11)

ED

275

(10)

AN US

ci,j =

CR IP T

position (i, j), 0 ≤ i < L, 0 ≤ j < K carries out a local winner search operation

From Equations 10 and 11, it can be noticed that the neuron at the position (0, 0) does not carry out a local winner search, because it does not have neighbouring

PT

neurons on its left and top side. The result of its local winner search operation are its own coordinates. Moreover, the neurons in the first row (j = 0) and the

CE

first column (i = 0), search a local winner among two neurons, whereas all other

280

neurons (0 < i < L, 0 < j < K) search a local winner in a set of 3 neurons. The

identity of the winner is known at the position (L − 1, K − 1). Therefore, the

AC

global winner search can be written as: 2 c = argmin DL2,(l,k) = cL−1,K−1 .

(12)

(l,k)∈W

where W = {(i, j)|0 ≤ i < L, 0 ≤ j < K} is the set of all available neurons in the L × K SOM. 13

ACCEPTED MANUSCRIPT

fsrc→dest (data) src

0 ≤ i ≤ L − 2,0 ≤ j ≤ K − 2

(i, j)

dest

data

(i + 1, j)

2 DL2,c i,j

(i, j + 1)

ci,j

CR IP T

neuron at (i, j)

0 ≤ i ≤ L − 2, j = K − 1

(i, j)

(i + 1, j)

i = K − 1, 0 ≤ j ≤ L − 2

(i, j)

(i, j + 1)

2 DL2,c i,j

ci,j

2 DL2,c i,j

ci,j

AN US

Table 1: Data exchange between neurons in the competition phase

The result of each local winner search operation at the neuron (i, j) are

285

the coordinates of the local winner ci,j which have to be propagated with the 2 corresponding squared Euclidean distance DL2,c also in a systolic manner. If i,j

we define a data transport function f(l,k)→(m,n) (d) between two neurons at the

290

M

positions (l, k) and (m, n) (0 ≤ l, m ≤ L − 1, 0 ≤ k, n ≤ K − 1) where the neuron (l, k) is source (src), the neuron (m, n) is destination (dest), and d is data to propagate, the data exchange between all neurons during the competition phase

ED

can be described as presented in Table 1. It should be noted that each neuron 2 sends to its neighbours the squared Euclidean distance DL2,c corresponding i,j

PT

to the neuron elected during the local winner search operation and its own

AC

CE

coordinates (ci,j , 0 ≤ i ≤ L − 1, 0 ≤ j ≤ K − 1). neuron at (i, j) 0≤j ≤K −2 0 ≤ i ≤ L − 2, 0≤j ≤K −1

On the other hand, during

fsrc→dest (data) src

dest

data

(L − 1, K − 1)

(L-1,j)

c

(L − 1, j)

(i, j)

c

Table 2: Data exchange between neurons in the adaptation phase

295

the adaptation phase, the identified winner position c = cL−1,K−1 is sent to all

14

ACCEPTED MANUSCRIPT

neurons. The neuron at the position (L − 1, K − 1) initiates this phase, because the global winner search ends at this position. First, it sends the global winner identity to all neurons located at the column L − 1. Thereafter, these neurons (located at the column L − 1) send the global winner identity c to all other

CR IP T

300

neurons by row.

If for a neuron at the position (i, j) we also define a data transport configuration function g(i, j), whose main role is to configure the destination neurons

of the data transport function f according to Tables 1 and 2, the call of this 305

function by the neuron (i, j) will specifically configure it to sent data in the sys-

AN US

tolic manner described earlier. Moreover, to include the added neurons in the SOM operation, this function should be called every time the dimensions of the SOM change. Consequently, the proposed SOM operation initially started on a

L1 ×K1 map can be extended to a L2 ×K2 map, L1 < L2 , K1 < K2 , only by up310

dating the destination neurons of the data transport function and this by calling

M

the function g(i, j) for each neuron (i, j) of the new map 0 ≤ i < L2 , 0 ≤ j < K2 . 3.2. SOM Operation

ED

Algorithm 1: SOM Operation if new configuration then 315

L∗ ← L; K ∗ ← K; D∗ ← D;

for i ← 0; i < L∗ ; i ← i + 1 do

PT

for j ← 0; j < K ∗ ; j ← j + 1 do g(i, j);

end for;

end for;

CE

320

else

AC

if competition phase then

325

2 DL2,(0,0) ; – Equation (6) 2 f(0,0)→(1,0) (DL2,(0,0) ); – transfer function (0,0)→(1,0) 2 f(0,0)→(0,1) (DL2,(0,0) ); – transfer function (0,0)→(0,1)

for i ← 1; i < L∗ ; i ← i + 1 do

for j ← 1; j < K ∗ ; j ← j + 1 do – done in parallel in all neurons 2 DL2,(i,j) ;– Equation (6)

15

ACCEPTED MANUSCRIPT

c(i,j) ;– Equation (10)

330

if i = L∗ − 1 and j = K ∗ − 1 then no operation else if j = K ∗ − 1 then

CR IP T

2 f(i,j)→(i+1,j) (DL2,c ); i,j

f(i,j)→(i+1,j) (ci,j );

335

else if i = L∗ − 1 then

2 f(i,j)→(i+1,j) (DL2,c ); i,j

f(i,j)→(i+1,j) (ci,j ); else 2 f(i,j)→(i+1,j) (DL2,c ); i,j

340

2 f(i,j)→(i,j+1) (DL2,c ); i,j

f(i,j)→(i,j+1) (ci,j ); end if end for;

345

end for;

AN US

f(i,j)→(i+1,j) (ci,j );

c ← c(L∗ −1,K ∗ −1) ;– The winner identified if adaptation phase then 350

M

end if ;

for j ← K ∗ − 2; j ≥ 0; j ← j − 1 do end for;

ED

f(L∗ −1,K ∗ −1)→(L∗ −1,j) (c);

for j ← K ∗ − 1; j ≥ 0; j ← j − 1 do

for i ← L∗ − 2; i ≥ 0; i ← i − 1 do f(L∗ −1,j)→(i,j) (c);

PT

355

end for;

end for;

CE

for i ← 0; i < L∗ ; i ← i + 1 do

for j ← 0; j < K ∗ ; j ← j + 1 do – done in parallel in all neurons

360

AC

S(i,j) ;– Equation (8)

365

h(i,j) ;– Equation (7)

m(i,j) ;– Equation (9) end for; end for; end if ; end if ;

16

ACCEPTED MANUSCRIPT

The proposed SOM operation is summarized with Algorithm 1 for a L∗ × K ∗ SOM. First, the dimensions of the map (L and K) and the input vector (D) 370

are configured, and thereafter the function g(i, j) is called for each neuron (i, j).

CR IP T

This operation can be considered as the initialization phase at the first run of the SOM operation, or as a new configuration for all later changes of the SOM size during the operation. Once the initialization or the configuration is done,

the SOM operation is carried out in the systolic manner as described earlier. If 375

the SOM is in the learning phase, both the competition and adaptation phases

are executed, whereas in the recall phase only the competition is considered.

AN US

The competition phase starts with the calculation of the squared Euclidean

distance by the neuron (0, 0), which is thereafter propagated to the neurons (1, 0) and (0, 1) respectively. Then, each other neuron (i, j) in its turn calculates 380

the squared Euclidean distance, compares it with the ones received from its neighbours (i − 1, j) and (i, j − 1), if they exist, sends the final result of the local winner search operation to its neighbours (i + 1, j) and (i, j + 1), if they

M

exist too. The competition phase ends at the neuron (L∗ − 1, K ∗ − 1), where the identity of the global winner, for a given input vector, is finally known. Afterwards, the adaptation phase can start. Before updating the weights of all

ED

385

neurons, the identity of the global winner must be known by all neurons in the map. The neuron (L∗ − 1, K ∗ − 1) first sends this data to all neurons located at

PT

the column L − 1. Thereafter, the neurons located at the column L − 1 forward the global winner identity (c) to all other neurons by row. Upon the reception of these data, the update of the neurons’ weights starts, first by calculating the

CE

390

shift parameter Si,j , then the neighbourhood function hi,j and finally the new

AC

weights mi,j . 3.3. Proposed SOM architecture The starting point of the proposed SOM architecture is the 2D mesh NoC topol-

395

ogy presented in Figure 2(a). Two layers can be distinguished: the communication layer consisting of NoC routers; and the processing layer composed of processing elements both arranged in a 2D manner. The NoC routers using 17

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 4: SOM-NoC neuron: each neuron is connected to other neurons through the network interface and the NoC

the wormhole switching technique (Section 2.3), are used for data transfer. On

400

M

the other hand, the processing elements called SOM neurons throughout this work, calculate the SOM operation as described in the previous section. The

ED

block diagram of a neuron is shown in Figure 4. It is composed of 5 modules: a Vector Element Processor (VEP), a Local Winner Search (LWS), an Update Signal Generator (USG), a Network Interface (NI) and a Local Configuration

405

PT

Module (LCM).

The VEP module is the unit calculating the squared Euclidean distance in the

CE

competition phase according to Equation 6, and updating the neuron’s weights in the adaptation phase according to Equations 7 and 9. Its block diagram is presented in Figure 5. It has two memory blocks, one for storing the neuron’s

AC

weights and the other one for delta values (the differences between the weights

410

µi and the input vector elements ξi ), necessary for the adaptation phase. The squared Euclidean distance is calculated in a sequential manner for each received input vector in the competition phase, as well as the update of neuron’s weights in the adaptation phase. It should be noted that the sequential manner of

18

CR IP T

ACCEPTED MANUSCRIPT

AN US

+/-

M

Figure 5: Vector Element Processor

calculation in the VEP is not mandatory for the proposed SOM operation.

ED

If the high performances are targeted, these calculations should be done in a

AC

CE

PT

415

Figure 6: Local Winner Search

19

Figure 7: Update Signal Generator

CR IP T

ACCEPTED MANUSCRIPT

massively parallel manner, as already reported in the literature [2].

The block diagram of the LWS module is shown in Figure 6. This module

AN US

carries out the local winner search according to Equation 10. With this module, up to three squared Euclidean distances could be compared. The neuron (0,0) 420

does not carry out the local winner search operation like other neurons. In this case, the LWS sets the local squared Euclidean distance and coordinates 2 at the outputs DL2,c and ci,j respectively. In the case of the neurons located i,j 2 at the first column or first row, the local distance DL2,(i,j) is compared only to

425

M

2 2 DL2,(i,j−1) or DL2,(i−1,j) respectively. The different modes of operating of the

LWS module are set up during the configuration phase performed by the LCM,

ED

described later in details.

The USG module performs the generation of the shift number S according to Equation 8. Its block diagram is presented in Figure 7. On receipt of the global

430

PT

winner neuron coordinates c, the USG calculates the relative position Ri,j of its parent neuron (i, j) to the global winner’s one. This value is then combined

CE

with the overall learning progress phase β to give the number of shifts for the neighbourhood function hi,j . The LCM unit is the module implementing the data transport configuration

AC

function gi,j introduced in the previous section. Therefore, for each new con-

435

figuration, the LCM configures, according to the map size, its parent neuron’s neighbours (i + 1, j) and (i, j + 1), to which it will send the result of the local

winner search DL2,c2i,j in the presented systolic manner during the competition phase. In addition, for the data transfers during the adaptation phase, the LCM also configures the role of broadcasting for its parent neuron as a function of the 20

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 8: Global Configuration Module

map’s size (see Table 2). For the neurons located at the column L − 1, during

M

440

the adaptation phase, they also have to forward the global winner identity c. Moreover, the LCM module also generates control signals for the LWS module

ED

allowing it to choose the right LWS operation according to the neuron’s position (see Figure 6). It also configures the size of the input vector D for the VEP 445

which is mandatory for the sequential distance calculation.

PT

The NI unit manages data received/sent by its parent neuron from/to other neurons. Along with the router to which it is physically connected, and indirectly

CE

other routers of the NoC, it ensures the implementation of the data transport function fsrc→dst (data) introduced in the previous section. The received data

450

2 2 are either data related to the SOM operation (DL2,c , ci,j , DL2,c , c), either i,j

AC

input vector elements (ξi ) or the neuron configuration data (L, K and D). The NI also ensures that all received data are correctly identified and dispatched to the corresponding modules, as presented in Figure 4. On the other hand, the locally identified winner and its squared Euclidean distance are prepared by

455

the NI in the form understandable by the NoC before their sending (see Figure

21

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 9: Illustration of the scalability of the proposed hardware SOM-NoC architecture

M

2(c)). These prepared data include destination addresses of the neighbouring neurons which are configured and supplied by the LCM unit. In this way, the

ED

systolic SOM operation is ensured.

The proposed hardware SOM architecture has at the system level a Global 460

Configuration Module (GCM), as it is presented in Figure 8. This module

PT

sends the configuration data to all neurons as well as the input vector elements to be processed. As configuration data, the dimensions of the map L and K, and the input vector dimension D are sent to all neurons. Upon reception of

CE

these configuration data, each neuron via its LCM unit has to (re-)configure its

465

own parameters, specially the addresses of the neurons to which the LWS data

AC

will be sent during the competition phase, and the addresses of the neurons to broadcast the global winner identity in the adaptation phase. The time needed to transport data between neurons via the NoC is not zero and must be taken into account during the configuration phase. All neurons will not receive

470

the new configuration data at the same time. To shorten the time needed to (re-)configure the entire SOM network, a solution is to send in parallel same 22

ACCEPTED MANUSCRIPT

configuration data by columns (dashed lines from the GCM to routers), thus supplying at the same time all neurons belonging to the same row. The major originality of the proposed hardware SOM architecture is its scalability. It is illustrated in Figure 9 on an example of extending an initial 2 × 2

CR IP T

475

to a 4 × 2 SOM-NoC architecture. This operation goes through three phases: a

physical linking, a configuration of routers and a configuration of neurons. We assume that the new pairs of router/neuron to add to the initial architecture

must be of the same type as the initial ones. For this reason, the physical link480

ing between the added pairs of router/neuron is straightforward. In the second

AN US

phase, the added routers must follow the initial structure in terms of coordinates. As it has been illustrated in Figure 9, the coordinates of added routers (in red) are ordered in accordance with the row and column coordinates of the

initial routers. This configuration can be done either in the design phase (in 485

the case of an inter-chip communication) or at runtime where right row and column coordinates are sent to the routers, which are presumably configurable.

M

It should be mentioned that this phase is crucial for the correct operation of the network, because the message routing between routers is based on these

490

ED

coordinates. Finally, in the last phase, all neurons of the new larger network must be updated with these changes. These new configuration data are sent to

PT

all neurons by the GCM unit, as explained previously.

4. Results and discussion

CE

4.1. Performance evaluation The proposed architecture was described in VHDL and synthesized on a Xilinx VC707 Virtex-7 FPGA board by using the Xilinx ISE Design Suite 14.7. It was

AC

495

also compared to the state-of-the-art SOM architectures presented in [19] and [2], also implemented on the same technology. The architecture presented in [2] is a massively parallel SOM having the best performances that have been reported in the literature. On the other hand, the architecture presented in [19]

500

is a hardware sequential SOM architecture, which is highly flexible and config-

23

ACCEPTED MANUSCRIPT

urable. These comparison results for 16-element input vectors are presented in Figures 10, 11 and 12 in terms of maximal operating frequency and number of cycles needed for both learning and recall phases respectively for different map

505

CR IP T

sizes. Figures 10 shows that the proposed architecture has the highest operating fre-

quency for the map sizes up to 8 × 8. This result does not imply that the

proposed architecture gives the best results in terms of performances among three tested architectures. It only confirms that the proposed architecture is scalable in terms of working frequency, which is essentially due to the use of

NoC for communication. No matter the map size, the operating frequency re-

AN US

510

mains stable. The maximal operating frequency of the sequential architecture is also stable because its hardware structure changes slightly between two different map sizes. On the other hand, the massively parallel hardware architecture is the most influenced by the map sizes. Its maximum working frequency decreases 515

with map size increase.

M

Figures 11 and 12 show the comparison results in terms of the number of cycles needed for the learning and recall phases respectively for different map sizes (up

ED

to 128 × 128). These results are the true measure of performances of the three hardware SOM architectures. The massively parallel architecture is unbeatable 520

in terms of performances. No matter the map size, the learning or recall phases

PT

are both finished within one clock cycle. On the other hand, the sequential architecture has the poorest results, which is expected given that all operations

CE

are done sequentially. The proposed SOM-NoC architecture is much faster than the sequential one, but much slower than the parallel one. It should also be

525

stated that the main objective of our proposed architecture is to show how the

AC

SOM operation can be made scalable, not to propose the best performing SOM architecture. Figures 13 and 14 show the time distribution of the proposed SOM architecture for both learning and recall phases as a function of map size. It can be seen that,

530

for a given input vector size (here 16-element vectors), the calculation of the squared Euclidean distance is constant along with the map size. Moreover, these 24

ACCEPTED MANUSCRIPT

calculations are done in parallel in all neurons and the presented times for all map sizes is the most optimistic case where we assume that all neurons start and finish to compute the Euclidean distances at the same time. The last assumption implies also that the input vectors are delivered synchronously to all neurons,

CR IP T

535

which is rarely the case. On the other hand, the winner search operation, which is distributed all over the network in the systolic manner described earlier, is

the most time consuming operation. Table 3 shows the time distribution of all operations of the proposed architecture in details as a function of the map 540

size and the input vector dimension. From Table 3, it can be seen that the

AN US

input vector dimension only influences the squared Euclidean calculation and

the weights update, TDL2 and TU respectively, not the global winner search part. The global winner search takes TGW S cycles to be carried out, and is proportional to the number of systolic stages which is equal to L + K − 2 for a 545

L × K network. The factor 9 is the time in clock cycles which is needed to a

AC

CE

PT

ED

M

message to cross one systolic level. The global winner search operation is time

Figure 10: Maximal operating frequency as a function of SOM size

25

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 11: Duration of the learning phase as a function of SOM size

Figure 12: Duration of the recall phase as a function of SOM size

26

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 13: Time distribution of the learning phase for the proposed SOM architecture

Figure 14: Time distribution of the recall phase for the proposed SOM architecture

27

ACCEPTED MANUSCRIPT

Table 3: Time distribution of all operations as a function of the map size L × K and the input vector dimension D Time

Time [Clk]

2 DL2,(i,j)

TDL2

D+4

TGW S

9(L + K − 2)

TBC

4(L + K − 1)

Global winner search Winner ID broadcasting

CR IP T

Step

TU

D+1

Competition

TC = TDL2 + TGW S

D + 4 + 9(L + K − 2)

Adaptation

TA = TBC + TU

D + 1 + 4(L + K − 1)

Learning

TL = TC + TA

AN US

Update phase

2D − 4 + 13(L + K − 1)

M

consuming for several reasons. First, for a neuron (i, j) the calculated squared Euclidean distance is sent to its neighbours (i+1, j) and (i, j+1). The calculated distance, before arriving at the destination, leaves the source neuron, crosses two NoC routers (the corresponding router and the neighbour’s one), leaves the

ED

550

router and finally accesses to the destination neuron. Each of these steps takes some time and lengthens the total amount of time TGW S needed to finish the

PT

global winner search. Second, it should also be stated that the employed NoC routers are the commonly used NoC routers based on the wormhole switching technique, and are without any modification or adaptation to the presented

CE

555

systolic SOM operation. In addition, the winner ID broadcasting phase is less time consuming than the global winner search part TBC < TGW S . The main

AC

reason for this is that a part of the winner ID broadcasting is done in parallel by all neurons belonging to the column (L − 1, j) where 0 ≤ j < K,

560

In the proposed approach, all neurons are supplied with input vectors through columns, as described earlier. Thus, all neurons belonging to the same row receive synchronously input vectors, and start the distance calculations at the

28

ACCEPTED MANUSCRIPT

Table 4: Total number of messages and flits sent per iteration as a function of the map size L × K and the input vector dimension D for different SOM operations

Input vector supply Global winner

Number of messages

Number of flits

Nm,i = L × K

Nf,i = (D + 1) × Nm,i

Nm,c = (L − 1) × (K − 1) × 2 +L + K − 2

search

broadcasting

Nm,d = (K − 1) + (L − 1) × K

Nf,c = 2 × Nm,c

Nf,d = 2 × Nm,d

AN US

Winner ID

CR IP T

Step

same time. The arrival time of the input vectors to the neurons belonging to the same row depends also on the traffic load of the network which may perturb 565

this synchronous start of computation. In a L × K SOM-NoC architecture, for a

M

D-element input vectors, where each input vector’s element is sent as a flit, the total number of messages and flits that are sent to supply all neurons with input vectors is equal to L × K and (L × K) × (D + 1) (plus one is for the header flit) 570

ED

respectively. In addition, during the competition phase where data exchange is done in the systolic manner (see Section 2.4), each neuron (i, j) (except the ones at the (i, K − 1) and (L − 1, j) positions) sends 2 messages of 2 flits (header

PT

flit plus one flit containing local winner neuron ID with its distance) to its closest neighbours. Therefore, the total number of messages and flits during

CE

the competition phase is equal to Nm,c = (L − 1) × (K − 1) × 2 + L + K − 2

575

and Nf lits,c = 2 × Nm,c respectively. Finally, in the diffusion phase, the neuron (L − 1, K − 1) which is the first informed about the identity of the global winner

AC

neuron, diffuses it to all neurons belonging to the column L − 1. The L − 1 column neurons, in their turn, diffuse the global winner identity to all neurons by row. The total number of sent messages and flits in this phase amounts to

580

Nm,d = (K − 1) + (L − 1) × K and Nf,d = 2 × Nm,d respectively. These results are summarized in Table 4.

29

ACCEPTED MANUSCRIPT

From the presented results, it can be concluded that the traffic patterns used in the presented SOM-NoC architecture per iteration are straightforward, application independent and not prone to congestion: input vectors are sent to all neurons by columns by the GCM module; each neuron sends a local winner ID

CR IP T

585

and the corresponding distance to its closest neighbours; the neuron (L−1, K−1) initiates the global winner diffusion phase by sending the global winner ID to

the neurons (L − 1, j) (where 0 ≤ j < K − 1), which send it (the global winner ID) to all neurons belonging to the same row. All these traffic patterns are done 590

sequentially and not at the same time, thus avoiding the congestion situations

AN US

which may occur in a NoC with high traffic loads. Therefore, adaptive routing schemes such as ones presented in [20–22] or reconfigurable NoC approaches [23, 24], which are often used to offload congested NoC routers and to balance

the overall traffic all over the network by using adaptive routing policies or by 595

changing NoC architecture respectively, would not help much to improve the overall performances of the SOM-NoC architecture, due to these deterministic

M

traffic patterns. However, network coding communication protocols allowing to send the same data to many processing nodes, so called multicast communica-

600

ED

tion as the approaches presented in [25, 26], may be a way of improvement of the existing SOM-NoC approach, which is using multiple unicast approach. In fact, the traffic patterns used in the proposed architecture are suited for the

PT

multicast communication in the input vector supply and diffusion phases: the same input vector must be delivered to all neurons from the GCM module, as

CE

well as the global winner identity must be known by all neurons starting from 605

the neuron (L − 1, K − 1). Another way of improvement of the presented SOMNoC architecture, which is directly derived from the presented results, is at the

AC

router micro-architectural level. Indeed, the time distribution results presented in Table 3 point out explicitly the latency needed to exchange data between two neighbouring neurons. A low-latency router should be preferred or even

610

some hybrid neuron-router approaches may come as a solution for performance improvements. From the presented discussion, it can be concluded that several possible improvements of the proposed architecture are possible and should be 30

ACCEPTED MANUSCRIPT

done in the future on this basis, keeping in mind the fact that the main objective of the proposed study is to show how the SOM operation can be made scalable, and not to propose the best performing SOM architecture. 4.2. Validation on image compression application

CR IP T

615

The scalability and adaptability of the proposed SOM-NoC architecture is also

tested and validated in an image compression application. The use of hardware SOMs in image compression applications have already been reported in 620

the literature [1–6]. The image compression using SOMs goes through two

AN US

phases: the colour quantization which results in a colour pallet comprising the

representative colours of the images to compress; and the phase of generating the compressed binary data by using the obtained colour pallet. In the colour quantization phase, a L × K SOM network is used as a colour quantizer, whose 625

neurons have the weights of the same size as the pixels used to train the network (3 elements, corresponding to a RGB pixel). At the end of the training phase of

M

the SOM quantizer, the randomly initialized SOM weights will converge to the most representative colours of the images used to train the network. Thus, the

630

ED

total number of colours of an image is reduced to the size of the SOM network used for quantization purposes (here L × K). Moreover, the colour pallet is obtained by taking the weights of the SOM neurons at the end of the training

PT

phase and is often called codebook. In the second phase, the obtained colour pallet is used to compress the image: instead of using the true colour code to code a pixel, the position of the neuron having the weights (colour) closest to the colour of the observed pixel is used, thus reducing the pixel size. A P × Q

CE 635

original image can also be divided into blocks of M × M pixels. Hence, the total

AC

number of obtained blocks is: Nblk =

P ×Q M ×M

(13)

with P ×Q representing the image resolution. The binary size of the compressed image Sc , the compression ratio CR and the space savings SS are obtained

31

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 15: The timeline of used configuration for image compression

32

ACCEPTED MANUSCRIPT

640

respectively with:

CR =

SS = 1 −

R×P ×Q Sc

Sc 1 =1− R×P ×Q CR

where R is the number of bits used to code a pixel.

(14)

CR IP T

Sc = Nblk × {dlog2 (L)e + dlog2 (K)e}

(15)

(16)

645

AN US

In the proposed validation, three different configurations were simulated: the configuration C1 representing a 7 × 7 SOM network using input vectors of 1 × 1 pixels (3 elements); C2 - a 10 × 10 SOM network using input vectors of 1 × 1 pixels (3 elements); and C3 - a 10 × 10 SOM network using input vectors of 2 × 2 pixels (12 elements). The timeline of the used configurations with different network and input vector sizes is presented in Figure 15. At the startup time T1 , the image compression system receives the C1 configuration data. These data

M

650

configure the system for image compression with 49 colours and input vectors

ED

of 3 elements corresponding to one RGB pixel. The time needed to configure the whole system is Tconf ig and is equal to the time needed for the Global Configuration Module to send all relevant data to all neurons. In the proposed architecture, Tconf ig is 35 clock cycles. At T1 + Tconf ig , the system is ready

PT

655

to compress all input images in accordance with the selected parameters. The image compression system remains in the C1 configuration, during the time

CE

needed to extract the most relevant colours from input images (the learning phase), to reduce the overall size of input images based on this colour extraction (compression phase) and to reconstruct the compressed images. This time

AC

660

is image size dependent and was determined through simulation. Each input image is accessed twice: the first time during the learning phase where pixels are chosen randomly, and the second time during the compression phase where all pixels are scanned one by one. At T2 , the image compression system receives

665

the C2 configuration data where the size of the network (and thus the number of 33

ACCEPTED MANUSCRIPT

colours) is increased while keeping the same dimension of the input vectors. It should be mentioned that, at T2 , new pairs of router/neurons are added to the initial SOM to form the final 10 × 10 map. At T2 + Tconf ig , the system is again 670

CR IP T

operational and ready to process input images. The same scenario is repeated with the C3 configuration data, where at T3 + Tconf ig the network size (and the

number of colours) is kept unchanged while the size of input vectors is increased to 12, corresponding to the blocks of 4 RGB pixels (2 × 2). The three presented

configurations were applied on several images (Lenna, Airplane, Pepper and Parrot) of different resolutions (from 128×128 to 640×480) and different types (greyscale and RGB) in the order presented by the timeline in Figure 15. The

AN US

675

obtained results are shown in a 3×3 matrix form in Figures 16 and 17. Each line of the matrix presented for each image in Figures 16 and 17 represents one among three tested configurations. Moreover, for each configuration only for information purposes, three additional metrics were presented: the compression 680

ratio (see Equation 15), the Mean Square Error (MSE) and the Peak Signal

M

to Noise Ratio (PSNR). For each tested image (shown in the top left corner of each quadrant), in the first column of the matrix, the obtained colour pallets

ED

or codebooks for different configurations are presented. In addition, the second and third column present the reconstructed images and the dissimilarities be685

tween the original and reconstructed image respectively. The codebooks (colour

PT

pallets) and the reconstructed images are generated by the image compression system whereas the dissimilarities and the calculation of MSE and PSNR are

CE

done offline with Matlab.

5. Conclusion In this work, a scalable and adaptable hardware implementation of a SOM net-

AC 690

work is presented. The scalability of the SOM operation is obtained by using the Network-on-a-chip communication approach and by distributing the global winner search operation in a systolic manner all over the network. Indeed, the global winner search operation is dispatched to the local searching units be-

34

ACCEPTED MANUSCRIPT

695

longing to the neurons, and thus making the neuron connections more relaxed and easier to configure. Consequently, the proposed architecture allows to extend dynamically the SOM operation from a smaller to a larger map only by

CR IP T

(re-)configuring the parameters of each neuron. On the other hand, the gained scalability is not without a cost, and can be to the detriment of overall perfor700

mances. Indeed, the performances bottleneck of the proposed architecture lies in the data exchange through the NoC, which can be very time consuming es-

pecially for large SOM networks. A solution to this problem may be, instead of using common NoC approaches, to design SOM specific NoCs, which will take

705

these time consuming tasks.

References

AN US

into account all specificities of the SOM operation and thus allow to reduce

[1] H. Hikawa, K. Doumoto, S. Miyoshi, Y. Maeda, Image Compression with

M

Hardware Self-Organizing Map, in: Neural Networks (IJCNN), The 2010 International Joint Conference on, 2010, pp. 1 – 8. [2] H. Hikawa, Y. Maeda, Improved Learning Performance of Hardware Self-

ED

710

Organizing Map Using a Novel Neighborhood Function, IEEE Transactions on Neural Networks and Learning Systems Volume: 26 (N: 15524695)

PT

(2015) 2861 – 2873.

[3] W. Kurdthongmee, A low latency minimum distance searching unit of the SOM based hardware quantizer, Journal - Microprocessors and Microsys-

CE

715

tems Volume: 39 (2015) 135 – 143.

AC

[4] W. Kurdthongmee, A hardware centric algorithm for the best matching

720

unit searching stage of the SOM-based quantizer and its FPGA implementation, Journal of Real-Time Image Processing (2013) pp 1 – 10.

[5] W. Kurdthongmee, Utilization of a fast MSE calculation approach to improve the image quality and accelerate the operation of a hardware K-SOM

35

ACCEPTED MANUSCRIPT

quantizer, Journal - Microprocessors and Microsystems Volume: 34 (2010) 174 – 181. [6] A. Ramirez-Agundis, R. Gadea-Girones, R. Colom-Palero, A hardware de-

CR IP T

sign of a massive-parallel, modular NN-based vector quantizer for real-

725

time video coding, Journal - Microprocessors and Microsystems Volume: 32 (2008) 33 – 44.

[7] M. Abadi, S. Jovanovic, K. Ben Khalifa, S. Weber, M. H. Bedoui, A Scal-

able Flexible SOM NoC-Based Hardware Architecture, in: Advances in

Self-Organizing Maps and Learning Vector Quantization: Proceedings of

AN US

730

the 11th International Workshop WSOM 2016, Houston, Texas, USA, January 6-8, 2016, 2016, pp. 165–175.

[8] D. Wiklund, D. Liu, SoCBUS: switched network on chip for hard real time embedded systems, in: Parallel and Distributed Processing Symposium, 2003. Proceedings. International, IEEE, 2003.

M

735

[9] P. P. Pande, C. Grecu, A. Ivanov, R. Saleh, Design of a switch for network

ED

on chip applications, in: Circuits and Systems, ISCAS ’03. Proceedings of the 2003 International Symposium on, Vol. Volume: 5, IEEE, 2003, pp. 217 – 220.

[10] T. Kohonen, ”Self-Organizing Maps, Third Edition”, third edition Edition,

PT

740

Vol. Volume: 29, Springer, 2001.

CE

[11] I. Manolakos, E. Logaras, High Throughput Systolic SOM IP Core for FPGAs, in: Acoustics, Speech and Signal Processing, ICASSP 2007, 2007,

AC

pp. 61 – 64.

745

[12] T. Talaska, M. Kolasa, R. Dlugosz, W. Pedrycz, Analog programmable distance calculation circuit for winner takes all neural network realized in the cmos technology, IEEE Transactions on Neural Networks and Learning Systems 27 (3) (2016) 661–673. doi:10.1109/TNNLS.2015.2434847.

36

ACCEPTED MANUSCRIPT

[13] M. Porrmann, M. Franzmeier, H. Kalte, U. Witkowski, U. Rckert, A Reconfigurable SOM Hardware Accelerator, 10th European Symposium on

750

Artificial Neural Networks, 2002, pp. 337–342.

CR IP T

[14] S. Rping, M. Porrmann, U. Rckert, Som accelerator system, Neurocomputing 21 (1) (1998) 31 – 50.

[15] M. Porrmann, U. Witkowski, U. Ruckert, A massively parallel architecture for self-organizing feature maps, IEEE Transactions on Neural Networks

755

14 (5) (2003) 1110–1121.

AN US

[16] N. Lightowler, C. Spracklen, A. Allen, A modular approach to implemen-

tation of the self-organising map, in: Proceedings of WSOM’97, 1997, pp. 130–135. 760

[17] H. Tamukoh, T. Aso, K. Horio, T. Yamakawa, Self-organizing map hardware accelerator system and its application to realtime image enlargement,

M

in: Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, Vol. 4, IEEE, 2004, pp. 2683–2687.

ED

[18] J. Pena, M. Vanegas, A. Valencia, Digital hardware architectures of kohonen’s self organizing feature maps with exponential neighboring function,

765

in: Reconfigurable Computing and FPGA’s, 2006. ReConFig 2006. IEEE

PT

International Conference on, IEEE, 2006, pp. 1–8. [19] H. Hikawa, K. Kaida, Novel FPGA Implementation of Hand Sign Recog-

CE

nition System With SOM-Hebb Classifier, IEEE Transactions on Circuits and Systems for Video Technology 25 (1) (2015) 153–166. doi:

770

AC

10.1109/TCSVT.2014.2335831.

[20] Z. Qian, P. Bogdan, G. Wei, C.-Y. Tsui, R. Marculescu, A traffic-aware

775

adaptive routing algorithm on a highly reconfigurable network-on-chip architecture, in: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’12, ACM, New York, NY, USA, 2012, pp. 161–170. 37

ACCEPTED MANUSCRIPT

[21] S. J. Hollis, C. Jackson, P. Bogdan, R. Marculescu, Exploiting emergence in on-chip interconnects, IEEE Transactions on Computers 63 (3) (2014) 570–582. [22] C. Jackson, S. J. Hollis, A deadlock-free routing algorithm for dynami-

CR IP T

780

cally reconfigurable networks-on-chip, Microprocessors and Microsystems 35 (2) (2011) 139 – 151, special issue on Network-on-Chip Architectures and Design Methodologies.

[23] D. Cozzi, C. Far`e, A. Meroni, V. Rana, M. D. Santambrogio, D. Sciuto,

Reconfigurable noc design flow for multiple applications run-time mapping

AN US

785

on fpga devices, in: Proceedings of the 19th ACM Great Lakes Symposium on VLSI, GLSVLSI ’09, ACM, New York, NY, USA, 2009, pp. 421–424. [24] I. Beretta, V. Rana, D. Atienza, M. D. Santambrogio, D. Sciuto, Runtime mapping for dynamically-added applications in reconfigurable embed-

2009, pp. 157–160.

M

ded systems, in: 2009 International Conference on Microelectronics - ICM,

790

ED

[25] Y. Xue, P. Bogdan, User cooperation network coding approach for noc performance improvement, in: Proceedings of the 9th International Symposium on Networks-on-Chip, NOCS ’15, ACM, New York, NY, USA, 2015, pp. 17:1–17:8.

PT

795

[26] S. Yan, B. Lin, Custom networks-on-chip architectures with multicast rout-

CE

ing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems

AC

17 (3) (2009) 342–355.

38

ACCEPTED MANUSCRIPT

SS % CR=66d

PSNR=36.78

Lenna

PSNR=40.22

MSE = 6.17

CR=75c SS %

PSNR=37.85

MSE = 10.65

M

ED

PT

CE AC

% CR=91c SS

PSNR=35.23

MSE = 19.5

SOM 10x10 2x2 block size

39

SOM 7x7 1x1 block size

SS % CR=66c

SOM 10x10 1x1 block size

MSE = 32.66 Dissimilarity

SOM 10x10 2x2 block size

PSNR=33 Reconstructed Image

SOM 7x7 1x1 block size

MSE = 13.83

% CR=91d SS SOM Pallet

Airplane

MSE = 13.63

PSNR=36.71

SS % CR=75d

Original Image

Dissimilarity

CR IP T

Reconstructed Image

SOM 10x10 1x1 block size

SOM Pallet

AN US

Original Image

Figure 16: Obtained quantization results for different configurations on 128×128 images: Lenna and Airplane

ACCEPTED MANUSCRIPT

Pepper

CR=66k SS %

PSNR=34.18

SS % CR=66d

PSNR=32.47

MSE = 36.8

% CR=75d SS

PSNR=31.36

MSE = 47.48

M

ED

PT

CE AC

% CR=91d SS

PSNR=28.76

MSE = 86.43

SOM 10x10 2x2 block size

40

SOM 7x7 1x1 block size

Dissimilarity

SOM Pallet

SOM 10x10 1x1 block size

MSE = 26.95

SOM 10x10 2x2 block size

PSNR=33.82 Reconstructed Image

SOM 7x7 1x1 block size

MSE = 26.02

CR=91k SS %

Parrot

MSE = 24.82

PSNR=33.97

CR=75k SS %

Original Image

Dissimilarity

CR IP T

Reconstructed Image

AN US

SOM Pallet

SOM 10x10 1x1 block size

Original Image

Figure 17: Obtained quantization results for different configurations on 128×128 images: Pepper and Parrot

ACCEPTED MANUSCRIPT

Authors' biography Mehdi Abadi received the the Engineering degree in real-time computing and the master’s

CR IP T

degree in embedded systems from the University of Sousse, Tunisia, in 2011 and 2013, respectively. He is currently pursuing the Ph.D. degree as an exchange scholar at the National Engineering School of Sousse, University of Sousse, Tunisia and University of Lorraine, Nancy, France. He is a member of the Jean Lamour Institute (UMR 7198), University of Lorraine, Nancy, France. And he is a member of the Technologie et Imagerie Médicale Laboratory with the Faculty of Medicine, University of Monastir, Tunisia. His main research interests include neural network architectures, reconfigurable and adaptable embedded systems and real-time signal processing.

ED

M

AN US

Slavisa Jovanovic received the B.S. in electrical engineering from the University of Belgrade, Serbia, in 2004, M.S. and Ph.D degrees in electrical engineering from the University of Lorraine, France, in 2006 and 2009, respectively. From 2009 to 2012, he was with the Diagnosis and Interventional Adaptive Imaging laboratory (IADI), Nancy, France, as a research engineer working on MRIcompatible sensing embedded systems. Then, he joined the Faculty of Sciences and Technologies and the Jean Lamour Institute (UMR 7198), University of Lorraine, Nancy, where he is currently an assistant professor. His main research interests include reconfigurable Network-on-Chips, energy harvesting circuits, neuromorphic architectures and algorithm-architecture matching for real-time signal processing. He is the author and co-author of more than 50 papers in conference proceedings and international peer-reviewed journals, and he holds one patent.

AC

CE

PT

Khaled Ben Khalifa received his MSc in Physic Microelectronic and his DEA in Materiaux et Dispositive pour l'electronique and a PhD diploma in Physics-Electronics from the University of Monastir, Tunisia, in 1999, 2001 and 2006, respectively. Currently, he is an assistant professor (electrical engineering) in High Institute of Applied Sciences and Technology, University of Sousse, Tunisia and senior researcher at the Laboratory of Technology and Imaging (LR12ES06) at the Faculty of Medicine, University of Monastir, Tunisia. His research interests are related to real-time embedded systems, FPGA-based systems, system-on-chip, neural networks and heterogeneous multiprocessor architectures. Serge Weber was born in 1961. He received the M.S. degree in electrical electronic and control engineering in 1983 and the Ph.D. Degree in electronics in 1986, both from the Henri Poincaré University of Nancy, France. In 1988 he joined the Electronics Laboratory of Nancy (LIEN) as an Associate Professor. Since September 1997 he is Professor and Manager of the Electronic Architecture group at LIEN (University Henri Poincaré). His research interests focus on reconfigurable and parallel architectures for image and signal processing or for intelligent sensors. From 2006 to 2013 he was director of the Electronics Laboratory of Nancy (LIEN). Since 2013, he joined the Jean Lamour Institute. He has coauthored more than 100 papers in reviewed international Journals and Conferences and two Patents.

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Mohammed Hédi Bedoui received the Ph.D. degree in biomedical engineering from Lille University, Villeneuve-d’Ascq, France, in 1992. He is currently a Professor of Biophysics with the Faculty of Medicine, Monastir University, Monastir, Tunisia. He is the Director of the Technologie et Imagerie Médicale Laboratory with the Faculty of Medicine, Monastir University. He has many published papers in international journals. His current research interests include biophysics, medical imaging processing, 800 embedded system, and codesign HW/SW.

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT