Integrated core selection and mapping for mesh based Network-on-Chip design with irregular core sizes

Accepted Manuscript Integrated Core Selection and Mapping for Mesh based Network-on-Chip Design with Irregular Core Sizes J. Soumya, K. Naveen Kumar, ...

Download PDF

2MB Sizes 3 Downloads 53 Views

Report

PDF Reader
Full Text

Accepted Manuscript Integrated Core Selection and Mapping for Mesh based Network-on-Chip Design with Irregular Core Sizes J. Soumya, K. Naveen Kumar, Santanu Chattopadhyay PII: DOI: Reference:

S1383-7621(15)00084-3 http://dx.doi.org/10.1016/j.sysarc.2015.07.014 SYSARC 1298

To appear in:

Journal of Systems Architecture

Received Date: Revised Date: Accepted Date:

5 September 2014 17 July 2015 29 July 2015

Please cite this article as: J. Soumya, K. Naveen Kumar, S. Chattopadhyay, Integrated Core Selection and Mapping for Mesh based Network-on-Chip Design with Irregular Core Sizes, Journal of Systems Architecture (2015), doi: http://dx.doi.org/10.1016/j.sysarc.2015.07.014

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Integrated Core Selection and Mapping for Mesh based Network-onChip Design with Irregular Core Sizes Soumya J., K. Naveen Kumar, Santanu Chattopadhyay Department of Electronics and Electrical Communication Engineering Indian Institute of Technology Kharagpur, Kharagpur, India [email protected], [email protected], [email protected] Abstract– Network-on-Chip (NoC) has been proposed to replace traditional bus based System-on-Chip (SoC) architecture to address the global communication challenges in nanoscale technologies. A major challenge in NoC based system design is to select Intellectual Property (IP) cores for implementing tasks and associate the selected cores to the routers to optimize cost and performance. These are commonly known as the process of core selection and application mapping respectively. In this paper, integrated core selection and mapping problem has been addressed. Mesh architecture has been considered for experimentation. The integrated core selection and mapping problem takes as input the application task graph, topology graph and a core library. It outputs the selected cores for the tasks and their mapping onto the topology graph, such that, all communication requirements of the application are satisfied. The cores present in a core library may perform more than one task and have non-uniform sizes. For this, a technique based on Particle Swarm Optimization (PSO) has been proposed to select cores from the given core library and map the resultant core graph onto mesh based architectures. An efficient heuristic for mapping has also been proposed, which maps the selected cores onto mesh based architectures, considering non-uniform core sizes. Comparisons have been carried out with step-by-step core selection and mapping approach and also with mapping algorithms that exist in the literature. Significant reductions have been observed in terms of communication cost over all the cases. Area comparisons have also been made. On average, improvement of 13.05% in communication cost and 2.07% in area have been observed. The proposed approach has also been compared in dynamic environment and significant reductions in the average network latency could be observed. On average, improvement of 5.48% in average network latency and 15.68% in network throughput has been observed. Comparison of energy consumption has also been done in both the cases. Keywords– Application Task Graph, Communication cost, Core selection, Mapping, and Particle Swarm Optimization I. INTRODUCTION

Network-on-Chip (NoC) has emerged as a viable approach to implement Intellectual Property (IP) core-based designs, popularly known as System-on-Chip (SoC). It can handle efficiently the high bandwidth communication requirements between system components with the help of an underlying on-chip network consisting of routers and interconnects. The cores are attached to the local ports of routers, while the global router ports are interconnected between themselves in different topologies, such as, mesh, tree, etc [1, 2]. Such system design methodology is largely dependent on the reuse policy and third-party cores. Preexisting modules are often integrated to achieve the desired functionality of the system. The cores integrated may be homogeneous in nature, each one being capable of implementing same set of tasks. This gives rise to the Multi-Processor System-on-Chip (MPSoC) architecture. The more general one is the heterogeneous structure in which each core used is capable of performing only a subset of tasks. The sizes of the cores may also vary. The NoC design problem can be visualized as a sequence of steps as noted in Fig. 1. The input to the synthesis problem is the application task graph. The graph consists of nodes representing the tasks of the application. The tasks need to exchange messages between themselves, thus requiring varying communication bandwidths. The first step in the synthesis process is to select cores from a given core library. In a heterogeneous core library, each core is capable of performing a subset of application tasks. Assignment of tasks to cores gives rise to the core graph. Each node in the core graph corresponds to a selected core. One core may be assigned the job of realizing many task functionalities. Thus, the communication between task nodes in the application graph gets translated into bandwidth requirements of intercore communication. The core graph is next mapped onto the specific topology graph. This stage is commonly known as Application Mapping. As will be noted in Section 2, many application mapping strategies have been reported in the literature. However, the first part, Core Selection, has not received that much attention. The core selection stage can significantly affect the mapping decisions as well. Since different selections of cores for the same application graph may lead to different core graphs, the mapping process will also result in different mapping solutions. Core size variations add another dimension to the NoC synthesis problem, as the mapping algorithm needs to take care of this aspect as well. Most 1

of the mapping algorithms target mesh topology due to its regular structure with small interconnects. In this work also, it is proposed to target the mesh topology. However, there are very few works [3337] that can handle non-uniform core sizes and irregular mesh structures. These techniques also lack in the core selection feature. Application Task Graph

Core Selection

Core Library

Core Graph Topology Graph

Mapping

Mapped Topology Graph

Routing

Routes for Communication

Scheduling

Synthesized NoC Fig.1: Application-specific NoC design flow

In this background, the present paper attempts to solve the integrated core selection and mapping problem. In the process, it also develops a new heuristic application mapping strategy for non-uniform core sizes, resulting in irregular mesh structure. To define the problem, a few definitions have been presented first. Definition 1: Application Task Graph (ATG) is a directed weighted graph A (T, E). Each node in T represents a task in the application. An edge corresponds to the communication requirement between the tasks. The weight of the edge represents the volume of communication. Each task ti ∈ T is of type

(ti ) ∈ F , F being the set of all functionalities. Definition 2: Core Library (L) is a library of IP cores. Each core c in it has associated dimension dim (c) and can realize a subset f (subset of F) of functions. A task t can be realized by c only if type (t ) ∈ f . Definition 3: Topology Graph (TG) is an undirected graph H (R, L) with the set of routers R, and a set of bidirectional links L between the routers. Each link has a capacity in terms of the maximum communication bandwidth that it can support. A number of tasks in the ATG may get merged onto a single core. This creates the core graph for the application. Cores in the core graph are mapped onto routers in the topology graph. The cost of such a mapping is evaluated in terms of communication cost [27] which has direct correspondence with performance of the NoC and the network power consumption. Communication cost = ∑ (No. of hops * Bandwidth between each pair of cores, ci , c j (1)

in core graph)

(ci , c j ) ∈ Edges of core graph Problem Definition: Given an application task graph A, a core library L and a topology graph H, select cores from L to 2

realize tasks in A and map the resulting core graph onto H to minimize the communication cost, area, and power consumption of the NoC. The salient contributions of the work are as follows. 1. Integrates core selection and mapping into a single problem. The core selection problem is an instance of set covering problem [3] whereas, the mapping problem is an instance of constrained quadratic assignment problem (QAP) [4], both of which are NP-hard. Hence formulation has been done based on a Particle Swarm Optimization (PSO) to solve the integrated problem. 2. Another major contribution of the work is the development of a heuristic application mapping procedure that takes care of non-uniform core sizes and can produce solutions in terms of irregular mesh architecture. 3. The heuristic mapping algorithm has been compared with well established techniques available in the literature. 4. Communication cost, floorplan area, throughput, latency and power consumption of the resulting networks for integrated and step-by-step solutions to the core selection and mapping problems, have been determined. It demonstrates the superiority of the integrated approach. Application Task Graph

Core Library

Topology Graph

Set of available cores for each task

Get cores for the tasks

Integrated Core Selection and Mapping Optimizer

Form core graph

Selected cores with mapping

Find communication cost

Mapping

Fig. 2: Core selection phase in Application-specific NoC design flow

It may be noted that the task scheduling is not under the purview of the current work. Thus, the proposed approach, starting with an application task graph, selects the cores from a core library and maps them to irregular mesh topology. The overall scope of the work has been depicted in Fig. 2. The rest of the paper is organized as follows. Section II gives literature survey. A brief discussion on PSO and PSO formulation for integrated core selection and mapping has been presented in Section III. Section IV describes the proposed predictive heuristic for the mapping problem. Section V enumerates the experimental results. Section VI concludes the paper. II. RELATED WORK

Application mapping onto NoC architectures is a well known NP-hard problem [4]. A detailed survey of application mapping strategies and their classification has been done in [5]. The works [610] are based on dynamic mapping techniques. In [6], authors have proposed a compiler based application mapping onto mesh based NoC to reduce energy consumption. In [7], authors have addressed run-time task allocation to homogeneous NoC architectures by considering user’s behavior. In [8], the real-time applications are dynamically mapped onto embedded MPSoCs, where communication is performed via NoC; the resources connected to the NoC have multiple voltage levels. Mapping heuristics have been proposed in [9, 10] to map the communication tasks of the application close to each other so as to minimize the communication overhead and thus improve performance. Some works reported in the literature provide mathematical programming based solutions for the mapping problem onto mesh based NoC architectures. A Mixed Integer Linear Programming (MILP) based task mapping for heterogeneous multiprocessor systems has been reported in [11]. In this heterogeneous multiprocessor, some processors are programmable, while others are application specific. A two stage Integer Linear Programming (ILP) formulation has been 3

presented in [12] for process allocation, data mapping on symmetric multi processing (SMP) and block multi-threading based network processor. A unified approach has been proposed in [13] to solve the application mapping problem on a heterogeneous NoC platform for energy minimization by utilizing MILP formulations. Many search algorithms and heuristics have been proposed to solve the problem of application mapping onto NoC based architectures. An algorithm based on branch-and-bound technique has been proposed in [14] for solving the mapping and routing path allocation problems in regular tile-based NoC architectures to satisfy design constraints through bandwidth reservations. In [15, 16], authors have proposed a strategy for topological mapping of IP cores in a mesh-based NoC architecture. The approach uses heuristics based on multi-objective genetic algorithms to explore the mapping space and find the Pareto mappings that optimize performance and power consumption. A multi-objective genetic algorithm based application mapping for NoC has been presented in [17], which targets mapping with Network Assignment (NA) for heterogeneous distributed embedded systems to improve the performance and reduce the power consumption and area. A multi-objective Genetic Algorithm (MOGA) based application mapping technique has been proposed in [18], where one–one as well as many–many mapping between switches and tiles have been taken into consideration to minimize energy consumption and required link bandwidth. A genetic algorithm based mapping and routing approach called GAMR has been proposed in [19] for low energy design of 2D mesh based NoC under communication bandwidth constraint. A Multi-objective Adaptive Immune Algorithm (MAIA) based on evolutionary approach has been proposed in [20] to solve the NoC mapping problem. MAIA explores the mapping space, targeting latency and power consumption optimization. An improved version of MAIA has been proposed in [21] to solve the multi-application NoC problem. It produces a set of mapping alternatives by exploring the mapping space. PSO based load balance mapping and routing (PLBMR), a Particle Swarm Optimization (PSO) based two-phase application mapping algorithm has been proposed in [22], which minimizes the NoC communication energy and allocates the routing path for balancing the link load. PSO based Mapping (PSMAP), a meta-heuristic strategy using PSO technique has been proposed in [23, 24] to reduce both static and dynamic cost (static cost is the communication cost as computed by Equation (1), Section I, while the dynamic cost is the latency faced by packets computed from a simulation of traffic flow through the network) of NoC for 2-D mesh based application mapping. A unified algorithm, called unified mapping, routing and slot allocation (UMARS) has been proposed in [25], which couples mapping, path selection and time-slot allocation, using a single consistent objective. A low complexity heuristic algorithm, CastNet, has been presented in [26] for the application mapping and bandwidth constrained routing algorithm for mesh-based NoC architectures aiming to minimize the energy consumption. NMAP, a communication aware mapping technique has been proposed in [27] with minimum path routing in the mesh architecture which satisfies the bandwidth constraint and minimizes the average communication delay. A polynomial time heuristic technique called mesh based on-chip interconnection architectures (MOCA) has been presented in [28] for automated design of low energy mesh based NoC architectures. To capture both timing of application communication and communication volume, communication dependence and computation model (CDCM) have been proposed in [29, 30]. It maps applications on regular NoC under bandwidth constraint and minimizes average communication delay. A power-aware template-based efficient mapping (TEM) algorithm for NoC has been proposed in [31] to generate good mapping solutions with low run time under bandwidth and latency constraints. A strategy that simultaneously refines the mapping and routing function has been presented in [32] to determine the Pareto optimal configurations which optimize average delay and routing robustness. A unified communication-aware NoC-based MPSoC mapping and scheduling algorithm has been proposed in [33] in which a list-scheduling method is used to map prioritized tasks to the best-fit processor, based on a transmission route-aware cost function. Most of the works in the literature do not consider non-uniform core sizes in the mapping phase and also lack in the core selection phase. There are very few works in the literature which consider the irregularity of cores while mapping. In [34], area utilization based mapping (AUBM) has been proposed in which authors have considered non-uniform sized cores while mapping onto 2D-mesh network. The mapping algorithm starts with the core having highest bandwidth requirement among all 4

the cores. Then the merging of cores is done by considering different positions and orientations of the cores and the algorithm terminates with the mapping which gives minimum value of the objective function. They have considered the product of area and communication volume as the objective function. However, they have not considered the core selection phase and assume that each core in the core graph can perform one or multiple tasks. They have not reported the results of their approach in dynamic environment. In [35], authors have addressed the mapping algorithm of irregular mesh based NoC and established a mathematical model. They have proposed the constraints to avoid grid overlapping and the method to calculate exact communication distance of irregular mesh NoC. A branch-and-bound mapping algorithm and routing path allocation technique has been proposed in [36] for regular tile-based NoC architectures. They have advised to split the irregular size IP to avoid mapping problem on regular mesh topology as part of the possible extensions of their work. However, they have not given the details of their extensions in terms of routing policy and irregular sized core to router mapping. Also, mapping results have not been reported for irregular tiles. In [37], an energy aware mapping algorithm has been proposed for irregular IP cores onto regular tile based 2D-mesh NoC architectures. It decomposes a large IP into several dummy IPs or integrates several small IPs into one dummy IP, such that each dummy IP can fit into a single tile. The algorithm computes the average area of all IP cores and generates the initial mapping solution. The authors have shown buffer space allocation scheme according to the input/output degree of cores to avoid connection congestion and reduce communication energy. However, they have not reported the results in dynamic environment and lacks in core selection. Architecture-Aware Analytic Mapping algorithm (A3MAP) has been proposed in [38] for NoC with homogeneous and heterogeneous cores on regular and irregular mesh or custom architecture. They have formulated mapping problem as a Mixed Integer Quadratic Programming (MIQP), and proposed a successive relaxation and genetic algorithm based mapping. However, the proposed algorithm takes more CPU time compared to the existing mapping techniques and does not consider the core selection problem. In [47], a two-step genetic algorithm has been presented which finds a mapping of the vertices of the task graph to available cores so that the overall execution time of the task graph can be minimized. They have developed a delay model to estimate the execution time. However, the irregularity of core sizes has not been considered while mapping. In [39], authors have proposed mapping in two phases, assigning tasks to the suitable IP cores and mapping the selected IP cores to the appropriate tiles in NoC platform to minimize delay and energy consumption. They have utilized chaotic discrete PSO in both the phases. However, in the first phase, as the tasks have not been assigned to physical tiles, the delay and energy consumption estimation are obtained based on average path length only which is not a correct measure. They have not considered the irregularity of the core sizes as well, while mapping. As the IP core selection is not integrated with the mapping phase, there is very little flexibility in selecting the IP cores for the second phase. The selected IP cores from the first phase only are to be used in the second phase for mapping onto NoC. In contrast, the proposed approach integrates the core selection and mapping phases together. The actual delay incurred in the communication between the tasks has been considered in the proposed technique, since the communication cost is calculated after the mapping phase. The irregularity of the core sizes is also considered at the mapping phase. III. PSO FORMULATION FOR INTEGRATED CORE SELECTION AND MAPPING

Particle Swarm Optimization (PSO) [40] is a population based stochastic technique developed by Eberhart and Kennedy in 1995, inspired by social behavior of bird flocking or fish schooling. In a PSO system, multiple candidate solutions coexist and collaborate simultaneously. Each solution, called a particle, flies in the problem space according to its own experience as well as the experience of other particles in the population. It has been successfully applied in many problem areas. Each particle has a fitness value. The quality of a particle is evaluated by its fitness. Inspired by its success in solving problems in continuous domain, several researchers have attempted to apply it in discrete domain as well [41]. This motivates the present work to look for a Discrete Particle Swarm Optimization (DPSO) formulation of selecting suitable cores from the core library as part of the application-specific NoC design problem. Let the position of the ith particle (in an n-dimensional space) at kth iteration be

pki =< pki ,1 , pki ,2 ,... pki ,n > . Let pbest i be the local best solution that particle i has seen so far over the 5

generations, and gbestk be the global best particle of generation k. The new position of particle i is calculated as follows:

(

(

)

))

(

pki +1 = ct1 *I ⊕ct2 * pki → pbesti ⊕ct3 * pki → gbestk .pki

(2)

In these expressions, a → b represents the minimum length sequence of swapping to be applied on components of a to transform it to b. For example, if a = < 1, 3, 4, 2 > and b = < 2, 1, 3, 4 >, a → b = < swap (1, 4), swap (2, 4), swap (3, 4) >. Here, swap (1, 4) indicates that the first and fourth entries of a are to be interchanged. The operator ⊕ is the fusion operator. Applied on two swap sequences, a ⊕ b is equal to the sequence in which the sequence of swaps in a is followed by the sequence of swaps in b. The constants ct1 , ct2 , ct3 are the inertia, self-confidence and swarm confidence values. The quantity cti * ( a → b ) means that the swaps in the sequence a → b will be applied with a probability ct i . I is the sequence of identity swaps, such as, < swap (1, 1), swap (2, 2), ··· swap (n, n) >. It corresponds to the inertia of the particle to maintain its current configuration. The final swap

(

)

i

i corresponding to ct1 * I ⊕ ct2 * pk → pbest ⊕ ct3 * ( pk → gbestk ) is applied on particle pk to

i

generate pk +1 . From [42], it can be found that the convergence condition for this DPSO is given by,

(1 −

ct1

)

2

(

≤ ct2 + ct3 ≤ 1 + ct1

)

2

(3)

Accordingly, different values of ct1 , ct2 and ct3 can be used in experimentation. As noted in [42], the values of ct1, ct2, ct3 should follow a relationship to ensure convergence. Accordingly, we have worked with different values. The results reported in the paper are for ct1=1, ct2=0.5, ct3=0.5. Experimentation with other values have shown difference, mainly in terms of convergence rate, the final solution remains unaltered in most of the cases. Next, particle formulation has been presented to select cores from core library to solve the integrated core selection and mapping problem, such that the communication cost is minimized. Inputs to the proposed formulation are application task graph, core library and topology graph.

A. Particle Formulation and Fitness Function First, the structure of the particle is enumerated. Let n be the number of tasks in the application task graph. For all these n tasks, there will be a set of available cores capable of performing each task, out of which one core per task will be selected. A particle is a sequence of n real numbers between 0 and 1. This leads to the selection of cores from the set of available cores for each task. If for the ith task there are k different cores that can perform it and the ith entry of the particle is xi, the core selected is xi × k . The length of the particle is same as the number of tasks in the application task graph. The total communication cost forms the fitness function. Each particle, results into a core graph for the application task graph considered, which is mapped onto the topology graph (2D-mesh) using a predictive heuristic proposed in this paper (Section IV) and the corresponding communication cost is calculated. For example, consider an application task graph with five tasks, {t1, t2, t3, t4, t5}. Also, let the set of cores capable of performing the individual tasks be as follows, where, {c1, c2, c3, c4, c5, c6} is the total set of available cores. t1: c1, c3 t2: c2, c5, c1, c4 t3: c3, c2, c6 t4: c4, c2 t5: c5, c6, c4, c1 Consider a particle with the following structure.

0.4 0.9 0.6 0.3 0.1 1 2 3 4 5 6

As already mentioned, the length of the particle is equal to the number of tasks in the application. In the example considered, there are 5 tasks and the particle length is also 5 with its index corresponding to the task number. Information given by the particle is used to select the cores from the set of available cores for each task. For task t1, out of the 2 available cores, core c1 is selected if the first entry in the particle is between 0 to 0.5 and core c3 is selected if the value is more than 0.5. So, for task t1, core c1 gets selected. In the same way, cores c4, c2, c4, and c5 are selected for the tasks t2, t3, t4, and t5 respectively. For every particle, the following steps are used to calculate its fitness value: 1. Core graph is formed from the cores selected using the values noted in the particle. 2. The core graph is mapped to the topology graph using the proposed heuristic (discussed in Section IV). 3. While mapping, the dimensions of the cores and fixed link length of mesh topology are taken into account. 4. For each edge in the core graph, a path is found between the cores in the router graph using XY routing algorithm. 5. Fitness is calculated using the formula, Fitness = (1-w) * (Communication cost /Minimum Communication cost) + w * (Area/ Minimum Area) (4) Communication cost = ∑ (No. of hops * Bandwidth between each pair of cores in core graph) Area = X-dimension * Y-dimension in Floorplan The weight factor w lies in the range 0 to 1. Lower the value of Fitness, better is the quality of the particle. The value of w=0 gives a solution that optimizes the communication overhead (Minimum Communication cost), while w=1 puts complete emphasis on area minimization (Minimum Area) of the synthesized NoC. It may be noted that as standard XY routing algorithm has been used, the deadlock issues have been taken care of at the routing phase.

B. Local and Global bests Every particle has a local best (pbest), which is one set of selections from the available core set, giving minimum communication cost and area combination, among all sets that the particle has seen so far in the evolution process. This local best partially guides the evolution of the particle. For a particular generation, the global best (gbest) is the particle resulting in the minimum fitness for that generation. It also controls the evolution of particles. The local best of each particle and the global best are modified if the corresponding values in the current iteration are less than the values till the previous iteration.

C. Evolution of generation Evolution of the particles is done over generations to create new particles which are expected to give results closer to the optimum. To start with, the initial population is created randomly and the fitness of individual particles is evaluated. The local best (pbest) of each particle is initialized to be the same as that of the initial particle. The global best of the generation is initialized with the particle giving the least communication cost and area combination (smallest fitness value) in the generation. The second generation results through random exchange of probabilities to select cores within the particles. The local best and the global best values are updated if they give better fitness values. Further generations are created through a series of swap operations. The local best of each particle and the global best are modified if the corresponding values in the current generation are less than the values in the previous generation. The local best and the global best evolution thus center around the basic operator, swap, explained next.

C.1 Swap Operator Each particle is a sequence of n real numbers to select cores. To effect a change in the particle, the swap operator is used. The operator takes two indices (say, i and j) of particle P as input and creates a

7

new particle P1. The particle P1 is same as P excepting that the positions i and j of P are interchanged in it. Let the particle P be s1 s3 s5 s7 s4 s2 s6 s8 where sx represents the probability of selecting a core from a set of cores available for a task. The indices of s1, s3, s5 are 0, 1 and 2 respectively. The swap operator SO(3, 5) swaps positions 3 and 5 in P to generate a new particle as shown below. s1 s3 s5 s2 s4 s7 s6 s8

C.2 Swap Sequence A swap sequence is a sequence of swap operators. For example, a swap sequence SS = {SO(1, 7), SO(3, 4)} creates particle Pnew working on particle P in two steps as follows. Particle P: s3 s6 s8 s4 s1 s5 s2 s7 SO (1, 7) on particle P creates intermediate particle Pint as s3 s7 s8 s4 s1 s5 s2 s6 SO(3, 4) on Pint results in new particle Pnew

Pnew : s3 s7 s8 s1 s4 s5 s2 s6 As discussed earlier in this section, in PSO, each particle tries to move towards the local best and the global best with some inertia of movement. After all particles have undergone the evolution, a new generation gets created. The best fitness of this generation gives the global best for the population. IV. PREDICTIVE HEURISTIC FOR MAPPING

In this section, the algorithm designed for obtaining an efficient mapping of the core graph (obtained from the application task graph) has been presented. The basic idea has been taken from [24], and modified for irregular cores, combined with core selection. This algorithm is executed for every particle in the PSO procedure to calculate its fitness. 2D-mesh topology has been considered for mapping by considering non-uniform core sizes and a fixed link length in the mesh topology. While mapping non-uniform sized cores onto the topology graph, all the routers in the topology may not be available for mapping due to the link length constraint. This situation is explained by taking an example shown in Fig. 3. R1

R2

R3

R4

CR5 1

R6

R7

R8

R9

C2

R1

R2

R3

R4

R5

R6

R7

R8

R9

Core

Router

Fig. 3: Mapping of non-uniform sized cores onto 2D-mesh topology

Suppose that the core ci attached to router R j is of dimension xi × yi . If xi is greater than the link length of the mesh, no core can be attached to the left neighbor of R j . Similarly, if yi is larger than the link length, no core can be attached to the bottom router. For example, Fig. 3 shows a part of the mesh topology graph and if core c1 (whose dimension is greater than link length in both x and y dimensions) is mapped onto a router R3 in the topology graph, the routers which are adjacent to R3 (R2, R5 and R6) will not be available for the mapping of the remaining cores, since core c1 covers the adjacent routers also. In a similar manner, because of the mapping of core c2 (whose dimension is greater than link length in y dimension) onto the topology graph, the router R4 will not be available for mapping. These routers route packets but do not have a core connected to them. For the sake of implementation, these routers can be in a different layer in 3-D environment. In 2-D, these routers will be removed and the links will be established between the routers present in the floorplan by taking care of the link length constraints. Pipelining can be used with repeaters introduced in between. In the 8

NoC architecture with non-uniform sized cores, the area consumed by the mapped on-chip network becomes higher due to the irregularity of the network. At the time of mapping, core dimensions need to be taken care of to satisfy the link length constraints. As a result, the number of routers required in the topology graph may be more than the number of cores in the core graph obtained from the task graph to accommodate cores whose dimensions are more than the link length considered. To satisfy this condition, before the mapping phase is invoked, the network size is considered to be equal to the number of cores in the obtained core graph plus the additional routers required for the cores which are larger than link length in width or height or both. At the end of the mapping phase, if there are any rows or columns in the topology graph with no cores connected to it, that row or column is discarded. The algorithm works as follows. First, the edges of the core graph are sorted on descending communication requirements. Let e = ( c1 , c2 ) be the edge with the maximum bandwidth requirement. Mapping process starts with this edge. For core c1 , the total bandwidth requirement is computed by summing up the labels of all edges of c1 to its neighbors. Same is done for c2 . Let the value computed for c1 be higher than that for c2 . The mapping process generates solutions with c1 mapped to each router position of the topology. For a particular placement of c1 , the remaining cores are mapped judiciously to obtain a good solution. Mapping resulting in the minimum cost is accepted as the final solution. At any point during execution of the mapping algorithm, let C ' be the set of already mapped cores. The algorithm now determines the core ci , neighboring to any core in C ' with highest bandwidth requirement. Routers at one hop distance from U ' are considered and the corresponding router set be U ' (set of routers with already assigned cores). For each such router, cost of mapping is evaluated by considering the sub-graph consisting of cores in the set C '∪ {ci } . If there is a single mapping with the minimum cost, it is accepted for mapping of ci . In general, let us assume M = {m1 , m2 , m3 .....mk } to be the set of k candidate positions for ci resulting in equal mapping cost for the sub-graph with vertex set C '∪ {ci } . To distinguish between these k positions, m1 is selected temporarily to be the mapping of ci .

Mapping for remaining cores is determined in a similar fashion. That is, for the next core to be mapped, the router positions have been evaluated neighboring to the topology sub-graph U '∪ {m1} . However, in this case we do not distinguish between contending positions with minimum cost value. Instead, the first such position has been taken and continued with mapping of remaining cores. When all cores have been mapped, the cost of the final mapping solution is taken as the predicted cost of selecting router position m1 for ci . Similarly other k-1 positions m2 , m3 ....mk are evaluated and the core

ci is mapped onto the router position with the minimum predicted cost. The process continues by selecting the next core. Thus for each of the possible mappings of the first core, the algorithm generates the mapping for all other cores of the core graph. The mapping resulting in the minimum cost is the final mapping solution. The algorithm is given in the following. Mapping algorithm: Input: Core graph G , Topology graph P Output: Mapping of G onto P Begin Sort edges of G on descending order of communication requirement Best_Cost = ∞ Best_Mapping = ∅ For each router position r of P do Mark all nodes of G as unmapped Mapping = Find_Mapping (G, P, r );

9

Cost = Compute_Mapping_Cost (Mapping, G ); If (Cost < Best_Cost) then Best_Cost = Cost; Best_Mapping = Mapping; End if End For End Procedure Find_Mapping Input: Core graph G , Topology graph P , Start_Posn Output: Mapping of all cores of G onto P with the first core mapped to Start_Posn Begin Let (c1, c2) ϵ G be the first edge in sorted order Node c1 if it has higher communication than c2 else Node c2 Mapping [Start_Posn] = Node While there exist unmapped nodes in G do Let (ci , cj) ϵ G be the highest communicating edge with one end mapped Let c be the unmapped core of the edge Positions {Available routers in P with minimum distance from mapped routers} Evaluate_Positions (Positions); // Using Eq. (4), Section III Min_Positions Subset of Positions with minimum cost Best_Position Predict_Best(Min_Positions, G , P , c ); Mapping[Best_Position] c ; Mark c as mapped Mark routers in P covered by area of c as unavailable End While Return Mapping End Procedure Predict_Best Input: Core graph G , Topology graph P , Node to be mapped Node, Set of positions Min_Positions Output: Predicted best position of node amongst Min_Positions Begin Min_Cost = ∞ Newly_Marked_Nodes = ∅ For each position p in Min_Positions do Mapping [Node] = p ; Newly_Marked_Nodes Newly_Marked_Nodes ∪ { Node} ; Mark Node mapped While there exist unmapped nodes in G do Let (ci , cj) ϵ G be the highest communicating edge with one end mapped Let c be the unmapped core of the edge For each r ϵ P at minimum distance from mapped routers do Map c to r and evaluate cost using Eq. (4), Section III Map c to router requiring minimum cost Mark c as mapped End While Cost Total communication cost for this mapping If Min_cost > Cost then Min_cost Cost Min_Posn p End If Unmark all nodes in Newly_Marked_Nodes 10

Newly_Marked_Nodes = ∅ ; End for Return Min_Posn End V. EXPERIMENTAL RESULTS

In this section, the results of the experimentation on integrated core selection and mapping problem have been presented. A number of task graphs having different number of tasks have been generated using TGFF [43] tool. The graphs have been named as G1 through G5. A random core library has been generated with 100 cores. Each core is capable of performing up to 30 different types of functions. While generating the task graphs using TGFF, nodes of the graphs belong to this set of types. The parameters used in generating the task graphs are given in Table 1. The details of core library and task graphs can be downloaded from the link [49]. G6 is a real benchmark formed by combining the real benchmarks named 263ENC_MP3DEC, MP3ENC_MP3DEC and 263DEC_MP3DEC [24] and shown in Fig. 4. The benchmarks are some encoders and decoders for multimedia applications. VLE 38.016 38.001 DCT

ME

38.001 0.5 HUFF2

BIT 4.166 RES1

0.025

0.025 HUFF1

VLD

24.634

0.25 MC

37.958 3.672

0.1

0.38

1.5

12.18 SUM

IMDC

2.083

BIT RES4

0.5

ITER ENC 1

0.87

MEM2

BUF

FILTER 0.15

1.0

3.672

BIT RES3

4.06

2.083 46.733 MEM FFT

IDCT

3.672

0.025 FP

IQ

0.5

BIT 0.02 RES2

0.01

0.187

MEM1

0.025 0.193

1.0 MDCT ITER ENC 2

ADD

Fig. 4: Task Graph G6 Graph No

No. of Tasks

G1 G2 G3 G4 G5 G6

8 16 32 64 128 29

TGFF Parameters Task type Bandwidth count range 30 650-750 30 650-750 30 700-800 30 700-800 30 450-550 -

Table 1: TGFF parameters used to generate graphs

A. Static Performance Analysis To check the impact of integrated core selection and mapping, the core selection and mapping phases have also been carried out separately. Since, the integrated approach could see up to the final mapping, it is expected to be able to come up with better solutions. The comparison between the integrated approach (with weight w=0, depicting a fully communication-aware mapping) and the stepby-step approach, in terms of static communication cost and overall NoC area has been shown in Table 2. In the proposed approach, depending on the core dimensions selected and the mapping, the area is calculated using the formula described in Section III. In case of NMAP and DPSO, as they do not take core dimensions into account while mapping, area is calculated after adjusting the resultant mapping according to the core dimensions selected. In case of POSEIDON, the area is calculated using the formula described in Section III. To check the quality of integrated approach, the comparison has been done with Chaos DPSO [39], where core selection and mapping has been done separately by assuming uniform core sizes. As the irregular core sizes are not taken into consideration while mapping, the area is calculated after adjusting the resultant mapping according the selected cores’ dimensions. To establish the goodness of the proposed mapping algorithm, the mapping solution produced by the tool NMAP [27] and DPSO [24] have been considered as well. Also, 11

comparisons have been done with an ASNoC approach, POSEIDON [48], to check the efficiency of the proposed approach. In NMAP case, PSO based core selection and NMAP based mapping has been used, while in the DPSO case, PSO based core selection and DPSO based mapping has been used. In POSEIDON case, PSO based core selection and POSEIDON based synthesis has been used. For fair comparison, in the synthesis process, the link length have been taken as same as the link length of mesh topology used in the proposed approach. In the proposed technique, PSO based core selection and heuristic mapping strategy reported in this paper has been used. As it can be observed from Table 2, in both integrated and step-by-step approaches, the mapping algorithm developed in this paper produces much better results than NMAP and DPSO. The proposed approach shows improvements in terms of communication cost and area compared to Chaos DPSO in all the cases since in the proposed approach both the phases have been integrated. Even in step-by-step approach, the results of the proposed technique are better compared to Chaos DPSO because of the efficient heuristic developed in this paper. The proposed approach shows better area results as well, since in step-by-step approach, the core selection has been done by optimizing area, which is not the case in Chaos DPSO. Also, the proposed approach produces better results compared to POSEIDON approach in terms of communication cost. In some cases, POSEIDON approach could not find path (shown as NO PATH in Table 2) for the communication between the cores. This is due to the link length constraint given to the approach, as the link length has been taken to be same as the link length of the mesh topology. From the area front, the proposed approach doesn’t fair well compared to POSEIDON in both integrated and step-by-step approaches. This is because POSEIDON approach produces ASNoC topology, resulting in compact floorplan compared to the mesh based one. As the integrated approach attempts to minimize the communication cost, compared to the step-by-step approach, a good reduction in communication cost could be achieved. It may be noted that even though the task graph G5 is larger than G4, the communication cost of G5 is less than G4. It is because the average Application No. of Tasks

Communicati on Cost(in no. of hops * Bandwidth)

Area (Xdimension * Y-dimension)

PSO+NMAP[27] PSO+DPSO[24] Integrated PSO+ Approach POSEIDON[48] PSO+Our Approach PSO and NMAP[27] PSO and Step-byDPSO[24] step PSO and Approach POSEIDON[48] PSO and Our Approach Chaos DPSO [39] PSO+NMAP[27] PSO+DPSO[24] Integrated PSO+ Approach POSEIDON[48] PSO+Our Approach PSO and NMAP[27] PSO and Step-byDPSO[24] step PSO and Approach POSEIDON[48] PSO and Our Approach Chaos DPSO [39]

G1 8 3465.92 2762.81

G2 16 12950.7 8267.2

G3 32 64653.0 60623.2

G4 64 85374.7 81231.8 NO PATH

G5 128 44466.3 29823.8 NO PATH

2708.33

7934.27

60501.2

2652.16

7520.07

4743.55 4743.55

G6 29 83.23 75.6

57513.7

79973.9

27381.1

72.81

22089.6

102103.8

188134.6

59111.4

127.34

14673.4

98823.7

156269.8

57834.8

124.6

4743.55

14064.2

NO PATH

NO PATH

NO PATH

NO PATH

4743.55

12390.5

96499.5

139814.5

50799.81

119.85

4743.5 39.5 38.5

12399.2 130.0 110.2

96512.5 234.0 213.7

140013.7 448.7 432.82

51211.64 582.7 571.2

121.71 120.3 108.2

34.1

96.6

201.1

417.91

560.1

105.3

36.0

99.75

201.5

420.25

563.5

110.6

20.0

55.0

127.5

310.0

560.9

89.7

19.0

42.7

124.0

291.2

523.6

55.8

13.0

34.4

120.9

276.4

500.9

49.3

14.0

36.0

123.5

282.4

510.3

52.3

41.0

132.7

242.0

471.2

601.3

125.5

74.91

Table 2: Communication cost and area comparison between integrated and step-by-step approach

bandwidth in G5 is 457.49, in G4 it is 793.23. The average hop distance for cores in G5 is 4.3, for G4 it is 2.7. The effect of this router hop distance can be observed in latency results. On the area front, the integrated approach do not fare well compared to the step-by-step one. This happens since in step-bystep approach, core selection is guided completely by the area minimization. At this stage, no mapping has been performed, and thus, the communication cost cannot be taken into consideration. 12

Area minimization attempts to achieve the solution with minimum network dimension. Thus, area values are less, but communication cost is higher than in the integrated approach. This has been explained further in the context of Tables 4 and 5 (presented later). Graph No

No. of Tasks

G1 G2 G3 G4 G5 G6

8 16 32 64 128 29

PSO+ NMAP[27] 395.11 5397.10 90362.37 45778.47 219989.79 7323.81

PSO+ DPSO[24] 402.81 5923.63 111236.27 79823.72 327286.69 7926.27

CPU Time (in sec) PSO+ POSEIDON[48] 525.2 7123.6 144278.03 98938.8 487234.2 10215.1

Chaos DPSO [39] 511.23 6220.3 124563.2 81123.12 388447.6 8474.71

PSO+Our Approach 241.53 2970.15 19765.71 23860.21 105194.75 4721.78

Table 3: CPU comparison between different techniques

Table 3 shows a comparison of CPU times in different approaches for different task graphs and real benchmark, denoted as G6. The results have been obtained on a machine with Intel dual core processor, operating at 2 GHz and having 3GB main memory. It can be observed that PSO takes a good amount of time while run with 3000 particles. The PSO evolution is terminated when there is no improvement in the fitness of best particle for the last 30 generations, or if the PSO has already run for 100 generations. PSO provides a good scope of parallelization, as each particle can evolve independently. This possible parallelization has not been explored in the current work. A parallel implementation may reduce this CPU time overhead significantly.

B. Dynamic Performance Analysis Even though the static analysis gives a rough idea about the performance of the approaches in NoC, there is a need to test the approaches in dynamic environment, as the real systems may face problems Graph No

No of Tasks

G1

8

G2

16

G3

32

G4

64

G5

128

G6

29

Weight of area (w) 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1

Network size

Communication Cost

Area

Latency

Throughput

2x3 2x3 2x3 2x2 2x2 4x4 4x4 3x4 3x4 3x3 5x6 5x4 5x5 5x5 4x5 8x8 7x8 7x7 7x7 6x7 9x10 9x9 8x9 8x9 8x9 5x5 5x5 5x4 5x4 4x4

2652.16 2711.50 2789.50 4103.50 6136.04 7520.07 7597.30 7651.56 12921.60 12959.30 57513.70 59317.30 61140.30 66569.30 79317.10 79973.90 87822.70 91599.20 107681.0 123372.0 27381.1 28688.6 32026.3 35163.0 39508.3 72.81 77.63 80.21 82.89 91.6

36.0 24.5 22.75 19.25 16.0 99.75 89.25 68.25 51.0 45.5 201.5 143.0 131.25 126.5 114.0 420.25 342.0 288.0 272.0 243.0 563.5 506 460 416.25 409.5 110.6 101.2 96.2 90.3 88.6

74.0 74.2 74.34 74.51 79.23 74.83 74.56 74.6 77.01 79.22 83.3 85.64 85.81 85.91 102.5 81.3 82.33 85.13 86.33 90.13 91.9 98.16 100.39 104.54 107.47 80.2 81.0 82.1 88.3 90.0

0.0196 0.0248 0.0212 0.0342 0.0395 0.0250 0.0251 0.0282 0.0320 0.0436 0.0776 0.084 0.0881 0.0997 0.1031 0.0480 0.0493 0.0546 0.0547 0.0548 0.087 0.087 0.092 0.095 0.096 0.052 0.059 0.060 0.068 0.072

Average packet Energy (*10-3Joule) 0.5501 0.4329 0.5066 0.2629 0.2286 0.5176 0.618 0.3064 0.3832 0.2330 0.1870 0.1028 0.1543 0.1368 0.1056 0.2925 0.2637 0.2262 0.2325 0.2067 0.165 0.153 0.135 0.137 0.134 0.321 0.282 0.301 0.242 0.231

Average link Energy (*10-12 Joule) 52.53 49.79 49.26 58.44 50.70 44.74 40.96 41.88 46.99 53.22 64.19 60.07 54.75 63.38 64.42 55.65 61.97 63.43 71.31 91.28 87.42 83.06 87.92 100.33 99.01 56.67 53.21 52.06 60.23 58.17

Table 4: Performance of integrated approach for different values of w

of network congestion affecting throughput and latency. Next, the performance of the proposed integrated approach in static and dynamic environment have been presented with different weights (w) assigned to the area optimization. The weight w=0 puts complete emphasis on performance optimization. The results have been shown in Table 4 for different w values, w = 0, 0.2, 0.5, 0.8, 1.0. 13

The corresponding network sizes have also been shown. As w value increases, the tool attempts to realize the application with mesh of lower dimensions. This has increased communication cost and latency values. However, the energy requirement shows a different trend. To obtain the energy consumed by the router network, the tool Orion [44] has been used. Orion provides a fast and accurate NoC power and area models for early-stage design-space exploration at various process technologies, such as, 90nm, 65nm, 45nm, and 32nm. Since the routers consume a good amount of energy, decreasing network dimension often has a good role to play in reducing the average packet energy. Hence, it is advisable to keep a weight on the area optimization as well. For the set of examples we have worked with, a w value between 0.5 and 0.8 seems to work well. To get a better understanding of the performance, the NoCs have been simulated (output from integrated core selection and mapping approach and the step-by-step core selection and mapping approach) using cycle-accurate SystemC based simulator [45, 46]. The performance of step-by-step approach in static and dynamic environment has also been presented with different weights (w) factors assigned to the area optimization. The results have been shown in Table 5 for different w values, w = 0, 0.2, 0.5, 0.8, 1.0. Graph No

No of Tasks

G1

8

G2

16

G3

32

G4

64

G5

128

G6

29

Weight of area (w) 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1 0 0.2 0.5 0.8 1

Network size

Communication Cost

Area

Latency

Throughput

2x3 2x3 2x3 2x2 2x2 4x4 4x4 3x4 3x3 3x3 5x5 5x4 5x4 5x4 5x4 8x8 7x8 7x7 6x7 6x7 9x9 9x9 8x9 8x9 8x9 5x5 5x4 5x4 4x4 4x4

4743.55 4703.38 4808.65 6653.96 8944.66 12390.5 12475.04 12502.55 20901.97 20472.83 96499.5 98944.62 100477.1 103208.2 112987.3 139814.5 147106.7 150656.6 171466.6 188067.1 50799.81 51598.2 55794.95 58410.3 60410.24 119.85 125.67 127.72 128.71 133.91

14.0 8.84 7.91 6.42 5.08 36.0 31.95 23.81 16.72 14.42 123.51 85.51 76.65 72.99 64.86 282.40 228.45 188.35 174.08 151.38 510.30 453.83 409.4 366.71 356.26 52.313 46.349 43.29 39.09 36.94

77.10 77.22 77.28 77.42 82.20 77.92 77.59 77.56 80.04 82.26 91.80 94.12 94.11 94.07 111.66 91.76 92.19 94.58 95.60 98.93 105.99 112.43 113.17 116.15 118.34 85.35 85.73 86.51 92.26 93.65

0.012 0.019 0.02 0.029 0.033 0.022 0.025 0.029 0.032 0.044 0.08 0.092 0.098 0.102 0.108 0.051 0.058 0.059 0.062 0.064 0.084 0.085 0.092 0.094 0.096 0.058 0.063 0.069 0.07 0.074

Average packet Energy (*103 Joule) 0.6433 0.5010 0.5789 0.2977 0.2548 0.6206 0.7313 0.3601 0.4441 0.2656 0.2314 0.1259 0.1854 0.1615 0.1239 0.3716 0.3337 0.2821 0.2842 0.2472 0.2162 0.1971 0.1706 0.1685 0.1616 0.3812 0.3317 0.3516 0.2811 0.2649

Average link energy (*10-12 Joule) 59.15 55.51 54.43 64.21 55.11 50.72 45.76 46.43 51.98 58.22 74.21 69.04 62.5 71.61 72.38 64.63 71.39 72.49 80.85 102.79 102.84 97.03 101.29 114.27 111.74 64.98 60.46 58.89 67.82 65.35

Table 5: Performance of step-by-step approach for different values of w

Table 6 shows a comparison of average network latency and throughput for integrated and step-bystep core selection and mapping approaches. From the observations, throughput is almost same in both the cases. An improvement in the average latency could be observed in integrated approach over step-by-step one. This is because the proposed integrated approach could decrease the number of hops required for the communication between the cores selected for the tasks, in turn reducing the latency. This shows the efficiency of the proposed approach in dynamic environment also. The dynamic results of other approaches have also been shown in Table 6. As it can be observed from Table 6, in both integrated and step-by-step approaches, the mapping algorithm developed in this paper produces much better results than NMAP and DPSO except in the case of graph G1. The latency of G1 is slightly more in the proposed case compared to other approaches even though the communication cost is same (Refer Table 2). This slight deviation is due to the traffic pattern generated for the simulation, as G1 consists of very less number of tasks. The proposed approach produces better results compared to Chaos DPSO in all the cases in terms of throughput and latency. Since the integrated approach targets to reduce the number of hops to minimize the communication cost, the effect is also shown in 14

dynamic environment as well, in terms of latency. Also, the proposed approach produces better results compared to POSEIDON approach. This shows the efficiency of the proposed approach. Application No. of Tasks

Average Network Latency (in router cycles)

Throughput (flits/cycle/IP)

PSO+NMAP[27] PSO+DPSO[24] Integrated PSO+ Approach POSEIDON[48] PSO+Our Approach PSO and NMAP[27] PSO and DPSO[24] Step-by-step Approach PSO and POSEIDON[48] PSO and Our Approach Chaos DPSO [39] PSO+NMAP[27] PSO+DPSO[24] Integrated PSO+ Approach POSEIDON[48] PSO+Our Approach PSO and NMAP[27] PSO and DPSO[24] Step-by-step Approach PSO and POSEIDON[48] PSO and Our Approach Chaos DPSO [39]

G1 8 77.03 75.6

G2 16 77.56 76.6

G3 32 90.35 86.62

G4 64 87.13 85.2

G5 128 92.18 92.0

G6 29 87.27 82.3

74.2

75.7

84.12

-

-

82.1

74.0

74.83

83.3

81.3

91.9

80.2

76.71

80.11

105.96

87.83

106.66

101.2

77.0

78.21

98.29

86.6

103.21

94.49

77.1

78.01

-

-

-

-

77.11

77.93

96.04

83.08

99.82

92.23

77.1 0.02 0.019

78.0 0.024 0.024

96.91 0.079 0.077

85.62 0.045 0.046

100.91 0.025 0.042

92.89 0.041 0.047

0.019

0.024

0.077

-

-

0.048

0.0196

0.025

0.077

0.048

0.087

0.052

0.018

0.022

0.08

0.049

0.077

0.05

0.015

0.022

0.08

0.051

0.079

0.05

0.014

0.022

-

-

-

-

0.012

0.022

0.08

0.051

0.084

0.058

0.012

0.022

0.08

0.05

0.081

0.051

Table 6: Average network latency and throughput comparison between integrated and step-by-step approach

C. Impact of no. of tasks and weight In this section, the impact of number of tasks and the weight factor on the performance has been shown. For this, the results of the proposed approach have been plotted in integrated and step-by-step cases. The resultant graphs are shown in Fig. 5. Different dash-dot type of line diagrams correspond to different number of tasks. X-axes in the graphs note the weight factors, while Y-axes correspond to percentage improvements for various quantities in integrated approach over step-by-step one. The percentage improvement in communication cost decreases as the weight increases and also with decreasing the number of tasks. It shows the efficiency of the proposed integrated approach as the improvement increases with increase in number of tasks. As the weight of the area increases, the improvement decreases since the proposed integrated approach is optimizing the communication cost. The same trend can be observed in case of latency, average packet energy and average link energy. In case of area, the integrated approach does not show improvement, as it is optimizing the communication cost. The graph shows the percentage degradation in area in integrated approach over step-by-step approach. The percentage degradation in area increases with weight value and decreases with increasing number of tasks. The fact that the area penalty decreases with increasing number of tasks reinforces the applicability of the integrated approach further. VI. CONCLUSION

In this paper, an integrated core selection and mapping strategy has been presented for mesh based NoC by considering non-uniform core sizes. An efficient heuristic have been proposed for mapping of non-uniform sized cores onto mesh topology. The proposed integrated approach have been compared with step-by-step core selection and mapping approach and showed significant improvement in the communication cost and floorplan area while considering static operation of the system. The dynamic performance is also comparable in terms of throughput and achieves improvement in average latency. The performance of integrated and step-by-step approaches have been shown with different values of weight. Comparison of the results of the proposed heuristic with the existing mapping algorithms have 15

been done and achieved better results. Future works involve the core selection for application-specific NoCs, scheduling the tasks and integration of thermal management into core selection. In the present PSO formulation, elements of a particle are exchanged. The formulation can be extended to allow for changes in the values as well. Parallelization of PSO is also an important direction to enable better exploration of the solution space.

Fig. 5: Impact of number of tasks and weight on performance

REFERENCES [1]

W. J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, Design Automation Conference (DAC),. Proceedings, pp. 684- 689, 2001.

[2]

D. Atienza, F. Angiolini, S. Murali, A. Pullini, L. Benini, and D. G. Micheli, Network-On-Chip Design and Synthesis Outlook, Integration- The VLSI journal, vol. 41, no.2, pp. 340-359, May 2008.

[3]

H. T. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, Cambridge, Mass.: MIT Press and McGraw-Hill, p. 1033–1038, 2001.

[4]

R. Pop, S. Kumar, A survey of techniques for mapping and scheduling applications to network on chip systems, ISSN 1404 – 0018, Research Report 04:4, School of Engineering, Jönköping University, 2004.

[5]

P.K. Sahu, S. Chattopadhyay, A survey on application mapping strategies for Network-on-Chip design, Journal of Systems Architecture, vol. 59, no.1, pp. 60-76, January 2013.

[6]

G. Chen, F. Li, M. Kandemir, Compiler-directed application mapping for NoC based chip multiprocessors, in Proceedings of ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems (LCTES), pp. 155–157, 2007.

[7]

C.L. Chou, R. Marculescu, User-aware dynamic task allocation in Network-on-Chip, in Proceedings of Design, Automation and Test in Europe (DATE), pp. 1232–1237, March 2008.

[8]

C.L. Chou, U.Y. Ogras, R. Marculescu, Energy- and performance-aware incremental mapping for NoCs with multiple voltage levels, IEEE Transactions on Computer-Aided design of Integrated Circuits and Systems vol. 27, no. 10, pp. 1866–1879, October 2008.

16

[9]

A.K. Singh, W. Jigang, A. Prakash, T. Srikanthan, Mapping algorithms for NoC based heterogeneous MPSoC platforms, in: Euromicro Conference on Digital System Design/Architecture, Methods and Tools, pp. 133–140, August 2009.

[10] A.K. Singh, T. Srikanthan, A. Kumar, W. Jigang, Communication-aware heuristics for run-time task mapping on NoC-based MPSoC platforms, Journal of System Architecture, vol. 56, no. 7, pp. 242–255, July 2010. [11] A. Bender, MILP based task mapping for heterogeneous multiprocessor systems, in Proceedings of International conference on Design and Automation (EURO-DAC), pp. 190–197, 1996. [12] C. Ostler, K.S. Chatha, An ILP formulation for system-level application mapping on network processor architecture, in Proceedings of Design, Automation and Test in Europe (DATE), pp. 1–6, April 2007. [13] P. Ghosh, A. Sen, A. Hall, Energy efficient application mapping to NoC processing elements operating at multiple voltage levels, in IEEE International Symposium on Network-on-Chip (NoCS), pp. 80–85, May 2009. [14] J. Hu, R. Marculescu, Exploiting the routing flexibility for energy/performance aware mapping of regular NoC architectures, in Proceedings of Design, Automation and Test in Europe (DATE), pp. 688–693, 2003. [15] G. Ascia, V. Catania, M. Palesi, Multi-objective mapping for mesh-based NoC architectures, in ACM International Conference on Hardware/Software Codesign and System Synthesis, pp. 182–187, September 2004. [16] G. Ascia, V. Catania, M. Palesi, Multi-objective genetic approach to mapping problem on Network-on-Chip, Journal of Universal Computer Science, vol. 12, no. 4, pp. 370–394, April 2006. [17] A.H. Benyamina, P. Boulet, Multi-objective mapping for NoC architecture, Journal of Digital Information Management, vol. 5, no. 6, pp. 378–384, December 2007. [18] K. Bhardwaj, R.K. Jena, Energy and bandwidth aware mapping of IPs onto regular NoC architectures using multi-objective genetic algorithms, in International Symposium on System-on-Chip (SOC), pp. 27–31, October 2009. [19] G. Fen, W. Ning, Genetic algorithm based mapping and routing approach for network on chip architectures, Chinese Journal of Electronics, vol. 19, no.1, pp. 91–96, January 2010. [20] M.J. Sepulveda, M. Strum, W.J. Chau, A multi-objective adaptive immune algorithm for NoC mapping, in International Conference on Very Large Scale Integration (VLSI-SOC), pp. 193–196, October 2009. [21] M.J. Sepulveda, M. Strum, W.J. Chau, G. Gogniat, A multi-objective approach for multi-application NoC mapping, in IEEE Latin American Symposium on Circuits and Systems (LASCAS), pp. 1–4, February 2011. [22] W. Zhou, Y. Zhang, Z. Mao, Link-load balance aware mapping and routing for NoC, WSEAS Transactions on Circuits and Systems, vol. 6, no.11, pp. 583– 591, November 2007. [23] P.K. Sahu, P. Venkatesh, S. Gollapalli, S. Chattopadhyay, Application mapping onto mesh structured Network-on-Chip using particle swarm optimization, in IEEE International symposium on VLSI (ISVLSI), pp. 335–336, July 2011. [24] P. K. Sahu, T. Shah, K. Manna, S. Chattopadhyay, Application Mapping Onto Mesh-Based Network-on-Chip Using Discrete Particle Swarm Optimization, EEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no.2, pp. 300-312, February 2014. [25] A. Hansson, K. Goossens, A. Radulescu, A unified approach to constrained mapping and routing on Network-on-Chip architectures, in IEEE/ACM International Conference on Hardware/Software Codesign and System, Synthesis (CODES+ISSS), pp. 75–80, 2005. [26] S. Tosun, New heuristic algorithm for energy aware application mapping and routing on mesh-based NoCs, Journal of System Architecture, vol. 57, no.1, pp. 69–78, January 2011. [27] S. Murali, G. De Micheli, Bandwidth constrained mapping of cores onto NoC architectures, in Proceedings of Design, Automation and Test in EuropeConference and Exhibition (DATE), vol. 2, pp. 896–901, February 2004. [28] K. Srinivasan, K.S. Chatha, A technique for low energy mapping and routing in Network-on-Chip architecture, in IEEE International Symposiun on Low Power Electronics and Design (ISLPED), pp. 387–392, August 2005. [29] C. Marcon, N. Calazans, F. Moraes, A. Susin, I. Reis, F. Hessel, Exploring NoC mapping strategies: an energy and timing aware technique, in Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE), vol. 1, pp. 502–507, March 2005. [30] C. Marcon, A. Borin, A. Susin, L. Carro, F. Wagner, Time and energy efficient mapping of embeded applications onto NoCs, in Proceedings of Asia and South Pacific Design Automation Conference (ASP-DAC), vol. 1, pp. 33–38, January 2005. [31] X. Wang, M. Yang, Y. Jiang, P. Liu, Power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints, ACM Transactions on Architecture and Code Optimization, vol. 7, no. 1, pp. 1–30, May 2010. [32] R. Tornero, V. Sterrantino, M. Palesi, J.M. Orduna, A multi-objective strategy for concurrent mapping and routing in Networks on Chip, in IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8, May 2009. [33] H. Yu, Y. Ha, B. Veeravalli, Communication-aware application mapping and scheduling for NoC-based MPSoCs, in IEEE International Symposium on Circuits and Systems (ISCAS), pp. 3232–3235, May-June 2010. [34] HC Chi, F Ferng, YC Hsieh , Area Utilization Based Mapping for Network-on-chip Architectures with Over-sized IP Cores, High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems (HPCCICESS), 2012 IEEE 14th International Conference on , pp.1520-1525, June 2012. [35] G Haiyun, L Changwen, S Shu, Research on mapping algorithm of irregular mesh NoC for portable multimedia appliances, IET Conference on Wireless, Mobile and Sensor Networks (CCWMSN07), pp.697-700, December 2007. [36] J Hu, R Marculescu, Energy- and performance-aware mapping for regular NoC architectures, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.24, no.4, pp. 551- 562, April 2005.

17

[37] LI Guangshun, WU Junhua, MA Guangsheng, Mapping of Irregular IP onto NoC Architecture with Optimal Energy Consumption, Tsinghua science and technology, ISSN 1007-0214 26/49, vol. 12, no. S1, pp. 146-149, July 2007. [38] W. Jang, D.Z. Pan, A3MAP: Architecture-aware analytic mapping for Network on- Chip, in Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 523–528, January 2010. [39] W. Lei, L. Xiang, Energy- and latency-aware NoC mapping based on chaos discrete particle swarm optimization, in Proceedings of IEEE International Conference on Communications and Mobile Computing, vol. 1, pp. 263–268, April 2010. [40] J. Kennedy, R. Eberhart, Particle swarm optimization, Proceedings of IEEE International Conference on Neural Networks, vol.4, pp.1942-1948, Nov/Dec 1995. [41] KP Wang, L Huang, CG Zhou, W Pang, Particle swarm optimization for traveling salesman problem, International Conference on Machine Learning and Cybernetics, vol.3, pp. 1583- 1585, November 2003. [42] L Guilan, Z Hai, S Chunhe, Convergence Analysis of a Dynamic Discrete PSO Algorithm, First International Conference on Intelligent Networks and Intelligent Systems (ICINIS '08), pp. 89-92, November 2008. [43] R.P. Dick, D.L.Rhodes, W. Wolf, TGFF: task graphs for free, Proceedings of the Sixth International Workshop on Hardware/Software Codesign, (CODES/CASHE), pp. 97-101, March 1998. [44] A. B. Kahng, L. Bin, S. P. Li, K. Samadi, ORION 2.0: A Power-Area Simulator for Interconnection Networks, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 1, pp. 191-196, January 2012. [45] S. Kundu, S. Chattopadhyay, Network-on-Chip architecture design based on mesh-of-tree deterministic routing topology, International Journal of High Performance System Architecture, vol. 1, no. 3, pp. 163-182, December 2008. [46] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh and S. Chattopadhyay, A comparative Performance Evaluation of Network-onChip architecture under self-similar traffic, In proceedings of International conference on Advances in Recent Technologies in communication and computing (ARTCom), pp. 414-418, October 2009. [47] L. Tang, S. Kumar, A two-step genetic algorithm for mapping task graphs to a network on chip architecture, Euromicro Symposium on Digital System Design, pp. 180-187, September 2003. [48] K. Soohyun, S. Pasricha, C. Jeonghun, POSEIDON: A framework for application-specific Network-on-Chip synthesis for heterogeneous chip multiprocessors, 12th International Symposium on Quality Electronic Design (ISQED), pp. 1-7, March 2011. [49] https://drive.google.com/folderview?id=0B87DOi0yR8r9fnU4T0otdDZZTktNdGVPZmNLUVRsa1VGNXBVbWZudjZVUW1uelM2 VHFXb1E&usp=sharing

18

Soumya J. received B.Tech degree in Electronics and Communications Engineering from JNTU, Hyderabad and Masters degree in Embedded Systems from IIT Kharagpur. She is currently working towards Ph.D degree in Department of E&ECE, IIT Kharagpur. Her research interests include Application Specific Network on Chip and Reconfigurable Network on Chip design. K. Naveen kumar is an M.Tech student in the department of Electronics and Communication Engineering at Indian Institute of Technology, Kharagpur. His research interests include Network-onChip architecture design, algorithm analysis design and implementation. Santanu Chattopadhyay did his B.E. in Computer Science and Technology fromUniversity of Calcutta (Bengal Engineering College) in1990.He completed his M.Tech in Computer and Information Technology and Ph.D. in Computer Science and Engineering, in 1992 and1996 respectively, both from Indian Institute of Technology Kharagpur, India. He is currently a Professor in the Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology Kharagpur, India. His research interests include low power circuit design and test, System- on-Chip test, Network-on-Chip design and test. He has published more than150 technical papers in refereed international journals and conferences.

19

20

21

22

Integrated core selection and mapping for mesh based Network-on-Chip design with irregular core sizes

Integrated core selection and mapping for mesh based Network-on-Chip design with irregular core sizes

Recommend Documents