Cluster-based application mapping method for Network-on-Chip

Advances in Engineering Software 42 (2011) 868–874 Contents lists available at ScienceDirect Advances in Engineering Software journal homepage: www...

Download PDF

505KB Sizes 0 Downloads 44 Views

Report

PDF Reader
Full Text

Advances in Engineering Software 42 (2011) 868–874

Contents lists available at ScienceDirect

Advances in Engineering Software journal homepage: www.elsevier.com/locate/advengsoft

Cluster-based application mapping method for Network-on-Chip Suleyman Tosun ⇑ Computer Engineering Department, Ankara University, Besevler, 06500 Ankara, Turkey

a r t i c l e

i n f o

Article history: Received 29 March 2011 Received in revised form 11 May 2011 Accepted 6 June 2011 Available online 7 July 2011 Keywords: Network-on-Chip Mesh topology Application mapping Clustering ILP Communication

a b s t r a c t Network-on-Chip (NoC) is a newly introduced paradigm to overcome the communication problems of System-on-Chip architectures. Mapping applications onto mesh-based NoC architecture is an NP-hard problem and several heuristic methods have been presented to solve it so far. Scalability is the main problem of the heuristic methods and it is very difﬁcult to conclude that one heuristic is better than the others. Integer Linear Programming (ILP) based methods determine the optimum mappings. However, they take very long execution times. In this paper, we propose a clustering based relaxation for ILP formulations. Our experiments conducted on several multimedia benchmarks and custom graphs show that the proposed method obtains optimal or close to optimal results within tolerable time limits. 2011 Elsevier Ltd. All rights reserved.

1. Introduction Network-on-Chip (NoC) has been proposed in the beginning of this century as a new communication infrastructure for integrated circuits [1,2]. NoC architectures mimic the traditional interconnection network concepts. Although there are various topologies exist for NoC architectures, the basic and well accepted topology is the mesh topology. Even commercial multi-core architectures have already adopted mesh based topology such as Intel’s Teraﬂops Research Chip [3]. In this chip, 80 processing cores are connected in a 2D mesh network. One of the biggest problems of the meshbased NoC architecture remains as optimum application mapping. Application mapping onto mesh topologies has been a well known NP-hard problem [4]. There have been several methods [5–10] proposed so far to solve it, mainly having the energy minimization as an objective criteria. While [2] presents a mapping algorithm called PMAP that supports single-minimum-path routing and split-trafﬁc routing, [6] proposes a fast branch-and-bound algorithm that exploits the routing ﬂexibility and improves the solution quality. MOCA [7] uses slicing tree based task mapping and generates routes on the mapping result. ONYX [8] and CastNet [9] are two heuristic methods that use the symmetric property of the mesh as a starting point. ONYX maps the tasks based on the lozenge-shaped path order, whereas CastNet maps them one by one based on the communication weights between candidate tasks and mapped tasks. CGMAP [10] employs chaos-genetic-based algorithm that obtains close results compared to other meta-heuristic ⇑ Tel.: +90 544 337 6547. E-mail address: [email protected] 0965-9978/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.advengsoft.2011.06.005

algorithms. However, none of these methods guarantee the optimal mapping onto mesh architectures. The optimum solutions can be obtained by Integer Linear Programming (ILP) based methods. ILP is a mathematical method for obtaining the best solution among several alternatives. Our mapping problem can be expressed as a linear programming problem having the communication cost minimization as an objective function. However, since the ILP method searches for every possible solution in the huge solution space, it may take very long CPU times to determine the optimum solution. Our earlier work in [11] demonstrates this performance bottleneck of ILP-based method. While one solution to overcome this timing problem can be efﬁcient numerical implementation of Simplex method as suggested in [12] another solution can be decomposing the constraint variables into ﬁnite number of polyhedral and solve for each decomposition. In our case, we picked the latter method. In this paper, we present a cluster-based ILP formulation for application mapping problem for mesh-based NoC architectures. Our method partitions the task graph, representing the given application, and the mesh into smaller sub-graphs and sub-meshes to decompose the given solution space into smaller polyhedral. It then maps each sub-graph onto the corresponding sub-mesh using our ILP-based method. Finally, it merges each mapping to determine the ﬁnal solution. We implemented our method using commercial ILP tool [13] and tested its effectiveness on several multimedia benchmarks and randomly generated graphs. Our experiments show that the proposed method is very effective to reduce the execution times of ILP method while determining similar results. The rest of this paper is organized as follows. Section 2 deﬁnes the mapping problem. Section 3 presents the ILP formulations.

869

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

Section 4 presents our cluster-based mapping method. Section 5 gives the experimental results. Finally, Section 6 concludes this paper. 2. Problem deﬁnition In this section, we formally deﬁne the application mapping problem. We use communication task graph (CTG) and topology graph (TG) to represent the given application and the mesh architecture, respectively. Deﬁnition 1. A CTG is a graph G(V, E), where each vertex vi 2 V represents a task in the application and each edge ei,j 2 E represents a dependency between two tasks vi and vj. The amount of data transfers between vi and vj is represented by the weight wi,j for all ei,j and it is given in bits per second. Fig. 1a shows an example CTG. This CTG represents the multimedia benchmark 263-enc mp3-dec presented in [16], where the weights of the edges in this CTG represent the amount of data transfer between two tasks in kbits/s. In this example, we have twelve vertices meaning that we need at least twelve tiles to map these tasks onto since we bound our ILP formulations to map at most one task to a tile. If one would like to map more than one task to a tile, he/she must add a preprocessing step to determine which tasks could be mapped on the same tile based on the tasks’ worst case execution times. Then, the selected tasks can be merged into a single node to be mapped onto a tile. Deﬁnition 2. A TG is a graph M(P, L), where each node pi 2 P denotes a tile (i.e. a processing core) in the topology and each edge denotes a physical link li,j 2 L between pi and pj. Fig. 1b shows an example TG with twelve tiles connected in a 4 3 mesh fashion. As illustrated in this ﬁgure, each tile has 2D coordinates in the mesh. In Fig. 1c, we give the structure of a tile. A tile contains a router to forward data between tiles, a memory to store program and storage data, and a processing core to process data. Using the above deﬁnitions, the application mapping problem can be formulated as follows: Given a CTG and a TG that satisfy

jVj 6 jPj

ð1Þ

ﬁnd a one to one mapping function F:V ? P from CTG to TG with

9 8 = < X MIN CommCost ¼ gf ðv i Þ;f ðv j Þ wi;j ; : 8e 2E

where CommCost is the total communication (i.e. the number of bits transferred per second between tiles.) on the network and gf ðv i Þ;f ðv j Þ is the minimal path (i.e. the minimum number of hops) between tiles f(vi) = pk and f(vj) = pl. Minimizing CommCost is directly proportional to g and our solution aims at minimizing the number of hops, g, resulting in minimized CommCost. In our problem formulation, we do not consider the bandwidth constraints and assume the minimal path routing between every communicating tasks. XY routing algorithm [17] is the commonly accepted deterministic routing algorithm because of its simplicity and very little hardware requirements. However, it may cause network congestion if two communication paths share the same link and this may cause latency overhead. There are several minimal path routing algorithms exist for mesh based NoC’s [7,9,14] that can be applied to our mapping result to eliminate congestion. Additionally, virtual channels can be used in router ports for congestion control. 3. ILP formulation In an ILP problem, problems are formulated using a linear objective function and linear functions as constraints, whereas the solution variables are restricted to be integers. In this work, we use 0–1 ILP formulation and the 0–1 ILP is a smaller subset of the general ILP problem in which each (solution) variable is restricted to be either 0 or 1. In this paper, we used Xpress MP [13], a commercial tool, to formulate and solve our ILP problem, though its choice is orthogonal to the focus of this paper. In our ILP formulation, we view the chip area as a 2D grid and assign tasks to tiles within this grid. Table 1 gives the constant terms and variables used in our formulations. We relist ILP formulations presented in [11] in the following paragraphs to make this paper self contained. For our formulations, we deﬁne a binary variable ai,x,y, indicating that task i is mapped to a tile in the coordinate (x,y) if ai,x,y = 1, otherwise ai,x,y = 0. In the following formulations, Eq. (5) indicates that every task i must be mapped to a tile with the coordinates (x, y) and only one task can be mapped to a single tile. The number of tasks in the CTG may be less than the number of available tiles. In this case, there will be some tiles that have no tasks mapped on them. Eq. (6) captures this constraint. Xdim X Ydim X

n X

ð2Þ

ai;x;y ¼ 1; 8i:

ð5Þ

x¼0 y¼0

ai;x;y 6 1; 8x; y:

ð6Þ

i¼1

i;j

such that:

8v i 2 V; 9pk 2 P; f ðv i Þ ¼ pk 8v i – v j 2 V; f ðv i Þ – f ðv j Þ

ð3Þ ð4Þ

(a)

(b)

As indicated in Eq. (2), in order to calculate the total communication of the architecture (i.e. CommCost), we must calculate the minimum number of hops gf ðv i Þ;f ðv j Þ between two tasks vi and vj mapped on the mesh. The Manhattan distance (i.e. the city block distance) gives the minimum number of hops between two tiles.

(c)

Fig. 1. (a) An example CTG, (b) An example TG with the size of 4 3, and (c) The structure of a tile.

870

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

method for each cluster-sub-mesh pairs. The proposed clusterbased mapping method follows the steps given below.

Table 1 Constants and variables used in the formulations. Constants variables

Deﬁnitions

n wi,j Xdim Ydim

The number of tasks (i.e. nodes) in input CTG The communication weight between tasks i and j in CTG The size of the mesh architecture in x dimension The size of the mesh architecture in y dimension Binary variable ai,x,y = 1 if task i is mapped to the tile in the coordinates (x, y). Otherwise ai,x,y = 0 Binary variable. Xi,j,a = 1 if the distance in x dimension between tasks i and j is equal to a. Otherwise, Xi,j,a = 0 Binary variable. Yi,j,b = 1 if the distance in y dimension between tasks i and j is equal to b. Otherwise, Yi,j,b = 0 The total communication cost of the network in x dimension The total communication cost of the network in y dimension

ai,x,y Xi,j,a Yi,j,b Xcost Ycost

4.1. Mesh partitioning

For this distance calculation, we deﬁne two binary variables Xi,j,a and Yi,j,b representing the distances in x and y dimensions between tasks vi and vj, respectively. Let us give an example to clarify these variables. Assume that task v2 and v5 in Fig. 1a are mapped on the tiles (0, 1) and (2, 2), respectively, in Fig. 1b. Then, the variables X2,5,2 and Y2,5,1 become 1 meaning that a = 2 and b = 1. The sum of these two values gives us the shortest path between these two tiles. We use Eqs. (7) and (8) to determine the distances a (for x dimension) and b (for y dimension) for each pair of communicating tasks, respectively. In these equations, (xi, yi) and (xj, yj) are the coordinates of the tiles that tasks vi and vj mapped onto, respectively.

Xdisti;j;a P ai;xi ;yi þ aj;xj ;yj 1; 0 6 xi ; xj 6 Xdim;

0 6 yi ; yj 6 Ydim such that a ¼ jxj xi j ð7Þ

Ydisti;j;b P ai;xi ;yi þ aj;xj ;yj 1; 0 6 xi ; xj 6 Xdim;

8i; j ei;j 2 E 8i; j ei;j 2 E

0 6 yi ; yj 6 Ydim such that b ¼ jyj yi j ð8Þ

Mesh dimensions are the designer guided inputs to our system. The system designer inputs Xdim and Ydim values such that jVj 6 jTj = Xdim Ydim. Our mesh partitioning method is a recursive procedure. It cuts the given mesh M into two sub-meshes M1 and M2 and it continuously cuts the sub-meshes until each partition has at most t tiles. The stopping criteria t is a predeﬁned value. From our experiments, we observed that our ILP-based system can map around 12 tasks in a tolerable time. Therefore, we used t = 12 in our experiments. The effects of the granularity of the partitions are twofold: If the number of nodes in a partition is big, then the ILP solution times increase dramatically. On the other hand, if we partition the graph into very small portions, then the ﬁnal solution becomes far from the optimal one since the number of mapping options for each node decreases. Fig. 2 illustrates our mesh partitioning procedure on 9 3 mesh. Our method cuts the mesh recursively in a speciﬁed dimension to partition the mesh into two sub-meshes. In our example in Fig. 2, we cut the mesh in x dimension. However, our method can be used to partition meshes in both x and y dimensions with the presented method. If the mesh is a square sized mesh, we can ﬁrst partition the mesh in one dimension, and then partition each submesh in the other dimension. For demonstrative purposes, we only show the cut procedure for only x dimension. In order to have equal sized (or as close to as equal sized) partitions, we partition the mesh into two sub-meshes satisfying the conditions given in (12)–(14).

Xdim1 ¼ dXdim=2e:

ð12Þ

Xdim2 ¼ Xdim Xdim1

ð13Þ

Ydim1 ¼ Ydim2 ¼ Ydim

ð14Þ

After formulating the number of hop counts between communicating tasks, we then formulate the total communication cost in x and y dimensions using Eqs. (9) and (10). In these formulas, we multiply the number of hops and the communication weights wi,j of each pair of communicating tasks, vi and vj, to obtain the total communication cost of the system.

Xcost ¼

X Xdim1 X ei;j 2E

Ycost ¼

wi;j a Xdisti;j;a :

M

ð9Þ

a¼1

X Ydim X

wi;j b Ydisti;j;a :

(a) ð10Þ

M1

ei;j 2E b¼1

M11

Our objective function is minimizing the total communication cost is

MIN : CommCost ¼ Xcost þ Ycost:

M2

M1

P1 M12

M2 M21

M22

ð11Þ

4. Cluster-based method Our ILP-based mapping tool obtains optimum results for the stated problem. However, when the number of tasks in the application graph increases the computation time also increases dramatically, which makes this method inapplicable in practice. To remedy from this timing problem, we propose a cluster-based method to reduce the computation times. In this method, we decompose the mapping problem into several sub-problems and solve each sub-problem separately. In other words, we ﬁrst decompose the mesh into sub-meshes and partition the application graph into smaller sized clusters. We then apply our ILP-based mapping

(b) M1

P1

M2

P2

M3

P3

M4

(c) Fig. 2. Mesh partitioning example. (a) Initial mesh and the cut, (b) The sub-meshes after the ﬁrst cut, and (c) result. In the ﬁgures, Mi represents a sub-mesh and Pj represents dummy tiles connecting two sub-meshes.

871

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

After the cut, we insert a 1 Ydim sized dummy mesh P1 between two sub-meshes M1 and M2 as shown in Fig. 2b. Subsequently, we update the coordinates of each tile since the x dimensions of the tiles after the dummy tile insertion will increase by 1. For example, in Fig. 2a and b, coordinates for the tile (8, 0) becomes (9, 0) after the dummy tile insertion. Note here that the number of dummy tiles (i.e. this number is Ydim in our present case) between two neighbor sub-meshes is an important criteria for our graph clustering method as we will explain it in Section 4.2. When we partition a graph, we add dummy nodes between two clusters. The maximum dummy nodes in graph clustering can be at most the number of dummy tiles between two neighbor sub-meshes. We continuously cut each sub-mesh until we reach our stopping criteria given in (15). Fig. 2c shows the result of mesh partitioning procedure.

8M i ;

jP i1 j þ jM i j þ jPi j 6 t:

i ¼ 1; 2; . . .

and P0 ¼ ;

ð15Þ

4.2. Graph clustering Our graph clustering follows the similar recursive steps with the mesh partitioning. We recursively divide the application graph into clusters until each cluster contains the nodes less than the tiles of corresponding sub-mesh. We modiﬁed Kernighan–Lin algorithm [15] for our partitioning method. Before elaborating the details of our clustering method, we give some deﬁnitions as follows. Deﬁnition 3. A cut C GA ;GB of graph G(V,E) is a partition of V into GA and GB = V GA. The cut degree dGA ;GB is the total number of edges crossing the cut C GA ;GB .

G1

The cut degree dGA ;GB of the cut C GA ;GB can be determined by using the equation

8v i 2 GA ;

dGA ;GB ¼

Deﬁnition 4. A node vi in cluster GA is called a free node if vi does not have any neighbor node in a dummy node set D and FA is the set of the free nodes in GA. Formally, FA = {vij"ei,j, (vj R D)}. Deﬁnition 5. A node vi in cluster GA is a bound node if there is at least one node vj connected to vi, such that vj is a dummy node in a dummy node set D and BA is the set of the bound nodes in GA. Formally, BA = {vij$ei,j, (vj 2 D)}. We deﬁne the free and bound nodes in a cluster to decide which nodes are candidates to be swapped between two clusters. By only moving the free nodes between clusters, we force the neighbor nodes of the CTG to be either in the same cluster or in the neighbor clusters. The rationale behind this idea is to increase the probability that they may be mapped in one hop distance. Initially, all nodes are free nodes since there is no dummy node set before the partitioning starts. For each cluster, the sum of free and bound nodes gives us the total nodes in a cluster. That is, GA = FA [ BA. In Fig. 3a, all the nodes are free nodes, whereas v9, v17, v8, and v16 are bound nodes in Fig. 3b. Given the necessary deﬁnitions, we now can explain our partitioning method. The partitioning starts by cutting the initial graph

D1

(a)

(b)

M1

P1

D2

G2

M2

G3

P2

ð16Þ

where bi,j is a binary variable and it becomes 1 when vi 2 GA ^ vj 2 GB. Otherwise, it is 0.

G1

D1

bi;j ;

8ei;j 2E

G2

G1

X

D3

G4

P3

M4

M3

(c) Fig. 3. Graph clustering.

G2

872

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

randomly into two clusters GA and GB as shown in Fig. 3a such that jVAj 6 jM1j and jVBj 6 jM2j. Then, we calculate the cut degree dGA ;GB . As we indicated in Section 4.1, the cut degree can be at most the number of dummy tiles between two sub-meshes. That is,

dGA ;GB 6 jDj ¼ Ydim:

ð17Þ

where jDj represents the number of dummy tiles between any two neighbor sub-meshes. Note that we write jDj = Ydim since we cut the mesh vertically in our explanation. If the cut is horizontal, then jDj = Xdim. If the constraint in (17) is not satisﬁed (i.e. dGA ;GB > jDj), we apply modiﬁed Kernighan-Lin algorithm [15] in an attempt to reduce dGA ;GB . In this algorithm, we swap two free nodes based on their total gain as deﬁned in [15] until dGA ;GB 6 jDj. The total gain (See Lemma 1 below) of a node is calculated based on its internal cost IC and external cost EC.

4.3. Mapping and merging Having decomposed the mesh into sub-meshes and the input graph into clusters, we apply our ILP formulation after modifying it based on the new problem space. The modiﬁed ILP method maps each cluster and attached dummy nodes onto corresponding submesh and dummy tiles based on the following formulation: Given that

f ðdm Þ ¼ pn ; 8dm 2 Di1 ;

Lemma 1. Let Di = ECi ICi be the difference between external and internal costs of vi 2 GA. Similarly, we deﬁne Dj for vj 2 GB. The total gain Gi,j (i.e. the reduction in dGA ;GB ) can be calculated as

Gi;j ¼ Di þ Dj 2bi;j : Proof. Given in [15].

ð18Þ h

Let us revisit our running example. In Fig. 3a, initially dG1 ;G2 ¼ 4 > jDj ¼ 3 meaning that we have to reduce dG1 ;G2 by swapping nodes between G1 and G2. To decide which two nodes to be swapped, we calculate the external and internal costs of all free nodes in these clusters. We then pick two nodes resulting in the highest gain calculated by Eq. (18). In our example, v8 and v17 have EC8 = EC17 = 2 and IC8 = IC17 = 1, respectively. Additionally, b8,17 = 0. Then, the total gain of swapping v8 and v17 is calculates as G8,17 = (EC8 IC8) + (EC17 IC17) = 2, which is the highest total gain. Thus, we swap v8 and v17 and dG1 ;G2 becomes 2. We then insert dummy nodes D1 between two clusters as shown in Fig. 3b. In the presented example, we satisﬁed the constraint given in (17). If we cannot meet this constraint, we select the solution resulting in minimum dGA ;GB . Then, we add jDj dummy nodes (i.e., the maximum number of dummy tiles) between two cuts and connect them with the entire cross edges. In such a case, we may sacriﬁce from the optimal solution. In the clustering procedure, dummy tiles and dummy nodes play an important role since they are reference points between two neighbor sub-mesh and clusters. Dummy nodes are mapped on the dummy tiles when a cluster is mapped onto the corresponding sub-mesh. Then, their positions on the mesh are ﬁxed to be reference points for the mapping of adjacent cluster. By doing this, we aim to minimize the inter-communication between two neighbor clusters. If we do not place dummy nodes, we end up with two disjoint clusters and we lose the communication traces between them during their mappings. Consequently, ﬁnal mapping result becomes far from the optimal one. After we insert dummy nodes between two adjacent clusters, we adjust the weights accordingly. For each ei,j, where vi 2 GA and vj 2 GB, we add a dummy node vk 2 DA such that ei,j is decomposed into two edges ei,k and ek,j. The weight wi,j of the edge ei,j is assigned to the new edges as wi,k = wk,j = wi,j. Fig. 3c shows ﬁnal clustering result.

jDi1 j 6 jPi1 j;

ﬁnd mapping functions

F : ðV i [ Di Þ ! ðT i [ Pi Þ; such that

f ðv k Þ ¼ t l and f ðdm Þ ¼ pn ;

8dm 2 Di ; Deﬁnition 6. The internal cost ICi of a node vi 2 GA is equal to the number of neighbors of vi in GA. The external cost ECi of a node vi 2 GA is equal to the number of neighbors of vi in GB = V GA. In other words, the external cost ECi of the node vi is the number of outgoing edges of vi that are crossing the cut C GA ;GB . The internal cost ICi of node vi is the number of remaining outgoing edges of vi.

9pn 2 Pi1 ;

9pn 2 P i ;

8v k 2 V i ;

9tl 2 T i ;

jV i j 6 jT i j;

jDi j 6 jPi j;

and the total communication CommCost is minimized. Fig. 3c illustrates the mapping relationship between the clusters and the sub-meshes. As we stated in the formal deﬁnition above, we have three clusters to map onto sub-meshes, two of which are the clusters containing the dummy nodes and they are the neighbors of the cluster of interest. Recall that D0 = ;. As shown in Fig. 3c, V1 (i.e. tasks of G1) and dummy nodes of D1 are mapped to tiles of sub-meshes M1 and P1, respectively. After this mapping, the nodes mapped to P1 are ﬁxed to their positions for the next mappings. That is, when we map the nodes in D1, G2, and D2, we take the mapping variables from previous mapping for F1:D1 ? P1. By doing this, we aim at mapping communicating nodes close to each other over the dummy tiles. Finally, we remove the inserted dummy tiles after the entire clusters are mapped to corresponding sub-meshes. 5. Experimental results We evaluated our cluster based method by comparing it with ILP solutions on different real benchmarks and randomly generated graphs. Before presenting the results, we demonstrate the working procedure of our cluster based method. For this demonstration, we used ILP and cluster-based methods to map multimedia benchmark 263-enc mp3-dec onto 4 3 mesh. We present both these mapping in Fig. 4. In this ﬁgure, (a) presents CTG of multimedia benchmark 263-enc mp3-dec with weights given in kbits/s and (b) shows ILP-based mapping onto 4 3 mesh, which took 1292 s resulting in 230,407 kbits/s total communication cost. Fig. 4c and d shows the mesh partitioning and graph clustering steps, respectively. Finally, Fig. 4e and f present the mapping and merging results, respectively. Our mappings took 0.9 s for the ﬁrst cluster and 0.3 s for the second, resulting in a total communication cost of 230,426 kbits/s. As the numbers illustrate, we achieved tremendous time savings while we sacriﬁced 0.01% on the optimum result. We tested our cluster based method on six multimedia benchmarks and four custom generated graphs. We give the results in Table 2. We list the name and the number of nodes of these graphs in the ﬁrst two columns of Table 2. Columns three and four give the total communication costs (CommCost) of ILP-based and cluster based methods, respectively. Note that we limit the running time of ILP tool to 8 h and accept its best solution returned within this time limit. In our six experiments, ILP-based method could not obtain the solutions in these time limits. We show these solutions with the t.o. (timeout) in column six of Table 2. In column ﬁve of Table 2, we give the difference between our cluster based method and ILP-based method in percentages. As seen from these results,

873

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 4. Example design. (a) CTG, (b) ILP mapping, (c) mesh partitioning, (d) graph clustering, (e) cluster mapping, and (f) merging.

Table 2 Experimental results on multimedia benchmarks and custom generated graphs. Application

Number of tasks

VODP [5] MWD [18] MPEG4 [18] 263 Dec. [16] 263 Enc. [16] Mp3 Enc. [16] Graph 1 Graph 2 Graph 3 Graph 4

16 12 12 14 12 13 20 25 30 35

Total communication (MBits/s) ILP

Cluster

4119 1184 3567 19.823 230.407 17.021 1856 2874 4572 5128

4205 1184 3567 20.098 230.426 17.021 1867 2874 4613 5096

we obtain the optimum results most of the time while the worst is being within the 2.01% of the optimum for VODP benchmark. One interesting point to note here is the results of Graph 4. In this result, cluster-based method obtains a better solution than ILP. One may opt to these results claiming that ILP based method always obtains better solutions than cluster based method. That is true if we can obtain the optimum result within the given time limits. However, as we stated above, we take the best solution of ILP based method after 8 h if the ILP tool is still searching for a better result. That means, the solution returned by ILP tool for Graph 4 is not the optimum one. The last two columns of Table 2 shows the CPU times of the two methods in seconds. These timing results prove that we achieved huge time savings with a very little or no communication cost overhead. 6. Conclusion and future work In this paper, we proposed a new cluster-based method for application mapping onto NoC architectures. Proposed method is

Difference from ILP (%)

CPU time (s) ILP

Cluster

2.01 0.0 0.0 1.38 0.001 0.0 0.59 0.0 0.89 0.62

T.o. 380 7750 T.o. 5825 8476 T.o. T.o. T.o. T.o.

114.6 0.3 2.6 34.8 1.2 17.4 198.8 286 425 562

a superior version of ILP based methods since it can determine optimum or very close to optimum results in very short times. We believe that our cluster-based mapping method can also be used to map the tasks in such a way that the communication is distributed over the chip evenly. To do this, the graph clustering algorithm must be modiﬁed in such a way that the ﬂow of two clusters is minimized. Acknowledgments This work is supported by Scientiﬁc and Technological Research Council of Turkey (TUBITAK) under the project ID 108E233. References [1] Dally WJ, Towles B. Route packets, not wires: on-chip interconnection networks. In: Proc design automation conference, Las Vegas, Nevada, USA; 2001, p. 684–9. [2] Benini L, De Micheli G. Network-on-Chip: a new SoC paradigm. IEEE Comput 2002;35(1):70–8. [3] http://techresearch.intel.com/articles/Tera-Scale/1421.html.

874

S. Tosun / Advances in Engineering Software 42 (2011) 868–874

[4] Garey MR, Johnson DS. Computers and intractability: a guide to the theory of NP-completeness. San Fransisco (CA): Freeman and Co.; 1979. [5] Murali S, De Micheli G. Bandwidth-constrained mapping of cores onto NoC architectures. In: Proc DATE’04. vol. 2, 2004, p. 896–304. [6] Hu J, Marculescu R. Energy-and performance-aware mapping for regular NoC architectures. Comput-Aid Des Integr Circ Syst IEEE Trans 2005;24(4):551–62. [7] Srinivasan K, Chatha KS. A technique for low energy mapping and routing in Network-on-Chip architectures. In: Proc ISLPED’05, San Diego, California; 2005. p. 387–92. [8] Janidarmian M, Khademzadeh A, Tavanpour M. Onyx: a new heuristic bandwidth-constrained mapping of cores onto tile-based Network on Chip. IEICE Electron Express 2009;6(1):1–7. [9] Tosun S. New heuristic algorithms for energy aware application mapping and routing on mesh-based NoCs. J Syst Architect 2011;57(1) [Special Issue OnChip Parallel And Network-Based Systems]. [10] Moein-darbari F, Khademzade A, Gharooni-fard G. CGMAP: a new approach to Network-on-Chip mapping problem. IEICE Electron Express 2009;6(1):27–34. [11] Tosun S, Ozturk O, Ozen M. An ILP formulation for application mapping onto Network-on-Chip. In: 3rd International conference on application of information and communication technologies, AICT2009, Azerbaijan, Baku; 2009.

[12] Nguyen DT, Bai Y, Qin J, Han B, Hu Y. Computational aspects of linear programming Simplex method. Adv Eng Soft 2000;31(8-9):539–45. [13] http://www.dashoptimization.com/pdf/Mosel1.pdf. [14] Hu J, Marculescu R. DyAD: smart routing for networks-on-chip. In: Proceedings of the 41st annual design automation conference, San Diego (CA, USA); 2004. [15] Kernighan BW, Lin S. An efﬁcient heuristic procedure for partitioning graphs. Bell Syst Technol J 1970;49:291–307. [16] Srinivasan K, Chatha KS, Konjevod G. Linear-programming-based techniques for synthesis of Network-on-Chip architectures. IEEE Trans Very Large Scale Integr Syst 2006;14(4):407–20. [17] Duato J, Yalamanchili S, Ni L. Interconnection networks. Morgan Kaufman; 2002. [18] Chang K-C, Chen T-F. Low-power algorithm for automatic topology generation for application-speciﬁc networks on chips. IET Comput Digit Technol 2008;2(3):239–49.

Cluster-based application mapping method for Network-on-Chip

Cluster-based application mapping method for Network-on-Chip

Recommend Documents