Stochastic multi-objective Pareto-optimization framework for fully automated ab initio network-on-chip design

Stochastic multi-objective Pareto-optimization framework for fully automated ab initio network-on-chip design

Journal Pre-proof Stochastic Multi-Objective Pareto-Optimization Framework for Fully Automated Ab Initio Network-on-Chip Design Tzyy-Juin Kao , Wolfg...

2MB Sizes 0 Downloads 20 Views

Journal Pre-proof

Stochastic Multi-Objective Pareto-Optimization Framework for Fully Automated Ab Initio Network-on-Chip Design Tzyy-Juin Kao , Wolfgang Fink PII: DOI: Reference:

S1383-7621(19)30493-X https://doi.org/10.1016/j.sysarc.2019.101686 SYSARC 101686

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

12 October 2018 9 October 2019 21 November 2019

Please cite this article as: Tzyy-Juin Kao , Wolfgang Fink , Stochastic Multi-Objective ParetoOptimization Framework for Fully Automated Ab Initio Network-on-Chip Design, Journal of Systems Architecture (2019), doi: https://doi.org/10.1016/j.sysarc.2019.101686

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Stochastic Multi-Objective Pareto-Optimization Framework for Fully Automated Ab Initio Network-on-Chip Design Tzyy-Juin Kao1 and Wolfgang Fink1 (1)

Visual and Autonomous Exploration Systems Research Laboratory, University of Arizona,

Department of Electrical and Computer Engineering, Tucson, AZ, USA

Abstract With the advent of multi-core processors, network-on-chip design has been crucial in addressing network performances, such as bandwidth, power consumption, and communication delays when dealing with on-chip communication between the increasing number of processor cores. As the numbers of cores increase, network design becomes inherently more complex. Therefore, there is a critical need in soliciting computer aid in determining network configurations that afford optimal performance given multi-objectives, such as resource and design constraints. We have devised a stochastic multi-objective Pareto-optimization framework that fully automatically explores the space of possible network configurations to determine optimal network latencies, power consumption, and the corresponding link allocations, i.e., the actual network-on-chip design, ab initio with only the number of routers given. For a given number of routers, average network latency and power consumption as example performance objectives can be displayed in form of Pareto-optimal fronts, thus not only offering a powerful automatic network-on-chip design tool, but also affording trade-off studies for the chip designers.

Keywords Automated ab initio network-on-chip design; multi-objective stochastic Pareto-optimization; BookSim2.0 and gem5; network latency; power consumption

1

Introduction Computational performance increases in single-core processors result in proportional increases in power consumption. Thus, the soaring power dissipation has become one of the performance bottlenecks. In contrast, multi-core processors achieve much higher computational power than single-core processors, harnessing the benefits of parallelism of computer tasks. When using several less-powerful small cores, with the ability to compute data simultaneously, the performance is boosted, at least in theory, proportionally to the number of cores with only a linear increase in power consumption.

Due to the rapid increase in numbers of processor cores, the data movement among all cores and memories forms another critical issue. Therefore, to keep improving the system performance, network-on-chip (NoC) design has gained considerable attention in recent years. Traditionally, linking all cores with a shared bus, through parallel electrical wires, causes serious congestion because every core uses the same link to transmit and receive data. Although there are many existing network topologies allocating resources with different pros and cons, most of them cannot be scaled efficiently to interconnect hundreds of cores on a chip. Therefore, many on-chip high-radix network topologies that feature low latency and better scalability have been proposed [1-3]. Implementing a well-designed network is crucial to meet the heavy-load bandwidth needs of multi-core processors.

In addition to the communication bandwidth, network performance in terms of latency and power consumption is critically affected by the chosen network topology. For example, by adding additional bypass links to a network it is possible to simultaneously reduce latency and power consumption when connecting two critical routers. Therefore, even if two network designs have an identical aggregate bandwidth, the latency and power performances may differ significantly, strongly depending on the respective link allocations. In light of the above, we have devised a stochastic multi-objective Pareto-optimization framework (POF) as a fully automated ab initio design tool for NoCs that explores different combinations of network configurations to determine link allocations to optimize network performance, e.g., to arrive at low-latency and power-efficient NoC architectures. This stochastic multi-objective Pareto-

2

optimization framework is an instantiation of a Stochastic Optimization Framework (SOF) devised by Fink in 2008 [4].

Background In the following we provide a definition and brief description of some of the technical terms used in the remainder of this paper.

Network-on-Chip (NoC) Figure 1 shows the schematic of a small 16-core NoC. Each core generates flits, i.e., smaller pieces of a packet, and transmits them to one designated router. Most flits usually have to pass through a few intermediate links and routers from a source node to a destination node because it is impractical and not cost-effective to implement a network that is fully connected, i.e., n(n-1)/2 links, where n is the number of routers. Each router puts the receiving flits in buffers, waiting for the switch to direct them to the output to the destination node. Links are the connections between core-to-router and router-to-router, having parameters, such as latency and length, based on the physical inter-router distances on a chip. The activity of all components is recorded to calculate the total power consumption.

Deterministic Routing Protocol The routing paths to send packets between any two routers are determined in advance. The most commonly used deterministic routing is shortest path routing: Packets follow the paths that have the shortest hop count without adapting to the current traffic load. The hop count is the performance unit that represents the total number of relay routers between any two nodes that a data packet must pass through. The benefits of this protocol are its simplicity and robustness.

Pareto-Optimal Front Economist Vilfredo Pareto proposed a concept that for any allocation of resources there exists an optimal solution where no further improvement can be made without sacrificing one of the performance objectives in a multi-objective system. Because these performance objectives are

3

usually conflicting, the Pareto-optimal front represents the optimal solution boundary after all performance evaluations, thus enabling trade-off studies [5].

Stochastic Optimization Framework Figure 2 shows the functional schematic of a Stochastic Optimization Framework (SOF) [4]: A SOF efficiently samples the parameter space associated with a model, process, or system (1) by repeatedly running the model, process, or system in a forward fashion, and (2) by comparing the respective outcomes against a desired outcome, which results in a fitness measure. The goal of the SOF is to optimize this fitness by using multi-dimensional stochastic optimization algorithms, such as Simulated Annealing [6, 7], Genetic Algorithms [8, 9], or other Evolutionary Algorithms as the optimization engine to determine optimal parameter values. One way to determine optimal parameter values is to analytically invert models, processes, or systems, or to run them backwards. However, in many cases – as is the case with NoC design – this is analytically or practically infeasible due to the inherent complexity and high degree of nonlinearity. A SOF overcomes this problem, by effectively “inverting” these models, processes, or systems to determine parameter values that, when applied, yield the desired outcomes, or approximate them as closely as possible according to a user-defined threshold (Figure 2).

NoC Stochastic Multi-objective Pareto-Optimization Framework Setup A C++ program was developed that interfaces with the widely used open source network simulator BookSim2.0 [10] to explore link allocations and resulting network performance (here: latency and power consumption) for any given number of routers on a chip. To quickly iterate the network simulations and evaluate the performance of each configuration, the simulation adopts synthetic traffics (uniform random), instead of real application traffics. Compared to synthetic traffics, real application traffics provide more realistic results, but will take a longer time to evaluate. It should be noted though that real traffics could be accommodated as well. Furthermore, to efficiently obtain optimal results, instead of searching all combinations exhaustively, the program employs three optimization algorithms that are detailed below: 

Random Search (RS) as a pure “guessing” performance baseline;

4



Special Greedy (SG) as a deterministic optimization algorithm, generating locally optimal solutions;



Simulated Annealing (SA) [6, 7] as a representative of a stochastic optimization algorithm that can overcome local minima (as opposed to SG), thereby approaching globally optimal solutions.

The pros and cons of the three optimization algorithms listed above are as follows: RS explores different configurations randomly across the entire search space. As such, the results are in general far from being optimal. Using RS we obtain the entire spectrum of latency and power consumption, i.e., both good and bad results across the board. Any optimization that aspires to be meaningful/useful, by definition, will have to perform (much) better than RS. To that effect, SG is a custom, deterministic algorithm that is guaranteed to obtain, at least, a local minimum in a relatively predictable and reasonable time compared to RS. However, due to the deterministic nature of the SG algorithm, and due to the fact that NoC optimization is a multi-objective optimization challenge with numerous local minima, SG will in general not reach an optimal solution that is of significant practical value, let alone global Pareto-optimal solutions, which is one of the goals of this work. To overcome this deficiency of getting trapped in local minima easily, we employed SA as a representative of stochastic optimization algorithms [6-9]. As a side note, it should be mentioned that SA is mathematically proven to converge to a global minimum (e.g., [11]), however from a practical point of view this process may take infinitely long. Moreover, we also “parallelized” the SA-based optimization by performing numerous independent SA-runs simultaneously on the institutional cluster-computing infrastructure with different start values and algorithmic parameters, such as start temperature, end temperature, and cooling rate (see below), respectively. For the 64-router scenario, i.e., the largest simulated NoC for this study, we limited the computation time to ten days. It should be stated though that the computation time is entirely dependent on the infrastructure/hardware available. As such, we refrained from listing any particular runtimes in the following. In general, the SA-based optimization can be readily scaled because each computing node or CPU can perform its own independent SA-run, as opposed to population-driven stochastic optimization algorithms, such as Genetic Algorithms and Evolutionary Algorithms (see also “Conclusion” below).

5

The Pareto-optimization framework program records all lowest possible network latencies in each power consumption interval, along with the corresponding NoC architectures, i.e., link allocations, for further analysis.

The overall flow of the proposed Pareto-optimization framework (Figure 3) is as follows: 1. Once the optimization process starts, one of the three optimization algorithms creates an initial network configuration and feeds it to the BookSim2.0 simulator. 2. BookSim2.0 starts setting up objects of nodes, routers, and links based on this configuration and starts issuing packets from each node to another according to a preselected traffic pattern. 3. Those packets, generated according to the chosen traffic pattern, are routed from their source nodes to their destination nodes via the shortest paths that are calculated by the deterministic routing protocol. 4. When BookSim2.0 produces the simulation results, the optimization algorithm will create the next network configuration based on the current and previous performance objective values (i.e., latency and power consumption). 5. The optimization algorithm stores the best latencies for each power interval in a log file. 6. This loop (2–6) will be executed until a termination condition is met (i.e., desired performance objective values are reached), or a maximum number of iterations is exceeded. Given the number of routers, the number of possible links can be derived from n(n−1)/2, where n is the number of routers. Then, the Pareto-optimization framework determines the link allocation to form the NoC. The link allocation is represented by a p-tuple, where p is the number of possible links. The elements of the p-tuple contain 0 and 1, representing absent and present links, respectively. The resulting number of combinations equals

. For example, the link

allocation of a 9-router network is a 36-tuple, resulting in about 68.7 billion combinations. Figure 4 shows the number of total network combinations as the number of routers increases. Therefore, facing this combinatorial explosion, an optimization algorithm to efficiently approximate the optimal result is critical.

6

As mentioned above, the program employs three optimization algorithms: a) Random search (RS): The purpose of this random search is to provide a “guessing” baseline result for other optimization algorithms to compare against. It is simple to implement. In each loop, it generates a random link allocation, simulates this network configuration, and stores the results in a table if the network is stable and has a better latency than the best NoC encountered so far.

b) Special greedy (SG): This deterministic algorithm starts from a fully connected point-topoint network. All links are present and it is guaranteed to obtain the lowest latency of all possible link allocations. It is the optimal solution in terms of latency with the caveat of the highest power consumption. Then, the following iterations explore neighboring networks (i.e., networks that have only one link fewer as the current network) one by one and move to the neighbor that has the minimum latency increase. The algorithm continues until there is no stable neighboring network and no further link can be removed. The greedy algorithm will yield a local minimum eventually because the next location is based on a local condition and the history of deterministic choices. However, it usually has many fewer steps than other stochastic or non-deterministic algorithms. For example, a 9-router network has 36 possible links and all are present at first. In the first iteration, the program simulates all 36 neighboring networks, which have one link removed. The following iterations simulate neighboring networks starting from 35, 34, …, down to 9. No links can be removed when the number of links has reached 8, i.e., n-1. The total number of simulations for this algorithm is (n4 − 2n3 − n2 + 2n)/8, i.e., O(n4), where n is the number of routers.

c) Simulated annealing (SA) [6, 7] (Figure 5): This algorithm starts from a random link allocation and keeps moving to neighboring networks based on a stochastic condition until the temperature variable is cooled below a terminating point. The neighboring network is created by flipping one random location/link within the current p-tuple of link allocations (i.e., a random element changed from 1 to 0 or vice versa). Then, the decision of either ACCEPT or REJECT with respect to the new network configuration is made by comparing the fitnesses of two networks. The fitness E is the weighted summation

7

between performance objectives (here: latency and power). If the new network has a better fitness, the algorithm accepts it and moves on, but if it has a worse fitness, the algorithm will make a decision based on a Boltzmann-probability: random number [0…1] < exp[-(Etemp – Ebest)/T]

(1)

where Etemp is the new fitness, Ebest is the best fitness recorded thus far, and T is the temperature variable. The temperature is decreasing after each iteration according to a user-defined cooling rate : When the temperature is high, i.e., in the initial phases of the optimization, the algorithm has a high probability of accepting a worse network (according to (1)), thereby sifting broadly through the entire design space, and as the temperature decreases gradually over the course of the optimization, the probability of accepting a worse network also decreases (according to (1)), i.e., thus homing in on an (global) optimum. At that point, only networks with better fitness will be accepted. The Simulated Annealing algorithm ends once a user-defined end or minimum temperature has been reached (Figure 5). The user-defined cooling rate  determines how quickly Simulated Annealing turns from a random search into a greedy downhill algorithm. If  is chosen very close to 1.0 (Figure 5), the temperature T decreases very slowly (Figure 5), thereby randomly searching through the entire design space, i.e., Simulated Annealing remains much longer in the random search phase, potentially not yielding good solutions quickly enough. Moreover, the number of iterations will be very large. In contrast, if  is chosen closer to 0.0, the temperature decreases rapidly, thereby turning Simulated Annealing prematurely into a greedy downhill algorithm that has a high potential of getting stuck in local minima, thus yielding inferior solutions after only a small number of total iterations, since it did not have the opportunity yet to explore large areas of the design space. Therefore, the user can influence the performance of Simulated Annealing through the empirical choice of the start temperature, the end/minimum temperature, and the cooling rate  (Figure 5). It should be emphasized that it is the feature of the Boltzmann-probability (1) that affords Simulated Annealing the ability to avoid getting trapped in local minima (as opposed to, e.g., deterministic, gradient-descent based optimization algorithms), thus approaching the global minimum.

8

In our optimization runs (see Results) we preset a start and end temperature as well as a cooling rate  such that the maximum number of iterations is about one million (note: these parameters are entirely user-defined). In addition, the fitness has tunable multiobjective weights to explore areas of design space the user is interested in (i.e., latency and power consumption in our case). For our optimization runs, the overall fitness E of each network design is expressed as a weighted summation equation of latency and power consumption: Fitness E = weight × latency + (1 − weight) × power

(2)

Because the Simulated Annealing algorithm iterates and converges to one optimal result (lowest fitness) eventually, based on its tunable multi-objective weights in the fitness E, we sweep the weight parameter from 0.1 to 1 in increments of 0.1 to generate a Paretooptimal front across a wide range of power consumption. Hence, when the weight is low (0.1 or 0.2), the algorithm explores the leftmost side of the Pareto-optimal front where the power is low, and when the weight increases, it moves to the right side gradually where the latency is low. Programs with different weight settings can be executed in parallel on a cluster computer and can record the results simultaneously, generating a Pareto-optimal front at last. The weight cannot be 0, i.e., power only, because without taking latency into consideration, the Simulated Annealing gets into unstable networks with diverging latencies.

Because of the use of BookSim2.0, all resulting network designs guarantee that any two routers can be connected via other routers (i.e., no orphans and no isolated groups). While the program keeps simulating each newly generated network, two performance objectives, i.e., latency and power consumption, are monitored for Pareto-optimization analysis. Power consumptions in watt are rounded to the nearest integer value (i.e., binning) to reduce the overall data that need to be recorded. Therefore, in each integer power interval, only the minimum latency is recorded.

9

BookSim2.0 requires to manually assign the latencies (in cycles) of all links since it cannot calculate the time delay for a signal to travel from one end to another of a link based on its length. Therefore, DSENT [12] is used in our simulations to calculate the minimum required latencies for different physical lengths of links in advance. Assuming all links on a chip only stretch into horizontal and vertical directions (not diagonal for better layout formatting) and all routers are distributed evenly across the chip (i.e., tiled architecture [13]), then each inter-router-distance (i.e., rectangular directions) can be derived from the die size and the number of routers. Whenever a link allocation is generated, the program calculates link latencies based on their inter-router-distances, and then creates a complete network configuration for BookSim2.0 to simulate subsequently.

All network devices are based on a 32 nm CMOS technology and assumed to operate at 5 GHz on a 21 x 21 mm2 chip (i.e., die size) for performance analysis. Table 1 below shows the configurations for BookSim2.0. Anynet is one of the topology functions that reads the configuration file: Table 1: BookSim2.0 Variables Variable:

Value:

Topology

anynet

Routing function

min

Traffic

uniform

Sample period

1000

Injection rate

0.1

Min routing function is the deterministic routing protocol that generates routing path tables based on the shortest hop counts between routers. Uniform traffic is the random synthetic traffic pattern. All packets are generated randomly based on the injection rate of each router and are sent to a random destination. Sample period is the cycle time of each simulation. In addition, to ensure the network is stable (i.e., generates converging results), every network is simulated at least four times. If one of the results is diverging, the simulator will discard it and run another time. Injection rate is the frequency of a new packet generated by each node.

10

Results Figure 6a shows the Pareto-optimal front between two performance objectives, i.e., latency as a function of power consumption, for a 16-router network scenario. First, a POF-designed NoC with lower latency but the same power compared to the well-known mesh network is found. A mesh network consists of short links only. In contrast, the POF-designed NoC contains some longer links, which reduce the number of relay routers between some routers. A lower number of relay routers results in a decrease in latency and power consumption for writing data into the router buffers. The POF is capable of finding the balance between the performance objectives. Second, Simulated Annealing produces better results than the other two algorithms (i.e., RS and SG). A fully connected point-to-point (pt2pt) network is also plotted as a reference to show the lowest possible latency of all networks.

Figure 6b shows the Pareto-optimal front between two performance objectives, i.e., latency as a function of power consumption, for a 36-router network scenario. Because the number of links to form a 36-router network is between 35 to 630, the range of power consumption increases to a maximum of 350 watts (for a fully-connected point-to-point network), which is an unaffordable cost for a multi-core processor. When the power increases to about 45 watts, the latency decreases from about 23 to 18 cycles. Hence, those networks (in the small figure on the left) are more advantageous network designs from a practical standpoint. When the power keeps increasing, the latency only decreases slightly (i.e., flat Pareto-optimal front), indicating that the rest of the designs are not significantly more power-efficient. On the other hand, the POFdesigned NoC also outperforms the mesh NoC, and this gap has increased more compared to the 16-router network scenario above. In addition, Simulated Annealing still finds better networks than the special greedy (SG) algorithm, which indicates the greedy algorithm may have gotten trapped in a local minimum as anticipated.

Figure 6c shows the Pareto-optimal front between two performance objectives, i.e., latency as a function of power consumption, for a 64-router network scenario. When comparing the mesh network with the lowest-power POF-designed NoC, although mesh only requires 24 watts and the POF-designed NoC requires 44 watts, there is a huge latency tradeoff in latency from about

11

197 to 24 cycles. The result shows that the mesh network has reached its saturation point [14] to accommodate current traffic load, and is a poor design based on the Pareto-optimal front. Other networks that use hundreds or thousands of watts are also impractical designs due to the power consumption budget on a single chip. A fully connected pt2pt network demands the highest power, more than 3000 watts, resulting from placing all 2016 links between any two routers.

The weights for the Simulated Annealing algorithm to generate the Pareto-optimal fronts for the 16-, 36-, and 64-router scenarios, respectively, are shown in Table 2: Table 2: The Weight Sweep for Simulated Annealing Algorithm

The POF-designed network topologies (i.e., link allocations) are also recorded during each simulation, allowing for the actual display of their design. The lowest power NoCs found by Simulated Annealing are plotted in Figure 7: Black lines represent the links that a mesh network topology has in common, and blue lines represent the differences with respect to the mesh network. It is important to note, that Figure 7 shows the network topologies indicating the connections between routers rather than the actual physical layout of the electric wires. Typically, electric wires are laid-out in horizontal and vertical directions on a chip distributed among multiple metal layers to avoid crosstalk, and wires in different metal layers are connected by vias (i.e., vertical interconnect access). For example, the longest link in Figure 7a, connecting router 5 and router 15, is 4 inter-router-distances long, i.e., 2 inter-router-distances in both horizontal and vertical directions. Because longer links consume higher power consumptions, Simulated Annealing, starting from a random network configuration, still converges to a mesh-

12

link network that uses many one-inter-router-distance links to connect neighboring routers. The results also show that the mesh topology has limited scalability because as the number of routers increases, one-inter-router-distance links are no longer efficient in terms of latency and power consumption, and are gradually replaced by longer links (Figure 7b, c). When packets are routed, long links lower the latency and reduce the usage of relay routers.

Discussion To appreciate, assess, and corroborate the feasibility and quality of the NoC designs fully automatically found by the introduced stochastic multi-objective Pareto-optimization framework (i.e., no human-in-the-loop), the following has to be emphasized: 1. The BookSim2.0 simulator always guarantees that a network is stable before producing latency and power results [15]. A stable network means that all packets can be sent to the correct destinations within a given (finite) simulation time. Hence, no packet is lost or stuck in the network when following the given routing protocol. 2. The deterministic routing protocol, which always directs packets along the same path between any two routers, is adopted and works properly even for these unconventional and irregular networks that are created by the Pareto-optimization framework. The protocol calculates the minimal paths among routers in advance and follows them even if there are other possible paths. In doing so, any livelock loop (i.e., a packet is travelling around and never reaching the destination) is prevented and at least all path lengths are the shortest. If any deadlock occurs (i.e., some packets hold resources and also request other resources that are held by other packets; if their dependencies form a cycle and none of these packets can proceed, a deadlock is created permanently), the simulator will abort those unstable networks. 3. The BookSim2.0 simulator simulates a newly generated network at least four times to ensure that network can produce converging results under different random traffics. Therefore, the simulated network should be robust enough to allow smooth traffic flow

13

without sudden congestion and uneven delay in certain paths, thus lending credibility to the latency results.

The Pareto-optimization framework uses random synthetic traffic to quickly evaluate the latency and power consumption for all the generated network combinations during the optimization process. Although more sophisticated multi-core processor simulators exist, such as gem5 [16], they usually are computationally much more expensive than BookSim2.0, i.e., their incorporation in the Pareto-optimal framework, albeit feasible in principle, is computationally rather unrealistic unless they were amenable to parallelization. Therefore, it is advantageous to use a computationally cheaper simulator, such as BookSim2.0, at first to optimize NoC architectures much more rapidly in an iterative manner, and to subsequently benchmark the resulting optimal POF-designed NoC architectures with a sophisticated and comprehensive NoC simulator, such as gem5 (see section Full System Application Benchmarking below for justification of this procedure). Given the quartic equation of the number of iterations (n4 − 2n3 − n2 + 2n)/8, the special greedy algorithm has become difficult to complete for larger router numbers: For a 64-router scenario, it already requires 2 million iterations per simulation, i.e., twice the number as the Simulated Annealing algorithm, and for a 256-router scenario 532 million iterations, thus becoming quickly impractical to use. In contrast, the complexity of the Simulated Annealing is based on the userdefined temperature parameters and the cooling rate, i.e., independent of the number of routers. In addition, compared to special greedy, Simulated Annealing produces better results within an adjustable (i.e., user-defined) finite simulation time.

Until now, POF-designed NoCs tend to adopt more links to create networks with lower latency (i.e., fewer cycles) at the expense of power consumption. However, when restricting the number of links to be 112 for a 64 router network, i.e., the same as a 64 router (8 x 8) mesh network, we found so far no better but similarly performing POF-networks in our POF simulations, as shown in Figure 8. Mesh networks still have many other benefits that POF-networks do not have, such as that they can utilize many different routing protocols and exhibit high path diversity and faulttolerance due to the link symmetry. POF-designed NoCs though show that there exist a lot more

14

alternative networks to be explored/considered while also allowing users to add additional restrictions/constraints based on their respective application needs.

Full System Real Application Benchmarking Although the introduced POF, when using BookSim2.0, has the ability of quickly evaluating large numbers of network combinations and generating a full Pareto-optimal front, these simulations are only conducted under random synthetic traffic as previously discussed. Therefore, to provide detailed evaluations of a system under real-world applications and to validate our POF simulation results, we further simulated and compared a standard mesh NoC architecture and the lowest-power 16-router and 64-router POF-designed NoC architectures, respectively, using the full system cycle accurate gem5 simulator [16].

The gem5 simulator consists of CPU models, a detailed cache/memory system, on-chip interconnection networks (including links, routers, routing protocols, and flow-control), and a variety of cache coherence protocols. In the full system mode, gem5 builds a system based on configuration input (e.g., our POF-designed NoC) and boots a Linux operating system on it – all in virtual space. Application benchmarks are then executed at runtime of the operating system. We select the PARSEC benchmark suite [17, 18] for our NoC benchmarking due to its emerging parallel applications, especially for multi-core processors. Table 3 below shows the system configuration for the 16-router system on gem5. Table 3: Gem5 System Configurations Core count:

16 and 64

Core frequency:

2GHz

L1 I/D cache:

32KB/32KB

L2 cache per core:

2MB

Cache coherence:

Decentralized directory based

State machine:

MOESI

Link latency:

Proportional to inter-router-distance

Link bandwidth:

128 bits

We implement the POF-designed NoC custom topology in gem5 to simulate/assess its average network latency while running the PARSEC benchmark suite. Power consumption is evaluated by the Orion2.0 power simulator [19] interfaced with gem5. Figure 9 shows the evaluations of

15

both 16- and 64-router scenarios. The POF-designed NoC exhibits a similar latency reduction compared to mesh NoC when using gem5 instead of BookSim2.0. The traffic loads, generated by the random synthetic traffic and the chosen benchmarks, differ between the two simulators used (i.e., BookSim2.0 and gem5). Thus, the latency results differ quantitatively. However, they are consistent qualitatively. Furthermore, no deadlock occurs and the deterministic routing protocol is also working properly in the full system environment (i.e., gem5). The power consumption also shows similar results as the POF simulations. For the 16-router scenario, POF-designed NoCs have an average latency reduction of 3.02 cycles with a cost of 1.07 watts. For the 64router scenario, POF-designed NoCs have an average latency reduction of 12.69 cycles with a cost of 24.95 watts. In addition, based on the previous POF simulation, POF-designed NoCs are expected to have higher saturation points, which enable the handling of higher traffic loads. Although mesh-NoCs have the advantage of a much simpler and regular architecture/link layout, POF-designed NoCs, which are selected from an enormous number of combinations beyond human/manual design abilities, appear to be superior in regards to latency and power consumption, and are implementable and controllable (protocol-wise).

Conclusion When taking inter-router-distance into consideration, it is hard to find an efficient set of short and long links without the aid of a computer, especially in large-scale multi-core systems. Long links consume more power than short links, but they reduce the number of relay routers in a path between two routers, thereby decreasing the latency. Therefore, adding long links at opportune locations can be critical. The framework is capable of iterating and exploring the tradeoffs between performance objectives (i.e., latency and power consumption in our case) in form of a Pareto-optimal front. Among the three tested optimization algorithms (RS, SG, SA), Simulated Annealing proved to be the most efficient and exhibits high flexibility because the weights for each performance objective can be fine-tuned to fulfill different design and application needs, thereby enabling trade studies.

16

The devised stochastic multi-objective Pareto-optimization framework shows encouraging results, paving the way towards fully automated ab initio NoC design, i.e., no human-in-the-loop NoC design, by leveraging the increasing computing power of high performance computation paired with well-established, well-proven multi-objective stochastic optimization techniques, such as Simulated Annealing. Earlier proposed NoC optimization frameworks have been employing multi-objective Genetic Algorithms (e.g., [20, 21]) and multi-objective Evolutionary Algorithms (e.g., [22, 23]). It should be noted that Simulated Annealing and Genetic Algorithms will perform similarly well, if used correctly, i.e., their respective use is a matter of personal preference. However, there are implementation differences. As mentioned above, in general, the SA-based optimization can be readily scaled because each computing node or CPU can perform its own independent SA-run, as opposed to population-driven stochastic optimization algorithms, such as Genetic Algorithms and Evolutionary Algorithms. Moreover, as discussed before, the performance of SA is exclusively driven by the empirical choice of only the start temperature, the end/minimum temperature, and the cooling rate  for any given fitness function. In contrast, when using Genetic Algorithms, the user is faced with the empirical choice of many more parameters, such as, but not limited to: the population size, i.e., number of parameter vectors; the probability of applying genetic operators (e.g., point mutation, crossover, and inversion [20, 21]) to the individual parameter vectors of the population; the choice of how to select crossover partners; the selection of the clipping points, i.e., where to perform the crossover within the parameter vectors, etc. As such, using SA makes it potentially more straightforward to tackle the actual optimization problem rather than the intricacies of the optimization algorithm itself (i.e., GA). It is for these reasons, and because of the extensive experience with applying SA successfully to highly diverse optimization tasks in the past that we chose SA over GA.

Prior work in the area of automated NoC design appears to have been focusing on application mapping on existing NoC platforms (e.g., [22, 23]), rather than actual architecture optimization, such as link-connectivity. At best, earlier proposed frameworks select among well-known, preexisting NoC standard architectures, such as mesh, torus, binary tree, clos, butterfly tree, ring, and polygon architectures (e.g., [21]), rather than ab initio NoC-architecture optimization, i.e., where only the number of routers are known but nothing else, which is the baseline for this manuscript.

17

Since the devised POF is agnostic to the optimization algorithm used, more effective and/or computationally efficient optimization algorithms can readily be incorporated. Furthermore, the traffic model choice, e.g., synthetic vs. real application traffics, (in our case: uniform random) does not affect the Pareto-optimization framework per se. It does, however, have an effect on the computation/optimization time for the NoC architecture, as some traffic models are computationally more expensive than others. Additional design parameters, such as load balancing, adaptive routing protocols, and photonic links [24], as well as the ability for compartmentalized optimization or re-optimization after partial reconfiguration can be considered and incorporated in future work to further enhance the scope and quality of automated NoC designs to meet the exploding need for multi-core systems.

Competing Financial Interests Authors T.-J. K. and W.F. may have a competing financial interest in the stochastic multiobjective Pareto-Optimization Framework for Automated Network-on-Chip Design presented here as a patent application has been filed on behalf of the University of Arizona.

18

References 1. J. Kim, J. Balfour, and W. Dally, “Flattened butterfly topology for on-chip networks,” in 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), Dec 2007, pp. 172–182. 2. Y. H. Kao, M. Yang, N. S. Artan, and H. J. Chao, “Cnoc: High-radix clos network- on-chip,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 12, pp. 1897–1910, Dec 2011. 3. B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, “Kilo-noc: A heterogeneous network-onchip architecture for scalability and service guarantees,” in 2011 38th Annual International Symposium on Computer Architecture (ISCA), June 2011, pp. 401–412. 4. W. Fink (2008) Stochastic Optimization Framework (SOF) for Computer-Optimized Design, Engineering, and Performance of Multi-Dimensional Systems and Processes; SPIE Defense & Security Symposium, Orlando, Florida; Proc. SPIE, Vol. 6960, 69600N (2008); DOI:10.1117/12.784440 (invited paper) 5. A.V. Lotov and K. Miettinen, Visualizing the Pareto Frontier. Springer Berlin Heidelberg, 2008. 6. N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of State Calculation by Fast Computing Machines, J. of Chem. Phys., 21, 1087 – 1091, 1953. 7. S. Kirkpatrick, C.D. Gelat, M.P. Vecchi, Optimization by Simulated Annealing, Science, 220, 671 – 680, 1983. 8. J. H. Holland, Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, Michigan, 1975. 9. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989. 10. N. Jiang, J. Balfour, D. U. Becker, B. Towles, W. J. Dally, G. Michelogiannakis, and J. Kim, “A detailed and flexible cycle-accurate network-on-chip simulator,” in 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2013, pp. 86–96.

19

11. M. Locatelli, “Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions,” Journal of Optimization Theory and Applications, 104: 121 (2000). https://doi.org/10.1023/A:1004680806815 12. C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, and V. Stojanovic, “DSENT - a tool connecting emerging photonics with electronics for opto- electronic networks-on-chip modeling,” in Proc. 6th IEEE/ACM Int. Symp. Netw. Chip (NoCS), May 2012, pp. 201–210. 13. J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip networks,” in Proceedings of the 20th Annual International Conference on Supercomputing, ser. ICS’06. New York, NY, USA: ACM, 2006, pp. 187–198. 14. S. Park, T. Krishna, C.-H. Chen, B. Daya, A. Chandrakasan, and L.-S. Peh, “Approaching the theoretical limits of a mesh noc with a 16-node chip prototype in 45nm soi,” in Proceedings of the 49th Annual Design Automation Conference, ser. DAC ’12. ACM, 2012, pp. 398–405. 15. N. Jiang, G. Michelogiannakis, D. Becker, B. Towles, and W. J. Dally, “Booksim 2.0 user’s guide,” Stanford University, March 2010. 16. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” SIGARCH Comput. Archit. News, vol. 39, no. 2, pp. 1–7, Aug. 2011. 17. C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The PARSEC Benchmark Suite: Characterization and Architectural Implications,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, October 2008. 18. M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, S. W. Keckler, “Running PARSEC 2.1 on M5,” The University of Texas at Austin, Department of Computer Science. Technical Report #TR09-32. October 27, 2009. 19. A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration.” In Proceedings of the Conference on Design, Automation and Test in Europe (2009), pp. 423–428 20. R. K. Jena and G. K. Sharma, “A multiobjective evolutionary algorithm-based optimisation model for network on chip synthesis,” International Journal of Innovative Computing and

20

Applications archive Volume 1 Issue 2, January 2007 Pages 121-127; DOI: 10.1504/IJICA.2007.016793 21. A. A. Morgan, H. Elmiligi, M. W. El-Kharashi, and F. Gebali, “Multi-objective Optimization of NoC Standard Architectures using Genetic Algorithms,” in: Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology 2010; DOI: 10.1109/ISSPIT.2010.5711730 22. M. V. Carvalho da Silva, N. Nedjah, L. de Macedo Mourelle, “Power-Aware Multi-objective Evolutionary Optimization for Application Mapping on NoC Platforms,” in "Trends in Applied Intelligent Systems", N. Garcia-Pedrajas et al. (Eds.): IEA/AIE 2010, Part II, LNAI 6097, pp. 143–152, 2010. Springer-Verlag Berlin Heidelberg 2010 23. N. Nedjah, M. V. Carvalho da Silva, L. de Macedo Mourelle, “Customized computer-aided application mapping on NoC infrastructure using multi-objective optimization,” Journal of Systems Architecture 57 (2011) 79–94 24. T. J. Kao and A. Louri, “Optical multilevel signaling for high bandwidth and power-efficient on-chip interconnects,” IEEE Photonics Technology Letters, vol. 27, no. 19, pp. 2051–2054, Oct 2015.

21

Packets

R

R

R

R Router

Tile/Core Core L1-I

R

R

R

R

L1-D L2 cache

R

R

R

R

R

R

R

R

Figure 1. A typical example of a 16-core NoC.

22

core

core +X -X +Y -Y

+X -X +Y -Y

Figure 2. Functional schematic of a Stochastic Optimization Framework (SOF; after [4]).

23

Generate the next network configuration

BookSim2.0 Create all necessary objects of a network

Optimization algorithm

Produce results of network latency and power consumption

Simulate performances with synthetic traffics

Figure 3. Schematic of the Pareto-optimization framework setup – an instantiation of a Stochastic Optimization Framework (SOF; after [4]).

24

Figure 4. Total number of network combinations as a function of number of routers in a semilogarithmic plot.

25

Figure 5. Schematic flowchart diagram of the Simulated Annealing optimization algorithm.

26

(a)

16-Router NoC under Random Traffic

Mesh network (a typical network most commonly used) Latency (cycle)

Better networks found by the optimization framework

Random 24 23 22 21 20 19 18 17 16 15 14

Greedy

Weight parameter=0.1

5

6

7

8

Weight parameter= …

Weight parameter=0.9

Fully pt2pt

Weight parameter=1.0

Power (watt)

36-Router NoC under Random Traffic

Region 1: More advantageous network designs

Latency (cycle)

Random

22 20 18 15 20 25 30 35 40 45

Greedy

40 38 36 34 32 30 28 26 24 22 20 18 16 14

Simulated Annealing

Mesh

Fully pt2pt

Region 2: Wasting power

0

50

100

150

200

250

300

350

Power (watt)

Mesh's latency increases significantly, indicating that it has reached the saturation point with an injection rate of 0.1 packets/cycle/node.

64-Router NoC under Random Traffic Random

Greedy

Simulated Annealing

Mesh

Fully pt2pt

200 26

Latency (cycle)

(c)

Mesh

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

(b) 24

Simulated Annealing

24 22 20 30

60

150 100 Infeasible networks due to soaring power consumption

50

90 0 0

250

500

750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250

Power (watt)

Figure 6. The Pareto-optimal front results for (a) 16-router networks, (b) 36-router networks, and (c) 64-router networks under random traffic. In addition to three optimization algorithms, a mesh network and a fully-connected point-to-point (pt2pt) network are also plotted. The Simulated Annealing uses different weight parameters (0.1, 0.2, ..., 1.0) for various power consumption regimes and combines the results to form the Pareto-optimal front, e.g., for the 16-router NoC scenario (a): for power 7 weight=0.1, for power 15 weight=0.9, and for power 22 weight=1.0.

27

(a) Link that mesh also has Link that mesh does not have

1

2

3

4

5

6

7

8

9

10

Normalized number of links

16 Router: Link Length Distribution 0

11

28 links 12

13

14

15

1 0.8

Mesh NoC

POF-designed NoC

0.6 0.4 0.2 0 1

2 3 4 5 Normalized inter-router-distance

POF-designed 16-router NoC (28 links)

6

(b) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36 Router: Link Length Distribution Normalized number of links

0

1 0.8

Mesh NoC

POF-designed NoC

0.6 0.4 0.2 0 1

2

3 4 5 6 7 8 9 Normalized inter-router-distance

10

POF-designed 36-router NoC (78 links)

(c) 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64 Router: Link Length Distribution Normalized number of links

0

1 0.8

Mesh NoC

POF-designed NoC

0.6 0.4 0.2 0 1

2

3 4 5 6 7 8 9 10 11 12 13 14 Normalized inter-router-distance

POF-designed 64-router NoC (206 links)

Figure 7. Compared to mesh topology, POF-designed NoCs add more and longer links in place of shorter links as the network size increases (see link length distribution histograms above). The total number of links are: (a) 16 routers: 28 for POF-NoC versus 24 for mesh, (b) 36 routers: 78 for POF-NoC versus 60 for mesh, and (c) 64 routers: 206 for POF-NoC versus 112 for mesh.

28

250

Mesh w/ shortest-path routing

Latency (cycle)

200

150

POF-designed 112-link network w/ shortest path routing

100

Mesh w/ dimensional order routing POF-designed 112-link network w/ shortest-path routing

50

POF-designed 206-link network w/ shortest-path routing

0 20

30

40

50

60

70

80

90

100

Power (watt)

Figure 8. Two POF-designed NoCs with the restriction of 112 links compared with two mesh networks.

29

16-Router NoCs under PARSEC benchmarks POF-designed NoC

25

10

6

s

x2 64

vip

ns

e

sw ap tio

qm in fre

t flu

bo

bl

ac ks c

ho le

x2 64

s vip

ns

e

sw ap tio

fre

qm in

t

im at

rre

id an

fe

flu

bo

bl

ac ks c

e

0 s dy tra ck ca nn ea l

2

0

e

4

5

im at

10

8

rre

15

POF-designed NoC

fe

20

s dy tra ck ca nn ea l

Power (watt)

12

(b)

64-Router NoCs under PARSEC benchmarks Mesh NoC

POF-designed NoC

Mesh NoC

40 35 30 25 20 15 10 5 0

Power (watt)

50 40 30 20 10 x2 64

s vip

ns

e

sw ap t io

qm in fre

im at e

re t fe r

id an flu

ea l ca nn

ra ck dy t

s ho le ac ks c bl

bo

x2 64

s vip

ns

e

sw ap t io

qm in fre

id an

im at e

re t fe r flu

ea l

ho le

dy t bo

ca nn

ra ck

0

ac ks c bl

POF-designed NoC

60

s

Latency (cycle)

Mesh NoC

30

ho le

Latency (cycle)

Mesh NoC

id an

(a)

Figure 9. Average network latency and power consumption comparison between mesh NoCs (blue bars) and POF-designed NoCs (orange bars) using the PARSEC benchmark suite [17, 18] running on a full system cycle accurate gem5 simulator [16] and Orion2.0 power simulator [19].

30

Dr. Tzyy-Juin Kao received his bachelor’s and master’s degrees in the field of electronics engineering from National Tsing Hua University, Taiwan in 2008 and 2011 respectively, and his Ph.D. from the University of Arizona in 2018, where he performed his research in the Visual and Autonomous Exploration Systems Research Laboratory under direction of Prof. Wolfgang Fink. His research interests include computer architecture, silicon photonics, network-on-chip, high performance computing, and performance and power simulation.

31

Prof. Dr. Wolfgang Fink is the inaugural Edward & Maria Keonjian Endowed Chair of Microelectronics with joint appointments in the Departments of Electrical and Computer Engineering, Biomedical Engineering, Systems and Industrial Engineering, Aerospace and Mechanical Engineering, and Ophthalmology and Vision Science at the University of Arizona in Tucson. He was a Visiting Associate in Physics at the California Institute of Technology (2001–2016) and a Visiting Research Associate Professor of Ophthalmology and Neurological Surgery at the University of Southern California (2005–2014). Dr. Fink is the founder and director of the Visual and Autonomous Exploration Systems Research Laboratory at Caltech (http://autonomy.caltech.edu) and at the University of Arizona (http://autonomy.arizona.edu). He was a Senior Researcher at NASA’s Jet Propulsion Laboratory (2001–2009). He obtained a B.S. and M.S. degree in Physics and Physical Chemistry from the University of Göttingen, Germany in 1990 and 1993, respectively, and a Ph.D. in Theoretical Physics from the University of Tübingen, Germany in 1997. Dr. Fink’s interest in human-machine interfaces, autonomous/reasoning systems, and evolutionary optimization has focused his research programs on artificial vision, autonomous robotic space exploration, biomedical sensor/system development, cognitive/reasoning systems, and computer-optimized design. Dr. Fink is an AIMBE Fellow, PHM Fellow, UA da Vinci Fellow, UA ACABI Fellow, and Senior Member IEEE and SPIE. He has over 253 publications (including journal, book, and conference contributions), 6 NASA Patent Awards, as well as 21 US and foreign patents awarded to date in the areas of autonomous systems, biomedical devices, neural stimulation, MEMS fabrication, data fusion and analysis, and multi-dimensional optimization.

32