Ensuring safety and efficiency in networks-on-chip

Ensuring safety and efficiency in networks-on-chip

Author’s Accepted Manuscript Ensuring Safety and Efficiency in Networks-OnChip Adam Kostrzewa, Selma Saidi, Leonardo Ecco, Rolf Ernst www.elsevier.com...

588KB Sizes 7 Downloads 54 Views

Author’s Accepted Manuscript Ensuring Safety and Efficiency in Networks-OnChip Adam Kostrzewa, Selma Saidi, Leonardo Ecco, Rolf Ernst www.elsevier.com/locate/vlsi

PII: DOI: Reference:

S0167-9260(16)30091-8 http://dx.doi.org/10.1016/j.vlsi.2016.10.015 VLSI1261

To appear in: Integration, the VLSI Journal Received date: 29 April 2016 Revised date: 12 September 2016 Accepted date: 13 October 2016 Cite this article as: Adam Kostrzewa, Selma Saidi, Leonardo Ecco and Rolf Ernst, Ensuring Safety and Efficiency in Networks-On-Chip, Integration, the VLSI Journal, http://dx.doi.org/10.1016/j.vlsi.2016.10.015 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Ensuring Safety and Efficiency in Networks-On-Chip Adam Kostrzewa+ , Selma Saidi* , Leonardo Ecco+ , and Rolf Ernst+ +

Technische Universit¨at Braunschweig, Germany * Technische Universit¨at Hamburg, Germany

Abstract— Networks-on-Chip (NoCs) for real-time systems require solutions for safe and predictable sharing of network resources between transmissions with different quality-of service requirements. In this work, we present a mechanism for a global and dynamic admission control in NoCs dedicated to real-time systems. It introduces an overlay network to synchronize transmissions using arbitration units called Resource Managers (RMs), which allows a global and work-conserving scheduling. We present a formal worst-case timing analysis for the proposed mechanism and demonstrate that this solution not only exposes higher performance in simulation but, even more importantly, consistently reaches smaller formally guaranteed worst-case latencies than TDM for realistic levels of system’s utilization. Our mechanism does not require the modification of routers and therefore can be used together with any architecture utilizing non-blocking routers. Keywords: Networks-on-chip, Real-time systems, Safety-critical multicores, Dynamic admission control, Shared resources.

I. I NTRODUCTION Ensuring safety in Multiprocessor Systems-on-Chip (MPSoCs) typically implies worst-case dimensioning to provide guarantees for the response time of running applications. However, providing worst-case service guarantees in such setups is difficult. Indeed, even with a static task-to-processor mapping, the execution of applications is usually not independent due to accesses to the shared resources where tasks running on different processor interfere. Moreover in modern MPSoCs, to conduct a single memory access, a task must acquire several resources, e.g. interconnect and memory controller, with independent arbiters and often provided by different vendors. The designer must therefore assure that the effects resulting from coupling these different arbiters will not lead to pessimistic formal guarantees or decreased utilization. Consequently, such systems require mechanisms which allow a seamless integration of critical workloads by assuring a composable, efficient and safe coordination of accesses to the shared resources. Networks-on-Chip (NoCs) are frequently considered in multi- and manycore architectures as the interconnect solution for future real-time systems due to their modular design allowing superior efficiency and scalability. However, in safety critical domains, such as automotive and avionics, the predominant requirement is to provide temporal guarantees i.e. prove that the worst-case system’s behavior is predictable and adheres to the application’s timing constraints resulting from real-time requirements e.g. worst-case network latency. This requires spatial and temporal separation of concurrent transmissions e.g. “sufficient independence” requested by the avionics safety standard DO-178B or following the same principle of “free-

dom from interference” requested by the automotive standard ISO26262. Commonly used wormhole-switched NoCs with multi-stage arbitration are usually not designed to meet these requirements but rather to deliver high average performance e.g. [1], [2]. Therefore, in order to progress, ongoing transmissions must acquire buffers and links separately in each router along their path i.e. packets are switched as soon as they arrive, and all traffic receive equal treatment. Moreover, some interference cannot be resolved locally by the router’s arbiter and requires inputs from adjacent neighbours e.g. a joint allocation of the crossbar switch and the router output due to a possible lack of buffers at the output (input-buffered router). This results in a complex spectrum of direct and indirect interferences between data streams which may endanger the system safety since the service depends on the runtime behavior of other streams. Consequently, such networks are considered to be hardly analyzable and generally not applicable to safety critical systems. Nevertheless, standard NoC architectures still remain appealing for use since they are affordable, fast and flexible [3]. Safety, in the context of NoCs, was already addressed by many custom mechanisms and architectures providing a safe and predictable sharing of the interconnect resources (e.g. links and buffers) in order to bound or avoid the interference between concurrent transmissions. There exists two established solutions to tackle this problem: non-blocking routers with rate control [4] and Time-Division Multiplexing (TDM) [5]. The first mechanism is based on a local arbitration performed independently in routers and conducting dynamic scheduling between transmissions competing for the same output port. Although this approach is capable of providing worstcase guarantees, in particular when using virtual channels for isolation, it comes at a high hardware cost and does not scale well with the number of isolated streams. TDM offers a different solution where each transmission receives, in a cyclic order, a dedicated time slot to have an exclusive access to the NoC. TDM allows an easy implementation and provides timing guarantees, but also results in average latencies which are very close to the worst case even when the system is not highly loaded [6]. This is mainly due to the traffic from general-purpose applications that hardly ever follows a constant and predictable pattern assumed by TDM schemes. Hence, an efficient execution is only possible for a single selected use-case with a known and static behavior for which the TDM scheme is fully optimized. The contribution of this work is an alternative mechanism for providing efficient and safe sharing of resources in NoCs

dedicated to real-time systems. We introduce a global and dynamic admission control based on scheduling units, called resource managers (RMs), with which applications have to negotiate their accesses to the NoC. Transmissions are scheduled using round-robin arbitration, and are granted an exclusive access to network resources. Synchronization is achieved using control messages and a dedicated protocol. In particular, we investigate the effect of combining this mechanism with the SDRAM controller since memory traffic constitutes most of the traffic in an MPSoC system. The proposed mechanism allows to overcome the limitations of state-of-the-art solutions and apply existing wormholeswitched and performance optimized NoCs in safety critical domains, without complex hardware modifications. As the main advantages, we highlight: 1) It decreases the hardwareoverhead compared to non-blocking routers due to a global arbitration. 2) It reduces the temporal overprovisioning compared to TDM due to the work-conserving scheduling. 3) The presented solution does not require the modification of routers and therefore can be used in conjunction with any architecture utilizing non-blocking routers. 4) The proposed approach preserves the locality of transfers and greatly simplifies the design of SDRAM controllers used in the platform. As a matter of fact, a simple SDRAM controller that serves incoming requests using first-come first-served arbitration allows to provide tight timing bounds and a very low hardware overhead. The rest of the paper is structured as follows: Section II provides a detailed discussion of the related work. In Section III we discuss our system model and initial assumptions. Section IV describes the flow of the mechanism. Based on that, a formal timing analysis is presented in Section V. In Section VI, we describe the design of a simplified SDRAM controller to be used in conjunction with the proposed admission control layer. Finally, Section VII presents an experimental evaluation and validation of the timing properties along with the corresponding overheads. II. R ELATED W ORK Multiple research efforts investigated the problem of ensuring worst-case guarantees in NoCs. The most frequently deployed solution is enforcing isolation with time-division multiplexing (TDM), such as [5, 7]. According to this scheme, resources are shared in time and each application has a dedicated static time slot during which it acquires exclusive access to the NoC. TDM-based systems are easy to implement and analyze, and the global TDM-based scheduling allows to guarantee the absence of contention and therefore reduces the amount of necessary hardware resources e.g. buffers or logic in routers. However, TDM-based arbiters offer static non workconserving scheduling since the network latency depends only on the duration of the cycle, i.e. the number of applications and the size of their time-slots, and not on the frequency of their accesses to the interconnect. Moreover, if an application does not have any pending transfer then its time slot is wasted., Therefore, the efficient utilization of such architectures is only possible when the system is highly loaded, i.e continuously requested. This implies dedicated solutions, such as an offline generated schedule statically applied to the whole NoC [8]. Consequently, average latencies resulting from non-optimized applications are close to the worst-case, even in lightly loaded systems introducing a major temporal overhead [6] .

Fig. 1. Effect of jitter on the total latencies of the applications synchronized with the TDM.

Additionally, whenever tasks expose dynamics in their behavior, even with a small jitter, transmissions are blocked and their execution is delayed for the duration of a whole TDM cycle. Consider the example depicted in Fig. 1 composed of two applications running in a NoC-based MPSoC, arbitrated using TDM. App1 is composed of three tasks (A, B and C) with precedence constraints and App2 is composed of one task (D). When the behavior of the system is static and fully predictable, running tasks can be synchronized using TDM cycles, see Fig. 1.a), exploiting their periodic activation scheme where the start of an application transmission occurs in a cyclic way whenever it is granted a time slot. However, whenever tasks expose dynamics in their behavior, even with a small jitter, transmissions are blocked and their execution is delayed for the duration of a whole TDM cycle, see Fig. 1.b). The activation of task A arrives with a small jitter and misses its slot, therefore tasks B and C cannot start their execution before the next cycle. Consequently, task B is granted access in the second cycle but due to the an activation jitter of Task C, a third TDM cycle is required for App1 to complete its execution. Such drastic decrease in the system performance forces designers to optimize setups e.g. by reducing the number of tasks running on the processing nodes, modify the functionality and the number of tasks i.e. the move from requirements resulting from the nature of controlled processes to requirements enforced by the platform. Moreover, resulting architectures are not flexible and the worst-case guarantees can be easily endangered by changes in the toolchain, in the distribution of the load on the cores or by modifications in the software (adding/removing tasks) as well as the hardware. Furthermore, as the overhead of TDM depends only on the duration of the cycle using large slots increases the latency of all applications [5]. Small slots lead to a distribution of longer transmissions over several TDM-cycles, even when the remaining TDM-cycles are not used, which drastically decreases performance. Moreover, too short TDM-slots are undesirable when accessing stateful shared resources that benefit from spatial locality, such as DDR memories. In case of short TDM-slots, longer transmissions are distributed between multiple TDM cycles and packets from different senders may freely interleave in the memory controller, potentially leading to undesirable timing effects [9]. Therefore, in order to employ short TDM-based setups in safety-critical systems, the sufficient independence between interfering senders must be enforced using specialized predictable memory controllers which, to deal with lack of locality, employ a combination of close-page policy and static bank interleaving. This introduces custom hardware increasing costs and power consumption. In order to mitigate these effects, multiplexing of time slots between several channels with independent TDM-schedules was proposed in [6]. The effectiveness of such approaches depends directly on the number of independent channels and their utilization. If there are few heavily utilized channels the

performance improvement will be low. Moreover, they rely on static budgets which leads to the same problems as in case of TDM-slot’s granularity. Other approaches, such as SurfNoc [7], employ optimized TDM scheduling to minimize the latency overhead. This is performed by replacing the cycle-by-cycle TDM-schedule with more flexible solutions e.g. domain oriented waves. Although this allows to decrease negative side-effects of TDMarbitration, it does not fully eliminate them, providing only a more optimized solution. Moreover, these solutions require much more complex routers thereby drastically increasing costs of the hardware and power consumption. The alternative solution to TDM is based on non-blocking routers with rate control where arbitration is performed locally in routers [10]. In such NoCs, transmissions must acquire output ports in routers as they progress through the interconnect and the arbitration is performed locally and independently within each router. Additionally, this scheme requires to ensure that a blocked packet on an output port cannot block a link for other arriving packets. The implementations in the scope of NoCs frequently utilize low-level priority preemptive scheduling (at the flit-level) between VCs, based on commonly deployed wormhole switching and virtual channel flow control e.g. [2], QNOC [11] and [12]. Another implementation of this general principle is offered by MANGO [13] where buffering is performed through whole virtual-circuits rather than output ports. The scheduling may optimize different NoC’s properties such as latency, buffer sizes and utilization [10]. Consequently, this allows to offer guaranteed service for safety critical streams, by mapping them to VCs with higher priorities, whereas running best-effort transmissions on low priority VCs i.e. forming hybrid, mixedcriticality systems. For instance, in [14] authors proposed a new protocol WPMC which introduces two modes of operation for the NoC. In the low-criticality mode, transmissions of all criticality levels are scheduled based on their typical-case behavior. When a monitor detects a violation of this behavior, the NoC switches to a high-criticality mode where only streams allocated to VCs of a high criticality are scheduled. Another advanced architecture applying similar principles is IDAMC [15], which uses a back suction mechanism [16] to maximise the bandwidth given to low (or non) critical messages while ensuring that high-criticality messages arrive by their deadline. Although the approaches based on non-blocking routers offer isolation, support for multiple criticality levels and workconserving scheduling, they introduce high hardware overhead. Firstly, the arbitration is based on the assumption that packets can be forwarded as they arrive [17] i.e. there is no back pressure and no correlation between routers which requires larger buffers than TDM-based approaches [5]. Secondly, the major challenge emerges from the constant increase in the number of applications integrated into a single chip, such as the Flight Management System [18], encompassing multiple tasks with different importance to the system’s safety (criticality levels). In the case of non-blocking routers, the number of VCs must be equal to the number of criticality levels and therefore must increase accordingly, otherwise the system is not predictable [5]. Finally, since local arbitration is applied, the same issue appears with the locality of memory accesses, which cannot be guaranteed. In this work, we consider a system in which applications

must be mapped to the same VC due to insufficient hardware resources, therefore excluding solutions based on nonblocking routers. The proposed solution allows to overcome the drawbacks of previously described mechanisms. It reduces the hardware overhead compared to non-blocking routers due to the global arbitration which decreases the blocking and the size of the necessary buffers in routers. Moreover, it supports safe and predictable sharing of the same VC between different transmissions. In comparison to TDM, our mechanism drastically decreases average latencies, i.e. temporal overprovisioning, due to the work-conserving arbitration. Finally the proposed mechanism allows to maintain the locality of memory accesses [19] through the isolation of a whole transmission. This improves performance and decreases power consumption of DRAM memory modules which constitute the most common hot-module in MPSoCs. The proposed solution is inspired by the principles of Software Defined Networks (SDNs) [20] developed originally for the purposes of the off-chip interconnects. It is based on disassociating the mechanisms responsible for decisions about where traffic is sent (the control plane) from the underlying physical infrastructure which forwards traffic to the selected destination (the data plane). Existing NoC architectures, implementing the concept of SDN, concentrated on the reconfiguration of independent NoC switches to optimize the average performance [21], [22], [23] by adjusting the scheduling arbitration during the runtime. The proposed solution differs significantly from these approaches as it: i) controls traffic at the source i.e checks the availability of the interconnect resources before the sender can obtain physical access the interconnect and ii) allows providing the worst-case guarantees for synchronized senders. To the best of our knowledge,in the context of NoCs, client-server admission control schemes are only applied to improve average latencies of synchronized transmissions by limiting the access rates to the hot-modules, such as in [24] and [25]. III. P RELIMINARIES In real-time NoCs, synchronization between interfering transmissions is essential since the effects of both back pressure and head-of-line blocking in routers, where switch arbitration between packets/flits is performed, may endanger the system’s safety [24], [5], [11]. Therefore in order to provide service guarantees, e.g. an upper bound on the worst-case latencies, we must arbitrate between interfering transmissions, i.e sharing the same VC and whose paths overlap in at least one physical link. Our goal is to provide a validation method to check if currently available resources are sufficient for the requested transmission, i.e. admission control, in order to acknowledge or deny the physical access to the interconnect for this transmission. The evaluation is based on information about all other transmissions/applications currently running in the system i.e the global state of the NoC. Additionally, as this state may change during runtime, e.g. due to dynamically arriving external events such as interrupts, we dynamically adjust the arbitration to efficiently accommodate arriving workloads. Therefore, the proposed mechanism introduces work-conserving arbitration i.e. it tries to keep the NoC resources busy whenever there are ready pending transmissions. In the following, we show that this allows to increase the systems utilization as well as reduce the pessimism of the worst-case guarantees in setups which do not fully utilize the interconnect resources.

Fig. 3. Workflow of the RM-based admission control in a NoC. Fig. 2. Global admission control at the top-layer before sending data on the bottom-layer.

The proposed mechanism can be applied to different NoC architectures. However, we make the following assumptions about the underlying architecture. We assume a 2D mesh network, implementing prioritized virtual-channels (VCs) (cf. [11]) and wormhole switching (cf. [2]) where packets are decomposed into flits which constitute the granularity of transmission and arbitration. Each virtual channel has an assigned fixed static priority. Routers schedule transmissions according to i) the availability of buffers in the next router, ii) the priority of a particular VC, and iii) a packet-based round-robin arbitration between the input ports for packets using the same VC. Finally, we assume that the traffic requiring guaranteed service imposes high load but exhibits well understood and predictable patterns e.g. streaming applications [26] or predictable execution models such as superblocks [27], [28]. Therefore, we assume that the traffic requiring service guarantees is DMA-based thereby targeting system level predictability i.e. the guarantees are given for transmissions composed of multiple packets not to single cache-lines. In safety-critical MPSoCs platforms, memory traffic performed using DMA engines is predominant and constitute the main bottleneck for both performance and predictability [27]. IV. DYNAMIC A DMISSION C ONTROL IN N O C S In this work, we propose a global and dynamic mechanism for an admission control in NoCs used for real-time systems. The proposed solution introduces an overlay network built over the existing NoC architecture, as depicted in Figure 2. Therefore, we distinguish between the top and bottom resource arbitration layers. The bottom-layer refers to the low-level NoC architecture e.g. flow-control method responsible for the switching of particular packets/flits without considering the logical dependencies between packets e.g. locality. The locality principle requires that packets, belonging to the same transmission/stream, reach the destination in a pre-defined order e.g. without interleaving with packets from other senders. In order to comply with the isolation requirement, we introduce the top-layer which is used to arbitrate between interfering transmissions, i.e sharing the same VC and whose paths overlap in at least one physical link, see Sec. III. We refer to a set of interfering transmissions as a synchronization scenario arbitrated/synchronized using a scheduling unit - Resource Manager (RM). Each sender belonging to a synchronization scenario must, for each transmission, request a permission from the RM. This admission control, at each network node, is performed using local supervisors called clients which trap outgoing transmissions and synchronize them with the RM. The RM conducts dynamic scheduling between entire transmissions constructed from multiple packets. Different arbitration policies can be employed, as long as they obey the following rules: i) only one transmission from the synchronization scenario can be active at a time ii) the change of an active transmission is possible only after the previous one has finished or it

is preempted due to a transmission request with a higher priority iii) the RM must prevent starvation of requestors. For easier comprehension, we apply in this work standard round robin scheduling for resolving arbitration between interfering transmissions on their native and detoured paths. However, prioritybased solutions [29], complex mixed-criticality setups [30], as well as TDM-based arbitration [31] are also possible. The proposed mechanism has the following advantages: 1. Predictability through the isolation of the entire transmission composed of multiple packets. Hence, timing guarantees are provided to the whole transmission instead of a single packet. 2. Efficiency since it becomes possible, given the global view, to dynamically adapt to the current state of the system and perform optimization. 3. Preserve the locality of accesses which is in particular very suitable for memory traffic performed using DMA. In safety-critical MPSoCs platforms, memory traffic constitutes the main bottleneck for both performance and predictability [27, 32]. The communication between processing nodes and the RM is protocol-based and is realized with special control messages transmitted on a dedicated virtual channel. The transmission of data starts only after being acknowledged by the RM. Later, when the transmissions completes, the sender must inform the RM in order to release the resources. A. Protocol The communication between the RM and the clients is performed with a protocol using three control messages: reqMsg (request), relMsg (release) and ackMsg (acknowledge). The workflow (cf. Fig. 3) is explained in the scope of conducted operations. Whenever a sender is trying to start a transmission its request is trapped by the client. The corresponding client then sends a request message to the RM to obtain access to the network (clnt req()). the RM is equipped with a queue for storing pending requests from clients. If the queue is empty, the RM must wait for a new request to arrive. Otherwise, the scheduler decides about which request from the queue should be served first (rm proc()). Each RM is sequential and serves one request at a time. The selected request is removed from the queue and from this moment on the resource is considered to be occupied. After that, the RM must notify the selected sender for service with the ackM sg. After receiving the ackMsg, the communication may start (clnt strt()). Once granted, the connection holds until the end of the transmission or the abortion through the client based on a predefined timeout used to prevent unbounded connection times. When the client detects the end of the transmission (e.g. based on its time-budget or injection of the last flit) it issues a relMsg to the RM (clnt rel()). As soon as the relMsg arrives, the RM considers the resource to be free again (rm rel()). The latency of control messages is crucial for the performance of the proposed mechanism. We use prioritized VCs and assign a dedicated VC (with the highest priority) for control messages in order not to be blocked by any other traffic. More generally, control messages can be allocated to any available VC capable of giving latency guarantees or a dedicated

Fig. 4. Structure of the client’s logic as an extension of the NI.

independent control NoC or signal lines for maximum performance. Recall that, in the considered embedded safety critical systems, the behavior and characteristics of real-time applications are usually well specified and tested cf. Sec. III, contrary to the off-chip networks where the behavior of nodes is often unknown at design time. This allows a trade-off analysis during the design phase providing estimation of the overhead resulting from the global arbitration. For instance, the designer may decide to monitor and enforce the minimum distance (i.e. time) between consecutive request messages from some applications (without endangering their deadlines), in order to limit the number of necessary synchronizations and interference. Additionally, a client may limit the number of outstanding requests from a given sender i.e. no permission for sending a new request before receiving release for an already granted one. That may also increase mechanism’s scalability and remove the hot-module effect in the case of very large synchronization scenarios cf. [24]. B. Implementation of the RMs and Clients In the following, we discuss a hardware implementation of modules necessary for the operation of the proposed overlay network including their requirements and introduced overheads. Note that, the clients as well as the RMs can be implemented as an independent hardware module controlling accesses before they arrive to the network interfaces (NIs) connecting processing nodes to the NoC, or fully in software. This allows the smooth integration with existing, commercially available components and MPSoC architectures e.g. Tilera Tile64 or Kalray MPPA, without expensive and custom router modifications. Moreover, the actions of the introduced modules can be completely transparent to the running applications which allows a compatibility with a legacy software i.e. no need for costly code adjustments. Clients are designed as an extension of the NIs, see Fig. 4. Whenever a sender running on a node is trying to access the NoC, the NI augmented with a client must detect and trap this access (intercept) and distinguish between different transmissions. This is done by the qualifier module which is equipped with a lookup table (LUT) with an entry for each transmission requiring synchronization, allowing the qualifier to pre-filter outgoing communication. For instance, the senders which are not in the LUT are automatically blocked and no request is sent to the RM. This validation may be optional as the designer may allow some unsupervised transmissions if they do not belong to any synchronization scenario. Later the arbiter module must generate and send request messages as well as process the acknowledge messages. Finally, the arbiter must detect the end of a transmission and send the release message. This can be done at the arrival of the last flit or by monitoring the transmission time in order to prevent unbounded connections caused by malicious or malfunctioning senders. The RM can be implemented in the form of a stand-alone processing node or built into one of the existing network mod-

Fig. 5. Resource manager logic for admission control.

ules e.g. DRAM module. Fig 5 presents a block-diagram of an exemplary RM unit with two major specified components: the message qualifier and the resource arbiter. The qualifier module must detect the arrival, classify and enqueue control messages. As discussed in Sec. III, the arbitration is based on information about shared links and VCs. Note that due to the applied deterministic routing transmissions paths, synchronization scenarios remain constant during system execution. Consequently, for the qualifier module it is sufficient to check sender and receiver IDs of a particular transmission (i.e. their placement in the NoC) to unambiguously define the synchronization scenario. The necessary information is stored in the LUT. Finally, the arbiter module is equipped with the roundrobin scheduler which grants the unique access to the interconnect based on the availability of the requests in the queue. The scalability of the system is ensured through the use of multiple RMs protecting separate synchronization scenarios e.g. transmissions directed towards different hot-modules. Synchronization scenarios are separated (disjoint) if none of the transmissions belonging to a particular scenario share a link on the same VC with other transmissions belonging to a different scenario. In principle, there can be as many RMs as many disjoint scenarios in the system, which allows to separate and distribute the control messages in the NoC i.e. decrease the average transmission time and reduce the response time of the RM. However, to further decrease the hardware overhead, one RM unit may also protect multiple synchronization scenarios, after accounting for additional temporal overhead, see Sec. V and Sec. VII. The payload of control messages is small as it consists only of information allowing to identify uniquely the transmission e.g. an identification number (IDs of the sender and receiver), and the type of the message (reqMsg, relMsg, ackMsg). Moreover, the number of outstanding requests for a particular transmission may be limited by clients to one i.e. no permission for sending a new request from the network node if there exists at least one currently pending request. In order to evaluate the hardware overhead required for the implementation of clients, we implemented a client in the IDAMC architecture [15] on a Virtex-6-LX760 Xilinx FPGA. The client required only 200 Look-up tables (LUTs), which corresponds only to 3% of the area of a Network Interface (NI) module. Overall, the area overhead of clients is comparable with the TDM-based solution [33]. A TDM-arbiter must also trap and distinguish between different transmissions and monitor their duration to detect the end of a time-slot. In the described setup, the scheduler used by the RM has the size of a standard round-robin arbiter, whereas the main area overhead of the central RM module comes from the request queues. Our implementation in the IDAMC architecture on a Virtex-6LX760 Xilinx FPGA for eight synchronized senders resulted in 800 LUTs which is only 12 % of the area of the NI module. The observed low hardware requirements of the RMs allows the implementation of multiple units in the same NoC in order to decrease the overhead and increase the scalability of the system.

V. W ORST-C ASE A NALYSIS OF THE M ECHANISM In order to prove the predictability of the mechanism, we compute a bound on the worst case latency for each transmission synchronized with RM taking into consideration other interfering senders and the overhead of the protocol. The timing relations of the individual transmissions are abstracted by event models [34] that capture the worst-case and best-case behavior of every possible transmission arrival/activation pattern. Therefore, we may use temporal-analysis frameworks, such as applied in this work Compositional Performance Analysis (CPA) [35], which apply event models to capture the dynamics of the system’s behavior. Note that the provided analysis concerns only the top-layer. We assume that the bottom-layer architecture of a particular NoC is analyzable (e.g. [17]) and provides Basic Network Latency Bounds, which constitute the input to the toplayer analysis, for any transmission conducted in the network. The bottom-layer latency depends on the following factors: the source/destination routing distance, packet size, link bandwidth, additional protocol overheads and other ongoing communication. Definition 1 (Basic Network Latency Bounds). Let Ci− and Ci+ denote the minimum and maximum time required to transfer all packets of a transmission i in specific VC when no contention and maximal NoC contention are considered respectively. Therefore, the time during which the packets from the transmission i are physically present in the network can be bounded by [Ci− ; Ci+ ]. A. Analysis of the Top-Layer In CPA [34], we consider every RM as a resource under round-robin scheduling and every transmission belonging to the synchronization scenario as a task. The latency of a transmission under RM control can be decomposed into 3 phases, the release time denoting the time between the client issuing a request and its arrival at the RM, the blocking time which a request has to wait in the RM queue before being granted access to the NoC by the RM, and the task execution denoting the transfer and processing of the acknowledge message, the maximum time necessary to transmit all packets belonging to the transmission and the time necessary to send and process the release message. Definition 2 (Request arrival functions). Let ηi− (Δt) and ηi+ (Δt) define the minimum and maximum number of requests (events), i.e transmissions which can be issued by a sender to the RM, within a time window of size Δt. Their pseudo-inverse counterparts δi− (n) and δi+ (n), denote the minimum and maximum time interval between the first and the last event in any sequence of n event/request arrivals [34]. The release time includes the propagation delay of the RM’s control messages through the NoC along with the interference with other synchronized senders. It introduces a delay and jitter, for the requests arriving at the RM. The jitter can be captured using the method from [36]. Definition 3. The worst-case propagation jitter of a transmission i, Ji,ctrl , is a time interval + − − Ci,ctrl Ji,ctrl =Ci,ctrl − Ci,ctrl

+ Ci,ctrl

(1)

where, and denote the best- and worst-case network latencies of a control message for transmission i. Based on this, we derive (similarly to [35]) the RM input requests arrival functions for each transmission i, which are iden-

tical to the output applications/senders requests arrival function including the propagation jitter. Definition 4. The output event model δi− , out(n) for arrivals of request messages for a transmission i, is denoted as: − − δi,out (n)=max{δi− (n) − Ji,ctrl , (n − 1) · Ci,ctrl } (2) where, δi− (n) denotes the minimum time interval necessary − minfor n activations of transmission i by a sender and Ci,ctrl imum network latency of a single request [35]. We analyze the RM with the busy window approach (see [35]). It constructs a critical instant, which marks the beginning of the busy window time interval and considers the worst-case arrival sequence of events, event’s duration and the scheduling policy to compute the maximum delay for a transmission (task) under analysis. It maximizes the response time i.e. the duration between the activation of the transmission and its completion. Definition 5. Let the minimum and maximum q-event busy windows ωi− (q) and ωi+ (q) describe the minimum and maximum time interval required for sender i to conduct q consecutive transmissions belonging to the same synchronization scenario protected by a RM. Corollary 1. The minimum q-event busy window can always be bounded by ωi− (q)≥q · Ci− . Theorem 2. The worst-case time necessary to conduct q transmissions i belonging to a synchronization scenario protected by a RM is bounded by: + ωi+ (q)≤q · Ci+ + q · 3 · Ci,ctrl + Bi (ωi+ (q)) (3) + where Ci,ctrl denotes maximum latency of the control messages for transmission i and Bi (ωi+ (q)) the maximum blocking resulting from scheduling of other transmissions belonging to the same synchronization scenario.

Proof. The theorem directly results from the description of the mechanism and protocol (see Sec. IV). The busy window of q consecutive transmissions mi is bounded by the time necessary to send all packets belonging to these transmissions + (q · Ci+ ), the time (3q · Ci,ctrl ) necessary to exchange control messages for each transmission (reqMsg, ackMsg, relMsg), plus the maximum time interval during which a particular request can be blocked due to other ongoing transmissions in the top-layer. Note that the ωi+ (q) appears on both sides of Eq. 3 forming a fixed-point iteration problem. It can be solved iteratively + starting with ωi+ (q)=q · Ci+ + 3q · Ci,ctrl until two consecutive iterations produce the same result. Lemma 3. The blocking time which q requests experience in a time window Δt can be bounded by  + + Bi (Δt)= (Cj+ + 2 · Cj,ctrl ) · min{q, ηj,out (Δt)} (4) ∀j∈S

where, S is a set of all transmissions belonging to the same synchronization scenario as transmission i, Cj+ denotes the + maximum latency of the transmission j , 2 · Cj,ctrl denotes the maximal latency of the acknowledge and release messages + for j and ηj,out (Δt) is the maximum number of requests for transmission j within interval Δt taking into consideration the worst-case propagation jitter resulting from the transmission of control messages.

Proof. According to the assumptions, the considered RM uses round-robin scheduler. The sum in Eq. 4 computes the interference from all other transmissions belonging to the same synchronization scenario. Following this arbitration policy, each transmission j may block only once the transmission i. However, q activations of i can be blocked only q times and also + no more than ηj,out (Δt) times. Finally, we assume that the first requests from all streams arrive exactly at the same moment in order to construct the critical instance. Therefore, it is a conservative overestimation of the actual interference. B. Derived metrics Based of the computed busy-window ωi+ (q), we derive the following QoS metrics, commonly used in the analysis of realtime systems: worst-case response time and worst-case backlog . Let Ri be the worst-case response time of the transmission i i.e. the longest time interval between its activation and completion: Ri =max{ωi+ (q) − δ − (q) , ωi+ (q)≥δ − (q + 1)} (5) ∀q≥1

The response time Ri is represented by the difference between the busy window ωi (q) and the earliest possible activation δ − (q). Later the schedulability test has to confirm if the constraint Ri ≤Di is satisfied, where Di defines the the transmission’s deadline for every transmission in the system. If not, then system is not schedulable given the constrains. Let bi be the worst-case backlog of a transmission i i.e. the maximum number of pending, unserved requests. bi can be computed as follows: bi =max{ηi+ (ωi+ (q)) − (q − 1)} (6) ∀q≥1

Calculation of the backlog provides the conservative upperbound on the size of the buffer required by the RM. VI. I NTERFACE WITH SDRAM C ONTROLLERS In this section, we firstly discuss SDRAM devices and their operation. Then, we discuss the drawbacks of designing SDRAM controllers for TDM arbitrated NoCs. Finally, we describe how our NoC admission control layer simplifies the design of SDRAM controllers. A. Background on SDRAMs In this article, we focus on DDR3 SDRAM devices. The acronym DDR stands for Double Data Rate, and means that data is transferred both in the rising and the falling clock edge. Hence, in a data bus clocked at 400 MHz, potentially 800 mega transfers per second can be performed.

Fig. 6. DRAM Device Organization.

We depict the internal structure of a SDRAM device and the commands used to operate it in Fig. 6. An SDRAM device is structured in banks. For DDR3, the number of banks (to which we will refer from now on simply as N ) is 8. Each bank contains a matrix-like structure and a row buffer (which is highlighted in the Figure). The matrix-like structures are not visible to the memory controller. All data exchanges are instead performed through the corresponding row buffer.

TABLE I T IMING CONSTRAINTS FOR DIFFERENT DDR3 DEVICES ( AVAILABLE IN [39]), CONSIDERING A ROW SIZE OF 2KB. JEDEC DDR3 Specification (data bus cycles) Constraint Description DDR3- DDR3800E 1600K Exclusively intra-bank constraints (same bank) tRCD ACT to RD/WR delay 6 11 tRP PRE to ACT delay 6 11 tRC ACT to ACT delay 21 39 tRAS ACT to PRE delay 15 28 tWL WR to data bus transfer delay 5 8 tRL RD to data bus transfer delay 6 11 tRTP RD to PRE delay 4 6 tWR End of a WR operation to PRE delay 6 12 Exclusively inter-bank constraints (different banks) tRRD ACT to ACT delay 4 6 tFAW Four ACT window 20 32 Inter- and intra-bank constraints (any bank) tRTW RD to WR delay 7 9 tWR-to-RD WR to RD delay 13 18 tWTR End of WR to RD delay 4 6 tBURST Data bus transfer 4 4 tCCD RD-to-RD or WR-to-WR delay 4 4

There are four1 commands used to move data into/from a row buffer: ACT, PRE, RD and WR. The activate (ACT) command loads a matrix row into the corresponding row buffer, which is known as opening a row. The precharge (PRE) command writes the contents of a row buffer back to the corresponding matrix, which is known as closing a row. The read (RD) and write (WR) commands are used to retrieve or forward words from or to a row buffer. We use the acronym CAS (column address strobe) to refer to both RD and WR commands. The operation of CAS commands is controlled by the burst length parameter (BL). For DDR3, BL=8 and, hence, each RD or WR command performs 8 data transfers, and occupies the data bus for BL/2=4 cycles (because of the double data rate). The amount of data transferred by each then depends on the data bus width (W) of the SDRAM device. If W =8 bits, each CAS command transfers W · BL/2=64 bits, or 8 bytes, assuming BL=8 supported by the majority of DDR3 and DDR4 standards. Finally, we discuss SDRAM command scheduling constraints. A SDRAM controller can execute at most one command per cycle. Moreover, it must respect several timing constraints [38, 39] that dictate how far apart consecutive commands must be. We enumerate such timing constraints for two different DDR3 devices in Table I. For instance, tRCD refers to the minimum distance between an ACT and a RD/WR command that targets the same SDRAM bank. Its value amounts to 6 and 11 cycles for a DDR3-800E and for a DDR3-1600K, respectively. B. Timing Analysis of a FCFS SDRAM Controller In this section, we describe a simple SDRAM controller to be employed in conjunction with our NoC admission control layer. The controller exploits row buffer locality and employs First-Come First-Served arbitration. As a consequence, the access to the SDRAM becomes transparently managed by the Resource Managers that regulate access to the NoC. A more detailed discussion can be found in Section D. 1 SDRAMs have a fifth command, the refresh, which is not related to data transfers. The refresh has to be executed regularly to prevent the capacitors that store data from discharging. The scheduling of refreshes is an extensively studied problem [37] and, for the sake of simplicity, will not be addressed in this article.

We discuss the SDRAM controller. Analyzing the timing constraints from Table I, it is easy to notice that consecutive CAS commands that target the same bank row can be executed faster than those that do not. This is because there is no need for precharging and activating a row and, hence, the timing constraints associated with ACT and PRE commands, e.g. tRCD , tRT P and tW R , play no role in the command scheduling. Hence, and given that our NoC admission control enforces the locality of transfers, we can map consecutive physical SDRAM addresses to the same bank/row in a SDRAM device, i.e. we can employ a contiguous address mapping, as depicted in Fig. 7. The alternative would be to map consecutive addresses to different banks, the so called interleaved address mapping. As we discuss in Section VI.C, bank interleaving is attractive when the locality of incoming accesses is not enforced. However, it increases power consumption.

Fig. 7. Mapping of four consecutive addresses to SDRAM banks in two different mapping mapping configurations (in a system with 4 banks).

We now compute a timing bound on the execution time of a request. In order to facilitate the inclusion of such bound into the timing analysis of the NoC (see Section V), we derive a single bound valid for both read and write requests. For that purpose, we define a new timing constraint which represents the worst-case timing interval between the execution of a CAS command and the beginning of the corresponding data transfer. We call such constraint tCL , and compute its value as max{tRL , tW L }. Moreover, we use the letter G to refer to the granularity of each SDRAM request (measured in bytes). Notice that the number of CAS commands (N) required to fulfill a request depends on the granularity (G) of the request and the interface width (W) of the SDRAM. Algebraically speaking, N =G/(8·W ). So, for instance, if W =8 bits and G=64 bytes, each SDRAM request demands 64 CAS commands to be fulfilled2 . With that in mind, the worst-case execution time of a request can be computed according to Theorem 4. Theorem 4. The worst-case execution time of a SDRAM request (Lsdram ) is computed with Eq. 7. req =tResidual + tRP + tRCD + tCL + tBU RST · N (7) Lsdram req where: (8) tResidual =max(caser , casew ) caser =max(tRAS − (tRCD + tRL + tBU RST ), 0) (9) casew =tW R (10) Proof. For the computation, we conservatively assume that when the SDRAM controller starts the execution of the request under analysis (u.a.), the desired row is not currently present in the corresponding row buffer. Moreover, we also assume that the request under analysis targets the same bank as the previous request that was executed. Hence, the computation of Lsdram becomes the sum of five terms. To clarify the purpose req 2 We assume that the demanded data is always aligned withing the boundaries of a single DRAM row.

of each of them, we depict a graphical decomposition of the worst-case execution time of a request in Fig. 8.

Fig. 8. Latency Decomposition of SDRAM request. In the figure, the letter C represents a CAS command, while the letters A and P represent an activate and a precharge command, respectively.

The first term in the computation is tResidual , and it refers to the minimum distance between a CAS command used to serve the previous request, and a PRE command, which is required before the SDRAM controller can load the appropriate bank row for the request under analysis into the row buffer. Such distance depends on whether the previous request was a read or a write and, hence, we use the max function to select between the two cases. Eqs. 9 and 10 come directly from the meaning of the constraints and hence, we do not elaborate further on them. After the execution of the precharge, the SDRAM controller waits for tRP cycles before executing an ACT (see Table I). Then, after the execution of the ACT, more tRCD cycles are required before the controller executes a CAS, which then causes the corresponding data transfer to start after tCL cycles. Finally, each request demands a total of N CAS commands to be served, as previously discussed. The last term of the equation simply accounts for the corresponding N data transfers, each occupying the data bus for tBU RST cycles. Notice that after the first CAS is executed, all the following CAS commands can be executed tBU RST cycles apart from each other. Hence, the distance between the first and the last CAS of the request u.a. is (N − 1) · tBU RST . This concludes the proof of the theorem by construction. Notice that the contiguous address mapping only provides acceptable timing bounds if N is large. For instance, consider a DDR3-1600G SDRAM device [39] (whose data bus is clocked at 800 MHz), with W =8 bits. In such configuration, a SDRAM request of G=64 bytes would require N =64 CAS commands. Using Theorem 4, we compute Lsdram as req 304 cycles. However, the overhead to open and close the row, which is represented by the first three terms of Eq. 7, amounts to only 48 cycles. This means that, on average, the SDRAM spends 84% of the cycles (required to execute a request) transferring data. The deficiency of 16% is a consequence of the time required for managing the row buffer. C. Considerations about the SDRAM Controller in TDM Arbitrated NoCs In NoCs with large TDM slots (that exceed the granularity of the DRAM controller), DRAM locality is also enforced and, consequently, the same SDRAM controller proposed in Section VI.B can be employed. However, as described in Section II, large TDM slots have the disadvantage of increasing the latency of all requestors in a system, including those who generate non-SDRAM traffic. Such shortcoming is tackled by employing smaller TDM slots. Nevertheless, if we still want to employ a simple FCFS SDRAM controller, we would have to match the granularity of the SDRAM with the granularity of the NoC, which would consequently decrease N .

As an example, consider a scenario in which the TDM time slots carry 16 bytes of data (which also represents G, the granularity of SDRAM requests), and a DDR3-1600G device with W =8 bits wide data bus. In such scenario, N =16 CAS commands. Computing Lsdram with Theorem 4, we obtain 112 req cycles (from which 48 cycles are required to close and open the row buffer). Hence, with this configuration, the SDRAM spends 57% of the time required to execute a request transferring data, which is 27% smaller than what is observed in the example from the previous section. In order to hide the latency of the row buffer management, a combination of bank interleaving (distributing the CAS commands over multiple banks, as depicted in Fig. 7) and close row policy (automatically closing the row buffer after using it) [40] can be employed. However, bank interleaving combined with the close row policy also increases power consumption [41]. In our experimental evaluation, we show that if the locality of data transfers is preserved, then the controller described in Section VI.B achieves similar timing bounds as a controller that employs bank interleaving (without relying on a power hungry approach). D. Admission Control for SDRAMs The proposed admission control mechanism allows to control interference on the NoC side, but also on the shared SDRAM side. Indeed, since memory traffic constitutes most of the traffic in an MPSoC system, controlling the interference on the NoC and the memory becomes a main issue in realtime multicore systems. In this case, predictable memory controllers, such as [40, 42], must be used in order to guarantee a predictable behavior of concurrent memory accesses. However these approaches often consider as input any arbitrarily traffic to the memory controller. In this paper, the proposed admission control mechanism allows to consider a holistic approach: NoC and shared SDRAM, as follows: i) it preserves the locality of memory accesses since the access to the NoC is allocated to the entire transmission which allows to fully benefit from the open row policy and to optimize the performance of the system, in addition to ensuring predictability, ii) it mitigates the management of interference between the NoC and the memory controller. Consider the example in Figure 9, where 3 applications are accessing the shared DRAM memory through the interconnect. Data streams triggered by applications T1 and T2 interfere on the NoC since they are sharing the same path in the network, and therefore form a synchronization scenario S1 managed by the RM1. The data stream triggered by application T3 is using a different path on the network, but is sharing the input to the DRAM controller. Therefore, this traffic stream should also be synchronized using the RM2 with the rest of the traffic in the network - synchronization scenario S2. Note depending on the tolerable synchronization overhead, a single RM can be used to synchronize all the traffic in the NoC accessing the DRAM memory. In this case, whenever an application is granted an access by the RM, it acquires an exclusive access to both resources: the NoC, as well as, to the SDRAM memory. In this case, no predictable memory controller is required in the system since a temporal isolation from other streams is guarantees by the RM both at the NoC and SDRAM level. VII. E XPERIMENTAL R ESULTS In this section, we evaluate the proposed mechanism. The analysis from Sec. V is implemented in the pyCPA frame-

Fig. 9. Multiple RMs to mitigate the interference on the NoC and memory.

Fig. 10. Analytical comparison of worst-case latency guarantees for applications (A1-A4) generating different NoC load. A3 and A4 have the same settings.

work [35] using the approach presented in [17] for deriving +/− +/− and Ci,ctrl ). Simulations the bottom-layer latencies (Ci are carried out with OMNeT++ event-based simulation framework and HNOCS library [43]. The evaluation is done through comparison with TDM-based solutions (cf. Sec. II) utilizing long TDM-slots, adjusted to the maximum network latency of a transmission (C + ) taking into consideration possible interference with other VCs, as well as short TDM-slots, adjusted to the transmission of a single packet. We assume as synchronized transmissions memory transfers which constitute the majority of the NoC traffic (cf. [44, 32]). They allow to exploit locality of transfers in the memory context and to challenge the performance and the scalability of the system since memories are the most common hot module in a system-on-chip. A. Evaluating Worst-Case Latency Guarantees In experiments, we consider synchronization scenarios with x senders performing a burst of y transmissions per activation. Senders are activated periodically with a period P =16 · x · C + and a small jitter J equal to 10% of a period. However, we vary the system’s load L defined as the total number of transmissions from synchronized applications per period P i.e.  L= (yx · C + )/P . x

First, we analyze a system with four (x=4) applications (A1-A4) where we vary the burst size to generate different loads L on the network (L is equal to 15%, 65% and 90% of P). We measure for each application the worst-case latencies obtained per burst, see Fig 10. We observe that synchronization with the RM always results in better guarantees than TDM (even when short slots are considered) despite the additional communication protocol overhead. This overhead increases with the load but remains very low compared to the transmission time (4,1% of C + for L=90%). Moreover, we assume that all synchronized transmissions are of equal length i.e. they have the same C + equal to 16kB. We assume that all synchronized transmissions are of equal length i.e. they have the same C + . Fig. 11 illustrate the same results , i.e transmission latency and protocol overhead in the worst case, as we vary the number of synchronized applications (x=[2..16]) and interfering load (yx =[2..16]). The results concern an application transferring periodically a burst of 16 transmissions. Single transmission from all synchronized senders has equal length i.e. they have the same C + equal to 8kB.

Fig. 13. Latencies of CHSTONE benchmark with TDM and RMs. Fig. 11. Worst case guarantees for a burst of 16 transmissions with jitter =10%P (a) Transmission latency (b) Protocol overhead resulting from RM.

Fig. 14. Effect of memory locality on the total transmission latencies for MPEG-4 module using TDM and RMs,

Fig. 12. MPEG-4 average communication demands specified in MB/s (a) and mapping (b).

The worst-case latencies, depicted in Fig. 11(a) for both TDM and RM increase along with the size of the synchronization scenario. However, they remain constant for TDM and independent from the system’s load. This allows our approach to significantly outperform TDM (up to 80%), due to the applied work-conserving scheduling. Note that TDM with short slots performs better than TDM with long slots as it is less sensitive to the jitter. The protocol overhead, depicted in Fig. 11(b) is presented as a percentage of the transmission’s length. It increases proportionally to the number of synchronized senders. This effect can be mitigated by implementing multiple RMs in the same NoC. Moreover, the protocol overhead depends directly on the frequency and number of synchronized transmissions i.e. system’s load. Finally it decreases, as an absolute ratio, with increasing length of transmissions as the protocol overhead is constant w.r.t the transmission length. B. Application and Benchmark-Based Results We evaluate the average performance with CHSTONE benchmark [45]. The traces of benchmarks are extracted using the Gem5 simulator. We use an ARMv7-a core with a 32 kB L1 cache and 64 Bytes long cache-line (10 packet long transmissions). Compilation is performed without any optimization with respect to RM nor TDM (standard gcc compiler ver. 4.7.3). Next, we establish different scenarios resulting from possible mappings and number of senders assuming a constant placement of the RM and one sender per node. Fig. 17 presents the average latencies for different sizes of synchronization scenarios. In this case, it is also visible that the RM significantly outperformed other solutions. However, the difference compared to TDM in case of small synchronization scenarios e.g. 2 senders, is rather low (around 8%). This has two reasons: short duration of TDM-cycles and relative high RM overhead (three control messages per transmission). In case of bigger scenarios, the solution based on RMs is up to 60% better than TDM. Note that the activation patterns of senders are not necessarily synchronized with respect to the TDM-cycle. Tailoring TDM schedule in order to optimize such systems, which are not fully loaded, requires dedicated solutions and introduces additional hardware overhead e.g. SurfNoC [7]. RM allows effective arbitration without additional effort.

Finally, we evaluate the average performance of the mechanism using the MPEG-4 video decoder application [26]. This comparison is relevant for all cases in which synchronized application not only requires the worst-case guarantees but also profits from a faster execution. We identify three modules with high communication requirements: SDRAM (the target of 7 senders), SRAM2 (target of 4 senders) and SRAM1 (target of 2 senders), see Fig 12(a). Requests to each memory module constitute independent synchronization scenario. We map different scenarios on independent VCs and protect each of them with an independent RM mapped on a different node. Each module of the MPEG-4 decoder is modeled with traffic generator conducting 8 KB long DMA transfers. This is because, in commercial SDRAM modules with 64 bit data buses, the row buffer size is 8 KB. Transmissions are performed periodically and periods are calculated based solely on the required bandwidth including some release jitter (J=10% of P). Fig. 14 presents the achieved, in the simulation, average-latencies of a single transmission in the system with TDM- and RM-based arbitration. The depicted values include both network and memory latencies to assess the effect of memory locality on the described mechanisms. Latencies of the SDRAM memory are modeled after the specifications of DDR3-2133N SDRAM [39]. As explained previously, TDM with short slots performs better than TDM with long slots in the network. However, when considering the memory effect, then the memory latency for TDM with long slots is better than with short slots since the long slots allow to maintain the locality of memory accesses to the SDRAM. Indeed, SDRAMs have an internal level of caching, which in standard DDR3 modules amounts to 8kB. Consequently, contiguous and aligned 8kB long transfers fully benefit from the caching. The RM also allows to maintain the locality of memory accesses, since applications are granted access to the NoC for the entire transmission. Overall, RM performs better than TDM for memory traffic traversing the NoC and accessing main SDRAM memory. C. Memory Effects In the next series of experiments, we extend our evaluation to account for memory effects caused by the introduced arbitration. We show that implementation of a RM in the NoC-based MPSoC allows not only to improve performance of the memory module, but event more importantly, enforce predictability of memory accesses. In our experiments, we considered DDR3 memories (800E, 1066G,1066E, 1333H, 1333J, 1600K, 1866M, 2133N). We begin with the effects of spatial locality on the response time of a

Fig. 15. Effect of memory locality on the total transmission latencies for MPEG-4 module using TDM and RMs. The calculations assumed a SDRAM device with an interface width of 8 bits. Notice that the latency is measured in data bus clock cycles of the SDRAM devices.

Fig. 16. Effect of memory locality on the total transmission latencies for MPEG-4 module using TDM and RMs,

memory module, assuming the simple SDRAM controller presented in Sec. VI. In the consecutive series of experiments, we use a benchmark application which is trying to write between 8 to 1024 bytes of data into the memory in systems with (solid lines) and without (dashed lines) support for locality of transfers. In the context of this experiment, to support locality means to ensure that the entire chunk of data arrives in one piece at the SDRAM controller, which would be the case in a system with large TDM slots or a RM. To not support locality means that several independent small requests are performed in order to transfer the entire data chunk of a request, which is the case in a system with small TDM slots. Fig. 15 summarizes the obtained results. For all tested SDRAM devices, whenever locality is enforced, the latency of a request can be significantly decreased. The performance gain is proportional to the length of transmissions and scales with the frequency of the memory module, see the zoomed section of the figure. The faster the frequency of the SDRAM device, the larger the latency measured in clock cycles. This is because SDRAM timing constraints vary little between different speed bins when measured in nanoseconds. Consequently, devices that have faster clock frequencies need more clock cycles to satisfy the constraints. The former experiment clearly shows that increasing the transmission granularity (length) results in decreasing memory overhead in a NoC-based MPSoC. However, in case of TDM-based arbitration, it simultaneously increases the worstcase NoC latencies. Recall that, whenever the senders in a system exhibit dynamics in their behavior, even with a small jitter, transmissions are blocked and their execution delayed for the duration of a whole TDM cycle. Therefore, longer transmissions increase this penalty. Fig. 16 presents the worst-case latencies in a NoC-based MPSoC for an application synchronized with 10 other senders and writing 1024 bytes to the SDRAM memory. We assume that each packet is capable of delivering 64 bytes of payload. In the consecutive runs, we increase the granularity conducting the synchronization for appropriately 1, 2, 4, 8 and 16 packets assuming small dynamic behaviour in activation of a sender (round 5%). We assume the load from other synchronized senders to be equal to 70% of the interconnect capacity. In case of the TDM, the NoC latency increases proportionally to the length of the slot i.e. number of packets. Consequently, by decreasing the memory latency, we increase the NoC response time. On the contrary, when RM is employed, the latency decreases simultaneously with increasing granular-

(a) (b) Fig. 17. Worst-case latency for a 256-bytes request on DDR3 devices (a) with 8-bit wide interfaces and (b) with 32-bit wide interfaces.

ity of the transfer i.e. we can simultaneously improve network and memory latency. However, in case of a very short transmission (1 or 2 packets), the RM’s protocol overhead is a major performance barrier (recall that we require 3 ctrl. messages per transmission). In these scenarios, TDM is managing to guarantee lower transmission latencies. On the other hand, as soon as the granularity of a single transfers increases the number of necessary synchronizations depletes and we may take full advantage of the work-conserving synchronization. This results in significant improvement for 1024 bytes long transmission (almost 70%) and confirms our previous findings. Finally, we demonstrate that preserving the locality of large transfers can have a beneficial impact on the worst-case latency of SDRAM requests. We consider a setup in which a SDRAM controller with a single port is connected to the NoC. As for NoC arbitration, we consider a regular NoC managed with our RM and a TDM-based NoC. In case of the RM-managed NoC, we employ the simple SDRAM controller described in Section VI.B. In case of the TDM-managed NoC, we consider a non-trivial predictable SDRAM controller that employs static bank interleaving and closed-page policy. We refer to such controller as Dedicated Close Page-Controllers (DCPC). The operation of the considered DCPC is controlled by two parameters: BI and BC. BI (Bank Interleaving) determines how many banks a single request accesses BC (Burst Count) determines how many read or write commands are executed per bank. Our evaluation consists in computing the worst-case latency of a 256-bytes long data transfer. For the TDM-based NoC, we consider that each slot is 64-bytes long, i.e. the 256-bytes transfer will demand 4 independent SDRAM requests. For the RMmanaged NoC (which enforces the locality of a single large transfer), the 256-bytes long transfer is served in one chunk. We depict the results in Fig. 17(a) and Fig. 17(b). The first figure depicts the results for a scenario in which a single SDRAM device with an 8-bit data bus is considered. Notice that, for SDRAM devices with slow operating frequencies (DDR3-800E), the DCPC provides better latency bounds than the combination of a standard controller and RM. This is because, for slow devices, the penalty to close and open SDRAM rows is smaller (in terms of data bus clock cycles). This situation changes for devices with higher operating frequencies, because the overhead becomes larger. The second figure depicts the results for a scenario in which a SDRAM module with a 32-bit data bus is considered. As the data bus is larger, the possibility to perform interleaving is reduced (because each SDRAM CAS command transfers 4 times more data in comparison with the previous scenario.) Hence, there is no efficient way to mitigate the overhead to close and open SDRAM rows. Hence, exploiting the locality of large transfers has a solid advantage. VIII. C ONCLUSION Designing predictable embedded multicores for safetycritical and real-time systems is a major challenge. In this

context, managing shared resources such as NoCs and main DRAM memory is an unavoidable obstacle. In this paper, we propose an efficient and a predictable solution for safely sharing NoC resources, which constitute the main communication backbone of modern MPSoCs platforms. The solution provides significant improvement over well established TDMbased solutions due to the work-conserving arbitration, which decreases both average latencies and provided service guarantees. This is performed at the cost of small hardware implementation and protocol latency overheads. R EFERENCES [1] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance evaluation and design trade-offs for network-on-chip interconnect architectures,” IEEE Transactions on Computers, vol. 54, pp. 1025–1040, Aug 2005. [2] W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. [3] J. Kim, “Low-cost router microarchitecture for on-chip networks,” in Micro, 2009. [4] H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,” in Proc. of the IEEE, 1995. [5] K. Goossens and A. Hansson, “The aethereal network on chip after ten years: Goals, evolution, lessons, and future,” in DAC, pp. 306–311, June 2010. [6] A. Hansson, M. Coenen, and K. Goossens, “Channel trees: Reducing latency by sharing time slots in time-multiplexed networks on chip,” in CODES+ISSS, pp. 149–154, Sept 2007. [7] H. M. G. Wassel and et al., “Surfnoc: A low latency and provably non-interfering approach to secure networkson-chip,” 2013. [8] M. Schoeberl and et al., “A statically scheduled timedivision-multiplexed network-on-chip for real-time systems,” in NOCS, pp. 152–160, 2012. [9] M. D. Gomony, B. Akesson, and K. Goossens, “A realtime multi-channel memory controller and optimal mapping of memory clients to memory channels,” ACM Transactions on Embedded Computing Systems (TECS), 2014. [10] H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,” IEEE, 1995. [11] E. Bolotin and et al., “Qnoc: Qos architecture and design process for network on chip,” JOURNAL OF SYSTEMS ARCHITECTURE, 2004. [12] Z. Shi and A. Burns, “Real-time communication analysis for on-chip networks with wormhole switching,” in NoCS 2008, 2008. [13] T. Bjerregaard and J. Sparso, “A router architecture for connection-oriented service guarantees in the mango clockless network-on-chip,” in DATE, pp. 1226–1231 Vol. 2, March 2005. [14] A. Burns, J. Harbin, and L. Indrusiak, “A wormhole noc protocol for mixed criticality systems,” in Real-Time Systems Symposium (RTSS), 2014 IEEE, pp. 184–195, Dec 2014. [15] S. Tobuschat, P. Axer, R. Ernst, and J. Diemer, “Idamc: A noc for mixed criticality systems,” in RTCSA, 2013.

[16] J. Diemer and R. Ernst, “Back suction: Service guarantees for latency-sensitive on-chip networks,” in Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on, pp. 155–162, 2010. [17] E. A. Rambo and R. Ernst, “Worst-case communication time analysis of networks-on-chip with shared virtual channels,” in DATE, 2015. [18] G. Durrieu and et al., “Predictable flight management system implementation on a multicore processor,” in ERTS, 2014. [19] Z. P. Wu, Y. Krish, and R. Pellizzoni, “Worst case analysis of dram latency in multi-requestor systems,” in RTSS, Dec 2013. [20] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: Enabling innovation in campus networks,” SIGCOMM Comput. Commun. Rev., 2008. [21] L. Cong, W. Wen, and W. Zhiying, “A configurable, programmable and software-defined network on chip,” in Advanced Research and Technology in Industry Applications (WARTIA), 2014 IEEE Workshop on, 2014. [22] R. Sandoval-Arechiga, R. Parra-Michel, J. L. VazquezAvila, J. Flores-Troncoso, and S. Ibarra-Delgado, “Software defined networks-on-chip for multi/many-core systems: A performance evaluation,” in ANCS ’16, 2016. [23] J. Wang, M. Zhu, C. Peng, L. Zhou, Y. Qian, and W. Dou, “Software-defined photonic network-on-chip,” in ICeND 2014, 2014. [24] I. Walter, I. Cidon, R. Ginosar, and A. Kolodny, “Access regulation to hot-modules in wormhole nocs,” in NOCS, pp. 137–148, May 2007. [25] A. Kostrzewa, S. Tobuschat, P. Axer, and R. Ernst, “Supervised sharing of virtual channels in networks -onchip,” in SIES, 2014. [26] D. Bertozzi, “Noc synthesis flow for customized domain specific multiprocessor systems-on-chip,” Parallel and Distributed Systems, vol. 16, pp. 113–129, Feb 2005. [27] A. Schranzhofer and et al., “Timing analysis for tdma arbitration in resource sharing systems,” in RTAS, 2010. [28] R. Pellizzoni, E. Betti, S. Bak, G. Yao, J. Criswell, M. Caccamo, and R. Kegley, “A predictable execution model for cots-based embedded systems,” in RTAS, 2011. [29] A. Kostrzewa, S. Saidi, and R. Ernst, “Dynamic control for mixed-critical networks-on-chip,” in Real-Time Systems Symposium, 2015 IEEE, pp. 317–326, Dec 2015. [30] A. Kostrzewa, S. Saidi, and R. Ernst, “Slack-based resource arbitration for real-time networks-on-chip,” in DATE, Jan 2016. [31] A. Kostrzewa, S. Saidi, L. Ecco, and R. Ernst, “Flexible tdm-based resource management in on-chip networks,” in Proceedings of the 23rd International Conference on Real Time and Networks Systems, RTNS ’15, (New York, NY, USA), pp. 151–160, ACM, 2015. [32] R. Pellizzoni and et al., “A predictable execution model for cots-based embedded systems,” in RTAS, 2011. [33] R. Stefan, A. Molnos, A. Ambrose, and K. Goossens, “A tdm noc supporting qos, multicast, and fast connection set-up,” in DATE, pp. 1283–1288, March 2012. [34] R. Henia and et al., “System level performance analysis - the symta/s approach,” in IEE Proceedings Computers

and Digital Techniques, 2005. [35] J. Diemer, P. Axer, and R. Ernst, “Compositional performance analysis in python with pycpa,” in WATERS, jul 2012. [36] S. Schliecker and et al, “Providing accurate event models for the analysis of heterogeneous multiprocessor systems,” in CODES+ISSS, pp. 185–190, ACM, 2008. [37] B. Bhat and F. Mueller, “Making dram refresh predictable,” Real-Time Systems, vol. 47, no. 5, pp. 430–453, 2011. [38] JEDEC, Arlington, Va, USA, JESD79-2F: DDR2 SDRAM Specification, Nov. 2009. [39] JEDEC, Arlington, Va, USA, JESD79-3F: DDR3 SDRAM Specification, July 2012. [40] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: A Predictable SDRAM Memory Controller,” in Int’l Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 251–256, ACM Press New York, NY, USA, Sept. 2007. [41] K. Chandrasekar, B. Akesson, and K. Goossens, “Improved power modeling of ddr sdrams,” in Digital System Design (DSD), 2011 14th Euromicro Conference on, pp. 99 –108, 31 2011-sept. 2 2011. [42] L. Ecco, S. Tobuschat, S. Saidi, and R. Ernst, “A mixed critical memory controller using bank privatization and fixed priority scheduling,” in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2014 IEEE 20th International Conference on, pp. 1–10, Aug 2014. [43] Y. Ben-Itzhak and et al., “Hnocs: Modular open-source simulator for heterogeneous nocs,” in SAMOS, 2012. [44] R. Pellizzoni, P. Meredith, M.-Y. Nam, M. Sun, M. Caccamo, and L. Sha, “Handling mixed-criticality in socbased real-time embedded systems,” in EMSOFT, ACM, 2009. [45] Y. H. et al., “Proposal and quantitative analysis of the chstone benchmark program suite for practical c-based high-level synthesis,” Journal of Information Processing, 2009.