An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip

An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip

ARTICLE IN PRESS JID: CAEE [m3Gsc;April 21, 2016;18:12] Computers and Electrical Engineering 0 0 0 (2016) 1–11 Contents lists available at Science...

900KB Sizes 26 Downloads 50 Views

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 21, 2016;18:12]

Computers and Electrical Engineering 0 0 0 (2016) 1–11

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip Tobias Wiersema∗, Arne Bockhorn, Marco Platzner Department of Computer Science, Paderborn University, 33098 Paderborn, Germany

a r t i c l e

i n f o

Article history: Received 10 June 2015 Revised 30 March 2016 Accepted 5 April 2016 Available online xxx Keywords: Virtual FPGA FPGA Overlay ZUMA ReconOS Timing analysis

a b s t r a c t Virtual field programmable gate arrays (FPGA) are overlay architectures realized on top of physical FPGAs. They are proposed to enhance or abstract away from the physical FPGA for experimenting with novel architectures and design tool flows. In this paper, we present an embedding of a ZUMA-based virtual FPGA fabric into a complete configurable systemon-chip. Such an embedding is required to fully harness the potential of virtual FPGAs, in particular to give the virtual circuits access to main memory and operating system services, and to enable a concurrent operation of virtualized and non-virtualized circuitry. We discuss our extension to ZUMA and its embedding into the ReconOS operating system for hardware/software systems. Furthermore, we present an open source tool flow to synthesize configurations for the virtual FPGA, along with an analysis of the area and delay overheads involved. © 2016 Elsevier Ltd. All rights reserved.

1. Introduction Virtualization of resources has a long tradition in computing. Generally, virtualization is an abstraction technique that presents a different view on the resources of a computing system than the physically accurate one. Virtualization is mostly used to give users the impression of complete and exclusive access to a resource, to isolate users of a resource and guarantee their non-interference, to optimize resource usage, or to simplify application development by abstracting away from the details of physical devices. Virtual FPGAs, also denoted as FPGA overlays, apply the concept of virtualization to the domain of reconfigurable computing. Interest in FPGA virtualization has been fueled by several motivations, which include overcoming the limited hardware resources, enriching the capabilities of existing FPGAs, achieving portability of reconfigurable logic implementations and, finally, providing an experimental testbed for FPGA architecture and computer-aided design (CAD) tool research. Although research in FPGA virtualization has been increasing over the last decade, the field is still in its infancy with only a few prototypical systems described in literature and fewer yet freely available to use and modify for research. Recent work has been addressing two main issues with virtual FPGAs, to minimize the overheads of FPGA overlays with respect to area and speed and to devise novel reconfigurable architectures and corresponding tool flows. An equally important issue that has not been sufficiently addressed yet is the embedding of virtual reconfigurable fabrics into complete configurable systemson-chip. Circuits configured onto virtual FPGAs cannot exist in isolation, they require interfaces to other, non-virtualized reconfigurable hardware, memory, peripherals and software running on central processing units (CPU). ∗

Corresponding author. Tel.: +49 5251604343. E-mail address: [email protected] (T. Wiersema).

http://dx.doi.org/10.1016/j.compeleceng.2016.04.005 0045-7906/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

ARTICLE IN PRESS

JID: CAEE 2

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

This paper is an extended version of our work presented in [1], its contribution is the conceptual and practical integration of an FPGA overlay into a platform FPGA comprising CPU and reconfigurable hardware cores, managed by a Linux operating system. To this end we extend ZUMA [2], a freely available state-of-the-art virtual FPGA architecture and tool flow, and embed it into the ReconOS architecture and operating system [3,4]. ReconOS is open source and semantically integrates reconfigurable logic cores into a standard Linux environment by introducing a multithreaded programming model not just for software but also for hardware. The resulting so-called hardware threads have complete access to virtualized main memory and all operating system services, including communication to software threads. We present the architectural integration of ReconOS and ZUMA, the corresponding extended tool flow and experimental results. The paper is structured as follows. In Section 2 we review related work in the area of virtual FPGAs, in Section 3 we present our extensions of ZUMA and its tool flow and in Section 4 we explain the integration of the extended ZUMA overlay into ReconOS. Section 5 shows experimental results and Section 6 concludes the paper. 2. Related work on virtual FPGAs Virtualization of reconfigurable hardware has been addressed in research for more than a decade. Early concepts of virtualization drew an analogy to virtual memory and proposed to load and remove reconfigurable hardware modules from an FPGA similar to pages of memory that can be swapped in or out of main memory frames, e.g., Brebner [5] or Fornaciari and Piuiri [6]. The main motivation of this paper was to overcome the limited hardware resources of FPGAs. Some years later, Lagadec et al. [7] introduced a definition of a virtual FPGA as a separate overlay on top of a physical FPGA. The authors discussed advantages of having an overlay that is unconstrained by the underlaying physical FPGA. The main advantages were described as the portability of circuits and, provided the virtual architecture is open and adaptable, as providing a means to investigate and experiment with new FPGA architectures – in FPGAs and application specific integrated circuits (ASIC) alike, virtual overlays can introduce features which the underlaying hardware does not have, most notably fast partial and dynamic reconfiguration. Lagadec et al. also mentioned potential disadvantages of using overlays, namely the area overhead, the reduced clock frequency, and a lack of tool chains for synthesizing to virtual FPGAs. The concept of virtual FPGAs has also been used by researchers from the domain of evolvable hardware, e.g., Sekanina [8] and Glette et al. [9]. Evolutionary circuit design requires very frequent synthesis and evaluation of evolved circuit candidates. Synthesis and reconfiguration times for commercial fine-grained FPGAs have been found to be far too slow. Hence, most approaches in evolvable hardware leverage some form of coarse-grained reconfigurable architecture and reconfigure this overlay through setting multiplexers, a process denoted as virtual reconfiguration. In 2004, Plessl and Platzner [10] published a survey of approaches for virtualization of hardware. One of the approaches, which is denoted as virtual machine [11], uses an abstract overlay with a different architecture than the underlay. In this approach, the virtual machine is a runtime system that adapts and synthesizes the configuration for an abstract FPGA to an actual reconfigurable device. The configuration was termed hardware byte code. Lysecky et al. [12] presented first measurements of an actual virtual FPGA, reporting a 100 × area overhead and a 6 × decrease in circuit performance through virtualization. They concluded that virtualization is only viable if circuit portability is of paramount importance. Brant and Lemieux later improved on these findings by presenting ZUMA [2], an FPGA overlay that lowers the area overhead to a reported 40 × through careful architectural choices. One of these choices is to store the virtual configuration not in flip flops but in LUTRAM, which is distributed random-access memory (RAM) built from lookup tables (LUT), by far the most abundantly available resource on FPGAs. Modern FPGAs allow designs to use these LUTs both as RAM and in data paths at the same time, making them ideal building blocks for virtual FPGAs. Brant and Lemieux also addressed the lack of tool chains for virtual FPGAs and used the well-known open source tool flow VTR (from Verilog To Routing) [13] to generate the hardware of and the configurations for their virtual FPGA. The generator source code for ZUMA virtual FPGAs has been released as open source1 . Hübner et al. [14] present a system-on-chip with an ARM Cortex M1 soft core processor and a virtual FPGA on one physical FPGA. They describe their virtual FPGA and a supporting tool chain, which bases on VPR, the place & route tool at the heart of the VTR flow. Unfortunately, Hübner et al. do not present a quantitative analysis of the area overhead for their virtual FPGA and the architecture is not openly available to the research community, but judging by some of the technological details, such as using flip-flops to store the configuration, ZUMA presumably is the more advanced architecture. Coole and Stitt present a slightly different approach to FPGA overlays called intermediate fabrics [15]. They do not address the advantages and disadvantages of overlays discussed in earlier work, but instead focus on FPGA synthesis times. Placing and routing sophisticated designs on high density devices using vendor tools can take hours or days, which Coole and Stitt consider a weakness. Consequently, they came up with intermediate fabrics as general concept for virtual overlays built from more coarse-grained building blocks than lookup tables. These intermediate fabrics should greatly simplify the placement and routing steps, speeding them up by a factor of up to 800×. Fine-grained virtual FPGAs such as ZUMA can be seen as special case of this approach, albeit not a very interesting one for their metric, as place and route would not be significantly faster than for usual FPGAs for overlays of the same size.

1

The official ZUMA GIT-repository is available at https://github.com/adbrant/zuma-fpga and includes many of the extensions described in this paper.

Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE

ARTICLE IN PRESS T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

[m3Gsc;April 21, 2016;18:12] 3

Finally, Jain et al. [16] present an approach similar to ours, integrating an overlay into a Zynq system-on-chip and reporting on the resulting area and timing overhead. In contrast to our work, however, the authors use much more restricted functional units that implement only a few operators. The resulting DySER overlay is thus rather coarse-grained, sacrificing generality and flexibility for performance. As the authors use the exact same system on chip (SoC) as we do, the maximum overlay size of 6 × 6 is directly comparable to our overlay sizes, only differing in the type of the constraining resource of the underlay. As the overlay can use special digital signal processing (DSP) blocks to implement the small set of operators, and is corse-grained, and thus does not include the combinational loops found in ZUMA, the authors can actually determine the safe operating frequency for the overlay. The high performance of the otherwise less optimized overlay shows that our pessimistic timing results are a direct consequence of ZUMA’s flexibility and fine-grained nature. In summary, we can identify a number of reasons why researchers have been looking into virtual FPGA architectures. First, portability of synthesized hardware designs across FPGA devices, families or even vendors is a long term goal and would help reduce dependence on single manufacturers and lower costs of migrating to new hardware. In addition, the overlay can provide architectural features the underlay lacks, for example, dynamic and partial reconfigurability. However, hardware portability still remains an active research topic rather than a practically used feature given the huge overheads in area and delay as well as the rather limited virtual architectures presented so far. Second, when speeding up place & route or the reconfiguration process is the main motivation then the overheads of current overlays might be bearable. Third, FPGA overlays are excellent experimental environments to study new reconfigurable architectures and design tool flows. This holds especially true if researchers have open access to virtual architectures and their bitstream formats, as well as to the corresponding tool flows. In our work we follow the definition of a virtual FPGA provided by Lagadec et al. [7] and use an extended version of the original ZUMA virtual FPGA architecture and tool flow [2]. As stated above, this paper is an extended version of our work presented in [1]. In [17], we have presented an example for an application using the integration of ZUMA and ReconOS. We have included a ZUMA overlay into a ReconOS hardware thread of an image stream processing application that applies a runtime-exchangeable filter on each image of a stream, i.e., from a webcam or video file, and displays the original and filtered images alongside each other. We have implemented the filter itself in the overlay, allowing us to write portable filters for different host FPGAs with provably correct functionality, and to easily verify and switch between them at the press of a button. To showcase the flexibility and power of the integration of a virtual FPGA into a complete configurable system-on-chip, we also included ZUMA into the base ReconOS system. The inclusion of such an overlay gives designers the flexibility to quickly evaluate new features or concepts for parts of the system, without having to resynthesize the complete system, or even with using a different architecture with different features for the overlay. For this integration, the designer has to implement the reconfiguration of the overlay in a similar way as our configuration controller, using adequate data channels to obtain the configuration at that point of the design. In [18] we have reported on one such integration scenario, where we have integrated a ZUMA overlay into the arbiter of ReconOS’ memory interface, where it can interact with all memory accesses originating from any hardware thread. We were successful in integrating a memory access monitoring and enforcement circuit in the overlay, enforcing a runtime-reconfigurable memory policy for all hardware threads. 3. Extending ZUMA To the best of our knowledge, ZUMA is the most advanced FPGA overlay freely available today. ZUMA was designed for a low virtualization overhead and the open reference implementation helps others to integrate it easily into any given design. We use and extend the original ZUMA tool flow depicted in Fig. 1. We start with a behavioral description of the virtual circuit in Verilog and a parametrized description of the regular island-style overlay that is translated to an architecture file in extensible markup language (XML) format, and use the VTR flow tools ODIN, ABC and VPR to synthesize, technology map, as well as pack, place and route the virtual circuit on the virtual hardware. Additionally, VPR is used to compute an abstract routing resource graph for the overlay architecture. From this routing resource graph we generate the behavioral description of the overlay hardware itself, to be synthesized to an actual FPGA, and from the netlist and placement & routing information we create the resulting bitstream that can be used to configure the virtual FPGA with the virtual circuit. To operate ZUMA overlays in a complete hardware/software system-on-chip platform, we had to extend the reference implementation in several ways. In this section, we first describe our ZUMA extensions with regard to interfacing between the overlay and the underlay and to implementing virtual sequential circuits, as well as more accurately estimating the safe operation frequency of the overlay with regard to the currently configured virtual circuit, and then discuss ongoing and future work in making the tool flow for ZUMA more sophisticated. 3.1. Virtual-physical interface To embed a virtual FPGA architecture into a physical reconfigurable fabric we must define an interface, just as FPGAs have input and output (I/O) pads to connect to off-chip components. This interface needs to have a fixed interpretation on both ends, the virtual and physical, because the embedding and wiring of the overlay into the underlay will be fixed once the overlay architecture is synthesized. All virtual configurations will have to adhere to the resulting virtual I/O pad locations. Since the ZUMA reference implementation had no concept of constraining the placement of virtual I/O pads, Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE 4

ARTICLE IN PRESS

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

Fig. 1. Tool flow to create a ZUMA overlay and configurations for it. Dashed lines denote further improvements on the tool flow, described in Section 3.4.

we have developed a work-around that effectively enables us to synthesize virtual circuits that can be connected to the underlay. We have introduced an extra ordering logic layer between the underlay and the overlay, which we use to connect all of the virtual I/O pads at their random locations to the physical wires at their fixed locations. The ordering layer consists of large multiplexers for all the virtual I/O pads that we automatically insert in place of the virtual FPGA I/Os during the second ZUMA script stage of the tool flow. To that end we have extended ZUMA to offer support for larger multiplexers, which were capped in the original ZUMA design at the square of the number of LUT inputs of the physical FPGA k2host , e.g., 36 for modern FPGAs with 6-input LUTs. We automatically deduce the correct permutation, and thus the correct multiplexer configurations, from VTR’s output files and include the corresponding configuration bits also into the bitstream. The ordering is thus transparent to the designer as it happens automatically in the background, if it is enabled. 3.2. Sequential virtual circuits We have extended the ZUMA reference design by adding optional virtual flip flops after each virtual LUT to enable sequential virtual circuits. While the description of the ZUMA architecture [2] includes flip flops for this purpose, the actual reference implementation supported only combinational circuits. To create virtual flip flops in the most area conserving manner, we have developed a special version of the LUTRAM macro that is the basic building block of the ZUMA overlay. The special version is only used for the eLUTs (embedded LUTs), i.e., the phyiscal LUTs representing a virtual LUT that are each implemented by a single LUTRAM macro, and not for any of the multiplexers of the virtual routing fabric, which are still implemented using the original (combinational) LUTRAM macro. The new macro uses both possible outputs of the LUTRAM, the unregistered one, as before, and the registered one to be used as virtual flip flop. To make the flip flop optional, as required by the ZUMA architecture, we have added a 2-input multiplexer node after each eLUT and derive its configuration from the netlist and routing of the virtual circuit. The LUTRAM macro accepts a second clock to be used with the registered output, so that the configuration of the overlay and its operation can be driven by different clocks. We have made use of this possibility, because in this way the clock network used for the overlay will be an actual clock network of the physical FPGA, allowing for fast clock signals that are synchronized with the underlay. Accordingly, we instruct the VTR flow to treat the clock of the ZUMA overlay as external network, which does not have to be routed using virtual resources. 3.3. Timing analysis Similar to related work on virtual FPGAs, the ZUMA tool flow lacked a timing analysis. Running a timing analysis on the overlay architecture using the underlay’s timing information results in extremely pessimistic delay estimates which are basically given by the longest possible combinational path in the overlay without taking into account the actual circuit configuration, see Section 5 for examples. Among other things, the physical timing analysis is impeded by the hundreds of potential combinational logic loops introduced by the overlay. To derive more meaningful bounds on the clock frequency for the overlay, we have implemented a way to propagate the timing information from the underlay’s synthesis tools back to the VPR tool flow for the overlay. Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE

ARTICLE IN PRESS T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

[m3Gsc;April 21, 2016;18:12] 5

This process consists of several steps, each with their own challenges. First, we generate a mapping between physical and virtual paths. Second, we measure the delay of the virtual paths by measuring their physical counterparts. Finally, we identify the physically critical virtual path, i.e., the physical path with the maximum delay that represents an actually used virtual path of the current overlay configuration. Due to the place & route step of the Xilinx tools, this is not necessarily the critical path of the overlay configuration without timing information. As the large number of possible combinational loops in the overlay allows for arbitrarily long connections between two nodes in it, and since we do not want to rerun the Xilinx tool flow for each new overlay configuration, we cannot use the naive approach and let the Xilinx tools determine the correct delay of all virtual paths. We thus have modified the ZUMA tool flow as shown in Fig. 1: The generator scripts of ZUMA create the hardware description of the overlay in Verilog from the routing resource graph of the VTR flow. We include this overlay into a ReconOS design, synthesize and implement it as before, but then extract the timing information for all physical paths that implement virtual edges. Since ZUMA employs Wilton routing [19] in the switch boxes and uses clos networks instead of fully connected crossbars inside the logic clusters, there are several virtual interconnection points that are not really programmable, i.e., several nodes in the routing resource graph have f anin = f anout = 1. In these cases the incident edges are actually contracted to a single edge in the hardware description language (HDL) representation of the overlay by the ZUMA scripts. As the Xilinx synthesis tools thus view these segments as one, we cannot obtain separate timing information for them on the physical side, but as we also cannot use them separately in the virtual routing inside the overlay, this poses no problem. We operate VPR in the area driven place & route mode and then afterwards run our own timing analysis of the overlay, using the physical timing information for all overlay edges. This way we can search for the physically critical path of an overlay configuration, which is, as mentioned above, not necessarily the critical path inside of the overlay, as the physical place and route may well have produced an overlay in which the triangle inequality does not hold for the delays. We deduce the minimum clock period for the overlay configuration from the critical path we have found. As an alternative way to use the extracted delays, we could annotate the ZUMA architecture description with it, by combining and averaging (or taking the worst-case of) the fine granular times until we obtain the delays for the coarse granular structure elements of the VPR architecture file. This way we could launch VPR in the timing-driven clustering and packing mode in order to give a meaningful timing analysis of the virtual circuit, but we would loose most of the potential timing accuracy present in the physical data, because we can, e.g., only give VPR the details of one deterministically used wire type, and only one switch box delay for the whole overlay, as well as only one cluster delay per column of clusters. We therefore decided to use the first option instead. These steps ensure that we find a maximum clock frequency that allows for safe operation without setup or hold time violations into the overlay, as we use the relevant part of the actual timing information of the underlay. We have implemented two different versions to extract the timing information from the Xilinx synthesis tools. For the first one we generate a Xilinx user constraint file containing one trivially satisfiable upper delay bound for each virtual wire. This forces the synthesis tools to report the final delay for each constraint, and thus for each virtual wire. Unfortunately, the reported delays are given as a list of worst case delays for connections between the endpoints of the virtual wire, and there is no easy way to enforce the restriction of the possible paths only to the physical counterparts (mapped wires) of the virtual wire. The obtained path delay is hence still quite pessimistic using this version. The second method uses the netgen tool of the Xilinx synthesis flow to export a standard delay format (SDF) file for the placed and routed netlist of the underlay, which includes detailed delay information for each physical component. Since the naming scheme of the components is tranparent, we can map the physical components back to the virtual ones, and thus use the information to annotate the complete routing resource graph with accurate aggregated timing information. The drawback of this method lies in a shortcoming of the tool: netgen does not support every structure needed to implement ReconOS on a Zynq SoC using the ISE flow. For a Zynq we can thus only observe the timing information for the overlay in isolation, which results in a good estimate of the actual delays of the overlay implemented in the underlay, as the overlay clearly dominates all other circuit parts in terms of area and delay (see Table 1). In Section 5 we report on experimental results for both methods. 3.4. Further extensions to the ZUMA tool flow For the experiments shown in this paper we have used the original ZUMA tool flow with the extensions described in Sections 3.1 and 3.2. The improvement and automation of the tool flow along the following three directions is ongoing and future work; Fig. 1 shows them as dashed lines. First, we are implementing a second work-around for constraining the placement of virtual I/O pads. This second workaround uses the optional io.pads placement file of the VTR tool flow (see Section 3.1), in which the location of each pad can be fixed using the pad names which are used internally in the flow, i.e., the names assigned by ODIN. Unfortunately, VTR does not allow designers to specify a generic file for this purpose. Thus, one has to interrupt the ZUMA flow after VTR, generate the constraints for the I/O pads and then re-run the flow again from the VTR step with this new constraints file. Both work-arounds come with different trade-offs. The ordering layer requires additional logic but gives VTR full flexibility in placing I/O pads which can potentially lead to more efficient designs. Fixing the I/O pad locations saves the logic for the ordering layer but constrains VTR’s placement. Second, a disadvantage of the current ZUMA tool flow as shown in Fig. 1 is the simultaneous processing of both layers in the VTR flow step, which means that the overlay description will be re-generated every time a new virtual configuration Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE 6

ARTICLE IN PRESS

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

Fig. 2. Xilinx Zynq version of ReconOS [3], with n + 1 software threads (SWT), m + 1 hardware threads (HWT), their m + 1 delegate threads (DT), and a 5 × 5 ZUMA overlay embedded into HWT 0.

is computed. Apart from being unnecessary overhead, this can also lead to incompatibilities if a new VTR version introduces changes to the routing graph computation, as this would render ZUMA unable to generate any new configurations for previously generated overlays. Here we are working on an import routine for VTR that can parse routing graphs instead of generating them anew from the architecture description. Then new ZUMA configurations could be computed by starting with an existing routing graph from a previous run. There are furthermore some enhancements of our proposed timing method. For example, we would like to embed the resulting frequency into the ZUMA bitstream. In the system-on-chip architecture we plan to employ a solution such as the Xilinx Dynamic Reconfiguration Port to alter the output clock frequency of a Digital Clock Manager to the desired frequency for the ZUMA overlay at runtime. We furthermore will look into alternative way of obtaining the timing information on the physical side, as well as the possibility of using the constraints in their current form to encourage the Xilinx tools to faithfully map the virtual structure using the physical resources, by lowering the allowed delay for each constraint. As the latter will obviously introduce a significant overhead in the place & route stage of the Xilinx tools, we will have to investigate if there is a good balance that can be achieved. 4. Embedding virtual fabrics into ReconOS ReconOS is an architecture and execution environment for hybrid hardware/software systems featuring a multithreaded programming model which allows for regular software threads as well as hardware threads [3,4]. Fig. 2 shows the Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

ARTICLE IN PRESS

JID: CAEE

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

[m3Gsc;April 21, 2016;18:12] 7

Fig. 3. Configuration process for the ZUMA overlay embedded in a ReconOS hardware thread.

architecture for a ReconOS v3.1 system on Xilinx Zynq with n + 1 software threads (SWT) and m + 1 hardware threads (HWT). Hardware threads are basically circuits that can interact with other threads and operating system services, such as semaphores, through a FIFO-based (first in, first out) operating system interface (OSIF). To enable this interaction in a fully transparent way, ReconOS instantiates one delegate thread (DT) in software per running hardware thread. The delegate thread accesses the operating systems services and communicates with other threads on behalf of the hardware thread. Additionally, each hardware thread can access the ReconOS memory subsystem through a FIFO-based memory interface (MEMIF). Using ReconOS for embedding a virtual FPGA provides us with a mature, Linux-based infrastructure for implementing hardware/software systems, including a CPU core, memory controller, peripherals and a standard software operating system. A ZUMA overlay can be embedded into several locations of a ReconOS architecture, depending on the intended use of the overlay. For example, we can embed the overlay into the MEMIF or OSIF to realize functions for controlling memory access or operating system interaction in virtualized hardware. Alternatively, we can integrate the overlay as separate core connected to an advanced extensible interface (AXI) bus. However, the most flexible way of embedding the overlay is to integrate it into a ReconOS hardware thread, as demonstrated in [1]. Inside the hardware thread, the overlay can either replace or augment the non-virtualized user logic. Fig. 2 depicts this integration option in HWT 0. The shown hardware thread contains a ZUMA overlay with an ordering layer for the virtual I/Os as described in Section 3.1, a configuration controller including a local buffer memory, and a block of non-virtualized user logic. We have developed a prototype system as shown in Fig. 2 including software functions for configuring and communicating with the virtual FPGA. The software connects to the hardware thread containing the overlay via message boxes and allocates a shared memory region in the system memory for data exchange. The subsequent configuration process is depicted in Fig. 3. The software reads a bitstream for the virtual FPGA from the file system and parses it to verify the integrity of the bitstream using ZUMA’s line checksums [2]. The file system in ReconOS versions using Linux as host operating system is either local or on a remote server connected via the network file system (NFS) protocol. To better utilize the bandwidth to the ReconOS memory subsystem and thus to reduce configuration time, the software as well as the configuration controller for the virtual FPGA operate on 8 KiB blocks of configuration data. The shared memory region is operated in a double-buffering scheme: Each time one block of the shared memory is filled with new configuration data, the software sends a message to the hardware thread and continues to parse the bitstream into the second block. In the meantime, the configuration controller copies the first block into a local on-chip RAM buffer and from there it feeds the configuration data into the ZUMA overlay, which shifts them into the LUTRAMs. Once the configuration data in the local memory blocks have been completely processed, the hardware thread sends a message to the software to request the next block. The software functions can be included in any user-defined software thread. In our test setup we have compiled them to an executable that can be called in ReconOS/Linux with the required bitstream as a command line parameter. After configuring the overlay, the software process remains connected with the hardware thread containing the overlay. New input data are continuously generated and sent to the hardware thread via message box calls. The hardware thread provides these data to the overlay using the I/O pad ordering layer. The outputs of the overlay are, again properly ordered, written to shared memory from where the executable can read and display them. Naturally, this is just a test setup we are using to showcase the integration of the overlay into ReconOS. The virtual circuit running in the overlay has full access to all ReconOS features, including all operating system services and the memory subsystem.

5. Experimental results In this section we report on experimental results for our extended ZUMA overlays embedded into ReconOS. We have used ReconOS v3.1 on an Avnet ZedBoard, containing a Xilinx Zynq integrated processing system with two ARM Cortex-A9 MPCore application processors and programmable logic fabric on a single die. The ZUMA layout we have synthesized is based on the island-style layout found in [2] and uses a 2d-array of k × l clustered logic blocks. Each cluster comprises 8 basic logic elements (BLE), and each BLE consists of one 6-input lookup table and one by-passable flip flop. Each cluster receives 28 inputs from outside the cluster and 8 feedback connections from the internal BLE outputs. Each cluster input and output can connect to 6 different virtual wires of the surrounding channels, which are 112 virtual wires wide. We Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

ARTICLE IN PRESS

JID: CAEE 8

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11 Table 1 Area and speed measurements for a ReconOS system with and without a 3 × 3 overlay. Measure

ReconOs with bare HWT

area [LUTs] area [LUTRAMs] Xilinx fmax [MHz]

3270 181 102.05

ReconOs with ZUMA and ordered I/Os

unordered I/Os

9919 5877 0.71

9568 5557 0.83

Table 2 Speed measurements for different virtual circuits in a 3 × 3 overlay. Contraints method

Xilinx tools NOT gate 6bit adder 8bit adder 8bit RCA 4bit mult.

SDF method

fmax MHz

Slowdown Factor

fmax MHz

Slowdown Factor

favg MHz

Slowdown Factor

0.732 1.235 0.605 0.666 0.591 0.408

139.35 × 82.63 × 168.68 × 153.23 × 172.67 × 250.12 ×

42.544 26.445 19.813 12.967 11.579

2.40 × 3.86 × 5.15 × 7.87 × 8.81 ×

91.752 56.173 43.442 28.098 24.827

1.11 × 1.82 × 2.35 × 3.63 × 4.11 ×

have generated all bitstreams, virtual and physical ones, on a machine with an Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz processor with 16 GiB RAM. Table 1 lists hardware area and speed for different system configurations. The hardware area is measured in LUTs and LUTRAMs, where the LUTRAM count is also included in the LUT measure. LUTRAMs are listed separately, since only 50% of the LUTs available on our Zynq programmable logic fabric can act as LUTRAM, creating an area restriction for larger overlays. The second column of Table 1 presents as a reference data for a base ReconOS system with one empty hardware thread, i.e., no actual user logic. The third and fourth columns show data for a hardware thread with an embedded extended ZUMA overlay of 3 × 3 clusters with ordered and unordered I/Os, respectively. The data shows that our I/O ordering layer leads to a rather small area overhead with a 3.7% increase in area compared to an overlay with unordered I/O, mainly attributed for by added LUTRAMs. To quantify this area overhead also for larger overlays, we have measured the area increase through inserting the I/O ordering multiplexers for overlays of sizes from 2 × 2 up to 100 × 100 clusters. We have found that the addition of the ordering multiplexers resulted in area increases between 3% to 6%. The third and fourth columns of Table 1 report the maximum clock frequency for the overlays; as discussed in Section 3 this is an overly pessimistic timing estimation. Table 2 shows the timing information, which we were able to generate using our methods described in Section 3.3. The first row shows the reference timing results obtained from the Xilinx tools, which disregards the actual overlay configuration, the other rows list other test circuits. Using the constraints method, whose results are shown in columns two and three, we are far away from the 100+ MHz without using any overlay, but we were able to increase the estimate of a safe fmax by a factor of up to 1.68, or in other words to show a virtualization slowdown factor of only 83× instead of 139× for a simple overlay configuration. Although the method in its current form cannot improve the estimate for larger circuits in the overlay, the lower range of estimates is very close to the Xilinx tools’ estimate, and thus very close to actually being improved. We therefore conclude that the method itself has great potential for working with overlays, but still needs improvement. The last four columns of Table 2 show the analysis results of the SDF method, with the worst-case results in columns four and five, and the average-case results in columns six and seven. As explained in Section 3.3, we had to measure the overlay in isolation for this method. Using the detailed knowledge base of the Xilinx tools allows us to show a much better bound for the simple test circuits, improving the best slowdown factor from about 83× to only 2.4×. We can now also show that in the average case, the Xilinx tools predict that we can safely operate the overlay with up to nearly 92 MHz, compared to the original speed which was 102.05 MHz. The fast degradation of the slowdown factor for increasingly complex circuits is due to the lack of timing optimization during the routing step of VPR, which at the moment only minimizes the area in the ZUMA tool flow. As already explained in Section 3.3, the current version of VPR does not allow for fine-grained timing information in the architecture file, but forces users to consolidate the delay of many different entities into average delays, thereby loosing a great deal of the detailed knowledge gathered in the Xilinx tools. Finding a way to guide VPR’s routing step with the detailed delay information of the underlay is left for future work. Although we have only tested our timing back annotation approach with Xilix tools and devices, in principle the approach should also work for Altera FPGAs. Citing from the Quartus Verification handbook, there is a “Standard Delay Format Output File (.sdo)” which “contains the delay information of each architecture primitive and routing element in your design”, which is exactly what we need for our second method. We have also compared the area required for implementing the overlay with the area required in the physical FPGA to realize a circuit with an equal number of LUTs. The 72-LUT overlay used for Table 3 needs 4787 LUTs to implement, so we Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

9

Table 3 Area requirements for a 3 × 3 ZUMA overlay in a ReconOS hardware thread. Measure

HWT area usage (unordered I/Os)

area [LUTs] area [LUTRAMs]

Overlay

Config controller and UL

4787 4784

669 0

Table 4 Measurements of ReconOS with overlays of different sizes. ZUMA Size 1 2 3 4 5 6

× × × × × ×

1 2 3 4 5 6

Xilinx

synth.[s]

reconf.[s]

bitstream[KiB]

synth.[s]

LUTRAMs

0.30 0.44 0.74 1.12 1.70 2.36

0.01 0.03 0.07 0.13 0.19 —

13 (2.3) 53 (13) 118 (30) 206 (53) 317 (85) 452 (121)

521.50 591.55 794.89 1298.80 2183.01 1316.76

757 (4%) 2437 (14%) 5077 (29%) 8677 (49%) 13,237 (76%) 37,514 (>100%)

achieve a 66 × increase in area with our ZUMA enhancements, which is still rather close to the 40 × reported area overhead for the original ZUMA [2]. The last column of Table 3 displays the area requirements for the user logic and the configuration controller. The user logic added to the overlay implements only some basic functionality such as receiving new inputs from the software side and sending back the outputs, and together with the configuration controller needs about 1/7th of the size of the overlay. Table 4 shows for differently sized overlays the area requirements, synthesis and reconfiguration times and the bitstream sizes. The left hand part of the table details the ZUMA synthesis and reconfiguration, i.e., the one of virtual circuits onto the overlay. The right hand part contains the synthesis time and LUTRAM count for the overlay itself, as reported by the Xilinx tools. The area requirements in the last column state the number of LUTRAMs and the percental LUTRAM utilization on our Zynq programmable logic fabric. Using the ZUMA architecture parameters detailed above, we can only fit overlays with a size of 5 × 5 clusters on our Zedboard. The synthesis of a new overlay configuration is quite fast; the runtimes in the second column of Table 4 comprise the complete tool flow from Fig. 1 up to the virtual configuration, as well as the mentioned overhead of re-creating the HDL file for the whole overlay every time. The overlay reconfiguration times in the third column include every step from Fig. 3, measured as wall time on the software side. As the hardware side still spends about 95% to 99% of the virtual reconfiguration cycles receiving, sending or waiting for messages from/to the software side, this number could probably be improved upon by using even more streaming or pipelining techniques for the reconfiguration process. In our ZUMA tool flow setup, bitstream sizes depend only on the virtual architecture and not on the actually implemented virtual circuit. The fourth column of Table 4 lists the virtual bitstream sizes and, in parentheses, the bitstream sizes when compressed using standard ZIP. On average the textual representation of ZUMA bitstreams allows for a 75% size reduction using compression. As expected the virtual bitstream sizes are quite small compared to the bitstream for the physical Zynq fabric, which amounts to 3.9 MiB. The time for synthesizing a ReconOS system with a hardware thread containing an k × k-overlay depends on the size and complexity of the overlay, and is listed in the fifth column of Table 4. In our experiments the time increased steadily from about 8 minutes to well over half an hour for a system using large amounts of LUTRAMs.

6. Conclusion In this paper we have presented the embedding of an extended ZUMA virtual FPGA fabric into a ReconOS/Linux system running on a Xilinx Zynq. We have discussed our extensions to ZUMA and the detailed embedding into ReconOS, and presented experiments using our prototype architecture and tool flow. While our experiments have confirmed that FPGA overlays still come with considerable virtualization overheads in terms of area and delay, the main result of our work is the greatly simplified experimentation with virtual FPGAs. Circuits mapped to virtual FPGAs can now easily call Linux operating system services and thus communicate with other threads or machines and utilize a standard virtual memory subsystem. Future work includes further improvements and automation of the tool flow, especially concerning the timing analysis, as well as an investigation of and experiments with portable reconfigurable circuits.

Acknowledgments This work was partially supported by the German Research Foundation (DFG) within the Collaborative Research Centre On-The-Fly Computing (SFB 901). We would like to thank the anonymous CAEE reviewers for their work and suggestions. Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE 10

ARTICLE IN PRESS

[m3Gsc;April 21, 2016;18:12]

T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

References [1] Wiersema T, Bockhorn A, Platzner M. Embedding FPGA overlays into configurable systems-on-chip: ReconOS meets ZUMA. In: Proceedings of the international conference on ReConFigurable computing and FPGAs (ReConFig). IEEE Computer Society. p. 1–6. doi:10.1109/ReConFig.2014.7032514. [2] Brant A, Lemieux GGF. ZUMA: An open FPGA overlay architecture. In: Proceedings of the international symposium on field-programmable custom computing machines (FCCM). IEEE Computer Society; 2012. p. 93–6. doi:10.1109/FCCM.2012.25. [3] Agne A, Happe M, Keller A, Lübbers E, Plattner B, Platzner M, et al. ReconOS: An operating system approach for reconfigurable computing. IEEE Micro 2013;34(1):60–71. doi:10.1109/mm.2013.110. [4] Lübbers E, Platzner M. ReconOS: Multithreaded programming for reconfigurable computers. ACM Trans Embedded Comput Syst 2009;9(1) 8:1–8:33. 10.1145/1596532.1596540. [5] Brebner GJ. The swappable logic unit: A paradigm for virtual hardware. In: Proceedings of the international symposium on field-programmable custom computing machines (FCCM). IEEE Computer Society; 1997. p. 77–86. doi:10.1109/FPGA.1997.624607. [6] Fornaciari W, Piuri V. Virtual FPGAs: Some steps behind the physical barriers. In: Rolim J, editor. Proceedings of the parallel and distributed processing workshops (IPPS/SPDP). Lecture notes in computer science, vol. 1388. Berlin Heidelberg: Springer; 1998. p. 7–12. ISBN 978-3-540-64359-3. doi:10. 1007/3- 540- 64359- 1_665. [7] Lagadec L, Lavenier D, Fabiani E, Pottier B. Placing, routing, and editing virtual FPGAs. In: Brebner G, Woods R, editors. Proceedings of the international conference on field-programmable logic and applications (FPL). Lecture notes in computer science, vol. 2147. Berlin Heidelberg: Springer; 2001. p. 357– 66. ISBN 978-3-540-42499-4. doi:10.1007/3- 540- 44687- 7_37. [8] Sekanina L. Virtual reconfigurable circuits for real-world applications of evolvable hardware. In: Tyrrell AM, Haddow PC, Tørresen J, editors. Proceedings of the International conference on evolvable systems: from biology to hardware (ICES). Lecture notes in computer science, vol. 2606. Berlin Heidelberg: Springer; 2003. p. 186–97. doi:10.1007/3- 540- 36553- 2_17.ISBN http://id.crossref.org/isbn/978- 3- 540- 00730- 2. [9] Glette K, Tørresen J, Yasunaga M. Online evolution for a high-speed image recognition system implemented on a Virtex-II Pro FPGA. In: Arslan T, Stoica A, Suess M, Keymeulen D, Higuchi T, Zebulum RS, et al., editors. Proceedings of the NASA/ESA conference on adaptive hardware and systems (AHS). IEEE Computer Society; 2007. p. 463–70. doi:10.1109/AHS.2007.83. [10] Plessl C, Platzner M. Virtualization of hardware introduction and survey. In: Plaks TP, editor. Proceedings of the international conference on engineering of reconfigurable systems and algorithms (ERSA). CSREA Press; 2004. p. 63–9. ISBN 1-932415-42-4. [11] Ha Y, Schaumont P, Engels M, Vernalde S, Potargent F, Rijnders L, et al. A hardware virtual machine for the networked reconfiguration. In: Proceedings of the international workshop on rapid system prototyping (RSP). IEEE Computer Society; 20 0 0. p. 194–9. doi:10.1109/IWRSP.20 0 0.855224. [12] Lysecky R, Miller K, Vahid F, Vissers K. Firm-core virtual FPGA for just-in-time FPGA compilation (abstract only). In: Schmit H, Wilton SJE, editors. Proceedings of the international symposium on field-programmable gate arrays (FPGA). ACM; 2005. p. 271. ISBN 1-59593-029-9. doi:10.1145/1046192. 1046247. [13] Rose J, Luu J, Yu CW, Densmore O, Goeders J, Somerville A, et al. The VTR project: architecture and CAD for FPGAs from Verilog To Routing. In: Compton K, Hutchings BL, editors. Proceedings of the international symposium on field-programmable gate arrays (FPGA). ACM; 2012. p. 77–86. doi:10.1145/2145694.2145708. [14] Hübner M, Figuli P, Girardey R, Soudris D, Siozios K, Becker J. A heterogeneous multicore system on chip with run-time reconfigurable virtual FPGA architecture. In: Proceedings of the international symposium on parallel and distributed processing workshops and Phd forum (IPDPS). IEEE Computer Society; 2011. p. 143–9. doi:10.1109/IPDPS.2011.135. [15] Coole J, Stitt G. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In: Givargis T, Donlin A, editors. Proceedings of the international conference on hardware/software codesign and system synthesis (CODES+ISSS). ACM; 2010. p. 13–22. doi:10.1145/ 1878961.1878966. [16] Jain AK, Li X, Fahmy SA, Maskell DL. Adapting the DySER architecture with DSP blocks as an overlay for the xilinx zynq. ACM SIGARCH Comput Archit News 2015;43. (in press). [17] Wiersema T, Wu S, Platzner M. On-the-fly verification of reconfigurable image processing modules based on a proof-carrying hardware approach. In: Sano K, Soudris D, Hübner M, Diniz PC, editors. Proceedings of the international symposium on applied reconfigurable computing (ARC). LNCS, vol. 9040. Switzerland: Springer International Publishing; 2015. p. 377–84. doi:10.1007/978- 3- 319- 16214- 0- 32. [18] Wiersema T, Drzevitzky S, Platzner M. Memory security in reconfigurable computers: combining formal verification with monitoring. In: Proceedings of the international conference on field-programmable technology (FPT). IEEE Computer Society. p. 167–74. doi:10.1109/FPT.2014.7082771. [19] Wilton SJE. Architectures and algorithms for field-programmable gate arrays with embedded memories. Ph.D. thesis, National Library of Canada; 1997.

Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005

JID: CAEE

ARTICLE IN PRESS T. Wiersema et al. / Computers and Electrical Engineering 000 (2016) 1–11

[m3Gsc;April 21, 2016;18:12] 11

Tobias Wiersema received his M.Sc. degree in Computer Science from Paderborn University, and now works as research assistant in the Computer Engineering group at the same university. His research interests include provably trustworthy remote verification of reconfigurable hardware circuits and FPGA overlays. Arne Bockhorn is studying computer science at the Paderborn University, and works as student research assistant in the Computer Engineering group at the same university. Marco Platzner is a professor for computer engineering at Paderborn University. He holds a diploma and PhD degree in Telematics from Graz University of Technology. His research interests include reconfigurable computing, hardware-software codesign and parallel architectures.

Please cite this article as: T. Wiersema et al., An architecture and design tool flow for embedding a virtual FPGA into a reconfigurable system-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.04.005