J. Parallel Distrib. Comput. 78 (2015) 1–5
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Research note
A case study of parallel JPEG encoding on an FPGA Chao Wang, Xi Li ∗ , Peng Chen, Xuehai Zhou Department of Computer Science, University of Science and Technology of China, China
highlights • We apply Hill & Marty corollaries by implementing heterogeneous cores running JPEG on a real FPGA platform. • We evaluate theoretical and experimental metrics including the speedup, area, power and core efficiency. • We explore the hardware/software tradeoff between heterogeneous and homogeneous computing architectures.
article
info
Article history: Received 21 May 2013 Received in revised form 16 September 2014 Accepted 17 September 2014 Available online 23 October 2014 Keywords: JPEG encoding FPGA Hill & Marty’s findings Case study
abstract In this note we focus on the empirical results on a case study of parallel JPEG encoding on real FPGA platform, which evaluates and complements Hill & Marty’s findings. A hardware prototype is constructed on FPGA with MicroBlaze processors and JPEG hardware accelerators. Experimental results on this case study demonstrate that the Hill and Marty’s findings reinforces the hardware/software task partitioning for hybrid MPSoC architectures and also provide creditable new insights to scalable homogeneous and heterogeneous FPGA based MPSoC domains. © 2014 Elsevier Inc. All rights reserved.
1. Review of Amdahl’ Law and Hill/Marty’s corollaries in multicore era
speedup is calculated in (1). Speedupsymmetric (f , n, r ) =
In 2008, Hill and Marty proposed an extension of Amdahl’s Law [1] on BCE based multicore structure in [5]. The extended forms of Amdahl’s Law for multicore system chip models introduce two additional parameters to symmetric and asymmetric multicore chip models, n and r. Due to the constraints of the BCE model, n and r parameters should be modeled to represent total resources available and those dedicated towards sequential processing, in particular measured with units of BCE cores respectively. Consequently, the speedup of processor organized by r BCEs is calculated as perf (r ). The three types of multicore clusters system-on-chip models based on BCEs [5] are illustrated in Fig. 1. The symmetric multicore in (a) resembles the multicore architecture composed of BCEs. (b) presents a symmetric multicore scenario with four microprocessors those are composed of multiple BCE cores. In this case the
1 1−f perf (r )
Corresponding author. E-mail addresses:
[email protected] (C. Wang),
[email protected] (X. Li),
[email protected] (P. Chen),
[email protected] (X. Zhou). http://dx.doi.org/10.1016/j.jpdc.2014.09.010 0743-7315/© 2014 Elsevier Inc. All rights reserved.
f ·r perf (r )·n
.
(1)
Fig. 1(c) illustrates a hypothetical heterogeneous multicore containing one sequential processor coupled with a sea of customized BCE cores. An asymmetric multicore system consists of a single microprocessor for sequential execution, while nearby BCE cores are used to execute parallel sections of code. Depending on whether the centric microprocessor is responsible for parallel task execution, the speedup is discussed separately: on one hand, if centric processor is involved in parallel sections, the speedup is calculated in (2). Speedupasymmetric (f , n, r ) =
1 1 −f perf (r )
+
f perf (r )+n−r
.
(2)
On the other hand, if the centric microprocessor is only in charge of sequential execution, the speedup should be calculated in (3). Speedupasymmetric-offload (f , n, r ) =
∗
+
1 1 −f perf (r )
+
f n −r
.
(3)
Besides Hill and Marty’s work, many literatures have been proposed as the extension of Amdahl’s Law. For example, Paul and
2
C. Wang et al. / J. Parallel Distrib. Comput. 78 (2015) 1–5
(a) Symmetric with BCEs.
(b) Symmetric with microprocessors.
(c) Heterogeneous multicore.
Fig. 1. Multicore models based on BCE. Table 1 Parameters configured in the FPGA prototype.
1 2 3 4 5 6 7 a
Parameters
Selection and configuration
FPGA development board Software version Microprocessor version Number of microprocessors Interconnect Hardware accelerators Evaluation metrics
Xilinx Virtex-5 XC5VLX110T Xilinx ISE 14.1, Power Analyzer, Synplify and ModelSim MicroBlaze 7.20a (frequency 125 MHz, local memory of 8 kB) 1 as controller and 1 as computing processor Star network based on Xilinx FSL [11] JPEG (Including CC, DCT-2D and Quant accelerators) Speedup, area, power and core efficiencya
We follow the term the core efficiency defined in Ref. [6].
Meyer [8] revisit Amdahl’s assumptions and develop sophisticated models for unique chip systems. Morad et al. [7] concentrate on the power efficiency and scalability of asymmetric chip multiprocessors. They predict the scalability limit of asymmetric CMPs, where one core is optimized to accelerate serial sections and the rest of the chip hosts as many cores as possible to accelerate the parallel section. Woo and Lee [12] extended the model into a sound framework considering power consumption, and other metrics like performance per watt, while Cho and Melhem focus on energy minimization and power-saving features [3]. [14,13] investigate a theoretical analysis of with scalable and quantitative conditions given to determine the optimal multicore performance. [9] analyzes multicore scalability under fixed-time and memory-bound conditions and from the data access perspective to the memory wall issues. [4] presents a fundamental law for parallel performance. It suggests that parallel performance is not only limited by sequential code guided by Amdahl’s Law but is also fundamentally limited by synchronization through critical sections. [2] studies an interaction case between parallelization and energy consumption in a parallelizable application. The author in [10,15] intend to break through the limitation of the BCE based architecture, to get a general purpose Intellectual Property (IP) based acceleration engines. Recently, [16,6] give a generalized Amdahl in Multicore architecture, which could optimize the heterogeneous multi-accelerator system-on-chip processors. 2. Case study in FPGA
(1) One scheduler kernel is implemented on a MicroBlaze processor. Meanwhile, the adaptive scheduling algorithms and mapping methods are implemented in software. What is more, the scheduler is also in charge of providing application programming interfaces to diverse programs. (2) Software computing tasks are also constructed on MicroBlaze processors, where software functions are packaged in C libraries. Decomposed concurrent tasks can be spawned to either MicroBlaze processors or IP cores (hardware computing engines) when all the input parameters are ready, considering the load-balancing status of entire system. (3) Hardware computing engines are implemented in function blocks RTL level HDL description and packaged as standalone IP cores. To enable dynamic partial reconfiguration, every IP core is packaged in a Xilinx fast simple link (FSL) attached slave module. (4) Scheduler MicroBlaze processor is connected to each computing MicroBlaze processor or IP core with a pair of FIFO based one way peer-to-peer FSL links. After the task is dispatched, the main program can operate the continuous applications, e.g. dispatching other computing tasks for parallel execution. When the task is finished, the results will return with an interrupt signal for synchronization. (5) Each MicroBlaze has its private instruction and data cache built in Block RAM. We used processor local bus (PLB) to model the interconnection between the MicroBlaze main scheduler and peripherals, such as interrupt controller, UART controller and timer controller. 2.1. A case study: JPEG encoding
As a sophisticated experimental platform, FPGA is becoming more and more popular for fast prototyping systems, especially for hardware/software codesign and parallel execution. However, few studies have focused on the experimental results on real FPGA to handle the software–hardware codesign and parallelization. Therefore, in order to evaluate the tradeoffs among performance, area, energy on heterogeneous on-chip architectures, we have implemented a prototype system on XUPV5 board equipped with Xilinx Virtex-5 XC5VLX110T FPGA. The configured parameters in the FPGA prototype are listed in Table 1. Generally the hardware architecture is composed of following components:
Regarded as one of the most indispensable algorithms, JPEG encoding has great potential to be implemented via hardware in embedded systems. The main body of the algorithm is to compress the color bitmap with an 8 × 8 block as one unit. The adapted example codes of the JPEG application are described in Fig. 2(a), in which each for primitive is responsible for an 8 × 8 data block. In particular, the JPEG process is composed of following four stages: (1) Initialization and Color Conversion (CC). At start-up, an 8 × 8 R–G–B (RGB) block is read from the origin bmp file and begins to covert to the Y–Cr–Cb color space.
C. Wang et al. / J. Parallel Distrib. Comput. 78 (2015) 1–5
(a) Example code Snippet for JPEG.
3
(b) Profiling results of the four stages in JPEG application.
Fig. 2. Example code and profiling results for JPEG applications, (a) refers to the example codes for JPEG application, while (b) presents the profiling results for the different four stages.
Fig. 3. Theoretical performance from Hill and Marty’s findings. x-axis refers to different hardware configurations, and the y-axis indicates the estimated metrics, including speedup, area, power and efficiency.
(2) DCT-2D. After each block is transferred into a vector as ⟨Y, Cr, Cb⟩, it will undergo a process for a two dimensional discrete cosine transform. (3) Quant and ZZ/Huffman. All data in the 8 × 8 block are normalized (Quant) for encoding. ZigZag and Huffman (ZZ/Huffman) algorithm is used to compress the block into the final bit stream. We first profiled the JPEG applications to identify the fraction for different phases, as illustrated in the Ratio term in the legend of Fig. 2(b). Due to that the ZZ/Huffman phase takes only 2.6% of the total execution time, thereby we have not implemented this part as hardware yet. Meanwhile, the DCT-2D phase is regarded as the major bottleneck of all the four phases, as it takes 73.8% of the entire execution time. In contrast, the CC and Quant steps take 20.8% and 2.8% of the total execution time respectively. When the JPEG application is deployed to FPGA, one or more stages are executed on the hardware IP core. As a consequence, these stages execution could be running in parallel with the main JPEG application, therefore the ratio for each stage could be regarded as the parallel task fractions (denoted as f ).
2.2. Experimental results To leverage not only the raw performance but also the parameters such as area cost, power consumption and core efficiency, a multi-target design space exploration (DSE) method is utilized. Theoretical performances of the Hill and Marty’s findings are illustrated in Fig. 3, in which the x-axis indicates different architecture configurations, while y-axis indicates respective speedup, full system power consumption, resource cost, and the core efficiency. Architecture names are related to the configurations, for example, 4 MB indicates a homogeneous system with 4 MicroBlaze CPUs, and 1 MB + 1CC + 1 Quant indicates a hybrid system with one MicroBlaze processor, a CC IP core and a Quant IP core. Observing only the first four architectures in Fig. 3, we can derive that homogeneous architectures can obtain an increasing speedup (1.96× for 2 MB, 2.89× for 3 MB and 3.78× for 4 MB) along the increase of CPU number, however the resource cost (from 4.23 to 9.42 mm2 ) also increases rapidly. To compute a rough estimate of the area, we adopted a metrics of Configured Logic Block (CLB) tile area from the model by Kuon and Rose [6]. The model reports that the area of a CLB tile with 10 6-input LUTs in the 65 nm technology node is approximately 8069 µm2 . We used this
4
C. Wang et al. / J. Parallel Distrib. Comput. 78 (2015) 1–5
Fig. 4. Experimental architecture performance, x-axis refers to different hardware configurations, and the y-axis indicates the experimental results of metrics, including speedup, area, power and efficiency.
Fig. 5. Difference between experimental and ideal performance.
estimate of 807 µm2 per LUT and multiplied it by the total number of LUTs occupied in our design to generate an area estimate. Meanwhile, it is noticeable that core efficiency vibrates between 1 and 0.94 and power consumption vibrates between 262.23 and 272.73 mW. The power consumption is evaluated by the Xilinx Power Analyzer when each module is verified. The last seven architectures are situations for the hybrid architectures. The result indicates that they could archive significant speedup up to 30.3× with little power consumption (varies between 264.4 and 271.52 mW) and resource cost (varies between 264.76 and 271.52 mm2 ). As is illustrated in Fig. 3, the peak speedup of 30.3× is archived with the configuration of 1 MB + 1CC + 1DCT-2D + 1 Quant. The lowest speedup of 1.08× occurs at the configuration of 1 MB + 1 Quant. Meanwhile, core efficiency also comes up to the peak of 7.58× for 1 MB + 1CC + 1DCT-2D + 1 Quant. The experimental results on FPGA architectures are introduced in Fig. 4. The x-axis indicates the diverse architectures which have been implemented in the FPGA platform, while the y-axis refers to experimental speedup, power consumption, hardware utilization and core efficiency for these architectures. The difference between
Figs. 3 and 4 is that we use the CLB tiles used by the entire system prototype instead of the sum of every hardware component. The experimental results depict that there are a small number of placing and routing overheads for wasted CLBs like the ‘‘dark silicon’’ area. From Fig. 4 we can learn that the actual architecture comes up to a speedup of 26.01× for hybrid systems and a speedup of 8.5× for a 4 MB homogeneous systems. Fig. 5 depicts the difference between experimental results and the theoretical performance. For homogeneous architectures, the difference in resource cost is less than 5.6% and that for power consumption is less than 6.8%. Besides, both the speedup and core efficiency differences are less than 2.9%. For hybrid architectures, all items have little differences except for the hardware of 1 MB + 1CC + 1DCT-2D and 1 MB + 1CC + 1DCT-2D + 1 Quant. Both of them have a difference up to 14.2%. Considering that Quant and ZZ/Huffman steps in the JPEG 8 × 8 block compression only take a small ratio, therefore the whole execution time is relatively short for these two configurations, therefore the bus delay or communication cost will have a scaled impact on the system speedup.
C. Wang et al. / J. Parallel Distrib. Comput. 78 (2015) 1–5
3. Conclusion In this short note, we have done an experimental study of Hill and Marty’s findings with JPEG applications on FPGA hardware platform. The JPEG application has been divided to four parts, each of which can be mapped to either software or hardware. The experimental results demonstrate that the Hill and Marty’s Law provides some insightful comprehensions to the leverage of HW/SW codesign and mapping schemes on the homogeneous and heterogeneous Multicore architectures. Acknowledgments This work was supported by the National Science Foundation of China under grants (No. 61379040, No. 61272131, No. 61202053, No. 61222204, No. 61221062), Jiangsu Provincial Natural Science Foundation (No. SBK201240198), Fundamental Research Funds for the Central Universities No. WK0110000034, Open Project of State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences (No. CARCH201407), and the Strategic Priority Research Program of CAS (No. XDA06010403). The authors deeply appreciate many reviewers for their insightful comments and suggestions. References [1] Gene M. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: Proceedings of the April 18–20, 1967, Spring Joint Computer Conference, ACM, Atlantic City, New Jersey, 1967. [2] Sangyeun Cho, Rami Melhem, Corollaries to Amdahl’s law for energy, IEEE Comput. Archit. Lett. 7 (1) (2008) 25–28. [3] Sangyeun Cho, Rami G. Melhem, On the interplay of parallelization, program performance, and energy consumption, IEEE Trans. Parallel Distrib. Syst. 21 (3) (2010) 342–353. [4] Stijn Eyerman, Lieven Eeckhout, Modeling critical sections in Amdahl’s law and its implications for multicore design, in: 37th Annual International Symposium on Computer Architecture, ACM, 2010. [5] Mark D. Hill, Michael R. Marty, Amdahl’s law in the multicore era, IEEE Comput. 41 (7) (2008) 33–38. [6] Amir Morad, Tomer Morad, Leonid Yavits, Ran Ginosar, Uri Weiser, Generalized MultiAmdahl: optimization of heterogeneous multi-accelerator SoC, IEEE Comput. Archit. Lett. (2013). [7] Tomer Y. Morad, Uri C. Weiser, Avinoam Kolodny, Mateo Valero, Eduard Ayguade, Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors, IEEE Comput. Archit. Lett. 5 (1) (2006) 4–17. [8] JoAnn M. Paul, Brett H. Meyer, Amdahl’s law revisited for single chip systems, Int. J. Parallel Program. 35 (2) (2007) 101–123. [9] Xian-He Sun, Yong Chen, Reevaluating Amdahl’s law in the multicore era, J. Parallel Distrib. Comput. 70 (2) (2010) 183–188. [10] Chao Wang, Xi Li, Junneng Zhang, Gangyong Jia, Peng Chen, Xuehai Zhou, Analyzing parallelization and program performance in heterogeneous MPSoCs. in: IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS, 2012. [11] Chao Wang, Xi Li, Junneng Zhang, Xuehai Zhou, Aili Wang, A star network approach in heterogeneous multiprocessors system on chip, J. Supercomput. 62 (3) (2012) 1404–1424.
5
[12] Dong Hyuk Woo, Hsien-Hsin S. Lee, Extending Amdahl’s law for energyefficient computing in the many-core era, Computer 41 (12) (2008) 24–31. [13] Erlin Yao, Yungang Bao, Mingyu Chen, What Hill–Marty model learn from and break through Amdahl’s law? Inform. Process. Lett. 111 (23–24) (2011) 1092–1095. [14] Erlin Yao, Yungang Bao, Guangming Tan, Mingyu Chen, Extending Amdahl’s law in the multicore era, ACM SIGMETRICS Perform. Eval. Rev. 37 (2) (2009) 24–26. [15] Junneng Zhang, Chao Wang, Xi Li, Xuehai Zhou, Aili Wang, Gangyong Jia, Nadia Nedjah, Amdahl’s and Hill-Marty laws revisited for FPGA-based MPSoCs: from theory to practice, Int. J. High Perform. Syst. Archit. 5 (2) (2014) 115–126. [16] Tsahee Zidenberg, Isaac Keslassy, Uri Weiser, MultiAmdahl: how should I divide my heterogeneous chip? IEEE Comput. Archit. Lett. 11 (2) (2012) 65–68.
Chao Wang is an associate professor in University of Science and Technology of China. His main research interests include Multiprocessor system on Chip and reconfigurable systems. He received B.Sc. and Ph.D. in Computer Science from University of Science and Technology of China, China.
Xi Li is a Professor and vice dean in the School of Software Engineering, University of Science and Technology of China. There he directs the research programs in Embedded System Lab, examining various aspects of embedded system with the focus on performance, availability, flexibility and energy efficiency. He has lead several national key projects of CHINA, several national 863 projects and NSFC projects. He is a member of ACM and IEEE, a senior member of CCF (China Computer Federation).
Peng Chen is a Ph.D. student in University of Science and Technology of China. His main research interests include Multi-objective system design and reconfigurable systems. He received B.Sc. in Computer Science from University of Science and Technology of China, China.
Xuehai Zhou is a Professor in the School of Computer Science in University of Science and Technology of China, China. His main research interests include embedded system design, reliable system design and reconfigurable computing. He received B.Sc., M.Sc. and Ph.D. in Computer Science from University of Science and Technology of China, China.