J. Parallel Distrib. Comput. 73 (2013) 1146–1156
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
A shared matrix unit for a chip multi-core processor Mostafa I. Soliman a,∗ , Abdulmajid F. Al-Junaid b a
Computers & Systems Section, Electrical Engineering Department, Faculty of Engineering, Aswan University, Aswan 81542, Egypt
b
Electrical Engineering Department, Faculty of Engineering and Architecture, Ibb University, Ibb, Yemen
highlights • • • • •
We propose extending multi-core processors with a common matrix unit. Cycle accurate model is implemented using SystemC to simulate the proposed idea. Linear algebra kernels, DCT, SAD, and affine transformation are used to evaluate the performance. 9%–26% improvements in the utilization of the shared matrix unit with dual-core. Average speedup ranges from 6% to 24% and maximum speedup ranges from 13% to 46%.
article
info
Article history: Received 6 July 2012 Received in revised form 4 March 2013 Accepted 16 March 2013 Available online 21 March 2013 Keywords: SystemC implementation Multi-core processors Parallel processing Vector/matrix processing
abstract This paper proposes extending a multi-core processor with a common matrix unit to maximize onchip resource utilization and to leverage the advantages of the current multi-core revolution to improve the performance of data-parallel applications. Each core fetches scalar/vector/matrix instructions from its instruction cache. Scalar instructions continue the execution on the scalar datapath; however, vector/matrix instructions are issued by the decode stage to the shared matrix unit through the corresponding FIFO queue. Moreover, scalar results from reduction vector/matrix instructions are sent back from the matrix unit to the scalar core that sent these instructions. Some dense linear algebra kernels (scalar–vector multiplication, scalar times vector plus another, apply Givens rotation, rank-1 update, vector–matrix multiplication, and matrix–matrix multiplication) as well as discrete cosine transform, sum of absolute differences, and affine transformation are used in the performance evaluation. Our results show that the improvement in the utilization of the shared matrix unit with a dual-core ranges from 9% to 26% compared to extending a matrix unit to a single-core. Moreover, the average speedup of the dualcore shared matrix unit over a single-core extended with a matrix unit ranges from 6% to 24% and the maximum speedup ranges from 13% to 46%. © 2013 Elsevier Inc. All rights reserved.
1. Introduction For more than four decades, technology continues producing chips in step with Moore’s Law [19], which states that the number of transistors per die doubles about every two years. Computer architects exploited this wealth of transistors to improve single-thread performance in two ways: clock frequencies and architectural techniques [13]. Clock frequencies of processors were improved by using faster transistors and deeper pipelines. In addition, architectural techniques (multiple instruction issue, out-of-order execution, speculation, aggressive branch prediction, etc.) have been used to further improve performance by exploiting instruction-level parallelism (ILP). However, these traditional sources of performance improvements have all been flattening
∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (M.I. Soliman).
0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpdc.2013.03.004
since 2003 due to power limitations (see [3] for more details). Recently, computer architects are finding difficulties in converting this high budget of transistors into single-thread performance. Thus, the industry has begun to de-emphasize single-thread performance and focus on integrating multiple cores onto a single die [12,20,2,24,7,6,9]. Fortunately, the number of cores approximately doubles every two years [24]. Only, it is required from programmers to parallelize their applications in order to make the whole application performance again start to track Moore’s Law. One of the advantages of using multi-core technology is the exploitation of thread-level parallelism (TLP). TLP is the best way to address the issue of power while maintaining performance [13]. Another advantage of multi-core is the require of only a fairly modest engineering effort for each generation of processors [20]. Moreover, the cores may be heterogeneous cores, which could address the variety of applications executed by the computer [17]. This paper proposes a shared matrix unit for multi-core processors to maximize on-chip resource utilization and to
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
improve the performance of applications based on data-level parallelism (DLP). Thus, this approach tries to exploit the three common forms of parallelism: ILP, TLP, and DLP. However, extending a single-core with a matrix unit can exploit only ILP and DLP. The ILP can be exploited via pipelining, superscalar issue, and out-of-order execution techniques. Concerning DLP, the matrix unit architecture relies on the use of multi-level ISA to express and exploit the DLP found in data-parallel applications (see [26] for more details about multi-level ISA). Vector/matrix instruction sets are executed on the extended matrix unit, however, scalar instructions are executed on the scalar cores. In addition to ILP and DLP, TLP can be exploited in two ways: (1) the execution of multiple threads on scalar cores using the traditional multi-threading techniques, and (2) the efficient utilization of the shared matrix unit using multiple threads of vector/matrix instructions issued by scalar cores. Obviously, the performance is expected to be higher on computationally intensive kernels rather than memory intensive kernels. Thus, when a memory intensive kernel and a computationally intensive kernel are executed as two threads on two scalar cores sharing a matrix unit, the performance will be better because of the overlapping between the memory operations and data computation as will be shown in this paper. In particular, this paper offers the following contributions:
• proposing a shared matrix unit for multi-core processors to maximize on-chip resource utilization and to improve the performance of DLP applications, • implementing a dual-core scalar processor extended with a single matrix unit on SystemC, • extending the scoreboard algorithm to control the execution of vector/matrix instructions fetched from multiple scalar cores on the extended matrix unit, and • evaluating the performance of the extended matrix unit on vector/matrix kernels.
1147
They benchmarked these vector coprocessor sharing policies for a dual-core system and evaluated them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization showed that these coprocessor sharing policies yielded high utilization and performance with low energy costs. The proposed policies provided a 1.2–2 speed-up and reduced the energy needs by about 50% as compared to the case of a dedicated vector processor assigned to each core. 2.3. Cryptography accelerator sharing for multi-cores Soliman and Abozaid [28] proposed a single crypto unit sharing multi-cores to accelerate the execution of cryptography applications. The shared accelerator is based on the Advanced Encryption Standard (AES) algorithm, where parallel AES pipelines are used for high throughput encrypting/decrypting data. For simplicity, the host processor contained four cores; each core consists of a simple five-stage, single-issue pipeline. Each core fetches an instruction from its instruction cache and sends it inorder to the decode stage. Crypto instructions are pushed into the crypto instruction queue (CIQ) during the decode stage, however, scalar instructions complete the remaining cycle of execution on the scalar pipeline stages. There is a CIQ in the shared crypto unit for each core, where crypto instructions are read from CIQs in round-robin fashion for execution on the parallel AES pipelines. On Xilinx Virtex V FPGA, they showed a maximum throughput of 45 Gbps at 400 MHz. 2.4. Efficient GPU sharing on multi-core processors
2. Related work
Wang et al. [37] provided an efficient sharing mechanism of a GPU device among multi-core CPUs in order to achieve higher application performance and scalability on large scale systems. They provided a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a GPU under hybrid programming. Using a set of microbenchmarks and applications on a GPU cluster, they showed that threadand process-based context hosting have different tradeoffs. Experimental results on application benchmarks suggested that both thread-based context funneling and process-based context switching natively performed similarly on the latest Fermi GPU, while manually guided context funneling was the best way to achieve optimal performance.
2.1. Floating-point unit sharing for multi-cores
2.5. Executing multiple threads on shared and dedicated resources
Sun introduced T1 in 2006, a multi-core processor focused on exploiting TLP rather than ILP [14,18]. The T1 processor contains eight cores, each supporting four threads. A single set of floatingpoint functional units is shared by all eight cores, as floating-point performance was not a focus for T1. Each core consists of a simple six-stage, single-issue pipeline (a standard five-stage RISC pipeline with one stage added for thread switching). T1 uses fine-grained multithreading, switching to a new thread on each clock cycle, and threads that are idle because they are waiting due to a pipeline delay or cache miss are bypassed in the scheduling.
Recently, AMD introduced Bulldozer [8], which represented a new direction in microarchitecture and included a number of firsts for AMD: (1) AMD’s multithreaded x86 processor, (2) implementation of a shared Level 2 cache, and (3) x86 processor to incorporate floating-point multiply–accumulate (FMAC). Bulldozer shares hardware if it is affordable and profitable (such as front-end and floating-point unit (FPU)) but replicates hardware as necessary for timing and complexity reasons (such as the integer execution core). Bulldozer combines two independent cores intended to deliver high per-thread throughput with improved area and power efficiency. Because the floating-point execution units are so large, Bulldozer shares them between the two threads via SMT. Thus, two threads can be executed via a combination of shared and dedicated resources. AMD designed the Bulldozer FPU to deliver industry-leading performance on HPC, multimedia, and gaming applications using four-wide, two-way, multithreaded, fully out-of-order FPU, combined with two 128-bit FMAC units supported by a 128-bit highbandwidth load/store subsystem. Thus, the FPU is a coprocessor
This paper is organized as follows. Section 2 discusses some related work. Section 3 presents the architecture of a single scalar core extended with a matrix unit. The adaptation of the extended matrix unit to work with two scalar cores is explained in detail in Section 4. Section 5 evaluates the performance of the same kernel/different kernels when issued by two scalar cores to the shared matrix unit. Finally, Section 6 concludes this paper and gives directions for future work.
2.2. Vector coprocessor sharing for multi-cores Beldianu and Ziavras [4] presented a robust design framework for vector coprocessor sharing in multi-core environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For their adaptive vector unit attached to multiple cores, they proposed three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing.
1148
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
Fig. 1. Extending a scalar core with matrix unit.
model shared between two integer cores via two-way multithreading. The major execution units are two 128-bit FMAC units, two 128-bit packed integer (MMX ALU) units, one 128-bit IMAC unit, and one 128-bit permute/shift (XBAR) unit. 3. Extending a scalar core with matrix unit The logical direct extension of vector processing is matrix processing, where a matrix unit (core) is extended to a scalar processor (core) to improve the performance of applications based on DLP [27,29,30]. The terms matrix unit and matrix core will be used interchangeably from now on. Multi-level ISA can be used to explicitly communicate data parallelism to a scalar core extended with a matrix unit in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques [32]. High-level instructions, such as vector–scalar, vector–vector, matrix–scalar, matrix–vector, and matrix–matrix instructions can convoy up to 3-D data parallelism to the matrix unit, which results in reducing the complexity of hardware and compiler. Fig. 1 shows a general-purpose scalar core (for executing scalar instructions) extended with a matrix unit (for executing vector/matrix instructions). To tolerate the memory latency, the extended matrix unit is decoupled into two components: address generation and data computation. The data computation unit is organized in parallel lanes (n parallel lanes); each lane contains a pipeline of each functional unit and a slice of the matrix register file (set of register banks). n register banks represent a matrix register (one register bank per lane), which can store vector/matrix data. Since executing some vector/matrix instructions needs interconnections between lanes, local, global, bus, etc., can be used for interconnecting parallel lanes. It is known that all these types of interconnections are not scalable, except the local, because longer
Fig. 2. Operations performed on crossbar.
wires are needed to connect more lanes. However, for a small number of parallel lanes, the use of full crossbars is a more efficient technique than the other interconnections. The operations that can be performed on the crossbars of the extended matrix unit include Pass, Rotate, and Broadcast (see Fig. 2). Decoupled architectures are based on the observation that the execution of a program can be split into two different tasks: moving data to/from a processor and executing arithmetic instructions that perform the program computations [25,10,23]. The main advantage of decoupled architectures is the toleration of memory latency. In decoupled architectures, the arithmetic instructions waiting for memory operands do not block the issue stage. They are sent to an instruction queue freeing the issue stage to run ahead to find more memory instructions later in the instruction stream. In other words, latency is tolerated because the address unit is able to slip ahead of the computation unit and loads data that will be needed soon by the computation unit early in time. This excess data produced by the address unit is stored in a FIFO queue and stays there until it is retrieved by the computation unit. The extended matrix unit is based on decoupled architectures to hide memory latency. Thus, it is split into two components:
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
1149
Fig. 3. Decoupled architecture of extending matrix unit.
address generation and data computation, which communicate through data queues, as Fig. 3 shows. The address unit performs all address computations, addresses checking, and loads/stores data from/to memory to/from queues. The computation unit moves data from/to queues to/from registers and executes all arithmetic instructions on data loaded into registers. As shown in Fig. 3, high-level vector/matrix instructions are fetched, decoded, and then dispatched in-order by the scalar core to the pre-address queue called instruction and scalar operands queue 1 (ISQ1). The instruction flow controller in the matrix part takes memory/arithmetic vector/matrix instructions in-order from the head of ISQ1. A load or store instruction is split into two components: address generation and pseudo-move. The first component generates a stream of addresses stored in LAQ (Load Address Queue) or SAQ (Store Address Queue). These addresses are used to fill LDQ (Load Data Queue) from L2 cache or empty SDQ (Store Data Queue) into L2 cache. The second component (pseudo-move) is inserted into the instruction and scalar operands queue 2 (ISQ2). Once pseudo-move is at the head of the ISQ2 and its operands are ready, the control unit moves operands from/to LDQ/SDQ to/from matrix registers. Other instructions such as arithmetic perform arithmetic operations on data in matrix registers. Thus, a scalar core/matrix unit is a load/store architecture, where memory can be accessed only with load/store instructions (data should be loaded into registers before processing). Scalar data are loaded from a scalar data cache into scalar registers (integer or floating-point), processed (in-order or out-of-order) on scalar execution datapath, and then stored from scalar registers back to a scalar data cache. Vector/matrix data are loaded directly from L2 cache into matrix registers through LDQ, processed in parallel on P execution datapaths, and then stored back from matrix registers to L2 cache through SDQ. As vector processors, the extended matrix unit has three varieties of vector/matrix load/store [16]. The first one is called a unit-stride access where the matrix unit can access a block of contiguous elements. The second way for vector/matrix memory access is the stride access which transfers memory elements that are separated by a constant stride (constant inter-element displacement). The last one is the indexed (scatter-gather) memory access, where an individual index is given for each element to be accessed. 4. Extending a multi-core with matrix unit For simplicity, the first implementation of an on-chip matrix unit shared for multiple cores is a dual-core scalar processor
extended with a single matrix unit (see Fig. 4). Scalar cores are connected to the extended matrix unit through FIFO queues. Each scalar core can have a simple in-order scalar or complex outof-order superscalar core. In addition, it has a hazard detection unit, forwarding unit, and control unit (see Fig. 5). In our implementation, the well-known five-stage MIPS pipeline is selected for the scalar part due to its effectiveness in terms of simplicity and processing power. Moreover, the design of the fivestage MIPS pipeline datapath is found in detail in [22]. Each scalar core fetches an instruction from its instruction cache and decodes it. If the fetched instruction is a scalar instruction, it will continue the execution on the scalar execution datapath. Otherwise, the fetched vector/matrix instruction is pushed by the scalar core to the extended matrix unit through a FIFO queue. Moreover, a scalar result produced from a reduction vector/matrix instruction in the matrix unit is sent back to the scalar core that sent the vector/matrix instruction through another FIFO queue. Thus, each scalar core is connected to the matrix unit through two queues: (1) instruction and scalar operands queue and (2) scalar result queue. These queues are inside the instruction flow unit. Based on our previous performance-area trade-offs (see [31] for more detail), the organization of the shared matrix unit has four lanes and the size of matrix registers is 8 × 4. The adaptation of the matrix unit to work with multiple scalar cores includes modifying both the instruction flow unit and matrix control unit, and duplicating the matrix registers unit (MRUnit), which contains the matrix register file and the crossbars that connect matrix registers to functional units (see MRUnit1 and MRUnit2 in Fig. 4). Two groups of multiplexers and demultiplexers are used for connecting MRUnit1 and MRUnit2 to functional units and to the load/store unit. Note that the matrix control unit controls the operation of MRUnit1 and MRUnit2 through two separate groups of control signals. Table 1 summarizes the microarchitecture parameters and their setting in our proposed multi-core sharing a single matrix unit. 4.1. Modifying instruction flow unit As discussed in the previous section, extending a single core with a matrix unit requires three FIFO queues (ISQ1, ISQ2, and SRQ) and an instruction flow controller in the instruction flow unit. For n scalar cores sharing a matrix unit, 2n + 1 queues should be added to the instruction flow unit. In addition, the operation of the instruction flow controller should be modified. As shown
1150
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
Fig. 4. The architecture of dual-core shared matrix unit. Table 1 Microarchitecture parameters and their setting in our proposed multi-core sharing a single matrix unit. Unit
Microarchitecture parameter
MRUnit1 or MRUnit2
4 lanes contain 8 matrix registers, each matrix register with size 8 × 4 elements, each element is 32-bit Crossbars 4 inputs and 4 outputs, each is 32-bit Load Address Queue (LAQ) 16 elements, each element is 37-bit Store Address Queue (SAQ) 16 elements, each element is 37-bit Load Data Queue (LDQ) 16 elements, each element is 128-bit Store Data Queue (SDQ) 16 elements, each element is 128-bit Instruction and Scalar operands Queues (ISQ1, ISQ2, and ISQ3) Each queue has 10 elements, each element is 96-bit Scalar Result Queues (SRQ1 and SRQ2) Each queue has 10 elements, each element is 32-bit Instruction Flow controller Based on round robin method to flow the vector/matrix instructions from the two scalar cores to the shared matrix unit ALU 4 stages pipeline FP adder 4 stages pipeline FP multiplier 6 stages pipeline FP MAC (Multiply–Accumulate) 6 stages pipeline FP divider 6 stages pipeline Based on extended scoreboard technique to execute vector/matrix instructions out-of-order The instruction execution path is 5-stage MIPS pipeline
Load/Store Unit
Instruction Flow Unit
Functional Unit
Matrix Control Unit Scalar Core1 or Core2
Setting
Register file
in Fig. 6, the instruction flow unit for dual-core contains five queues (ISQ1, ISQ2, ISQ3, SRQ1, and SRQ2) besides the instruction flow controller. Vector/matrix instructions and scalar data sent by scalar core1 and scalar core2 are stored in ISQ1 and ISQ2, respectively. The instruction flow controller reads instructions from the top of ISQ1 and ISQ2 in round-robin fashion. Then, it adjusts the matrix register numbers inside the instruction (destination, source1, and source2) depending on which one of the scalar cores sent the vector/matrix instruction. In more detail, the extended matrix unit has 16 matrix registers, where eight matrix registers in MRUnit1 (M0–M7) are used for executing vector/matrix instructions coming from core1 and the remaining eight matrix registers in MRUnit2 (M8–M15) are used for core2. Since vector/matrix instructions of core1 and core2 deal with eight
matrix registers (M0–M7), the instruction flow controller should adjust these register numbers inside the vector/matrix instruction. Vector/matrix instructions coming from scalar core2 are changed from (M0–M7) to (M8–M15), however, instructions coming from scalar core1 are left without change. This adjustment can be done easily by setting the fourth bit of the matrix register number to one when core2 uses the shared matrix unit. 4.2. Duplicating matrix registers file The matrix unit is designed based on parallel lanes to reduce the execution time of processing vector/matrix data (see [16] for more details about the benefits of the modular, lane-based implementation). In our implementation of the matrix unit, four
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
1151
5. Performance evaluation of a matrix unit shared multi-core using SystemC
Fig. 5. Scalar core.
lanes are used, where the size of a matrix register is 8 × 4 elements. Each lane consists of a slice of matrix register file (eight elements per matrix register) and one pipeline per functional unit. Matrix registers and functional units are connected through crossbars. The slices of matrix register file in the four lanes (matrix register file) with the crossbars are virtually called a matrix registers unit as shown in Fig. 4 (MRUnit1 and MRUnit2). The straightforward step to adapt the matrix unit to work with two scalar cores is to duplicate the matrix registers unit and dedicating each unit to one scalar core. This results in minor changes on the other components of the matrix unit. Vector/matrix instructions coming from scalar core1 use only the matrix registers in MRUnit1 (M0–M7) whereas instructions coming from scalar core2 use only the matrix registers in MRUnit2 (M8–M15). A group of multiplexers/demultiplexers selects which matrix registers unit is used as a source/destination to functional units. Furthermore, another group of multiplexers/demultiplexers selects which matrix registers unit is used as a source/destination to the load/store unit. 4.3. Modifying matrix control unit The matrix control unit is based on the scoreboard technique [34,35,33]. Processing vector/matrix instructions fetched from multiple cores on the extended matrix unit results in some changes in the scoreboard algorithm. They include (1) increasing the number of matrix registers, (2) enlarging the result status table, and (3) enlarging the register read port status table. The number of columns in these tables is increased from 8 to 16 since the matrix registers are duplicated due to the use of two cores. As shown in Fig. 4, the matrix control unit controls the two matrix registers units MRUnit1 and MRUnit2 through two separate groups of signals. Note that it is impossible to use one group of signals with a demultiplexer to control the two matrix registers units. That is because two vector/matrix instructions from the two scalar cores are considered independently concerning their sources and destinations matrix registers. Thus, these two vector/matrix instructions can be executed simultaneously when the required functional units are free. This means different control signals should be applied by the matrix control unit to MRUnit1 and MRUnit2 simultaneously. The matrix control unit can identify the core that sent the vector/matrix instruction (scalar core1 or scalar core2) based on its operands (matrix registers numbers). Then the vector/matrix instruction is executed under control of the scoreboard algorithm. At each clock cycle the matrix control unit sends the appropriate control signals to MRUnit1, MRUnit2, functional units, load/store unit, and the groups of multiplexers/demultiplexers between these units.
SystemC is a system design and modeling language. This language evolved to meet a system designer’s requirements for designing and integrating today’s complex electronic systems very quickly while assuring that the final system will meet performance expectations [5,11,21]. The essence of SystemC lies in the availability of hardware primitives together with a simulation kernel. With such features, SystemC is able to support multiple abstraction levels and refinement capabilities ranging from highlevel functional models to low-level timed, cycle-accurate, and RTL models [11]. The IEEE Standard SystemC language has been proposed as an ANSI standard C++ class library for system and hardware design for use by designers and architects who need to address complex systems [15,36]. SystemC was developed by the Open SystemC Initiative (OSCI) [21] on top of C++ which is a mature and one of the widely used software development languages. With the modeling of hardware behavior as a library in C++, both software and hardware can now be modeled in a single language, making it easy to simulate and test a system in early stages of the design cycle. Since SystemC is C++, it is not a new language, which makes the existing software IP ready to be linked into a SystemC project. One can develop cycle-accurate models of hardware, software, and interfaces for simulation and debugging within the existing C++ development environment. Moreover, SystemC provides a unified environment for developing hardware as well as software, which eliminates translation error and allows fast and easy verification. Thus, SystemC is selected for implementing and simulating our proposed processor as a cycle accurate model. Two cycle accurate models are implemented using SystemC: (1) a matrix unit shared with two scalar cores (dual-core model) and (2) a matrix unit extended a single scalar core (singlecore model) for comparison. Some dense linear algebra kernels (scalar–vector multiplication (SVmul), scalar s times vector x plus vector y (SAXPY), apply Givens rotation (Givens), rank1 update (Rank-1), vector–matrix multiplication (VMmul), and matrix–matrix multiplication (MMmul)) as well as discrete cosine transform (DCT), sum of absolute differences (SAD), and affine transformation (Affine) are used to evaluate the performance of the two processor models. The objective is to answer the question ‘‘does a dual-core model utilize the shared matrix unit more efficiently than the single-core model’’?. Two ways are used in our performance evaluation. In the first way, the scalar cores in the dual-core model execute the same kernel simultaneously, This means the vector/matrix instructions of the same kernel are issued by scalar cores to the extended matrix unit. The performance is compared with the single-core model, which executes the same kernel serially twice. However, in the second way, the scalar cores execute two different kernels simultaneously in the dualcore model. This means the vector/matrix instructions of different kernels are issued by scalar cores to the extended matrix unit. The performance is compared with the single-core model, which executes the two different kernels serially. The performance is measured in FLOPs/cc (floating-point operations per clock cycle). 5.1. Executing the same kernel on the two models The main objective of extending a multi-core processor with a single matrix unit is the efficient utilization of the extended hardware. Since the kernels used in the performance evaluation are vector/matrix kernels, the execution time of these kernels on the extended matrix unit dominates the overall time. Thus, according to Amdahl’s Law [1], the overall performance of the data parallel applications on dual-core with a matrix unit is limited by the
1152
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
Fig. 6. The modified instruction flow unit. Table 2 Performance evaluation of dual-core and single-core models that execute the same kernel. Kernel
SVmul
SAXPY
Givens
Rank-1
x c = y −s
Semantic
xi = a ∗ xi , 1 ≤ i ≤ n
yi + = a ∗ xi , 1 ≤ i ≤ n
FLOPs Memory reference
n 2n
2n 3n
6n 4n
Single-core model Dual-core model (RR) Dual-core model (MRR)
1.45 FLOPs/cc 1.57 FLOPs/c 1.6 FLOPs/cc
2.12 FLOPs/cc 2.13 FLOPs/cc 2.13 FLOPs/cc
4.2 FLOPs/cc 3.47 FLOPs/cc 4.21 FLOPs/cc
performance of the matrix unit. The expectation of more utilization of the extended matrix unit comes from overlapping the execution of memory intensive kernels with computationally intensive kernels. This expectation is not true in the case of executing the same kernel on two scalar cores since the executed kernels have the same characteristics (both of them are computationally intensive or memory intensive). Two copies of the same kernel are executed concurrently on the dual-core model and serially in the single-core model. As shown in Table 2, executing the same kernel on dual-core and single-core models leads to a small degradation/improvement in the overall performance because of the following reasons. Each kernel follows the same sequence for executing its scalar/vector/ matrix instructions: loading data in registers (load instructions), processing the loaded data (arithmetic instructions), and finally, storing results (store instructions). This sequence is repeated a number of times according to the problem size. In the dual-core model, reading vector/matrix instructions in ISQ1 and ISQ2 by the round-robin (RR) method groups them into a sequence of memory instructions and a sequence of arithmetic instructions. The memory instruction sequence delays the arithmetic instructions to arrive at the matrix control unit because memory instructions wait a done signal from the load/store unit. This signal indicates the finishing of address generation. (Note that each memory instruction takes 8 clock cycles for address generation.) The length
s c
VMmul
t x y
MMmul
n
A(i,j) + = xi yj
yi + =
2n2 From 2n2 + 2n To 3n2 + n 2.55 FLOPs/cc 2.5 FLOPs/cc 2.66 FLOPs/cc
2n2 From n2 + 3n To 2n2 + 2n 4.2 FLOPs/cc 3.63 FLOPs/cc 4.1 FLOPs/cc
i=1
xi A(i,j)
C(i,j) + = n k=1
2n3 O(n2 )
A(i,k) B(k,j)
6.43 FLOPs/cc 6.86 FLOPs/cc 6.43 FLOPs/cc
of this sequence differs from one kernel to another, therefore, some kernels are affected more than others. In other words, the RR method results in arranging vector/matrix instructions in a manner that decreases/increases the overlapping between address generation and data computation. The RR method in the instruction flow controller is modified to pass arithmetic instructions to ISQ3 as soon as possible which increases the overlapping between address generation and data computation. That can be done by continuing reading vector/matrix instructions from the same queue until finding a memory instruction and then switching to the other queue. The modified RR method is called MRR. The performance of a matrix unit shared multi-core is approximately similar to the performance of a matrix unit extended single-core (see Table 2). Moreover, Table 3 shows the performance of DCT and affine transformation on dual-core and single-core models. The performance of round-robin (RR) is better than modified version (MRR) because the two kernels are computationally intensive. MRR tries to pass arithmetic instructions to ISQ3 as soon as possible by continuing reading vector/matrix instructions from the same queue until finding a memory instruction and then switching to the other queue. This decreases the overlapping between memory instructions and computation instructions. The speedups of the dual-core model over the single-core model are 26% and 9% on affine transformation and DCT, respectively. The speedup of the
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156 Table 3 DCT and affine transformation on dual-core and single-core models that execute the same kernel.
Size FLOPs Single-core model Dual-core model (RR) Dual-core model (MRR)
Affine transformation
DCT
n = 4096 vertices 32n 5.55 FLOPs/cc 7 FLOPs/cc 6.55 FLOPs/cc
n × n = 100 × 100 32n2 6.44 FLOPs/cc 7 FLOPs/cc 6.56 FLOPs/cc
former is higher than the latter because the performance of DCT on the single-core model is higher than the affine transformation since affine transformation is less computationally intensive (32n FLOPs) than DCT (32n2 FLOPs). Note that the RR method is better for computationally intensive kernels (MMmul, Affine, and DCT) and MRR is better for memory intensive kernels (SVmul, Givens, Rank-1, SAXPY, and VMmul). 5.2. Executing different kernels on the two models As mentioned in the previous section, a better utilization of a matrix unit shared multi-core is expected by executing different kernels on the two scalar cores compared to the execution of these kernels on a single-core with a matrix unit. This is because of overlapping the execution of memory intensive kernels and computationally intensive kernels. For example, matrix–matrix multiplication (computationally intensive kernel) can be executed concurrently with scalar–vector multiplication (memory intensive kernel). Actually, in a multi-core system, different threads are executed on its cores. These threads may be from different applications or from the same application. In this case, the matrix–matrix multiplication is selected as a first kernel and the second kernel is either scalar–vector multiplication, Givens rotation, or rank-1 update. As shown in the first column of Fig. 7 (16 × 16 MMmul), the matrix size of the first kernel (matrix–matrix multiplication) is fixed at 16 × 16 and the vector length or matrix size of the second kernel (SVmul, Givens, or Rank-1) is varied. The overall performance is close to the performance of matrix–matrix multiplication when the size of the second kernel is small. As increasing the size of the second kernel, its execution time increases and the overall performance degrades because the second kernel is memory intensive and its load/store time cannot be overlapped by the execution of matrix–matrix multiplication. Increasing the size of matrices in the first kernel (matrix–matrix multiplication) from 16 × 16 to 32 × 32 results in improving the overall performance (see the middle column in Fig. 7: 32 × 32 MMmul). Moreover, the degradation in the performance as increasing the size of the second kernel occurs on larger vector/matrix sizes. Further improvements on the overall performance occurs when increasing the size of matrices in matrix–matrix multiplication from 32 × 32 to 64 × 64 (see the last column in Fig. 7: 64 × 64 MMmul). Fig. 8 shows the performance evaluation of dual-core and single-core models when executing affine transformation (Affine) with either SVmul, Givens, or SAD. As increasing the vector length, the overall performance of the single-core model decreases, however, the performance of the dual-core model increases slowly as the execution time of the affine transformation dominates the overall execution time. The performance of the dual-core model decreases drastically when the execution time of SVmul/Givens/SAD dominates the overall execution time. The performance breakpoint occurs at longer vector length as the affine transformation increases from 256, to 512, and then to 1024 vertices. Moreover, Fig. 9 shows the performance of executing DCT with either SVmul, Givens, or SAD on dual-core and single-core models. Note that DCT is more computationally intensive than affine transformation. Thus, the overall performance is higher on
1153
DCT than affine transformation. As increasing the vector length in SVmul, the overall performance of the single-core and dualcore models decrease because SVmul requires three load/store operations per FLOP. The use of higher computationally intensive kernels like Givens and SAD improves the performance of the dualcore model until the execution time of the DCT does not dominate the overall execution time. The breakpoint occurs at longer vector length as the size of DCT increases from 16 × 16, to 32 × 32, and then to 64 × 64. The performance results show that multi-core utilizes the extended matrix unit more efficiently than single-core. The maximum utilization of the extended matrix unit is achieved when the execution time of the two different kernels is approximately equal (the two kernels require the same execution time). That is a logical result because a maximum overlapping between the memory intensive kernel and computationally intensive kernel occurred when the execution time of the two different kernels is approximately equal. From Figs. 7–9, the maximum difference in performance between dual-core and single-core models ranges between 0.71 and 2.1 FLOPs/cc. Since the ideal performance of a four-lane matrix unit is 8 FLOPs/cc, the improvement in the utilization of the extended matrix unit to dual-core ranges from 9% to 26% compared to extending a matrix unit to a single-core. Table 4 presents the average and the maximum speedups of the dual-core model over single-core model. The average speedup ranges from 6% to 24%. However, the maximum speedup ranges from 13% to 46%. 6. Conclusion and future work This paper proposed extending a single common matrix unit to multiple cores to improve the performance of data-parallel applications. This approach could increase the overall utilization and throughput of the shared matrix unit embedded into a multicore chip by providing a mechanism for the simultaneous sharing of the matrix unit by instructions issued by multiple scalar cores. This approach could support multithreading inside the matrix unit, where the threads come from either a single or multiple applications running on the multi-core chip. Each scalar core fetches a scalar/vector/matrix instruction from its instruction cache and decodes it. If the fetched instruction is a scalar instruction, it will continue the execution in the scalar execution datapath. Otherwise, the fetched vector/matrix instruction is issued by the scalar core to the extended matrix unit. The adaptation of the extended matrix unit to work with multiple scalar cores includes modifying both the instruction flow unit and matrix control unit, and duplicating the matrix registers unit, which contains the matrix register file and the crossbars that connecting matrix registers to functional units. In addition, this paper evaluated the performance of the proposed matrix unit sharing multi-cores. A cycle accurate model of our proposed multi-core shared single matrix unit was implemented using SystemC to simulate and evaluate its performance. For comparison, a single-core extended with a matrix unit is also implemented using SystemC. Some dense linear algebra kernels (scalar–vector multiplication, scalar s times vector x plus vector y, apply Givens rotation, rank-1 update, vector–matrix multiplication, and matrix–matrix multiplication) as well as discrete cosine transform, sum of absolute differences, and affine transformation are used to evaluate the performance of a dual-core scalar processor shared matrix unit. Our results showed that the maximum difference in performance between multi-core and single-core extended with a matrix unit ranges between 0.71 and 2.1 FLOPs/cc. Since the ideal performance of a four-lane matrix unit is 8 FLOPs/cc, the improvement in the utilization of the extended matrix unit to dual-core ranges from 9% to 26% compared to extending a matrix unit to a single-core.
1154
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
SVmul
16×16 MMmul
32×32 MMmul 7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
Vector Length
Vector Length
0
Givens
Vector Length
0 64
128
256
512
1024
2048
0 256
4096
512
1024 2048 4096 8192 16384 32768 65536
1024 2048 4096 8192 16384 32768 65536 131072 262144
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
Vector Length
0
1
Vector Length
0 64
Rank-1
64×64 MMmul
7
128
256
512
1024
2048
4096
128
256
1024 2048 4096 8192 16384 32768 65536 98304 131072
512 1024 2048 4096 8192 16384 32768
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
Matrix Size
Matrix Size
8
16
24
32
48
64
72
1
Matrix Size 0
0
0
Vector Length
0 64
8
16
32
64
128
200
16
32
64
128
200
Fig. 7. Performance evaluation of dual-core and single-core models that execute MMmul with either SVmul, Givens, or Rank-1.
Fig. 8. Performance evaluation of dual-core and single-core models that execute Affine with either SVmul, Givens, or SAD.
256
400
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
1155
Fig. 9. Performance evaluation of dual-core and single-core models that execute DCT with either SVmul, Givens, or SAD. Table 4 The average/maximum speedups of dual-core model over single-core model. Kernels 16 × 16 MMmul+SVmul 16 × 16 MMmul+Givens 16 × 16 MMmul+Rank-1 16 × 16 DCT+SVmul 16 × 16 DCT+Givens 16 × 16 DCT+SAD 256 Affine +SVmul 256 Affine +Givens 256 Affine+SAD
Average/Max. Speedup (%) 24/46 13/17 13/28 12/16 11/15 12/20 19/36 14/25 22/43
Kernels 32 × 32 MMmul+SVmul 32 × 32 MMmul+Givens 32 × 32 MMmul+Rank-1 32 × 32 DCT+SVmul 32 × 32 DCT+Givens 32 × 32 DCT+SAD 512 Affine +SVmul 512 Affine +Givens 512 Affine +SAD
Moreover, the average (maximum) speedup of the dual-core model over the single-core model ranges from 6% (13%) to 24% (46%). The following are some key areas for future research:
• Improve the performance of the scalar cores to improve the
• • • • •
performance of the multi-core processor when the percentage of vector/matrix instructions is low. Some simple superscalar and VLIW techniques can be considered to improve the performance without increasing area/power drastically. Integrating multiple scalar cores with multiple matrix units (cores) to improve the performance of intensive data-parallel applications. Compiler for a matrix processor by extending the vectorization technique, which is a mature research area in supercomputers. Performance evaluation of more scientific and multimedia applications. Calculate transistor count, die area size, and power consumption of the proposed processor. Implementing the matrix core on FPGA as an accelerator card for data-parallel applications.
Acknowledgments The authors would like to thank the editor and reviewers for their valuable comments and suggestions that greatly helped to improve the quality of this paper.
Average/Max. Speedup (%) 18/42 8/17 6/15 10/15 10/14 11/19 18/38 13/26 21/44
Kernels 64 × 64 MMmul+SVmul 64 × 64 MMmul+Givens 64 × 64 MMmul+Rank-1 64 × 64 DCT+SVmul 64 × 64 DCT+Givens 64 × 64 DCT+SAD 1024 Affine +SVmul 1024 Affine +Givens 1024 Affine +SAD
Average/Max. Speedup (%) 16/41 9/17 8/18 9/14 9/13 11/28 18/38 12/26 21/45
References [1] G. Amdahl, Validity of the single-processor approach to achieving large scale computing capabilities, in: Proc. AFIPS 1967 Spring Joint Computer Conference, Vol. 30, AFIPS Press, Atlantic City, New Jersey, 1967, pp. 483–485. [2] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, K. Yelick, The landscape of parallel computing research: a view from Berkeley, Technical Report, EECS Department, University of California, Berkeley, December 2006. [3] C. Batten, Simplified vector-thread architectures for flexible and efficient data-parallel accelerators, Ph.D. Thesis, Massachusetts Institute of Technology, 2010. [4] S. Beldianu, S. Ziavras, Multicore-based vector coprocessor sharing for performance and energy gains, ACM Transactions on Embedded Computing Systems, 2012 (in press). [5] D. Black, J. Donovan, B. Bunton, A. Keist, SystemC: From The Ground Up, second ed., Springer, 2010. [6] G. Blake, R. Dreslinski, T. Mudge, A survey of multicore processors, IEEE Signal Processing Magazine 26 (6) (2009) 26–37. [7] R. Buchty, V. Heuveline, W. Karl, J. Weiss, A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators, Concurrency and Computation: Practice & Experience 24 (7) (2012) 663–675. [8] M. Butler, L. Barnes, D. Sarma, B. Gelinas, Bulldozer: an approach to multithreaded compute performance, IEEE Micro 31 (2) (2011) 6–15. [9] Dr. Dobb’s, The parallel programming landscape: multicore has gone mainstream-but are developers ready? Available at http://software.intel. com/sites/billboard/sites/default/files/PDFs/TW_1111059_StOfParallelProg_ v6.pdf, 2012. [10] R. Espasa, M. Valero, Decoupled vector architecture, in: Proc. 2nd International Symposium on High-Performance Computer Architecture, San Jose, CA, pp. 281–290, February 1996.
1156
M.I. Soliman, A.F. Al-Junaid / J. Parallel Distrib. Comput. 73 (2013) 1146–1156
[11] F. Ghenassia, Transaction Level Modeling with SystemC TLM Concepts and Applications for Embedded Systems, Springer, Netherlands, 2005. [12] L. Hammond, B. Nayfeh, K. Olukotun, A single-chip multiprocessor, IEEE Computer (1997) 79–85. [13] J. Hennessy, D. Patterson, Computer Architecture A Quantitative Approach, fifth ed., Morgan-Kaufmann, 2011. [14] R. Hetherington, The UltraSPARC T1 Processor-Power Efficient Throughput Computing, Sun Microsystems Inc., California, USA, 2005. [15] IEEE Computer Society, IEEE Standard SystemC language Reference Manual, IEEE, New Yourk, USA, 2006. [16] C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, K. Yelick, Hardware/compiler codevelopment for an embedded media processor, Proceedings of the IEEE 89 (11) (2001) 1694–1709. [17] R. Kumar, D. Tullsen, N. Jouppi, P. Ranganathan, Heterogeneous chip multiprocessors, Computer 38 (11) (2005) 32–38. [18] A. Leon, B. Langley, J. Shin, The UltraSPARC T1 processor: CMT Reliability, in: Proc. IEEE Custom Integrated Circuits Conference, pp. 555–562, September 2006. [19] G. Moore, Cramming more components onto integrated circuits, Electronics 38 (8) (1965) 114–117. [20] K. Olukotun, L. Hammond, The future of microprocessors, ACM Queue 3 (7) (2005) 26–29. [21] Open systemC initiative, The SystemC Library. Available at: http://www. systemc.org/. [22] D. Patterson, J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, fourth ed., Morgan Kaufman, San Francisco, CA, 2011. [23] W. Ro, S. Crago, A. Despain, J. Gaudiot, Design and evaluation of a hierarchical decoupled architecture, The Journal of Supercomputing 38 (2006) 237–259. [24] J. Shalf, J. Bashor, D. Patterson, K. Asanovic, K. Yelick, K. Keutzer, T. Mattson, The manycore revolution: will HPC lead or follow? Journal of SciDAC Review (14) (2009) 40–49. Fall. [25] J. Smith, Decoupled access/execute computer architectures, ACM Transactions on Computer Systems 2 (4) (1984) 289–308. [26] M. Soliman, A Technology-Scalable Matrix Processor for Data Parallel Applications: Trident Processor, LAP LAMBERT Academic Publishing, ISBN: 9783844395327, 2011. [27] M. Soliman, Mat-core: a decoupled matrix core extension for generalpurpose processors, in: Neural, Parallel and Scientific Computations, Vol. 19(1), Dynamic Publishers, Atlanta, USA, 2011, pp. 91–110. ISSN: 1061-5369. [28] M. Soliman, G. Abozaid, Shared cryptography accelerator for multicores to maximize resource utilization, The Seventh IEEE International Conference on Computer Engineering and Systems, ICCES 2011, Ain-Shams University, Cairo, Egypt, pp. 33–38, November/December 2011. [29] M. Soliman, A. Al-Junaid, Out-of-order matrix processor: implementation and performance evaluation, in: Proc. 2nd International Conference on Advanced Computer Theory and Engineering, ICACTE 2009, ASME Press, Cairo, Egypt, 2009, pp. 809–816. [30] M. Soliman, A. Al-Junaid, SystemC implementation of mat-core: a matrix core extension for general-purpose processors, in: Proc. IEEE International Conference on Design & Technology of Integrated Systems in Nanoscale era, DTIS’09, Cairo, Egypt, pp. 9–14, April 2009.
[31] M. Soliman, A. Al-Junaid, Performance-area trade-offs for decoupled matrix processor, in: Proc. 19th International Conference on Computer Theory and Applications, ICCTA 2009, Sheraton EL-Montazah Hotel, Alexandria, Egypt, pp. 44–51, October 2009. [32] M. Soliman, A. Al-Junaid, Codevelopment of multi-level instruction set architecture and hardware for an efficient matrix processor, in: Neural, Parallel and Scientific Computations, Vol. 18(1), Dynamic Publishers, Atlanta, USA, 2010, pp. 59–74. ISSN: 1061-5369. [33] M. Soliman, A. Al-Junaid, SystemC implementation and performance evaluation of a decoupled general-purpose matrix processor, in: Parallel Processing Letter (PPL), World Scientific Publishing Company, 2010, pp. 103–121. ISSN: 0129-6264. [34] J. Thornton, Parallel operation in the control data 6600, in: Proc. 26th AFIPS Conference, Vol. 2, pp. 33–40, 1964. [35] J. Thornton, Design of a Computer: The Control Data 6600, Scott Foresman, Glenview, Ill, 1970. [36] C. Tuncali, Implementation and Simulation of 68HC11 Microcontroller Unit Using SystemC for Co-design Studies, M.Sc. Thesis, Middle East Technical University, 2007. [37] L. Wang, M. Huang, T. El-Ghazawi, Towards efficient GPU sharing on multicore processors, in: Proc. 2nd International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS’2011, pp. 23–24, November 2011.
Mostafa I. Soliman is an associate professor at Aswan University, Egypt. He received a B.E. in electrical engineering (1994) and a M.E. in computer science and engineering (1998) from the University of Assiut, Egypt. His Ph.D. is in computer science and engineering (2004) from the University of Aizu, Japan. He is an IEEE member. He is interested in computer architecture, parallel processing, high performance computing, vector/matrix processing, performance evaluation of multi-core/many-core processors, parallel algorithms, hardware visualization, FPGA and SystemC implementations. To contact:
[email protected] and
[email protected].
Dr. Abdulmajid F. Al-Junaid received his BS degree in control and computer engineering from the University of Technology, Baghdad, Iraq in 2000. He received his M.Sc. and Ph.D. degrees in electrical engineering ‘‘computer engineering’’ in 2007 and 2011, respectively, from Assiut University, Egypt. Currently, he is an assistant professor, Electrical Engineering Department, Faculty of Engineering and Architecture, Ibb University, Ibb, Yemen. His research interests include high performance computation, parallel processing, matrix/vector computation, FPGA/SystemC implementation, and multi-core/many-core processors.