An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

J. Parallel Distrib. Comput. 70 (2010) 1110–1118 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevi...

2MB Sizes 1 Downloads 54 Views

J. Parallel Distrib. Comput. 70 (2010) 1110–1118

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing Jung-Wook Park a , Hoon-Mo Yang a , Gi-Ho Park b , Shin-Dug Kim a , Charles C. Weems c,∗ a

Department of Computer Science, C532, Yonsei University, 134 Shinchon-dong Seoul, 120-749, Republic of Korea

b

Department of Computer Engineering, Sejong University, 98 Kunja-Dong, Kwangjin-Ku, Seoul, 143-747, Republic of Korea

c

Department of Computer Science, University of Massachusetts Amherst, MA 01003-4610, United States

article

info

Article history: Received 14 December 2008 Received in revised form 29 June 2010 Accepted 7 July 2010 Available online 14 July 2010 Keywords: SIMD Programmable GPU Instructoin systolic Pipelined management

abstract In order to guarantee both performance and programmability demands in 3D graphics applications, vector and multithreaded SIMD architectures have been employed in recent graphics processing units. This paper introduces a novel instruction-systolic array architecture, which transfers an instruction stream in a pipelined fashion to efficiently share the expensive functional resources of a graphics processor. Specifically, cache misses and dynamic branches can cause additional latencies and complicated management in these parallel architectures. To address this problem, we combine a systolic execution scheme with on-demand warp activation that handles cache miss latency and branch divergence efficiently without significantly increasing hardware resources, either in terms of logic or register space. Simulation indicates that the proposed architecture offers 25% better performance than a traditional SIMD architecture with the same resources, and requires significantly fewer resources to match the performance of a typical modern vector multi-threaded GPU architecture. © 2010 Elsevier Inc. All rights reserved.

1. Introduction Various key applications have characteristics that can exploit parallelism to achieve high performance, and parallel architectures are often designed to achieve maximum performance for specific target applications. For example, GPU (Graphics Processing Unit) architectures were traditionally designed with a fixed pipeline structure to perform 3D rendering in real-time. Recently, various shading effects have been added to the requirements placed upon rendering pipeline structures, necessitating greater programmability [6,11]. In general, vector and data-parallel architectures offer significant advantages over superscalar processors for many data- and computation-intensive applications where abundant parallelism can be statically identified. Such architectures can execute many operations concurrently in each cycle without complicated scheduling mechanisms. Hence, vector architectures are one approach to providing the necessary performance and programmability. Vector and multithreaded execution models have been applied to recent SIMD (Single Instruction Multiple Data) based shader architectures [1,12–15] because 3D graphics applications offer rich parallelism over thousands of threads.



Corresponding author. E-mail addresses: [email protected] (J.-W. Park), [email protected] (H.-M. Yang), [email protected] (G.-H. Park), [email protected] (S.-D. Kim), [email protected] (C.C. Weems). 0743-7315/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2010.07.002

SIMD is a proven, cost-effective, simple, architecture that offers easy parallelization. But the resources for each PE must be organized efficiently. PEs cannot share their functional resources without suffering performance loss because every PE accesses the same resource simultaneously. Thus, SIMD PEs are generally designed with their own functional units. In the GPU, a texture unit may be shared among multiple PEs and thus a delay can be encountered because of the need for simultaneous access between the PEs and this shared resource. This paper introduces an instruction-systolic programmable shader architecture based on a vector multi-threading approach, where the instruction stream flows in systolic fashion through the array of processing elements. As a result, some expensive functional resources can be efficiently shared in the array. The proposed architecture also has the ability to form a group of active warps dynamically, which can be executed logically together. Overall, performance is improved 26% over a baseline GPU architecture by enhancing WLP (Warp-Level Parallelism) and register usage efficiency. The proposed architecture improves performance by efficient management of shared resources within a warp. Actual shader programs were evaluated with various simulation models to analyze the effect of different models. Our results show that the proposed architecture has almost the same performance as a conventional SIMD architecture that uses a factor of four-times more special function and texture units. The proposed architecture can also obtain 25% better performance than SIMD architectures with similar resources, without increasing the number of warps running concurrently.

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

2. Related work and multi-threading in shader applications Multithreaded architectures can tolerate the long latency of operations by overlapping these operations with the execution of other contexts. They also increase resource utilization by issuing multiple operations from different contexts [10,18]. The vectorthread architecture [8], SCALE, was developed to utilize explicit data-level parallelism among different threads. To unify vector architectures with multi-threading, SCALE can broadcast a block of instructions to every thread. Then each thread fetches its instructions individually because it can take a different control flow path from other threads. Graphics rendering applications are typically formed as tremendous numbers of shader threads. Although GPUs exploit various forms of parallelism, the most significant form of parallelism is vector multi-threading (VMT), because thousands of threads can execute the same shader program on their individual data sets [7]. Fig. 1 shows the organization of a SIMD-based programmable shader architecture and its execution model in the GPU. As shown in Fig. 1, multiple threads that are executing the same shader program on their own data set can be performed as a group. This group is called a warp [14,16]. Every PE in a SIMD architecture executes a thread based on the instructions broadcast by the control unit. In general, programmable shader architectures have four groups of instructions; general ALU operations, branch instructions, texture instructions, and special mathematics operations such as sine, exponent, and log. Although SIMD architectures have a lower hardware implementation cost with respect to multiprocessors, in terms of the number of gates required to achieve a given degree of parallelism, SIMD suffers from lower effective parallelism due to the inherent limitations in its control architecture. Obtaining more effective parallelism from SIMD requires additional technologies to exploit the potential offered by warp thread groups. 2.1. Interleaved multi-warp For several reasons, recent GPU architectures interleave multiple warps by fetching their instructions alternately. For example, when a cache miss occurs in an active warp, the other warps can take over the processor array. Hence off-chip memory access time can be hidden effectively. Another advantage of warp interleaving is increased utilization of functional units. The warp scheduler maintains a group of active warps and fetches an instruction from any selected warp on every cycle. When a multi-cycle operation using a special function is issued, instructions using other functional units can be issued from the other warps to keep those units busy. 2.2. Shader instructions and functional units Recent programmable shaders are designed as scalable VMT arrays with a SIMD model, called a Single Instruction Multiple Thread (SIMT) architecture [1,15,14,5]. Resource allocation for each processing unit to execute a thread is an important factor in performance. To understand these resource requirements, eleven representative shader programs [4] were analyzed to determine basic information about the distribution of instructions among the different categories: basic lighting VS, bulge VS, reflection VS, refraction VS, refraction FS, dispersion VS, bump FS, fog VS, fog FS, HDR_obj VS, HDR_obj FS. In the preceding list, VS indicates vertex shader and the FS indicates fragment shader. As shown in Fig. 2, the percentages of each instruction type for this entire suite are: 82% vector ALU instructions (divided into three subcategories as explained below), 12% special instructions (SI), 4% texture instructions (TI), and around 2% branch instructions.

1111

Basic ALU instructions are actually vector operations that calculate up to four data elements simultaneously. In Fig. 2, these are divided up according to the size of vector that was actually calculated (vector 1 through vector 4). Note that only 35% of vector instructions are applied to four-element vectors. Thus, the majority of vector operations underutilize the vector ALU. VLIW (Very Long Instruction Word) and superscalar architectures have also been introduced for shader processors [3,20] to increase functional unit utilization. These ILP architectures have been designed to issue multiple micro-vector operations at a time (typically five in current implementations). Because the potential ILP is limited by the code, an instruction serialization technique [2] can be applied to transform the code to achieve more utilization of the scalar ALU. However, other functional units may remain idle. Some conventional shader processors have special function units [14], which are used to perform operations such as log, exponent, square-root, and reciprocal operations. These kinds of operations add complex logic to every processing element in spite of infrequent execution. Furthermore, many texture unit architectures suffer a significant resource constraint because multiple texels may be fetched from the texture cache at specifically calculated addresses and interpolated through a filter algorithm to support a single texture instruction. SIMD-based shader processors thus incur a significant on-chip bandwidth bottleneck, because all processing elements need to share one or more of these bandwidth-limited texture units. 3. Systolic execution architecture The proposed instruction-systolic architecture is designed to fully utilize a limited number of functional units. It also supports the hiding of cache miss latency through multithreading, while avoiding the need for a huge register file. The basic architectural model includes a control unit, sixteen processing elements, four texture fetch units (TFU), four special function units (SFU), a cache memory, a data transfer bus, and a systolic interconnect for instruction transfer, as shown in Fig. 3. The control unit in the proposed architecture is designed to dispatch threads, organize the warps, and fetch the instruction stream in a manner similar to previous SIMD-based shaders. Available threads are maintained in the local thread pool, and thread IDs, consisting of a specific warp, are granted to each PE. The control unit has a single program counter and instruction memory for thread execution. The instruction fetch mechanism is also conventional, but the pipelined instruction transfer scheme allows instruction issue time to be differentiated for each PE, as well as each data transfer in the conventional systolic architecture [9]. Instead of an instruction broadcast bus, the proposed architecture is equipped with a PE to PE systolic interconnection to transfer each instruction to the neighboring PE. The basic functional units and register file organizations are similar to conventional SIMD-based shader architectures. The special function units are placed with the texture fetch units and their use can be shared. Each PE can request any shared resource, but only a limited number of PEs, which have priority, can use a resource at any time. Four SFUs and four TFUs are connected to each PE by a data transfer bus. A Resource Selector manages access priority and data transfer in the bus. 3.1. Resource sharing pipelined execution When a warp is ready to run, the CU allocates a thread to each PE with the same shader program. To start this shader program on the proposed architecture, the CU continuously fetches instructions from the shader code, corresponding to this selected thread, and

1112

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

Fig. 1. SIMD-based vector multi-threading in a recent GPU.

Fig. 2. Instruction type distribution in eleven shader programs. From left to right, branches, texture instructions, special instructions, and vector instructions with one, two, three, or four operands.

Fig. 3. Overview of resource sharing systolic execution architecture.

Fig. 4. Systolic instruction transfer.

passes them to the first PE through the systolic interconnection network. Fig. 4 shows this mechanism, where a warp of 8 threads (T0 to T7) is assigned to 8 PEs (PE0 to PE7), respectively. At every cycle, each PE executes its given instruction and passes the previous instruction to its neighboring PE. For instance, as

shown in Fig. 4, the first operation ‘OP0’ can be delivered to PE5 five cycles after PE0 issued it. Each warp completion is delayed by the pipeline’s fill time, which is determined by the array’s width, as shown in the box on the right side of Fig. 4. Overall PE utilization does not decrease because the CU can quickly load a new warp after

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

a) SIMD

b) Systolic without conflict

1113

c) RF resource allocation

Fig. 5. Shared resource management.

the first PE completes its own thread. The proposed architecture and management techniques improve overall performance by reducing PE idle time caused by resource conflicts and texture cache accesses. Fig. 5 shows the effect of delayed execution in the proposed architecture. To keep the figure simple, only four PEs are shown, with one shared special function unit assumed in Fig. 5(a) and (b), and two shared special function units in Fig. 5(c). Fig. 5(a) shows the SIMD case, where sharing of a resource in OP3 (shaded blocks) causes conflicts among PEs. In contrast, Fig. 5(b) shows that thread pipelining achieves full utilization of the shared resource. Because each PE delays its start time by one cycle, accesses to the shared resource are automatically interleaved for different data. Of course, continuous accesses to the same shared resource can still cause conflicts. When a resource conflict occurs, the resource selector in the proposed architecture gives priority to the later threads, according to a right node first (RF) strategy. For example, as shown in Fig. 5(c), OP4 in T0, OP3 in T1, and OP2 in T2 request simultaneous accesses to the shared resource at same time, namely at the fifth cycle, so that a conflict occurs among them when they have only two shared resources. The RF strategy gives higher priority to T2:OP2 and T1:OP3, and T0:OP4 will stall its execution. A stall in any thread causes subsequent stalls of chained threads, because they cannot receive subsequent instructions. Hence, the sequence of operations through the array is not changed. 3.2. On-demand warp activation Major sources of potential performance degradation in a conventional shader architecture are cache miss penalty, dependency among the instructions in a sequence, and a limited number of specific functional units. A large number of threads must be executed simultaneously because a conventional GPU exploits only warplevel parallelism to solve these problems. The cost of using this warp-level parallelism is that a huge register file is required for storing the state of active threads. One approach to addressing this situation, balanced multi-threading (BMT) [19], combines simultaneous multithreading for short latency operations with coarsegrained multithreading for long latency operations. According to their findings, it is more efficient to execute only a limited number of contexts simultaneously when other contexts are ready to run. Alternatively, a simple method, on-demand warp activation, can be used to exploit warp-level parallelism while avoiding the hardware cost of a large register file. In this scheme, the warp manager prepares multiple warps for execution as in the interleaved multi-warp scheme, but only a few of them are allowed to execute simultaneously. A limited number of warps starts together and then another ready warp can be started when a running warp encounters a long latency operation. In this

arrangement, a stalled warp does not need to be switched out. The minimum number of active warps in the proposed systolic architecture can be determined from the condition required to eliminate pipeline stalls caused by data dependencies among the in-flight instructions. Under this approach, it is possible for a single warp to obtain almost optimal SFU utilization. To evaluate the effect on the number of active warps in terms of the performance degrading factors noted above, an experimental environment for Simultaneous Multi-Threading (SMT) [18] was modified to simulate the interleaved multi-warp scheme. The SMT fetches instructions from multiple threads at every cycle by considering fairness with respect to each thread. It is assumed that each thread in the SMT is a warp that consists of only one thread. Thus, fetch width is limited to one as in the case of multi-warp interleaving. To isolate the effect of other features of the shader architecture, a perfect branch predictor, and a zero miss instruction cache and TLB were used in the simulation. Four integer programs from SPEC2000 and their combinations were used for simulation. To emulate the resource occupancy of short-length shader threads, each thread is divided into groups of 500 instructions and formed into a batch of short-length threads. Then functional resources such as registers and instruction queue entries are automatically freed after committing all 500 instructions. Overall, 16 warps are prepared to be executed and a certain number of threads is activated simultaneously. After finishing every shortlength thread, a new set of active warps starts and the previous resource occupancy of each warp is restored during this activation time. The number of starting active warps is varied over the values 2, 4, 8, and the original multi-warp interleaving size of 16 active warps. At every cycle, an instruction is fetched from a warp selected out of the active warps using the ICOUNT [17] policy. When a cache miss occurs in any active warp, another warp is added to the active warp group. The memory system is designed so that a private data cache is dedicated to each individual thread because the performance can be improved easily when fewer active warps share the same size of cache. Off-chip memory access latency in real systems can vary due to factors such as controller interconnections, queuing delay, and DRAM paging overhead. But it is fixed at 300 cycles in the simulator to simplify the analysis. Detailed configurations for the simulation are summarized in Table 1. The results are presented in Fig. 6. Each bar shows IPC for a given set of threads. The figure’s legend can be explained with an example; D16K_8/16 indicates a 16 kB data cache with 8 warps activated from a total of 16 ready warps. 16 kB and 64 kB data caches are used to explore the effect of this parameter. 16 kB yields an overall lower performance due to low cache hit rates while the 64 kB data cache requires less frequent off-chip memory requests. The cache hit rate for each configuration is denoted below the

1114

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

Table 1 Simulation configurations. Fetch width Pipeline depth # of ready threads Branch predictor Functional units Instruction queues Rename registers L1 instruction cache L1 data cache Memory latency

1 8-stages 16 Perfect predictor 1 int unit (including 1LD/ST unit) and 1 FP unit, all instruction have 1 cycle latency 8 int, 8 FP 128 int, 128 FP registers 1 cycle, no miss 2-way, 64-byte line, private cache per thread (1 kB × 16, 4 kB × 16) 300 cycles, fully pipelined

D16k_4/16

D16k_2/16

D16k_8/16

D16k_16/16

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 crafty 93% hit

GCC 87% hit

equake 69% hit

mgrid 69% hit

crafty+GCC 89% hit

equake+mgrid 89% hit

GCC+equake 78% hit

ALL 81% hit

(a) 16 kB cache (1 kB per thread).

D64k_4/16

D64k_2/16

D64k_8/16

D64k_16/16

1.2 1 0.8 0.6 0.4 0.2 0 crafty 97% hit

GCC 96% hit

equake 95% hit

mgrid 95% hit

crafty+GCC 97% hit

equake+mgrid 97% hit

GCC+equake 95% hit

ALL 93% hit

(b) 64 kB cache (4 kB per thread). Fig. 6. Overall IPC with various configurations and thread combinations.

name of the application. In most cases, 8 active warps show the best performance because this is the minimum number of warps needed to ensure that there is no read or write delay within the 8-stage pipeline depth. Pure multi-warp interleaving, denoted as 16/16, suffers from resource conflicts in most cases but shows better performance in the crafty and mgrid benchmarks with the 64 kB cache. Overall, we see about a 26% improvement in average performance by halving the number of active warps. 3.3. Handling branch divergence Both SIMD and systolic multithreading architectures encounter the problem of control flow divergence caused by dynamic branch operations. It is conceptually more challenging for a systolic architecture to address the branch divergence problem because each thread’s execution is interleaved in time. The simplest method to handle a branch divergence is to serialize the thread by fetching instructions from each control flow and stalling the alternate thread(s) with mask bits. Within a warp, the instruction after a branch cannot be fetched before resolving the corresponding branch because a conventional SIMD shader serializes its execution when at least one thread takes a control path that diverges from the others.

In the proposed systolic approach, the branch resolution time of each thread is interleaved and serialization can be delayed until the last PE is resolved. But even when a thread in a warp fails to resolve its branch, at least one or more instructions can be fetched because the first PE can resolve its branch result without any additional delay. Moreover, serialization can be decided in an earlier thread. However the worst case happens when the control path for the first branch result has only one instruction and only the thread in the last PE diverges. The overhead of serialization in the proposed systolic scheme is still negligible because the cost of the worst case is rarely encountered, as shader programs are generally coded to avoid this case. By adopting this serializing mechanism, the proposed systolic architecture can use an existing branch reconverging technique based on a hardware stack and compiler support, called the immediate post dominator [5]. This approach provides a simple and efficient solution to reduce the performance degradation. Here, we provide a detailed description of stack operations with an example of control flow, as shown in Fig. 7. The example program shown in (a) has the same branch divergence as in [5], but each given branch result is different. The execution order of this program should be ABCDEFG as shown in (b). Each box in (c) and (d) indicates the basic block for a given lane and its status; plain bold

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

(a) Example control flow.

(b) Status of the stack.

1115

(c) SIMD.

(d) Proposed systolic. Fig. 7. Example of managing branch divergence.

is normal execution and underlined (m) means stalled execution with mask bits. Tables in (b) present the internal values of the stack (the re-convergence PC (R.PC), next PC (N.PC), and resolved mask (R.mask) bit) at each of the following points: (i) after executing A, (ii) after executing B, (iii) C, (iv) D, and (v) E respectively. At every cycle, the top of stack is chosen as the target for next PC. The aforementioned worst case can happen at the end of block B when block C has only one instruction. In that case, the first instruction of block D can be fetched after executing the end of block B in the last PE (branch to D). 4. Experimental results A skeleton execution-driven shader simulator was developed to analyze the overall performance. This simulator does not functionally emulate the instructions but only calculates the occupancy and latency of each instruction and its corresponding resources. Both the SIMD model and the proposed architecture were implemented with variable parameters such as array width, the number of resources, and the latency of each operation. The assembly-level shader programs were used as execution inputs, and the total number of completed threads was obtained as the performance metric for different shader array models. The main contribution of this paper is to improve the efficiency of resource utilization without increasing the number of active warps. To evaluate the effect of the proposed mechanism clearly, all performance results are normalized to the SIMD architecture which has dedicated functional units for every PE.

4.1. Resource efficiency The effect of the proposed resource sharing architecture is presented in Fig. 8 for eleven shader programs with twelve array configurations. The number of special function units and texture units for the proposed architecture is varied from 1 to 8 in the x-axis. The number of conventional SIMD processors is varied according to the even divisors of the array width. Thus, only points for 1, 2, 4, and 8 are shown. Logarithmic trendlines are added to approximate the result of unexamined points in the SIMD curve. The following assumptions are used in the simulation. In both SIMD and the proposed systolic architecture, only a single warp is active at a given time while the other ready warps wait for execution. Cache miss penalty is not considered in the simulation because both interleaved multi-warp and on-demand warp activation can efficiently hide the latency when they have the same number of ready warps. Therefore the simulation results show purely the effect of exploiting the proposed systolic-level parallelism, separate from warp-level parallelism. As a result, performance of the SIMD based shader array decreases linearly as the number of functional units decreases. Because all of the processing elements try to access the same resource at the same time, full performance cannot be achieved even when just two PEs share a resource. However, the proposed systolic execution architecture can achieve the nearly optimal performance with only four special function units, because that number is sufficient to avoid resource conflicts over the entire array for most of the shader programs. When the entire shader array shares

1116

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

proposed 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0

basic lighting VS

refraction VS

1.2 1 0.8 0.6 0.4 0.2 0

bump FS

1.2 1 0.8 0.6 0.4 0.2 0

HDR-obj VS

1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0

SIMD bulge VS

refraction FS

Log. (SIMD) 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0 1.2 1 0.8 0.6 0.4 0.2 0

fog VS

HDR-obj FS

1.2 1 0.8 0.6 0.4 0.2 0

reflection VS

dispersion VS

fog FS

AVERAGE

Fig. 8. Performance of various shader configurations, normalized to that of a shader with fully dedicated functional units.

only one special function unit, the lowest performance result is observed in both architectures, because most of the operations cause significant resource conflicts on this single resource. With the exception of this degenerate case, the systolic execution architecture can achieve a performance similar to a SIMD architecture with two to four times as many special functional units. 4.2. Performance optimization Even though resource sharing achieves near-optimal performance in the systolic architecture, some additional resource conflicts can be removed by efficient code scheduling. Consider the sample shader program as shown in Fig. 9, which has four operations requiring the shared resource, as represented by the gray shaded boxes. Suppose three operations have execution dependencies in sequence with the other operations, namely OP2 to OP1, OP9 to OP3 and OP18 to OP13. Also suppose that only three shared resources are available. Instruction reordering can be applied to this code to avoid the resource conflicts, while still preserving the operation dependencies. Overall performance of the chosen shader programs was evaluated. The three models compared are the 16-way SIMD case (s4), and the proposed systolic architecture operating with the original code (p4) and optimized code (po4). All three configurations have four special function units and four texture units and their performance results were normalized to the SIMD architecture with full-resources. The results are presented in Fig. 10. The proposed systolic architecture, equipped with a limited number of resources, shows only around a 3% loss of performance, compared with the ideal model, which has four-times more resources. After optimization, it was possible to improve the performance of the shader programs by an average of 1.5% without any additional hardware cost. When the SIMD and proposed architectures have the same limited number of resources, the proposed architecture shows 25% better performance than the SIMD model on average.

5. Conclusion This paper introduces an instruction-systolic architecture and its associated array management scheme for vector multithreading. Systolic instruction execution makes it possible to efficiently share special function unit resources among PEs by automatically interleaving requests that would otherwise occur simultaneously. Hence, the number of some expensive resources, such as functional units for special math operations, texture units, and their on-chip memory ports to the caches, can be reduced without any performance loss. A conventional GPU uses multi-warp interleaving which runs multiple warps simultaneously in a time sharing fashion and thus the miss penalty of any active warp can be hidden by execution of the other warps. The cost of exploiting warp-level parallelism (WLP) is a huge register file for maintaining active threads. The proposed systolic scheme requires less WLP because it can almost fully utilize the SFU with only a single warp. Therefore a simple on-demand warp activation approach was devised to avoid the hardware cost of the extra registers. Thus either the register file size can be minimized or each warp can utilize more registers by decreasing the number of active warps. A performance gain of approximately 26% over the conventional GPU is obtained by halving the number of active warps that share renaming registers. The main contribution of this work is the improvement of resource utilization in a vector multi-threading graphics architecture. Eleven representative shader programs were used to estimate the access frequency for functional resources as well as the overall performance of the architecture. According to our simulation results, the proposed architecture shows almost the same performance as a conventional SIMD architecture that has four-times as many special functional units. The proposed architecture was shown to offer 25% better performance than a SIMD architecture with the same resources, without increasing the number of warps running concurrently. Hence the proposed

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118

(a) Original code with dependency.

1117

(b) Modified code reduce resource conflict.

Fig. 9. Example code modification for performance tuning.

Fig. 10. Overall performance improvements in the proposed architecture. Values are normalized to the performance of a SIMD array with dedicated functional units for each PE.

architecture provides an effective alternative for cost-sensitive 3D graphics applications. Acknowledgment This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (No. R01-2007-000-11309-0). References [1] Advanced Micro Devices, Inc., ATI CTM Guide, 1.01 ed., 2006. [2] H. Clark, A. Hormati, S. Yehia, S. Mahlke, K. Flautner, Liquid SIMD: abstracting SIMD hardware using lightweight dynamic mapping, in: Proc. Intl’ Symp. on High Performance Computer Architecture, 2007, pp. 216–227. [3] M. Doggett, ‘‘AMD’s Radeon HD 2900’’ Hot 3D session, in: Graphics Hardware, 2007. [4] Randima Fernando, Mark J. Kilgard, The CG Tutorial: The Definitive Guide to Programmable Real-Time Graphics, Addison Weseley, 2003. [5] W.W.L. Fung, I. Sham, G. Yuan, T.M. Aamodt, Dynamic warp formation and scheduling for efficient GPU control flow, in: Int. Symposium on Microarchitecture, 2007.

[6] D. Geer, Taking the graphics processor beyond graphics, IEEE Computer 38 (9) (2005) 14–16. [7] C. Kozyrakis, D. Patterson, Overcoming the limitations of conventional vector processors, in: Proc. Int. Symposium on Computer Architecture, 2003. [8] R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, K. Asanovic, The vector thread architecture, in: Proc. Int. Symposium on Computer Architecture, 2004. [9] H.T. Kung, Why systolic architectures? Computer Magazine 15 (1) (1982). [10] J. Laudon, A. Gupta, M. Horowitz, Interleaving: a multithreading technique targeting multiprocessors and workstations, ACM SIGOPS Operating Systems Review 28 (5) (1994) 308–318. [11] E. Lindholm, M.J. Kligard, H.P. Moreton, A user programmable vertex engine, in: SIGGRAPH, 2001, pp. 149–158. [12] E. Lindholm, J. Nickolls, S. Oberman, J. Montrym, NVIDIA tesla: a unified graphics and computing architecture, IEEE Micro 28 (2) (2008) 39–55. [13] J. Montrym, H. Moreton, The Geforce 6800, IEEE Micro 25 (2) (2005) 41–51. [14] NVIDIA Corporation. Technical Brief: NVIDIA GeForce 8800 GPU Architecture Overview, November 2006. [15] NVIDIA Corporation, NVIDIA CUDA Programming Guide, 1.0 ed., 2007. [16] M. Shebanow, ECE 498 AL: programming massively parallel processors, lecture 12. http://courses.ece.uiuc.edu/ece498/al1/rchive/Spring2007 (February 2007). [17] D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multi-

1118

J.-W. Park et al. / J. Parallel Distrib. Comput. 70 (2010) 1110–1118 Gi-Ho Park received the B.S., M.S., and Ph.D. degrees in Computer Science from Yonsei University, Seoul, Korea, in 1993, 1995, and 2000, respectively. He is currently an Assistant Professor in the Department of Computer Science and Engineering at Sejong University, Korea. Before joining Sejong University, he worked for Samsung Electronics as a Senior Engineer in Processor Architecture Lab., System LSI Devision during 2002–2008. His research interests include advance computer architectures, memory system design, System on Chip (SOC) design and embedded system design.

threading processor, in: International Symposium on Computer Architecture, May 1996. [18] D.M. Tullsen, S.J. Eggers, H.M. Levy, Simultaneous multithreading: maximizing on-chip parallelism, in: Proc. Int. Symposium on Computer Architecture, 1995. [19] Eric Tune, Rakesh Kumar, Dean Tullsen, Brad Calder, Balanced multithreading: increasing throughput via a low cost multithreading hierarchy, in: International Symposium on Microarchitecture, December 2004. [20] C. Yu, K. Chung, D. Kim, L. Kim, An energy-efficient mobile vertex processor with multithread expanded VLIW architecture and vertex caches, IEEE Journal of Solid-State Circuits 42 (10) (2007).

Shin-Dug Kim received the B.S. degree in Electronic Engineering from Yonsei University, Seoul, Korea, in 1982, and the M.S. degree in Electrical & Computer engineering from University of Oklahoma in 1987. In 1991, he received the Ph.D. degree from the School of Electrical & Computer Engineering at Purdue University, West Lafayette, Indiana. He is currently a professor of computer science at Yonsei University, Seoul, Korea. His research interests include advanced computer architectures, media processing systems, intelligent memory system design, and context-aware computing. He is a member of the IEEE

Jung-Wook Park received the BS, MS, and the Ph.D. degrees in computer science from Yonsei University, Seoul, Korea in 2003, 2005, and 2010. Currently he carries out research about memory hierarchy optimization for various computer systems and software/hardware codesign for embedded parallel systems.

Computer Society. Hoon-Mo Yang received the BS and the MS degrees in computer science from Yonsei University, Seoul, Korea in 2003 and 2005. He is currently a computer science Ph.D. candidate at Yonsei University, Seoul, Korea. His research interests include performance analysis and memory subsystem optimization in embedded systems area.

Professor Charles Weems earned the B.S. in 1977 (honors) and M.A. in 1979, from Oregon State University, and the Ph.D. in 1984 from the University of Massachusetts at Amherst. He is a co-director of the Architecture and Language Implementation research group at the University of Massachusetts, where he is an Associate Professor. His current research interests include advanced architectures, and a system-modeling framework based on specification languages.