A pixel pipeline architecture with selective z-test scheme for 3D graphics processors

A pixel pipeline architecture with selective z-test scheme for 3D graphics processors

Microprocessors and Microsystems 37 (2013) 373–380 Contents lists available at SciVerse ScienceDirect Microprocessors and Microsystems journal homep...

1MB Sizes 1 Downloads 56 Views

Microprocessors and Microsystems 37 (2013) 373–380

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

A pixel pipeline architecture with selective z-test scheme for 3D graphics processors Jinhong Park a, Il-San Kim a, Woo-Chan Park b, Yong-Jin Park a, Tack-Don Han a,⇑ a b

Department of Computer Science, Yonsei University, 134 Shinchon-Dong, Sudaemoon-Ku, Seoul 120-749, Republic of Korea Department of Internet Engineering, Sejong University, 98 Kunja-Dong, Kwangjin-Ku, Seoul 143-747, Republic of Korea

a r t i c l e

i n f o

Article history: Available online 23 June 2012 Keywords: 3D graphics Graphics hardware Rendering hardware Pixel cache

a b s t r a c t We propose pixel pipeline architecture with a selective z-test scheme that focuses on reducing the data processed in the pixel pipeline by employing preprocessing. Reduction of data can reduce the data transmission between the 3D graphics processor and the memory and also reduce the power consumption of memory access, which is a critical point in the case of mobile devices. In 3D graphics processor, most of the memory transmissions are occurred in rasterization stage, especially in pixel pipelines. To reduce memory transmission, the proposed architecture exploits the coherency among pixel fragments to predict the visibility of each pixel fragment. Through this, the proposed architecture eliminates invisible fragments before texture mapping using a single z-test, which would require two z-tests in the mid-texturing architecture. According to the simulations, the proposed architecture reduces data transmission by 19.9–22.6% as compared to the mid-texturing architecture at the expense of a 5% reduction in performance. Further, the proposed architecture also reduces the cell area of the depth cache by 26.4% and the area of overall architecture by 6% as compared to that in the mid-texturing architecture. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction The use of mobile application processors is increasing with advances in semiconductor technology. Recently, the demand for dedicated processors [1,2] for 3D graphics acceleration has increased; in order to achieve a higher throughput, most of these 3D graphics processors use pipelining to process pixel operations [3]. These pipeline architectures can be classified as pre-texturing, mid-texturing, and post-texturing architectures [4,5], depending on the location of the z-test (depth-test) unit, which tests the visibility of fragments using the z-buffer (depth-buffer) [6]. Among these architectures, the mid-texturing architecture shows a performance approximately 30% better than that of the other two architectures by using two z-test units before and after texture mapping in order to eliminate invisible fragments and by employing a prefetch scheme for reducing cache misses. However, the mid-texturing architecture transmits more depth information because of the extra z-test. In addition, for performing two z-reads and one zwrite simultaneously, the depth cache requires a 3-port SRAM, one port more than that required by the other two architectures, which results in an increased in the cell area of the depth cache. In this paper, we propose selective z-test architecture, which is an improvement on the hardware and bandwidth requirements of the mid-texturing architecture. By utilizing the high probability

that the visibility within the fragments is continuous, the proposed architecture predicts the visibility of the fragments, thereby performing only a single z-test; hence, the data transmission as well as the number of ports required in the SRAM of the depth cache is reduced. In comparison to the mid-texturing architecture, the simulation results showed that the proposed structure provides a 19.9%, 20.9%, 22.6%, and 24.6% reduction in the average bandwidth requirement for the texture and z-data transmission per frame in Quake3, UT2004, Half-Life, and Light-Scape, respectively, at the expense of a 5% reduction in performance. Further, by reducing the number of ports in the SRAM, the cell area of the depth cache in the proposed architecture is reduced by 26.4% as compared with that of the mid-texturing architecture. In the point of overall architecture, the area is reduced by 6%. In Section 2, we provide a brief overview of the three-pixel pipeline architectures mentioned above and the visibility coherence. This study is organized as follows. In Section 3, the proposed architecture is described in detail. Various simulation results and performance evaluations are shown in Section 4. The conclusions are presented in Section 5.

2. Related work 2.1. The pre-texturing and post-texturing architectures

⇑ Corresponding author. E-mail address: [email protected] (T.-D. Han). 0141-9331/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2012.06.001

According to the order of z-test and texture mapping, 3D graphics rasterization architecture is categorized into three architectures:

374

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

pre-texturing, post-texturing, and mid-texturing. The pre-texturing architecture is a standard pixel pipeline architecture for 3D graphics processors which complies with the procedures in OpenGL API (Application Programming Interface) [7] – an industry standard for 2D and 3D graphics. In the pre-texturing architecture, the z-test is performed after texture mapping procedures, and for this reason, unnecessary texturing mapping is performed on fragments that are hidden by other fragments. In the post-texturing architecture, the z-test is performed before texture mapping so that the invisible fragments are not textured. However, the disadvantage of the post-texturing architecture is that the wide pipeline stages between the z-read and the z-write stages cause consistency problems [5]. If two fragments with the same address exist between the z-read and z-write stages, the z-write of the first fragment must be completed before the z-read of the second fragment. Although such overlapping is very rare, the additional hardware required to detect overlapping and the resulting losses in performance are problems that are inevitable in the post-texturing structure [5].

Table 1 Ratio of continuous visibility in spans.

Number of fragments Ratio of continuously visible fragments Ratio of continuously invisible fragments Total ratio of continuous visibility

Quake3

Half-Life

UT2004

LightScape

45,970,174

53,178,676

105,511,051

38,944,325

89.24%

75.54%

81.59%

55.81%

10.43%

24.15%

17.58%

41.30%

99.67%

99.69%

99.17%

97.11%

leads to an increase in cell area as compared with the other two architectures which use 2-port SRAM. 2.3. Visibility coherence

2.2. The mid-texturing architectures The mid-texturing architecture proposed in [5], is shown in Fig. 1; this architecture comprises two z-test (visibility test) units before and after the texture unit. The 1st z-test unit eliminates invisible fragments to reduce unnecessary texture operation, while the 2nd z-test unit maintains the consistency of the visible fragments. Thus, the mid-texturing architecture maintains the advantages of both the pre-texturing and post-texturing architectures while reducing the burden of an overlap detector to maintain consistency. Also, when cache misses occur in the 1st z-read, a prefetch operation is performed, thereby allowing the pipeline to proceed without stall; this reduces the penalties of the cache misses. On the other hand, the mid-texturing architecture has the following two defects compared with both the pre-texturing and post-texturing architectures. First, the bandwidth requirement for the depth cache is increased because the number of z-reads for the visible fragments is doubled. Second, the depth cache is the mid-texturing architecture requires a 3-port SRAM to enable simultaneous accessing for two z-reads and one z-write, which

In the pixel calculations for polygon-based 3D graphics, the vertex information received as the input is used for generating spans through triangle-setup and edge-walk processing, and these spans are interpolated to create fragments. The fragments contain the depth and color values, and these values are stored separately in the color buffer and the depth buffer after the texture mapping and visibility tests are performed. The fragments within the same span tend to have identical visibility because they always are located along continuous horizontal coordinates. Such spatial coherency was mentioned in earlier researches of [8,9], and the latter, hierarchical z-buffer was used in HyperZ [10] by ATI. Table 1 shows the ratio of continuous visibility in spans measured using four benchmarks—Quake3 [11], Half-Life [12], Unreal Tournament 2004 (UT2004) [13], and Light-Scape [20]—which are three commercial games and an OpenGL benchmark.That use the OpenGL API for 3D graphics processing. For all benchmarks, 50 frames are used to examine the visibility of the spans. The visibility results in Table 1 show that the ratios of continuously visible fragments in Quake3, Half-Life, UT2004, and Light-Scape are 89.24%, 75.54%, 81.59%, and 55.81%, and the ratios of continuously invisible fragments are 10.43%, 24.15%, 17.58% and 41.30%, respectively. The total ratios of continuous visibility are greater than 97% in all the benchmarks; thus, there is a high probability that the visibility of the fragments will be continuous. This is attributable to the fact that applications such as 3D games optimize drawing objects by scene management techniques [14] such as occlusion culling [15,16,20]. As observed in Table 1, the spatial coherence of visibility in Quake3, Half-Life, UT2004 and Light-Scape are 99.67%, 99.69%, 99.17% and 97.11%, respectively. 3. The proposed selective z-test architecture This chapter presents the structure and operational flows of the proposed architecture. Section 3.1 presents the structure and characteristics of the proposed architecture and Section 3.2 presents the operational flows. 3.1. Proposed architecture and characteristics

Fig. 1. The mid-texturing architecture.

The proposed architecture (shown in Fig. 2) consists of a pixel pipeline with two variable z-test units, a z-tag register for pipeline control, and a depth cache that selects the z-read requests. Similar to the mid-texturing architecture, the proposed architecture has two z-test units, but to reduce the number of z-reads, only one

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

375

Fig. 2. Proposed selective depth test architecture.

z-test unit is activated at a time by using the spatial coherence of visibility mentioned in Section 2.3. To enable this, the result of the previously preformed z-test is stored in the z-tag register and is referred to decide which z-test unit is to be activated. If the ztag register is set to 0, the 1st z-test unit is activated for testing the invisible fragments presumed to have entered. Conversely, if the z-tag register is set to 1, the 2nd z-test unit is activated for testing the visible fragments presumed to have entered, and in this case, the 1st z-read requests the depth cache to pre-fetch the zdata (same procedures in [5]) and the 1st z-test bypasses these fragments. The details of the pre-fetch procedure will be discussed in Section 3.2.2. The proposed architecture improves on the two disadvantages of the mid-texturing architecture mentioned in Section 2.2. First, the control scheme of the z-test units is modified to reduce the number of z-read requests. When visible fragments enter into the pixel pipeline, the proposed architecture only tests their visibility in the 2nd z-test unit, as opposed to the mid-texturing architecture wherein the visibility is tested in both the 1st and 2nd ztest units. Hence, the number of z-read requests is reduced in the proposed architecture, and the simulation result according to the z-test scheme is discussed in Section 4.2. Second, the cell area of the depth cache is reduced due to the reduced number of the ports in the depth cache. With the depth cache in the mid-texturing architecture, 3-port SRAM are required for simultaneously processing two z-reads and one z-write. In the proposed depth cache, the data-ram uses 2-port SRAM for a simultaneous z-read and z-write, and the tag-ram uses 3-port SRAM for a simultaneous z-read, z-write, and pre-fetch process. As the total cell area of the depth cache is largely dependent on the data-ram, the proposed depth cache reduces the cell area dramatically as compared with the depth cache required in the mid-texturing architecture. The comparison results are discussed in Section 4.3. 3.2. Operation flows 3D graphics has high ratio of continuous visibility within the same span. The proposed selective depth-test architecture shares the ports of depth cache SRAM with these visibilities of the previ-

ous fragments. The operational flows of the proposed architecture can be categorized into three cases according to the value in the z-tag register. First, when processing invisible fragments, the ztag register value is maintained at 0 and the visibility is tested at the 1st z-test unit. Second, when processing visible fragments, the z-tag register value is maintained at 1 and the visibility is checked in the 2nd z-test unit. Last, when the visibility within fragments is not identical, the z-tag register updates its value so that the visibility is tested in the appropriate z-test unit. 3.2.1. Processing invisible fragments: z-tag register is 0 Fig. 3 shows the processing flow when the z-tag register value is 0. In this case, the proposed architecture expects that the input fragments are invisible, and activates the 1st z-test unit. When the fragments are inputted, the 1st z-read is requested the z-data to the depth cache, which corresponds to the fragment’s coordinate. The depth cache returns the z-data to the pipeline by hit/miss handling procedures, and the returned z-data is compared with the fragment’s z-value for the visibility test in the 1st z-test. If the 1st z-test result is fail, the fragment in question is not visible. This invisible fragment is eliminated from the pipeline, and therefore, the texture unit and the 2nd z-test unit are not operated. The ztag register value remains at 0 for the subsequence invisible fragments. 3.2.2. Processing visible fragments: z-tag register is 1 Fig. 4 shows the pipeline process when the z-tag register value is 1, and the proposed architecture expects that visible fragments will be processed. When the fragment information is entered into the pipeline, the 1st z-read requests the depth cache to pre-fetch process of the z-data corresponding to the fragment’s coordinate and this requested z-data is used in the 2nd z-test unit. For pre-fetch process, the depth cache performs a tag comparison to determine whether the requested z-data is in the depth cache. If the tag-comparison fails, a cache block, including the requested z-data, is fetched from the z-buffer. Thus, the tag-ram in the proposed architecture requires a 3-port SRAM to execute the tag comparison for pre-fetch process along with the 2nd z-read and z-write

376

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

Fig. 3. The operational flow when the Z-tag register is 0.

Fig. 4. The operational flow when the Z-tag register is 1.

processes. After the pre-fetching request, the 1st z-test bypasses this fragment and the texture operations are processed. The 2nd z-read re-requests the z-data of the fragment, and the 2nd z-test checks the fragment’s visibility when the pre-fetched zdata is ready. If the result of the 2nd z-test is success, the z-value of the fragment is written to the depth cache and the z-tag register value is maintained at 0 for the subsequent visible fragments. 3.2.3. Changes in the visibility of the fragments: update z-tag register There are two cases that cause the visibility within the fragments to change. The first case is that the result of the 1st z-test mentioned in Section 3.2.1 is success, when the visible fragment

is entered into the pipeline after the invisible fragment. In this case, the z-tag register changes its value from 0 to 1 with no penalty, because the visibility of the fragments will be tested in the 2nd z-test unit. Hereafter, the fragments are processed as illustrated in Section 3.2.2. The second case is that the result of the 2nd z-test mentioned in Section 3.2.2 is fail, when the invisible fragment is entered after the visible fragment. Unlike the change from 0 to 1, the z-tag register cannot change its value from 1 to 0 immediately because the visibility of ongoing fragments between the two z-test units have not yet been tested. If the pipeline continues to proceed without stalls when the z-tag register value is changed from 1 to 0, it is impossible

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

to decide whether these fragments are visible or not. Therefore, the pipeline stages ahead of the 1st z-test unit must be stalled until the ongoing fragments between the two z-test units have been completely processed through the 2nd z-test unit. Once these fragments are processed in the 2nd z-test unit, the z-tag register changes its value from 1 to 0 and the pipeline continues to process the stalled fragments. The penalty of the pipeline stalls is proportional to the number of pipeline stages between the two z-test units, and the related performance results are discussed in Section 4.4. 4. Simulation results In this chapter, we present the various simulation results by using three benchmarks to evaluate the performance of the proposed architecture in comparison to the pre-texturing and midtexturing architectures. Section 4.1 presents the simulation environments and benchmarks. A comparison of the bandwidth requirements are presented in Section 4.2. Section 4.3 presents the results of the cell area comparison of the depth caches, and Section 4.4 presents a performance comparison in terms of the average rendering cycles per frame. 4.1. Simulation environments In order to analyze the performance of the proposed architecture, a trace-driven simulator featuring the proposed architecture was built. The benchmarks for the simulations are same as used in Section 2.3, and Mesa3D API [17]—an OpenGL compatible API— is used for generating traces. For all the benchmarks, the addresses of the texture and z-data transferred with memory are traced for 50 frames at a resolution of 640  480. Fig. 5 shows the captured images from the benchmarks used in the simulations. All the benchmarks are commercial 3D games developed for the efficient processing of 3D graphics procedures, and among these, Quake3 has been frequently used as a benchmark in other related studies

377

[5,6]. With the traces, texture and depth cache simulations are performed using the well-known Dinero IV cache simulator [18]. 4.2. Average bandwidth requirements To calculate the average bandwidth requirements, we measure the transmitted data in the pre-texturing, mid-texturing, and proposed architectures. The transmitted data in these architectures can largely be divided into texture, depth, and color data. Among these, the bandwidth requirement for the color data is identical in the three architectures, thus only the bandwidth requirements for the texture and z-data are compared in the simulation. The sizes of the texture and the z-data are 2 bytes, and the bilinear filtering method is used for the texture filtering in the benchmarks. Table 2 shows the results of the average bandwidth requirements per frame in the three architectures. In Table 2, ‘‘Texture’’ represents the bandwidth requirement for the texture data, ‘‘Depth’’ for the z-data, and ‘‘Total’’ for the sum of the texture and z-data. When there is a failure of z-test, the bandwidth for texture is not required because the fragments are thrown away. In the ‘‘Depth’’ result, the pre-texturing and proposed architectures have the same bandwidth requirements, and the mid-texturing architecture requires more bandwidth than the other two architectures. This is due to the number of the z-tests that the mid-texturing architecture performs two z-tests per fragment, while the pre-texturing and proposed architectures perform a single z-test per fragment. The mid-texturing architecture has a z-data transmission 52.7% more than the others in Quake3, 54.2% in UT2004, 57.0% in Half-Life, and 63.2% in Light-Scape. In the ‘‘Texture’’ result, the proposed architecture has the lowest requirement because invisible fragments are eliminated through the 1st z-test before texture mapping. Although the mid-texturing architecture also eliminates the invisible fragments through the 1st z-test, some of the invisible fragments are not

Fig. 5. Four benchmarks used in the simulations.

378

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

Table 2 The average bandwidth requirements according to the benchmarks. (Unit: Mbytes) Pre-texturing

Quake3 UT2004 Half-Life Light Scape

Mid-texturing

Proposed

Failure rate of the Z-test (%)

Total

Depth

Texture

Total

Depth

Texture

Total

Depth

Texture

10.33 23.54 11.67 8.28

3.32 7.36 3.56 2.34

7.01 16.18 8.11 5.94

11.97 26.00 12.54 7.61

5.07 11.35 5.59 3.82

6.90 14.65 6.95 3.79

9.59 20.56 9.70 5.74

3.32 7.36 3.56 2.34

6.27 13.20 6.14 3.40

eliminated when the 1st z-read requests are missed at the depth cache. Thus, the requirement of the texture data in the mid-texturing architecture is increased in comparison to the requirement of the proposed architecture, and the pre-texturing architecture has the highest transmission requirement. Compared to the mid-texturing architecture, the proposed architecture shows approximately 9.1%, 9.9%, 11.7%, and 10.3% reductions in texture requirements in Quake3, UT2004, Half-Life, and Light-Scape, respectively, and 10.6%, 18.4%, 24.3%, and 42.8% reductions as compared to the pre-texturing architecture, approximately. In the ‘‘Total’’ result, which includes depth and texture data, the proposed architecture has the lowest bandwidth requirement, followed by the pre-texturing and mid-texturing architectures. The differences in bandwidth requirements between the proposed architecture and other architectures tend to be proportional to the failure rates of the z-test in Table 2. This means that if the failure rate of the z-test is increased, the bandwidth requirement of the proposed architecture will be reduced more than that of the other architectures. In Quake3, which had the lowest failure rate in the z-test, the bandwidth requirements of the proposed architecture are reduced by 7.2% and 19.9%, as compared with the pre-texturing and mid-texturing architectures, respectively. The reduction rates are increased to 12.7% and 20.9% in UT2004, in which the failure rate is increased. In Half-Life, the bandwidth requirements are reduced by 16.9% and 22.6% compared with the two architectures, respectively. In Light-Scape, which shows the highest failure rate, the bandwidth requirements are reduced by 30.7% and 24.6% compared with the two architectures, respectively. 4.3. Throughput evaluation To evaluate the performance of the proposed architecture, we calculate the average rendering cycles per frame (ARCFs) for the pre-texturing, mid-texturing, and proposed architectures. For calculating the ARCF, we assume that all the procedures of the fragments are pipelined and the simulation environments other than the cache systems are identical. Then, the basic equation to calculate the ARCF can be obtained as follows.

ARCF ¼ ðNFragment þ T Penalty Þ=NFrame ;

ARCFPRE ¼ ðNFragments þ T Penalty þ T ZPenalty Þ=N Frame 2 3 , NFragments 6 7 ¼ 4 þM Tex  NFragment  Pmem 5 NFrame þM z  NFragment  Pmem ¼ ½NFragments  f1 þ ðM Tex þ M Z Þ  Pmem g;

ð2Þ

where TTexPenalty denotes the cache miss penalty of the texture cache and TZPenalty denotes the cache miss penalty of the depth cache. MTex and MZ denote the miss rates of the texture and depth caches, respectively, and Pmem denotes the memory access cycles due to the cache misses. Next, the basic ARCF of the mid-texturing architecture with two z-tests can be computed as follows.

ARCFMid ¼ ðNFragments þ T Z1Penalty þ T TexPenalty þ T Z2Penalty Þ=NFrame ;

ð3Þ

where TZ1Penalty and TZ2Penalty denote the cache miss penalties due to the 1st and 2nd z-test units, respectively. In the mid-texturing architecture, TZ1Penalty is 0 because the missed fragments of the 1st z-read are pre-fetched and the pipeline proceeds without stalls. Thus, the revised equation of (3) is as follows.

ARCFMid ¼ ðNFragemnts þ T TexPenalty þ T Z2Penalty Þ=NFrame 2 3 , NFragments 6 7 ¼ 4 þM Tex  ðN Z1Visible þ NZ1Missed Þ  Pmem 5 NFrame ;

ð4Þ

þM z2  ðNz1Visible þ NZ1Missed Þ  P mem where NZ1Visible, passed at the 1st z-test and NZ1Missed, missed at the 1st z-read and pre-fetched, are the fragments that must be texture mapped. MZ2 denotes the cache miss rate of these fragments that occurs during the 2nd z-read, and its value is close to 0%, similar to the results in [5]. Lastly, the ARCF of the proposed architecture has three types of penalties. The first is the miss penalties of the depth cache that is caused by the 1st and 2nd z-reads. The second is the cache miss penalty of the texture cache, and the third is the penalty due to the pipeline stalls that occur due to the change in the z-tag value, as mentioned in Section 3.2.3. The basic ARCF of the proposed architecture according to these penalties is as follows.

ð1Þ

where NFragments denotes the total number of fragments, TPenalty denotes the cache miss penalties, and NFrame denotes the total number of frames. ARCF = NFragments/NFrame when TPenalty = 0, and TPenalty occurs when the misses of the texture, depth, and color caches occur. Among these misses, we assumed that the color cache always hits because the miss ratio of the color cache is identical for all the three architectures and does not affect the performance discrepancy. The requests of the depth cache are either reads or writes, but a cache miss occurs only due to the read requests because the write requests are always hit. We also assume that the memory access cycles due to the cache misses are 10 cycles, as in [5]. The parameters of the texture and depth caches are identical—direct-mapped, 16and 32-Kbytes cache sizes, and 32- and 64-bytes block sizes. Using (1), the ARCF of the pre-texturing architecture is determined as follows.

10.6 18.0 24.3 42.7

 ARCFProposed ¼

NFragments þ T z1Penalty þ T texPenalty þT Z2Penalty þ T Switching

 NFrame ;

ð5Þ

where TSwitching denotes the penalty according to the changing value of the z-tag register, and it can be obtained by multiplying N1to0 and NSeparation, which denote the number of times changing the value of the z-tag register from 1 to 0 and the number of pipeline stages between the 1st and 2nd z-test units, respectively. Then, (5) can be revised as follows.

3 NFragments 7, 6 þM  N 6 Z1 Invisible  P mem 7 7 6 7 ¼6 6 þMTex  NVisible  Pmem 7 NFrame ; 7 6 4 þMZ2  N Visible  Pmem 5 þM1to0  NSeparation 2

ARCFProposed

ð6Þ

379

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

where the cache miss penalty at the 1st z-read is proportional to NInvisible, the number of invisible fragments, while the miss penalties of the texture and the 2nd z-read are proportional to NVisible, the number of visible fragments. Fig. 6 shows the ARCFs of the three architectures when the separation lengths are 10, 20, and 30; here, a lower ARCF implies a higher performance. In Fig. 6, ‘‘16K32B’’ represents the cache parameters of 16-Kbytes cache size and 32-bytes block size, and Separation10’’ represents the number of pipeline stages between the 1st and 2nd z-test units in the proposed architecture is 10. The simulation results show that the mid-texturing architecture has the lowest ARCF because it has the lowest miss rates of the texture and depth caches among the three architectures. The proposed architecture has the next lowest ARCF, and the pre-texturing architecture shows the highest ARCF. The discrepancy of the cache miss rates between the mid-texturing and proposed architectures shows little difference, thus the performance discrepancy between the mid-texturing and proposed architectures largely depend on the TSwitching penalty. Because the TSwitching penalty is proportional to the separation length, the proposed architecture shows better performance if the separation length is reduced. When the separation length is increased from 10 to 20, the average ARCF of the proposed architecture is increased by approximately 3.5%, while an increase in the separation length from 10 to 30 causes an increase in the average ARCF by approximately 6.1%. Another factor that affects the TSwitching penalty is N1to0; N1to0 is proportional to the failure rate of the z-test shown in Table 2. As the failure rate of the z-test is increased, N1to0 is also increased. In the results from Half-Life with ‘‘Separation10,’’ the failure rate of the z-test is 24.3% and the performance discrepancy between ARCFMid and ARCFProposed is 21.8%. However, in Quake3, in which the failure rate is lowered to 10.6%, the performance discrepancy between ARCFMid and ARCFProposed is also lowered to 4.8%. There-

fore, the performance of the proposed architecture is close to that of the mid-texturing architecture if the application has a low failure rate in the z-test by scene management. The situation that z-test pass occurs in succession is the case in which a visible region is continuous. In this case we can achieve the performance improvement because there is no penalty of changing z-tag register due to no change in z-tag register. When there is a very low ratio of failure on z-test, the access to texture data increases and it will bring the similar performance as the pre-texture architecture. Also, this work has a limit on the bandwidth decrease when there is no texture operation and the success ratio of depth test becomes high. 4.4. Area comparison The memory in the depth cache can be divided into tag-ram, which stores the tag data for hit/miss comparison, and data-ram, which stores the z-data (depth data) of the fragment. To reduce the latency, the tag-ram and the data-ram consist of SRAM, and the cell area of the cache increases according to the number of ports used by the SRAM. In [19], increasing one port in the SRAM increases the cell area by 60%, and the equation for calculating the ratio of cell area is as follows.

Ratio of Cell Area ¼ 1 þ ð0:6  ðN  1Þ;

ð7Þ

where N denotes the number of ports used in the SRAM. With (7), the ratio of cell area with 1-port SRAM (N = 1) is 1, and the ratio of cell area with 2-port SRAM (N = 2) is 1.6. Based on (7), the cell areas of the depth caches in the three architectures are listed in Table 3. The depth caches of the three architectures have the same parameters, cache sizes of 32 Kbytes and block sizes of 64 bytes. With regard to the results in Table 3, the cell area of the tag-ram has little effect on the total cell area because its size is considerably

ARCF(cycles)

Fig. 6. ARCFs of the pre-texturing, mid-texturing, and proposed architectures.

Table 3 The cell area results of the depth caches. Pre-texturing

Number of ports Ratio of cell area Total bytes Cell area Total cell area

Mid-texturing

Proposed

Tag-ram

Data-ram

Tag-ram

Data-ram

Tag-ram

Data-ram

2 1.6 1,088 13,926.4 433356.8

2 1.6 32,768 419,430.4

3 2.2 1,088 19,148.8 595,865.6

3 2.2 32,768 576,716.8

3 2.2 1,088 19,148.8 438,579.2

2 1.6 32,768 419,430.4

380

J. Park et al. / Microprocessors and Microsystems 37 (2013) 373–380

smaller than the data-ram. When comparing the total cell area, the mid-texturing architecture has the largest area, and the difference between the areas of the pre-texturing and the proposed architectures is small. As the depth cache in the proposed architecture uses a 2-port SRAM for its data-ram, which occupies most of the cell area, the total cell area is increased by only 1.2% as compared to that of the pre-texturing architecture, and is 26.4% less than that of the mid-texturing architecture. For the comparison of the overall hardware area, we implemented 3D graphics rasterization hardware which consists of a triangle setup unit, an edge processing unit, a span processing unit, a texture mapping unit (max tri-linear texture mapping support), a depth test unit, a stencil test unit, a fog unit, an alpha test unit, and a color blending unit. It is also equipped with three 16 KB SRAM caches of texture cache, depth cache, and color cache. The implemented hardware was designed with verilog HDL and synthesized with Synopsys Design Vision. The hardware including the proposed selective z-test architecture has made of 1.33 million gates while the mid-texture architecture required 1.42 million gates; therefore, the proposed architecture can achieve 6.3% reduction of the total area compared with the mid-texture architecture.

[18] Jan Edler, Mark D. Hill, 2007. . [19] H.J. Mattausch, ‘‘Hierarchical n-port memory architecture based on 1-port memory cells’’, in: Proceedings of the 27th European Solid-State Device Research Conference, September 1997, pp. 348–351. [20] http://www.spec.org/gwpg/gpc.static/viewperf71info.html, 2003. Jinhong Park works at LG Electronics. His research interests include 2D/3D graphics hardware architecture, GPGPU, and SoC platform. He received the BS, MS, and PhD degree in computer science from Yonsei University, Seoul, Korea.

Il-San Kim works at Samsung Electronics. His research interests include 3D rendering processor architecture and ASIC design. He received the BS, MS, and PhD degree in computer science from Yonsei University, Seoul, Korea.

5. Conclusion In this paper, we proposed a selective z-test architecture with variable z-test units. The proposed architecture improves the mid-texturing architecture to perform only a single z-test at a time by using the coherence within the fragments. For this improvement, the proposed architecture reduces average bandwidth requirements by 19.9%, 20.9%, 22.6%, and 24.6% in Quake3, UT2004, Half-Life, and Light-Scape, respectively, at the expense of a 5% reduction in performance. Further, the area of overall architecture is reduced by 6%. These memory bandwidth reduction and area reduction can lead lower power consumption.

Woo-Chan Park is an associate professor at Sejong University, Korea. His research interests include 3D rendering processor architecture, ray tracing accelerator, parallel rendering, high performance computer architecture, computer arithmetic, and ASIC design. He received the BS, MS, and PhD degree in computer science from Yonsei University, Seoul, Korea.

References [1] Advanced Micro Devices Inc., ‘‘ATI Delivers First 3D Gaming Chip For Cellphones’’, 2007. . [2] nVidia Corporation, ‘‘Handheld Graphics Processing Units (GPUs) for Advanced Handheld Devices’’, 2007. . [3] J. McCormack, R. McNamara, C. Gianos, L. Seiler, N.P. Jouppi, K. Correl, T. Dutton, J. Zurawski, ‘‘Neon: a (big) (fast) single-chip 3D workstation graphics accelerator’’, Research Report 98/1(WRL-98-1), Western Research Laboratory, Compaq Corporation, August 1998 (revised July 1999). [4] Moon-Hee Choi, Woo-Chan Park, F. Neelamkavil, Han Tack-Don, Kim Shin-Dug, An effective visibility culling method based on cache block, IEEE Transactions on Computers 55 (8) (2006) 1024–1032. [5] Woo-Chan Park, Kil-Whan Lee, Il-San Kim, Tack-Don Han, Sung-Bong Yang, An effective pixel rasterization pipeline architecture for 3D rendering processors, IEEE Transactions on Computers 52 (11) (Nov. 2003) 1501–1508. [6] K.S. Booth, D.R. Forsey, A.W. Paeth, Hardware assistance for Z-buffer visible surface algorithms, IEEE Computer Graphics and Applications 6 (11) (1986) 31–39. [7] http://www.opengl.org, 2007. [8] M. Kaplan, The use of spatial coherence in ray tracing, in: D. Rogers, R.A. Earnshaw (Eds.), Techniques for Computer Graphics, etc., Springer-Verlag, New York, 1987. [9] N. Greene, M. Kass, G. Miller, Hierarchical Z-buffer visibility, in: Proceedings of SIGGRAPH ‘93, July 1993, pp. 231–238. [10] Advanced Micro Devices Inc., ‘‘High Definition Gaming White Paper’’, 2007. . [11] http://www.idsoftware.com/games/quake/quake3-arena/, 2007. [12] http://www.valvesoftware.com/games.html, 2007. [13] http://www.epicgames.com/index_2k4.html, 2007. [14] L. Bishop, D. Eberly, T. Whitted, M. Finch, M. Shantz, Designing a PC game engine, IEEE Computer Graphics and Applications 18 (1) (1998) 46–53. [15] D. Bartz, M. Meißner, T. Hüttner, Extending graphics hardware for occlusion queries in OpenGL, in: Proceedings of Eurographics/Siggraph Workshop on Graphics Hardware, Lisbon, 1998. [16] N. Green, ‘‘Efficient occlusion culling for z-buffer systems’’, in: Proceedings of Computer Graphics, International, June 1999, p. 78. [17] http://www.mesa3d.org/, 2007.

Yong-Jin Park is a PhD student in the Department of Computer Science at Yonsei University, Korea. He is currently researching architecture for 3D computer graphics. He received the BS and MS in computer science from Yonsei University.

Tack-Don Han is a professor in the Department of Computer Science at the Yonsei University, Korea. His research interests include high performance computer architecture, media system architecture, and wearable computing. He received a PhD in computer engineering from the University of Massachusetts.