Area-efficient pixel rasterization and texture coordinate interpolation

Area-efficient pixel rasterization and texture coordinate interpolation

ARTICLE IN PRESS Computers & Graphics 32 (2008) 669–681 Contents lists available at ScienceDirect Computers & Graphics journal homepage: www.elsevie...

2MB Sizes 0 Downloads 20 Views

ARTICLE IN PRESS Computers & Graphics 32 (2008) 669–681

Contents lists available at ScienceDirect

Computers & Graphics journal homepage: www.elsevier.com/locate/cag

Technical Section

Area-efficient pixel rasterization and texture coordinate interpolation Donghyun Kim a,, Lee-Sup Kim b a b

Qualcomm Inc., USA Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, Republic of Korea

a r t i c l e in fo

abstract

Article history: Received 19 October 2007 Received in revised form 3 August 2008 Accepted 20 August 2008

In this paper, new pixel rasterization and texture coordinate interpolation algorithms are presented to reduce silicon area. The proposed pixel rasterization based on the characteristics of edge function saves silicon area in terms of gate count by 38.9% and 35.3% compared to the previous centerline and scanline algorithms, respectively. The proposed texture coordinate interpolation combines the benefits of division and midpoint iteration in order to reduce silicon area without performance loss in computing the fraction part of texture coordinates, which is required for texture filtering. The proposed texture coordinate interpolation architecture uses less silicon gates than the architecture using dividers, and the gate count reduction ratios are 25.2% and 37.0% for 16- and 32-bit texture coordinates, respectively. The hardware feasibility of the proposed architecture is proved by implementation into a three-dimensional (3D) graphics SoC. & 2008 Elsevier Ltd. All rights reserved.

Keywords: Rasterization Scan conversion Texture coordinate interpolation

1. Introduction Three-dimensional (3D) graphics is now commonly used in mobile and consumer devices with widespread use of liquid crystal displays. The major applications of 3D graphics such as computer games and graphical user interfaces need real-time interactivity. Therefore, the typical rendering algorithm based on rasterization, mapping 3D primitives into a raster format, is popularly used due to its fast speed compared to other approaches such as ray-tracing or image-based rendering. A 3D scene is typically described as triangles represented by three vertices. Rasterizers take a stream of triangles and transform them into corresponding pixels on the viewer’s monitors [1–3]. Previous mobile 3D graphics chipsets with limited area and power used to integrate only rasterizers which are essential in 3D graphics pipeline [4,5], and rasterizer still accounts for large proportion of silicon area in 3D graphics mobile accelerators now [6–8]. An important part of rasterization is how to fill in all the pixels inside a given triangle. Various algorithms have been developed for moving from pixel to pixel to cover all pixels inside the triangle, and the most popular solution is the scanline algorithm [9]. It firstly searches boundary pixels along triangle edges and fills the internal pixels row by row between boundary pixels. This approach is intuitive and does not traverse outside the triangle, but separate hardware logics for edge traversal and span filling are

 Corresponding author at: Qualcomm Inc., 5775 Morehouse Drive, San Diego, CA 92121, USA. Tel.: +1858 805 5993; Fax: +82 42 879 9860. E-mail address: [email protected] (D. Kim).

0097-8493/$ - see front matter & 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.cag.2008.08.007

typically required [4]. Instead of scanline edge search, several algorithms which traverse pixel by pixel using edge functions have been presented in [1]. The centerline algorithm also based on edge functions has been implemented into real hardware with the tiling method [2,10]. The centerline algorithm has parallelism with any degree of efficiency, but it needs to lock up the traversal point inside the given triangle because determination of traversal direction is based on intersection tests. There are other approaches such as zigzag rasterization and Hilbert order rasterization to increase memory coherency for the performance of texturing and visibility test [11], but they do not provide hardware feasibility for forward differencing interpolation because their rasterized pixels are not always continuous in horizontal or vertical direction. Forward rasterization that does not sample at the pixel center, but it is efficient for small primitives [12]. The performance improvement in texturing and visibility test can be also achieved with previsibility tests on rasterizer [13–15]. Another important part is that interpolation for texture coordinates must be followed by the division with the depth of the pixel in order to avoid the perspective foreshortening problems. Since this per-pixel division makes it difficult to achieve the high throughput of pixel-fill rate with small hardware cost, various previous studies have been developed to eliminate the division in perspective correct texturing. Some techniques to approximate hyperbolic curves with linear and quadratic interpolations have been studied [16]. Hardware architecture for the quadratic interpolation that uses only adders instead of dividers has been also presented [17]. However, these studies are based on approximations with possible errors, and the error may cause conspicuous pixel defects in some particular cases such as large

ARTICLE IN PRESS 670

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

polygon rendering. There have been other approaches to use midpoint algorithms [18–20], which are well known for the trace of hyperbolic functions. The midpoint algorithm computes exact integer texture coordinates for the point sampling by only using several additions instead of a division, but the point sampling does not provide good image quality. Usual filtering methods such as trilinear filtering require the fraction part of texture coordinates as filtering coefficients. Fixed point representations including the fraction part can be scaled such that they take integer values only, as mentioned in [20], but the number of iterations is also greatly increased. In this paper, the proposed entire rasterization improves the previous works using two schemes, which are the pixel rasterization and the texture coordinate interpolation [23]. The proposed pixel rasterization scheme uses the characteristics of the edge function [1] in the pixel traversal, and it reduces silicon area compared with previous schemes [1,2]. The texture coordinate interpolation proposed in our earlier work [23] combines the both merits of the midpoint iteration for the integer part and the pipelined digit-recurrence division approach for the fraction part. Since the precision of the fraction part is much less than that of the integer part, the major hardware cost gain is obtained from integer part iteration, and the throughput is sustained from short pipeline of the fraction part division. Section 2 describes the proposed pixel rasterization scheme, and Section 3 reviews the proposed texture coordinate interpolation [23]. After hardware architecture and implementation for the two proposed techniques are described in Section 4, the analysis and discussion about the proposed pixel rasterization and texture coordinate interpolation are presented in Section 5. Section 6 summarizes our works.

2. Pixel rasterization 2.1. Previous pixel rasterization algorithms The intuitive scanline traversal and its variants have been used in several hardware architectures [4,7]. A graphics engine [7] has been developed for mobile applications, and it has separate hardware logic for edge traversal and span filling. A hardware rasterizer [4] also has two separate edge-traversal blocks and a span-filling block. There is hardware redundancy in the edgetraversal blocks because their throughputs are different from that of the span-filling block. In addition, it is hard to generate multiple pixels in parallel with any degree of efficiency. Various algorithms have been developed to remove the hardware redundancy by traversing pixel by pixel. These algorithms are based on the edge function which classifies a point within the plane as falling into one of three regions: the region to left of the line direction, the region to the right to the line direction, or the region representing the line itself [1,3]. Let x and y denote the horizontal and vertical screen coordinates, respectively. The edge function E(x, y) is

defined as follows: Eðx; yÞ ¼ ðx  x0 ÞDy  ðy  y0 ÞDx

ðDx ¼ x1  x0 ; Dy ¼ y1  y0 Þ (1)

There is a relationship between the sign of the edge function and the position of the point: E(x,y)40 if (x, y) is to the right side of the edge direction E(x,y) ¼ 0 if (x, y) is on the edge E(x,y)o0 if (x, y) is to the left side of the edge direction Assuming a triangle that consists of three edges in clockwise direction, pixel validity that indicates whether a pixel is inside the triangle or not is known by checking if all the edge functions are positive. However, as the edge function guarantees only pixel validity, additional traversal algorithms are required to traverse all the pixels of an object. Several traversal algorithms with the edge function have also been presented as depicted in Fig. 1. The centerline algorithm shows the smartest traversal among them because it avoids unnecessary traversal in the bounding box or repeated pixel generation. The centerline algorithm was implemented into a silicon chip, PixelVision [10]. The centerline algorithm is well described in [2]. Before deciding on the pixel traversal direction, the rasterizer should check which direction is possible. The centerline algorithm checks the intersection of the boundary of a pixel rectangle (or a pixel stamp rectangle) and triangle edges. There are four probe points for edge functions at the four corners of the pixel for the intersection test. PixelVision samples the pixel value at the lefttop corner, and hence three additional probe points are used. For the commodity hardware, four additional probe points are required since OpenGL semantics stipulate pixel center sampling. Computing the intersection of a pixel boundary segment with the object consists of two steps. The first step determines whether one or both probe points at the end of each pixel boundary segment are inside each edge segment. The second step tests if the pixel boundary segment is inside of the object bounding box. If a segment passes both tests, then it probably intersects with the object. The rasterization algorithm of PixelVision starts the traversal from the top-most vertex. Before the traversal moves down to the next horizontal line, the centerline algorithm sweeps all the pixels on the current line in the object. When the traversal is started on a horizontal line, the centerline algorithm checks whether the right direction of the current pixel is valid or not. If the right direction is valid, the stamp contexts such as edge function values, color, and depth are stored into an additional context register named RightSave. After that, the centerline algorithm traverses to the left direction. After all the pixels in the left side of the line are traversed, the pixel context in RightSave is restored to traverse the right direction. While the traversal sweeps left or right, the algorithm also looks for valid down positions. The first valid down

Fig. 1. Conventional traversal algorithms.

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

Fig. 2. The centerline algorithm. Octagons are DownSave positions, and circles are RightSave positions.

position is saved into an additional context DownSave. After all the pixels in the line are swept, the algorithm restores DownSave and moves the traversal down. If a valid DownSave is not found in the horizontal sweeping, the rasterization of the object will be finished. Fig. 2 describes an example of the pixel generation order and saved contexts in rendering a triangle. This algorithm requires two additional pixel contexts, DownSave and RightSave. A context includes not only edge function values but also all of the pixel data such as color, depth, and texture coordinates. Since this is a large amount of data, the number of contexts influences the silicon area of rasterizer. Because an additional context save point is required per one-level tiling, there exists a trade-off between silicon area and memory efficiency [2].

671

Traverse-code 11 is undefined. This traverse-code indicates the relative location of the pixel to the edge. Traverse-code 00 means that the pixel is inside the edge-on the edge or in the clockwise plane of the edge direction. If traverse-code is 01, the pixel is outside the edge and the edge is located on the right side of the pixel. Traverse-code 10 implies that the pixel is outside the edge and the edge is located in the left side of the pixel. Traversal direction should be right when traverse-code is 01, and the direction should be left when traverse-code is 10. When an edge is horizontal and a current traversal position is outside the edge, traverse-code is defined as 01 though the edge is not located on the right-side of the pixel. In this case, the edge is never reached by moving right, but there will be a movement of up-direction by another edge of the triangle. When a pixel lies on the shared edge of two triangles, it should be drawn by only one triangle. The shared edge of two adjacent triangles must have opposite direction according to OpenGL specification, and the pixel is drawn according to the edge direction. In the pixel traversal, all the pixels on edges are visited regardless of the edge direction once, and they are filtered out according to the edge directions. Now let us consider a triangle of three edges. Fig. 3 shows a triangle with three edges E0, E1, E2. Traverse-codes T1 and T2 are also similarly defined for the other two edges E1 and E2. Let us define a 2-bit traverse-code Ttri for a triangle by bit-OR operations of T0, T1 and T2. This is a very simple computation, but the traverse-code Ttri of a triangle directly informs the traverse direction of a pixel. If Ttri is 00, it indicates that a pixel is inside the triangle (Region 0). If Ttri is 10, it implies that a pixel must be traversed left (Region 2). If Ttri is 01, a pixel must go right to meet the triangle (Regions 1, 3 and 6). Traverse-code Ttri 11 implies that the pixel must move vertical to meet the triangle (Regions 4 and 5). By this characteristic, the pixel movement is determined from anywhere on the screen. This is a useful characteristic in pixel traversal that is not available in the approach based on the intersection of object edges and stamp edge segments. 2.3. Proposed pixel rasterization

2.2. Proposed movement decision The centerline algorithm used in PixelVision determines the rasterization direction by searching the validity of the adjacent pixels with the intersection test [2]. It cannot find the movement direction if a pixel is fully outside the object. This property demands two context save points, DownSave and RightSave to bind the current pixel inside the object. The proposed method determines the movement direction of a pixel by the only edge function characteristics regardless of the pixel position. An edge function of a point describes the relative position of the point to the edge. To determine whether a pixel is left or right of an edge in screen coordinate, it is necessary not only to compute the sign of the edge function value but also to examine the direction of the edge. We can distinguish the edge direction by examining the sign of vertical difference Dy. If Dy is positive, the edge is up-direction. Otherwise, the edge is down-direction. Therefore, if a pixel is outside an edge, the horizontal direction to meet the edge is simply decided by examining the sign of Dy. Since Dy is already required to update edge function values by the x-movement, the sign of Dy is easily obtained. For a pixel P(x, y) and an edge with edge function E0 and Dy0, a 2-bit traverse-code T0 is introduced and defined as follows: T 0 ¼ 00

if E0 ðx; yÞX0

(2)

T 0 ¼ 01

if E0 ðx; yÞo0 and Dy0 X0

(3)

T 0 ¼ 10

if E0 ðx; yÞo0 and Dy0 o0

(4)

The centerline algorithm uses additional probe points of edge functions at the four corners of the pixel for the intersection test. This requires additional sets of adders, and increases the data amount of a pixel context. The proposed traversal algorithm uses only probe points at the center of pixel rectangle, which is required for the pixel validity test. Fig. 4 shows the probe point positions for OpenGL semantics. The proposed rasterization uses the direction characteristics of the edge function, instead of intersection tests. It is similar to the previous centerline algorithm, but it requires only one context save point, RightSave, while the centerline algorithm requires two context save points, RightSave and DownSave. The centerline algorithm examines the first valid down point for DownSave during horizontal sweeping, but this is not necessary in the proposed algorithm because it proceeds down to the next stamp line immediately after it finishes sweeping one stamp line horizontally. We now describe the proposed pixel rasterization algorithm. This algorithm always starts with the top-most vertex, sweeping out an entire horizontal pixel line before moving down to the next pixel line. First, if the start pixel of the horizontal traversal is valid with Ttri ¼ 00, which means that the pixel is inside the triangle, there are probably other pixels to the left and the right side of the pixel. Therefore, it must sweep both sides. To sweep both sides, the context of the current position is saved into RightSave for the future reload, and the traversal goes to the left side until there are no valid pixels. After that, the pixel context is restored from RightSave, and the traversal sweeps the remaining right-side area

ARTICLE IN PRESS 672

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

Fig. 3. L and R for three edges by divided regions.

Fig. 4. (a) Additional probe points (diamonds) are required in the centerline algorithm. (b) Proposed algorithm removes redundant probe points at the corners.

of the pixel line. If the horizontal traversal starts outside of the triangle, the direction of traversal is decided by a triangle traversecode Ttri. If Ttri is 10, the direction is left side. It goes left side until Ttri becomes 01 or 11. If Ttri is 01, it goes right side until Ttri becomes 10 or 11. If the traversal starts where Ttri is 11, it finishes this horizontal line. After sweeping one horizontal pixel line, the traversal goes down directly, and repeats these processes again until there are no horizontal lines to draw. Fig. 5 depicts the cases of the start of pixel line traversal.

3. Division-free texture coordinate interpolation 3.1. Perspective correct texture mapping Texture mapping is to map an image, called a texture map, onto a surface to obtain realistic look by using simple geometry instead of precise geometry. The transformation of texture coordinate consists of two steps. The first is a transform from the 2D texture space to the homogenous object space, and the second is a transform from homogenous object space to 2D screen space [21]. For triangles, the homogenous object coordinates and the texture coordinates are connected by a homogeneous linear

transformation, and the homogenous object coordinates and 2D screen coordinates are related hyperbolically. The relation among the screen coordinates (x, y), the homogeneous object coordinates (x0 , y0, w0 ), and the texture coordinates (u, v) is given by 2 3 2 32 0 3 u K L M x  0  0 x y 6 7 6 76 0 7 y v N P Q ; (5) ðx; yÞ ¼ ¼ 4 5 4 5 4 5 w0 w0 w0 1 R S T Therefore, texture coordinates (u, v) can be represented in terms of screen coordinates (x, y) in the next relation   Kx þ Ly þ M Nx þ Py þ Q ; (6) ðu; vÞ ¼ Rx þ Sy þ T Rx þ Sy þ T As shown in (6), two division operations or one reciprocal with two multiplications are required in calculating exact texture coordinates of one pixel. 3.2. Previous midpoint algorithm The per-pixel division can be avoided by using midpoint algorithm in perspective texture mapping as proposed in [20]. The key idea of [20] is based on the three characteristics of 3D

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

673

Fig. 5. Four cases of the start points in horizontal traversals.

graphics texture mapping. First, it is not necessary to know the exact value in infinite precision. The error smaller than unit in last position (ulp) is allowed for finite precision. Second, the order of pixel generation is not arbitrary but sequential in a rasterizer. Third, the derivatives such as qu/qx are less than two if the level of mipmap, pyramid architecture of preprocessed texture, does not change [20,22]. With these three characteristics, repetition of additions enables us to trace hyperbolic curves within given precision as shown in Fig. 6. The ulp of pixel coordinate x and texture coordinate u ¼ f(x) is 1 in the example of Fig. 6. Since the exact value f(xi) for given xi is quantized into a nearest value of limited precision, the nearest quantized value ui is acceptable for f(xi). The difference between ui and f(xi) must be smaller than half ulp, which is 0.5 in this case. This is formulated as below: Kxi þ Lyi þ M Kx þ Lyi þ M  0:5oui p i þ 0:5 Rxi þ Syi þ T Rxi þ Syi þ T

Fig. 6. Integer points in the acceptable region for a hyperbolic curve.

(7)

Two variables d and E are introduced to simplify the expression dðx; yÞ ¼ 2ðRx þ Sy þ TÞ

(8)

Assuming the case that d is positive, (7) is simplified to the next inequality by using (8) and (9):

Eðx; y; uÞ ¼ udðx; yÞ  ðR þ 2KÞx  ðS þ 2LÞy  ðT þ 2MÞ

(9)

dðxi ; yi ÞoEðxi ; yi ; ui Þp0

(10)

ARTICLE IN PRESS 674

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

A new variable A is introduced for simplicity. When x increases by 1, d and E are changed as below: AðuÞ ¼ ð2u  1ÞR  2K

(11)

dðx þ 1; yÞ ¼ dðx; yÞ þ 2R

(12)

Eðx þ 1; y; uÞ ¼ Eðx; y; uÞ þ AðuÞ

(13)

When u increases by 1, A and E can be linearly updated as follows: Eðx; y; u þ 1Þ ¼ Eðx; y; uÞ þ dðx; yÞ

(14)

Aðu þ 1Þ ¼ AðuÞ þ 2R

(15)

Therefore, when E and d are changed by increases or decreases of x or y, E can be restored for satisfying inequality (9) by adjusting u with iterative additions or subtractions in (14) and (15). For y-direction movement, similar operations can be done. A variable B for y can be defined similar to A. The restoration of E into the range of (10) incurs multiple iterative additions of (14) and (15) depending on the partial derivatives of the hyperbolic curve. Generally, this iteration of additions or subtractions does not have upper bound, but the characteristics of mipmap [22] limit the derivatives such as qu/qx to two as mentioned previously. However, texture filtering methods such as trilinear filtering are usually used to increase image quality of texturing, and they require fraction parts of the texture coordinate. In these cases, ulp of u is much less than 1. For m-bit precision of the fraction part, the upper bound of iterations (14) and (15) become 2m+1. If 4-bits are used to represent the fraction of u, 32 iterations must be performed in the worst case. Large number of iterations results in the long latency of clock cycles or consumes large hardware cost for parallel additions of (14) and (15) in the hardware implementation, like divisions. 3.3. Proposed fraction part evaluation with midpoint algorithm To solve this problem of the midpoint algorithm in texture filtering, we try to combine the benefit of midpoint iteration and division. Midpoint iteration method requires small hardware size for addition of several variables, but its iteration bound depends on the precision of fraction part. On the contrary, most divisions always provide same throughput regardless of their precision, but the hardware size becomes larger as the pipeline is deeper. We perceive that the precision of fraction part is much lower than that of integer part. Therefore, we try to obtain major area gain from midpoint iteration on integer part and to remove the performance loss on fraction part by using digit-recurrence division. The proposed algorithm [23] separates integer part iteration and fraction part evaluation so that there is no dependency between the fraction part computation of previous pixel xi1 and current pixel xi. The mathematical induction follows as below. We firstly move the acceptable region from rounding off to truncation as shown in Fig. 7. The maximum error increases from 12 ulp to ulp, but truncation is simpler in computation of fraction part in our method. And also, the difference is negligible for the fraction part of enough precision. For the nearest filtering which chooses the nearest texel, only 1-bit fraction has to be calculated to indicate which texel is the closest. The next inequality formulates the acceptable region described in Fig. 7 for m-bit precision of the texture coordinate fraction part: Kxi þ Lyi þ M 1 Kx þ Lyi þ M  m oui p i Rxi þ Syi þ T Rxi þ Syi þ T 2

(16)

We introduce E0 and A0 instead of E and A in (9) and (11) so that E is independent of m as follows. Note that the second and the 0

Fig. 7. The acceptable region is changed by truncation.

third in relations (12)–(14) are still valid: E0 ðx; y; uÞ ¼ udðx; yÞ  2ðKx þ Ly þ MÞ

(17)

A0 ðuÞ ¼ 2uR  2K

(18) 0

Therefore, substituting E in (16) yields the next relation 

dðxi ; yi Þ oE0 ðxi ; yi ; ui Þp0 2m

(19)

Let sequence k(u) represents binary digits of fraction part of u as shown in the next relation. Let buic denote the largest integer equal to or less than ui, m   X kj ui ¼ ui þ ; j 2 j¼1

kj 2 f0; 1g

(20)

Naturally, buic satisfies (19) in the case of m ¼ 0:   dðxi ; yi ÞoE0 ðxi ; yi ; ui Þp0

(21)

Notice that (21) and (10) have the same shape such that buic is evaluated as the same way described in Section 3.2. We will show how the fraction part, the digit sequence k, is evaluated. Let u|n imply the number with n fraction bits equal to or less than u. Then next relations are induced easily: ui jnþ1 ¼ ui jn þ

knþ1 nþ1

2

nþ1   X kj ¼ ui þ j 2 j¼1

(22)

  ui j0 ¼ ui

(23)

ui jm ¼ ui

(24)

E0 ðxi ; yi ; ui jnþ1 Þ ¼ E0 ðxi ; yi ; ui jn Þ þ



knþ1 2nþ1

dðxi ; yi Þ

dðxi ; yi Þ oE0 ðxi ; yi ; ui jn Þp0 2n

(25)

(26)

The rule to evaluate sequence k is induced from (22) to (26) as follows: 



dðxi ; yi Þ 2nþ1

oE0 ðxi ; yi ; ui jn Þp0

! knþ1 ¼ 0

dðxi ; yi Þ dðxi ; yi Þ oE0 ðxi ; yi ; ui jn Þp  nþ1 ! knþ1 ¼ 1 2n 2

(27)

(28)

Fig. 8 depicts how the sequence k is obtained from k1 to km step by step. The number of these sequential evaluations is m for m-bit fraction part of texture coordinates while the iteration bound of the previous midpoint algorithm is 2m. More important thing is that these sequential evaluations do not have any dependency

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

675

Fig. 8. Finding from k1 to km sequentially. a) Integer level : evaluation of ui+1|0 from ui|0, b) To find k1: evaluation of ui+1|1 from ui+1|0, c) To find k2 : evaluation of ui+1|2 from ui+1|1, d) To find k3 : evaluation of ui+1|3 from ui+1|2.

Fig. 9. State transition diagram for the proposed traversal algorithm.

from computation of previous texture coordinates. For hardware implementation, the fraction part evaluation can be pipelined and performed simultaneously with integer part iterations. The hardware resource to evaluate 1-bit of fraction part is one adder for computing (28) and updating E0 .

Table 1 State description of FSM

4. Hardware architecture and implementation

LEFT RIGHT DOWN

State name

Description

IDLE START

Idle state ready to rasterize a triangle. Renders a first pixel (stamp). Save context into RightSave. Indicates traversal to left side. Indicates traversal to right side. Indicates traversal to down side for the next pixel line. Save context into RightSave. Indicates traversal to left side. After this state, the traversal restores RightSave and traverses right side. Restores RightSave.

4.1. Hardware architecture LEFT2

The rasterizer architecture presented in this paper receives triangle data from the triangle setup engine [6,24], and then it generates pixels with texture coordinates. The pixel rasterization proposed in Section 2 is implemented by only a simple FSM. The state transition diagram is illustrated in Fig. 9. There is only one major 2-bit input, a triangle traverse-code Ttri, and two control inputs, triangle_start and triangle_finish. The control signals in the hardware implementation level such as stall signals are ignored. Table 1 describes each state.

RIGHT2

The proposed texture coordinate interpolation is adopted in a scalable pixel pipeline, named Basic Rasterization Units (BRU), as depicted in Fig. 10, which shows the data flow for only x and u for simplicity. The dimension of screen and texture coordinates is scalable. For example, there are screen coordinates (x, y) and

ARTICLE IN PRESS 676

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

Fig. 10. The block diagram of BRU for 1D screen coordinate x and 1D texture coordinate u.

Fig. 11. The Integer Interpolator. The dashed rounding boxes imply the additional hardware for the 2  2 pixel stamp.

texture coordinates (u0, v0, u1, v1) for two-level multi-texture in our SoC implementation [6]. Therefore, the internal variables (E0 , A0 , B0 ) are also extended into four variable sets such as (Eu00 , Au00 , Bu00 ), (Ev00 , Av00 , Bv00 ), (Eu10, Au10, Bu10 ), and (Ev10, Av10, Bv10 ). The Context Register contains the variables of a current pixel such as position, color, depth, integer part of texture coordinates, and the internal variables defined in Section 3.3. There are two register sets for the current pixel context and RightSave. The Pixel Interpolator consists of adder arrays to interpolate the pixel contexts by rasterization traversal. It receives only two control signals indicating the traversal direction. In our implementation, Ef0, Ef1, and Ef2, which are 32-bit edge function values of three edges, are updated for every pixel to determine whether the pixel is inside a triangle or not. Twenty-six-bit depth Z, 10-bit fog factor

f, and a 4  18-bit color channel (R, G, B, A) are also interpolated. The internal variables in (11)–(13) are updated in the same architecture. The pixel rasterizer, which consists of the Context Register and the Pixel Interpolator described as above, produces all the attributes of one pixel per clock cycle. A texture coordinate interpolator consists of the other parts in the BRU, which are the Integer Interpolator, the Mipmap Switching Detector, the Mipmap Switch, and at least one Fraction Evaluator. The Integer Interpolator finds the integer part of texture coordinates by checking (10) and by updating (11)–(13). For 2D twolevel multi-texture coordinates (u1, v1) and (u2, v2), four Integer Interpolator blocks are required in a BRU. Fig. 11 shows the architecture of the Integer Interpolator. The interpolator receives the negative value of E0 and determines whether E0 is positive or

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

677

Fig. 12. Cascading the Fraction Evaluators.

not. The sign signal, MSB of E0 , drives all the subtraction control ports of all adders. If E0 is positive, E0 is negative, and MSB of E0 is high. Two parallel adders evaluate E0 +d, E0 +2d in the case of the single-pixel stamp, but Dx is two in the case of a 2  2 pixel stamp. Therefore, the iteration bound is doubled from 2 to 4, and four parallel adders are required to evaluate E0 +d, E0 +2d, E0 +3d, and E0 +4d. If E0 is negative, E0 is positive, and MSB of E0 is low. In the case of a 2  2 pixel stamp, the four adders evaluate E0 d, E0 2d, E0 3d, and E0 4d. The Decision Logic examines the sign of five results and bypassed E0 , and selects the result that satisfies condition (10). It also generates control signals of MUX, and finally selects the add-term for texture coordinate u. Internal variables A0 and B0 are updated in the same way. The Mipmap Switching Detector determines whether the current mipmap level must be changed. When the mipmap level is changed, the outputs of the Pixel Interpolator and the Integer Interpolator are not updated to the Context Register. Instead, the outputs of the Mipmap Switch are updated. There is one cycle loss to change a mipmap level, but mipmap-level shifting does not frequently occur. The Fraction Evaluator produces the fraction part of texture coordinate by (28). A Fraction Evaluator produces only one fraction bit by (28), but pipelining multiple Fraction Evaluators enables us to obtain multiple bits of fraction parts per cycle as shown in Fig. 12. 4.2. Hardware implementation The proposed rasterization algorithms were implemented into a real application for 3D graphics system to test the feasibility of the proposed architecture [6]. The developed SoC integrates a RISC processor, 3D graphics IP and other peripheral blocks. The 2  2 rasterizer in 3D graphics IP evaluates texture coordinates by the proposed method. All the internal variables related on texture coordinates are represented in 16-bit fixed point representation. The developed rasterizer including pre-depth test block consists of 290 k gates. It runs at a speed of 166 MHz and hence gives 666 M pixels and 1.3 G texture coordinates per second. Fig. 13 shows the test board of the chip rendering real-time images successfully on the LCD.

5. Analysis 5.1. Analysis setup There are many kinds of rasterizer architectures according to target applications, and performance and silicon area vary widely.

The proposed pixel rasterizer is compared with rasterizers based on bounding box, span filling, and centerline methods in terms of gate counts in Section 5.2. Each rasterizer for the test is implemented in Verilog-HDL and synthesized for 166 MHz clock frequency in 0.13 mm CMOS technology. It is assumed that each edge function is 32-bit, each color channel is 18-bit, depth is 26-bit, and the fog factor is 10-bit. All the rasterizers are designed to traverse one pixel per clock cycle in the peak performance, and the critical-path delays of all the rasterizers are almost same. Therefore, the performance of each rasterizer totally depends on pixel traversal efficiency which indicates the ratio of valid pixels inside a given triangle to the traversed pixels. The comparison in the pixel traversal efficiency is shown in Section 5.3. The proposed texture interpolator is compared with the architecture using pipelined dividers in terms of gate counts in Section 5.4, and the performance comparison with general midpoint iteration in the calculation of fractional part is described in Section 5.5. Two kinds of pipelined dividers are used for the comparison in Section 5.4. The first one is digit-recurrence divider of radix-2, and the second one adopts very-high radix algorithms using a lookup table. Regarding the very-high radix dividers, The divider based on Hung’s algorithm [25] is used for 16-bit texture coordinate precision, and the divider using modified Hung’s algorithm with Newton–Raphson iteration [26] is used for 32-bit texture coordinate. The digit-recurrence divider for 16-bit texture coordinates consists of 4 pipeline stages and consumes 6773 NANDequivalent gates. Hung’s divider is 2 pipeline stages and 14,025 gates, but it processes two dividends with one common divisor. The 32-bit digit-recurrence divider consumes 23,486 gates in 5 pipeline stages, and the 32-bit very-high radix divider consumes 41,102 gates in 3 pipeline stages. 5.2. Area comparison of the proposed pixel rasterization There are several kinds of traversal algorithms including the conventional centerline algorithm. The bounding box scan is the simplest algorithm, but the algorithms for memory locality such as zig-zag scan or Hilbert-order scan [11] are also based on the bounding box scan. The concept of scanline traversal algorithm looks intuitive, but the throughput balance of edge traversal and span filling is important. If the hardware resource of edge traversal is equal to that of span filling so that the pixel throughput is one per clock, but the required logic becomes twice of edge function-based algorithms like the centerline or the proposed algorithm.

ARTICLE IN PRESS 678

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

Fig. 13. Real-time verification of the 3D graphics SoC adopting the proposed rasterizer.

centerline method, and less by 35.3% than that of scanline method. 5.3. Performance comparison of the proposed pixel rasterization

Fig. 14. The gate count comparison of the single-pixel rasterizer without texture coordinate interpolation.

Fig. 14 shows the gate count comparison of the single-pixel rasterizer implemented without texture coordinate interpolation. Common logic includes control logic and register to store derivatives to be used in interpolation, and pixel register counts all the flip-flops used to store pixel contexts. Pixel interpolator includes movement decision logic and all adders used in pixel interpolation. The bounding box rasterizer is the smallest in the point of gate count because there are not any backup register files and logic for movement decision is simple. However, the pixel throughput of rasterizer is much less than a half, and this performance degradation will be shown in Section 5.3. The centerline rasterizer costs the highest gate counts in this case because there are two sets of pixel backup registers and complex logics for intersection test. The test result in Fig. 14 shows that the centerline rasterizer uses higher costs in gate count than the scanline rasterizer due to the additional intersection test logic. The proposed rasterizer reduces gate count over all three parts compared to the centerline rasterizer and the scanline rasterizer. The movement decision logic is much simpler than the intersection test logic in the centerline rasterizer, and the size of interpolation adder set is almost half of that of the scanline rasterizer. The size of the register to store pixel context is almost two thirds of the pixel register in the centerline rasterizer. Common logic is also smallest. The centerline rasterizer has the additional storage for corner probe points and the scanline rasterizer has the additional pixel context fields in edge traversals. In the test implementation, the gate count of the proposed pixel rasterization architecture is less by 38.9% than that of the

As all the rasterizers are implemented to traverse one pixel per clock cycle, the performance of them depends on how many of generated pixels are valid. Pixel efficiency denotes the number of valid pixels over the number of traversed pixels, and Fig. 15 shows the pixel efficiency of above rasterization methods according to scenes with various average triangle sizes. While the scanline rasterizer always shows pixel throughput of one because it firstly finds correct edges, the pixel throughput of the centerline and proposed methods is lower when an average triangle size is smaller. There are null pixels, which mean traversed pixels outside a given triangle in the centerline and proposed algorithms. The pixel throughput of bounding box is much less than those of other methods. Fig. 16 shows the performance per area of each rasterizer as considering cost in gate counts shown in Section 5.2. The performance efficiency per gate count is normalized to the scanline rasterizer in Fig. 16. The proposed and the centerline rasterizers have similar performance tendency, but the proposed rasterizer improves the average performance per area by 78.2% compared to the centerline rasterizer. The bounding box rasterizer is proved to be the worst due to its low pixel efficiency. The proposed rasterizer architecture shows the best results on overall simulation except for the case when average triangle size is lesser than 16. Considering that average triangle sizes in most game applications are much bigger than 16 [27], the proposed rasterizer architecture is proved to be better than other rasterizers. 5.4. Comparison of the proposed texture coordinate interpolation and division The proposed texture coordinate interpolator and pipelined dividers have the same performance as they can produce outputs per clock cycle, but their silicon area usages are different. Fig. 17 shows the gate count comparison graph for the texture coordinate interpolator compared to the interpolator architectures using dividers in the case of 16- and 32-bit texture coordinate, respectively. Each interpolator produces a set of 2D texture coordinates per clock cycle. All interpolator have different critical

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

679

Fig. 15. The pixel efficiency according to scenes with various average triangle sizes.

Fig. 16. The performance per area normalized to scanline rasterizer.

logic delays, but they are pipelined for the same clock frequency, 166 MHz. The additional pipelining registers are counted in logic part (Fig. 17). In the proposed architecture of a texture coordinate interpolation, the storage of intermediate term d, E, A, and B is additionally required. Therefore, the gate count of the register part in the proposed texture coordinate interpolator is increased by 17.7% from the divider-based interpolators in both cases. However, the gate count of logic part is greatly reduced in the proposed architecture since the proposed architecture uses only adders instead of dividers. The logic part is reduced by 34.1% for 16-bit texture coordinate compared to the very-high radix divider. In the case of 32-bit texture coordinate, the reduction ratio of the logic gate count becomes larger up to 45.1% because divider size usually increases quadratically as the precision increases. This gain in texture logics covers the penalty in texture registers, and it totally saves the gate count without performance loss by 25.2% and 37.0% for 16- and 32-bit texture coordinates,

respectively. If a rasterizer is designed to produce multi-sets of texture coordinates for the multi-layer texturing, the area gain of proposed texture coordinate interpolator architecture is dominant in the whole rasterizer system. 5.5. Comparison of the proposed texture coordinate interpolation and midpoint algorithm Fig. 18 shows image examples rendered by the proposed algorithm. Fig. 18a is an image rendered without texture filtering using conventional midpoint algorithm, and Fig. 18b is an image rendered with trilinear texture filtering using the proposed algorithm. The rendering times of both scenes are the same, but it is clear that Fig. 18a shows many visual defects, which are pixels contorted by aliasing noise. Trilinear texture filtering alleviates aliasing noise as shown in Fig. 18b. Although the precision of fraction part in texture coordinates is only 4-bit, Fig. 18b shows much better image quality. The image quality is greatly improved

ARTICLE IN PRESS 680

D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

Fig. 17. Gate count comparison for 16- and 32-bit texture coordinate interpolator. (a) 16 bit, (b) 32 bit.

Fig. 19. Performance comparison for midpoint iteration and proposed texture coordinate interpolation. Fig. 18. Image quality comparison between nearest filtering without fraction part (a) and trilinear filtering with 4-bit fraction part (b).

with the same performance by the proposed algorithm, and the additional hardware is only 4 adders used in 4 Fractional Evaluators. The fractional part of texture coordinates can be produced by general midpoint iteration, but the iteration upper bound is exponentially increased as the number of the fractional bit. Fig. 19 shows the performance difference between the proposed technique and midpoint iteration for 4-bit fractional part. The texture coordinate restorations (14) and (15) are checked in parallel for two cases that the texture coordinate differences are 1 and 2 ulp. Therefore, the midpoint algorithm calculates integer texture coordinates per every clock cycle only if ulp is 1, while the proposed technique calculates full texture coordinates per every clock cycle. The maximum texture coordinate difference is two by 1 mipmap, and the iteration upper bound is up to 16 as ulp is 16 in 4-bit fractional part. The low performance of the previous midpoint technique shown in Fig. 19 is caused by increased average iteration numbers, and it is 24% of full performance on

average. The performance degradation, which is an average iteration number, depends on the characteristics of each scene. For example, scene 19.a shows the performance degradation is smaller than the other scenes because magnified textures are used. There are not negative mipmap levels, so the texture coordinates vary slightly. 6. Conclusion In this paper, 3D graphics rasterization algorithms to reduce hardware area are presented. The proposed pixel traversal algorithm is based on edge function characteristics instead of the intersection test of polygon edges and pixel stamp edge segments. It reduces not only the edge function probe points at the four corners of the pixel stamp, but also one context save point. The gate count of the proposed pixel rasterization architecture is less by 38.9% than that of the centerline method, and less by 35.3% than that of scanline method.

ARTICLE IN PRESS D. Kim, L.-S. Kim / Computers & Graphics 32 (2008) 669–681

The proposed texture coordinate interpolation benefits from the low cost of midpoint algorithm and high throughput of pipelined divider for the case of texture filtering which requires fractional texture coordinates. The hardware area cost for proposed texture coordinate interpolation is lesser by 25.2% and 37.0% than the area cost of the architecture using dividers in the cases of 16- and 32-bit texture coordinates, respectively. The rasterizer of the proposed architecture with four parallel pixel processing units is implemented in a 3D graphics SoC. The implemented rasterizer achieves a throughput of 666 M pixels and 1.3 G texture coordinates per second.

Acknowledgment This work is supported in part by SAMSUNG Electronics, the university IT research center program, and the Consortium of Semiconductor Advanced Research through the SYSTEM IC 2010 project, Korea. References [1] Pineda J. A parallel algorithm for polygon rasterization. In: Proceeding of SIGGRAPH, 1988. p. 15–21. [2] McCormack J, McNamara R. Tiled polygon traversal using half-plane edge functions. In: Proceeding of SIGGRAPH, 2000. p. 15–21. [3] Lentz DJ, Kosmal DR, Poole GC. Polygon rasterization. US Patent 5,446,836, 1995. [4] Lee J, Kim LS. SPARP: a single pass antialiased rasterization processor. Computers and Graphics 2000;4:233–43. [5] Park YH, Han SH, Lee JH, Yoo HJ. A 7.1-GB/s low-power rendering engine in 2-D array-embedded memory logic CMOS for portable multimedia system. IEEE Journal of Solid-State Circuits 2001;32(6):944–55. [6] Kim D, Chung K, Yu CH, Kim CH, Lee I, Bae J, et al. An SoC with 1.3 Gtexels/s 3-D graphics full pipeline for consumer applications. IEEE Journal of SolidState Circuits 2006;41(1):71–84. [7] Woo R, Choi S, Sohn JH, Song SJ, Yoo HJ. A 210 mW graphics LSI implementing full 3D pipeline with 264Mtexels/s texturing for mobile multimedia applications. In: Proceedings of IEEE international solid-state circuits conference; 2003. p. 44–5. [8] Akenine-Moller T, Strom J. Graphics for the masses: a hardware rasterization architecture for mobile phones. ACM Transactions on Graphics 2003;22(3):801–8. [9] Wylie C, Romney GW, Evans DC, Erdahl A. Halftone perspective drawings by computer. In: Proceeding of AFIPS fall joint computer conference, 1967. p. 49.

681

[10] Kelleher B. PixelVision architecture. Technical note 1998-013, System Research Center, Compaq Computer Corporation, 1998, available at /http:// www.research.digital.com/SRC/publications/src-tn.htmlS. [11] McCool MD, Wales C, Moul K. Incremental and hierarchical Hilbert order edge equation polygon rasterization. In: Proceeding of SIGGRAPH /EUROGRAPHICS workshop on graphics hardware, 2001. p. 65–72. [12] Popescu V, Rosen P. Forward rasterization. ACM Transactions on Graphics 2006;25(2):375–411. [13] Yu CH, Kim LS. An adaptive spatial filter for early depth test. In: Proceeding of IEEE international symposium on circuits and systems, vol. 2, 2004. p. 137–40. [14] Park WC, Lee KW, Kim IS, Han TD, Yang SB. An effective pixel rasterization pipeline architecture for 3D rendering processors. IEEE Transactions on Computers 2003;52(11):1501–8. [15] Park WC, Lee KW, Kim IS, Han TD, Yang SB. A mid-texturing pixel rasteriation pipeline architecture for 3D rendering processors. In: Proceeding of IEEE 13th international conference on application-specific systems, architectures and processors, 2002. p. 173–82. [16] Demirer M, Grimdale RL. Approximation techniques for high performance texture mapping. Computer and Graphics 1996;20(4):483–90. [17] Abbas A, Szirmay-Kalos L, Szijarto G, Horvath T, Foris T. Quadratic interpolation in hardware Phong shading and texture mapping. In: Proceeding of spring conference on computer graphics, 2001. p. 25–8. [18] Blinn JF. Hyperbolic interpolation. IEEE Computer Graphics and Applications 1992:89–94. [19] Pitteway M. Algorithms for drawing ellipses or hyperbolae with a digital plotter. Computer Journal 1967;10(3):282–9. [20] Barenbrug B, Peters FJ. Overveld CWAM. Algorithms for division free perspective correct rendering. In: Proceeding of SIGGRAPH /EUROGRAPHICS workshop on graphics hardware. 2000. p. 7–13. [21] Watt A. 3D computer graphics. 2nd ed. Boston: Addison-Wesley Publishing Company; 2000. [22] Ewins J, Waller MD, White M, Lister PF. MIP-map level selection for texture mapping. IEEE Transaction on Visualization and Computer Graphics 1998; 4(4):17–29. [23] Kim.D, Kim LS. Division-free rasterizer for perspective-correct texture filtering. In: Proceeding of IEEE international symposium on circuits and systems, vol. 2, 2004. p. 153–6. [24] Chung K, Kim D, Kim LS. A 3-way SIMD engine for programmable triangle setup in embedded 3D graphics hardware. In: Proceeding of IEEE international symposium on circuits and systems, 2005. p. 4570–3. [25] Hung. P, Fahmy H, Mencer O, Flynn MJ. Fast division algorithm with a small lookup table. In: Proceeding of 33rd Asilomar conference on signals, systems, and computers, vol. 2, 1999. p. 1465–8. [26] Jeong J, Park WC, Jeong W, Han TD, Lee MK. A cost-effective pipelined divider with a small lookup table. IEEE Transaction on Computers 2004;53(4): 489–95. [27] Roca J, Moya V, Gonzalez C, Solis C, Fernandez A, Espasa R. Workload characterization of 3D games. In: Proceeding of IEEE international symposium on workload characterization, 2006. p. 17–26.