A VLSI-design for fast vector normalization

A VLSI-design for fast vector normalization

CompJt. & Graphics, Vol. 19, No. 2, pp. 261-271, 1995 Copyright 0 1995 Elsevier ScienceLtd Printed in Great Britain 0097-8493/95 $9.50 + .OO Pergamon...

806KB Sizes 1 Downloads 92 Views

CompJt. & Graphics, Vol. 19, No. 2, pp. 261-271, 1995 Copyright 0 1995 Elsevier ScienceLtd Printed in Great Britain 0097-8493/95 $9.50 + .OO

Pergamon

0097~8493( 94)00152-9 Graphics Hardware

A VLSI-DESIGN

FOR

FAST

VECTOR

NORMALIZATION

GRUNTERKNITTEL Wilhelm-Schickard-Institut fur Inform&k, Graphisch-Interaktive Systeme (WSIIGRIS), Universitat Ttibingen, Auf der Morgenstelle 10, C9, D-72076 Tubingen, Germany, e-mail: [email protected] Abstract-We describe tbe design of a high-speed, high-precision sxngle-chip vector normalizer for 3D vectors. It was constructed as a pipelined unit to speed up our graphics system for scientific visualization, but can profitably be employed in any application where a large number of vectors must be processed rapidly. The circuitry accepts 3D vectors with 33 bit two’s complement components. The components of the normalized vectors are computed as 16 bit two’s complement fixed-point numbers. Due to the overall pipeline architecture, tbe chip is able to receive one 3D vector and to produce one normalized vector each clock. The architecture was implemented for a clock frequency of 80 MHz using a 0.8 pm Gate Array technology. The design consumes about 66.500 gates. To normalize a 3D vector, three square operations, two additions. one sauare root ooeration and three divisions must be done. Thus, the chip provides a sustained pcrformancd of 720 MGPS. 1. INTRODUCTION

Many computer graphics algorithms require fast and frequent vector normalizations. For example, the Phong illumination model [2] 4 = 1,&G

respectto speedand requiredchip space(by placing more or lessfunctional units into a single pipeline stage)or precision(by addingor discardingan appropriate number of stages and operand bits). 2. ARCHITECTURAL

+ I,,(kG(&v-~,v)

OVERVIEW

The circuitry describedon the following pagesac+ k,(E,- gN)“) (simplified)+ (1) ceptsthe componentsof a 3D vector 7 = ( u,; u,;u, ) and producesits associatednormalizedvector N = calculates the light intensity ii of a point on an object (n.r; n,; %I. surface according to four unit vectors: The block diagram(seeFig. 1) showsthe deepbut regularpipeline structureof the chip. The boxeswith l the surface normal cN, the smallfilled triangle representregisters.The regisl the normalized vector z, in direction to the light ter structure .within the pipelinedunits (squareroot source and unit anddivide unit) hasbeenomitted for clarity, but l the so-called halfway vector HN, which in turn is will be explained in later sections.Operandswhich the normalized sum of z,,, and the normalized vecskipcertain functional unitsmusttravel throughpipetor TN in direction to the viewpoint. line registers(FIFOs) to maintain synchronization. Thus, FIFO rnemoriesmust also be placed onto the Applications aiming at virtual reality, e.g., the graphics subsystem for volume rendering developed at chip. There are no feedbacksor functional units for exception handling required, by which the control WSUGRIS [ 5,6,7], must provide perspective projecstructurebecomesextremely simple.The zero vector tion and non-parallel light, that is, none of the vectors is constant.Unfortunately, normalizinga vector pres- in turn producesthe zero vector. Thereis anadditional ents a great computationalexpense(especially the valid flag which travels along with each vector and a squareroot operation) and, moreover, ( 1) has to be small circuitry to maskthe clock. Besidesthat, the chip has just to be clocked. The excessive pipeline

evaluated several millions of times each second. This was the motivation to develop a high-speed

structure relies on a great numberof vectors to be single-chipvector normalizer. The large number of processedsequentially,asit is the casein most comvectors to be processedsequentiallypermitsthe use puter graphicsapplicationsandespeciallyin the algoof a moderatelydeep pipeline structurewithout any rithmsusedin our voxel subsystem.Thus,the pipeline performance penalty. For the square root function, an will alwaysbe filled and sooperateat maximumefficiency. algorithm was adapted which computes one result bit We assumea global spaceof 32 bit extent in each per stage and uses only a smallcircuitry within each direction, that is stage. The architecture of the chip is scalable with -23’

’ Given in Blinn’s alternative formulation [ 11. X: red, green or blue, iah: ambient light, fU: light coming from the light source, k,. kd, k,: ambient, diffuse and specular reflection

5 x, y, z 5

23’

- 1.

(2)

Therefore, the input operandsare expectedto be 33 bit two’s complementintegers.Smalleroperandsmust be sign-extendedto 33 bits. The componentsof the

coefficients, C,: color of the object, n: specular reflection exponent 261

the index is omitted. The particular bits of a component or a magnitudeare identified by subscriptnumbers,e.g., u = ( uL5;u14;ui3* * . uo) .The vector length is representedby the uppercaseletter without any diacritical marks. The bits of squared variables are quoted, e.g., U2 = ( U;,; IJh * * * UA) . 4. THE SIGN

UNIT

(INPUT)

The signunit at the inputsconverts a 33 bit two’s complementnumberu into a 32 bit unsignedinteger a precededby a sign flag S. Thus, the range is restrictedto + 1 I IJ 5

-232

SCALE

-

T2

-

-

-

SQUARE

-

= -

ROOT UNIT

= -

+

+

Ax; *t+

a(, = u();

-

al -COMPONENTFIFO I

(4)

The sign flag is 1 if the numberis negative. All signflagsarepropagatedthroughthewholecircuit and passedto the sign units at the outputs.The arithmetic operationis to invert all bits and add 1 if the highest bit is set, otherwiseto leave everything unchanged. Thus:

UNIT +

- 1.

232

= l73*u,

(5) v u32(iJ&l

v I@~)

= u332UI v u32(UI

(6)

@ UO);

a2 = ii,,u2 v ug*(i72(u,v uo) v u2qi7~)

I

a3

= ~32%

v U32(U2

@ (ul

= i?32U3

v U32@3(U2

= n32U3

v U32(U3

v @ (U2

v

(7)

UO));

u, v u(,) v uI

v U3i72i7,iijo)

v UO));

(8)

(9)

In general: af =

F32Up

V U32(Up v

Fig. 1. Overall

Cl3 (Up-1 . . .

V Up-2

v VI v uo)); S = U32.

architecture.

(10)

Gate count: 3.000 normalized vector are computed as 16 bit two’s complement fixed-point numbers -IS n = -n, x 20 + c j=-1

nj

x

2’.

(3)

5. THE ALIGNMENT

UNIT

In orderto reducethe width of the arithmeticunits, the componentsof a vector are uniformly scaledup or down until no componentis greaterthan 2” - 1 and at leastonecomponentis greaterthan or equalto 2 14.Theoretically, no error emergesfrom this operation since

Thus, the chip has 147 I/O-pins (excluding control-, test- andclock-terminals). We will now describeall functional units in data flow order in details, i.e., by Boolean equationsor I?=by schematicdrawings.For eachfunctional unit, an \‘(Uxx 2y approximategate count is given.

u x 2” + (uy + 2”)2 + (u, x 2”)2

3. NAMING CONVENTIONS = J&q(11) A vector is denotedby an uppercaseletter with an arrow. The componentsare designatedby the lower- However, dueto the possibletruncationof large veccaseletter with the indicesx, y andz, e.g., b = ( u,; tors,a roundingerror might arise.SeeSection14Error uy; U,) . If an operationis appliedto any component, Estimation.

A VLSI-design for fast vector normalization To describe the function of this unit we use the following abbreviations:

263

6. THE

SQUARE

UNITS

Since the input operands are 15 bit integers, the results are 30 bit positive numbers. We use standard multiplier networks. However, the computing pattern shows some redundancy which can be exploited to cut the required chip space by one half. We will demonstrate the scheme for a 6 bit number. q2 = (q525 t q424 + q323 + 9222 + q,2’ + qo20)2

= (q1,2” + q;@2’O + *. * + qA2O).

(24)

The computingpattern is shownin Table 1. Most elementsoccur twice and so Table 1 can be reorganizedasshownin Table 2. The circuitry in Fig. 2 performsthis function (q& = qo; q{ = 0). Gate count: 3.500 7. THE

GUNIT

This is a standardtriple input integer adder. The resultsare 32 bit integers.The alignmentunit assures that SHL14 = (aJo A Zy3,

A

aye V azo)

V

- - * A 4,

A Cx3,

A Ez3,

A

A

1OOOOOOOH 5 Q’ 5 BFFDOOO3H.

- * * A lix,

* * - A Ziz,.

(18)

(25)

Gate count: 3.000

Besidesthat, the alignment unit activates the flag Z(er0, active low) in case the zero vector was received. The function of the alignmentcircuitry is defined by:

The componentsof the vector andthe vector length are scaledso that

q. = a0

1oOOOOOOH ~2 T2 I 3FFlWOlH

SHO

A

a3 q, = a0

A

SHRl

SHR3

V

- * - V

SHLl

A

a3 q2 = a,

A

a,

V

A

V a,

SHR2

SHL2

A

a3

A

V

a,

V

SHRl

SHO

A

- - -

V

a,, a2

V

SHR2

A A

SHR17;

A

SHRl

V

V a2 A

- - - V a,9

A

SHO

SHR17;

V

V

(21)

SHL14

A V

a,4

V

A

.**

V

aI3 A

A

SHO

V

a,5

a3,

A

SHR17.

V

or

That is, the squaredvector length is shiftedright two placesandeachof the componentsis shiftedright one digit if one of the mostsignificantbits of the squared vector lengthis set. The function is then describedas follows: Q,f

A e;,

A e&,

V

Qj+,

SHLl

A

(Q;,

0 sj

5 29, (27)

ti4 = qi4 A &:I A &.

4:

for

0 sj 5 13 and (28)

Table1. Squaring binaryintegers. 4:o

Qio) for

A &o V @+I A tQ& V Qio)

(22)

Gate count: 3.000

4:’

V

SHRl 4 = qj A 0;’

v .*.

UNIT

2145 T s 2” - 1. (26)

T, =

q,4 = a,

SCALE

(19)

v a,s A sHRl7 ; (20)

SHLl

A

a2

V

8. THE

4;

s:

(29)

264

HA: Half Adder FA: Full Adder

Fig. 2. 6 bit square

Again, truncation errors are subject of consideration (seeSection 14 Error Estimation). Gate count: 1.500 9. THE

SQUARE

ROOT

unit.

we increase the square by 20 X A X B + B2. So we can formulate: R,* z 20 x A x B + B2.

UNIT

(30)

The algorithm is first explainedfor decimal numbers[ 31. For every two digits of the integer part of a Since 20 X A X B > B2, we will for the moment number, the integer part of the squareroot hasone neglectB2. A rough estimationgives: digit. If the numberof digits is odd, a zero must be placedin front of the integer. For the computationof B = LR,*/20 x A] = L1626/180] = 9. (31) the fractional part of the root, the decimalsareconsideredin pairs.Thus,a zero mustbe appendedif necesIf this value satisfies(30), we have found B . SW. In this example,however, this is not the caseand Let us considerthe following example. thereforewe have to decrementB by one. Thus, B = 8. The remainderRs is given by: 97126141.70 A B

C.D

RB = R,* - (20 x A x B + B2)

= 1626 - (1440 + 64) = 122. (32)

We startthe calculationwith the mostsignificantdigit.

A is the integerpart of the squareroot of 97, no matter what follows. Thus, A = 9. The remainderRA = 16. Now we are ready for the next two digits. Again, Now let’s considerthe next two digits. 10 X A is still 10 X AB is still an approximationof the squareroot an approximationof the squareroot of 9726.The new of 972641(Don’t get confusedwith the notation:AB remainderR.f = 100 X RA+ 26 = 1626.If we adds, = 98 in this example,whereasA x B = 72). The new the new squareroot is ( 10 X A + B). By doing so, remamderR,* = 100 x Rs + 41 = 12241. Table

2. Removing

redundancy

by combining

identical

terms.

A VLSI-design for fast vector normalization

265

D29 D28 D27 D28 D25 D24 D23 DzDplDaDjgDlg

Input operand

T14

Square root

T13

TII

T12

TIO

T9

Fig. 3. Format of operands and results.

We add C to increasethe squareby 20 X AB X C + C’. So we find:

R;‘:129..26, 2 4 x T,4 x z-13 + z-:3.

(42)

If we assumeT,3 = 1, the relationto be satisfiedturns R; z 20 x AB X C + C2.

(33)

into

Another guesstimategives: C = LR,*/20 x ABA = L12241/19601= 6. (34) This time (33) is satisfied.The integerpart of the root thereforeis 986. For the decimalswe can proceedin just the sameway:

(43)

In other words: T,3 = 1 if the above test passes,else T,3 = 0. The new remainderR1&.26, is computedby: Rt:s...26,

=

* R[29...261

-

4 x T,z, x T,3 - T:,.

(4)

That is, the remainderis left unchangedin the case T,3 = 0. This function

Rc = R,* - (20 X AB X C + C’) = 12241 - (20 x 98 x 6 + 36) = 455;

(35)

R; = 100 x Rc + 70 = 44570;

(36)

R,* 2 20 x ABC X D + D2;

(37)

D = tR;/20

z 4 x TM + 1.

R;29..26,

x ABC] = t44570/19720~

= 2. (38)

is performedby the simple circuitry shown in Fig. 4. The subtractor generatesthe flag N( egative,active low) asresultbit which alsocontrols the Multiplexer. For clarity, we will demonstratethe computationof the next result bit, T,2. The new intermediateremainder Rr28...241is constructedfrom R$R::R:iD25DN. If we add T,2r the squareis increasedby 4 X T,.+T,, X 7-,2 + T:z-

The final result: 6972641.7= 986.2. (39) T12= I! if This procedurecan be continuedby appendingpairs of zeros to the decimalsuntil the requiredprecision is reached. The advantagesof this methodwill becomeclear if we considerbinary integers.Again the integerpart of the root hasone bit for every pair of bits of the operand. For a better readability we will denote the squarebits asD., that is T2 = { D29 ; Dz8 * * * Do ) , as shownin Fig. 3. Naturally, the most significantbit T,4 can only be 0 or 1 (accordingly, there are only two squarenumbers which fit into two bits: 00 and 01). Thus: 1‘14 = Dz9 ‘.’ D28.

R~28,.,2d,2 4 X T,4T,3 + 1. (45)

Dependingon the result of this compareoperation, the new remainderis either left unchangedor Rkz41

=

&...24,

- 4 x T,4T,3 - 1.

Just as we did for decimal numbers,we can repeat this calculationuntil therequiredprecisionis obtained.

(40)

The remainderR &281 is calculatedaccordingto R’429 = Dm, - A D 28 and

R:; = Dzs A

028.

(41)

The squareroot of the binary numberD29D28D27 D26 is still approximated by 2 x T,4, the remainder R~z9...261 is representedby R$RizD,,D,,. If we add T,,, the squareis increasedby 4 X Tz4 x TI1 + T&. So the following relation mustbe satisfied:

(46)

F T13

Fig. 4. T17 circuitry.

266

G.

, I

t

R

, I Z

I

bIlTEL

et d.

l N using 16 bit two’s complementarithmetic. m. = 1 if

t - T 2 0. In Fig. 7, the light grey cells contain the

sign extensionsof the operands.The dark cell holds the inverted result bit m,,. Fig. 5. If m,, = 0, the remainderR” must be corrected by addingT. Then, R-’ = R” - T/2, andm-, = 1 if R-’ 2 0. However, the samecan be achievedby adding Note that R$ and R:i have beendismissed.They T/2 to R” if m. = 0 and subtractingT/2 from R” if are always 0. In general,the remainderhasat most m. =: 1 (Fig. 8). one digit more than the root. This can be shown as Note that the result bits can be excludedfrom furfollows. Considerthe integernumbersZ andN in Fig. ther calculationsince 5, whereN is the integerpart of the squareroot of Z. For the remainderR we can formulate:

0

N*

(N+l)*

lR”l

R 5 (N + 1)’ - NZ - 1;

1 -N2-

RsN2+2N+

1;

R 5 2N.

(47)

/R-II

(48)

]Rm21 = /R-‘1

= It - TI 5 T;

(51)

= JIR’) - T/21 c T/2;

(52)

- T/41 5 T/4 and so forth. (53)

(49)

Thus, the width of the requiredAdder/SubtractersreFig. 6 showsthe circuitry for 30 bit integersD,29...01. mainsconstantthroughoutthe completepipeline.The Except for the first onea registeris insertedafter each computationis continuedin this way until therequired stage.The multiplexersand registersat the outputsof precisionis reached.The last remainderis discarded. However, this schememakesno good use of the the subtractorshavebeenmergedinto a singlesymbol. available precision.m,,is set only in the caset = T. In general,for the calculationof the integerpart of a squareroot of a numberwith N bits (where N is as- To increasethe precision,we usethe following format sumedeven), we needN/2 - 1 subtractors,starting instead: with a 4 bit and ending with a (N/2 + 2)-bit sub-15 tractor. N/2 - 2 multiplexersare alsorequired,startm = C mj2’. (54) ing with a 3 bit, ending with a NIZbit multiplexer. ,=-I In this particular case,the SquareRoot Unit has 14 stages,14 subtractorsfrom 4 to 17 bits, 13 multi- The maximum error is then reduced to 2-l’. For plexersfrom 3 to 15 bits and 433 registerbits. t = I”, m is expressedas0.111111111111111. This is Gate count: 6.000. achievedby assumingm,,= 0 andstartingthe computation with the operation t - T/2. The first step is given in Fig. 9. This is a 14 X 50 bit memory including the valid The circuitry shownin Fig. 10 performsthis funcbit and the Z-flag per vector. The componentsFIFO tion. Each pipelinestagecomputesoneresult bit. The shouldbe realized asa registerpipeline (as opposed three division pipelines consume approximately to the usual “fall-through”-architecture of FIFOs), 40.000 gates. sothat the componentsand the vector length arrive at the sametime at the inputsof the divide unitswithout 12. THE SIGN UNIT (OUTPUTS) specialcontrol circuitry. The sign units at the outputsperform the inverse .Gatecount: 5.000 function as the sign units at the inputs,however, the arithmetic operationis the same.The 15 bit positive 11. THE DIVISION UNIT The componentst are taken from the component componentsm, which are precededby a sign flag S, FIFO as 15 bit unsignedintegersprecededby a sign are converted into 16 bit two’s complementcompobit. The vector length T arrives as a 15 bit unsigned nentsn. The Z-flag restoresthe zero vector. integeraswell. Thus, therearethreeunsigneddivision pipelines,asfor exampleexplainedin [41, to be connm15 = m-15 A Z; (55) structed.The results( m,; my; m, ) shallbe computed aI4 = (Sm-,, V S(M-lflm_ls V m-,gEL5)) A Z as 15 bit fixed point numbers. Obviously, 0 I m 5 1. In the first instance,we = (.k,, V S(met4 CT3 mwls)) A Z; (56) assumethat 10. COMPONENT

FIFO

xl3

-14 m = x rni2’

where m = Inl.

(50)

= (Srnm13

V

S(K13(m-14 V

V

rnm15)

m1,3M-14fi-15))

A Z;

j=O

= (Sm-13V S(rnm13 @ (rnw14 V mw15))) The algorithm shall be explained by an example where t = 011100111010011 and T = 100001111010111. The quotient is computedbitwise n-, == (.?m-, V S(m-, Cl3(m-2 V m-3

A

(57) Z; (58)

A VLSI-design for fast vector normalization

I

261

Dpx.01

?s Dp5..0]

1

D[25..24]

D[23..0]

T[14..12]

TII .

I T[14..11]

D[21..0]

R”[26..22]

Fig. 6. Square root pipeline.

Fig. 7. Computing q.

1

0

Bit-Position

1

1 Component

t

0

1 2’s Compl.

of T

0

0

Flof,S.,Ol

Remainder

268

G.

kWTEL

62 cd.

Fig. 8. Computing mm,.

v . . . v m-14v m-15))) A z. n, = s.

-1
(59) (60)

Gate count: 1.500 13. CONTROL

STRUCTURE

If the componentFIFO is realizedasa registerpipeline (as opposedto the usual“fall-through”-architecture of FIFOs), there is no internal control structure required. All operandstravel the samedistanceand sothe chip just hasto be clocked.Provisionsare made to freeze the pipeline. The activation of an external signalmasksthe clock. This circuitry is designedvery carefully to avoid spikeson the internal clock lines. The valid flags, one for each stage, must be reset during initialization. The valid flag mustbe heldactive at the inputswhenevera vector is clockedin. Normalized vectors are availableaslong as the valid vector output maintainsan active state. “Design-for-Testability” featuresare alsotaken into account.We use scan-pathflip-flops for all registersto constructa number of scanchains. 14. ERROR

ESTIMATION

(62)

Due to this truncationof the components,a changein the direction of the normalizedvector might occur. Insteadof the vector v = { u, ; uY;u,) , the vector 7 = [ t.,; tv; t: } is normalized.Providedthiscomputation is carriedout accurately, the maximumdeviation occurs for ?=

{V,,,,;O;O)

and At, = At, = -1 where V,,,,, = 214. (63)

The terrorvector %, is then given by: j@,

=

(0;

-2-14; -2-14) where MD = 8.63 x 10m5. (64)

Any other permutationof the componentsin (63 ) and (64) yields the sameresult for MD. However, theremight be an error in T, so that ? is not scaledproperly. We have to distinguishtwo cases: 1. There wasno shift operationin the scaleunit. T2 is the true squaredlength of 7. However, the limited precisionof the squareroot unit causesa truncation error AT. For simplicity, we assumethat

Incoming componentsu are consideredto be “true values.” The normalization of 57 without rounding errors will give the exact unit vector GE. We will -1 s ATSO. (65) derive an error vector AZ, so that ?? = GE + Ai?. The signunits at the inputsoperateprecisionconserv2. The scaleunit performeda right shift operation. ing, that is The squaredvector length is divided by 4 and the two LSBs arediscarded.After thisoperation,the range a = IUI. (61) of T” is given by lOOOOOO0H 5 TZ 5 2 FFF4OOOH (66) Dependingon their size, the operandsare possibly right shifted and truncatedby the alignmentunit and the scale unit. For simplicity, let’s assumethat the and therefore, the truncation error of T2 can be neerror At is definedby glected.So it can be saidthat

Fig. 9. Computing m-,

for higher

precision

format.

A VLSI-design for fast vector normalization

A

M=

269

fi&-)‘+(hT)‘+(&T)

Tp4..01

=

t; + tt + t: J (7 + AT)’

For the moment we assumethat the divide units operate at infinite precision. Then the error vector Gi, is given by: gs = s x %

where -6

x 2-15

52 s 52-14.

(71)

The maximumtruncation error of the divide units is -2-15 for each component.This producesan additional error vector Gr, given by: 2, = ( -2.-15. . .o; -2-l-5. . .o; -2-u. . .o). (72) The error vector AZ is then definedby:

AZ=Iiii,+Gi,+G,.

(73)

However, the worst-caseconditions of &?, and G, are mutually exclusive, and thus the upper limit of deviation is given by: AZ = ( ?2-14; -3 X 2-15; -3 X 2-15) and AN = 1.43 X 10m4, (74) or by any other permutationof the components.The sign units at the outputsagainoperateprecisionconserving. 15. ERROR

s

ml-t.451

Fig. 10.Divisionpipeline. T*= (+r+

(%>‘+

($

(67)

On the other hand,

T= (t*;t,;t;) ={Ly;L~J;Lg}.

(68)

Taking the truncationerror of the squareroot unit into account,the resultingerror AT is then given by

16. DESIGN

-1 ~AT~OSx&.

(69)

For a given AT, the resultingvector lengthM is given by:

STATISTICS

The error statistics were produced by feeding 10.000randomly generatedvectors into the hardware simulator,and by comparingthe outputswith the resultsof appropriateC programs.The resultsof the C programswere computedat full compiler precision (double precisionfloating-point) and are considered to be the “true” values.Each vector set one dot in the diagrams.For Fig. 11, the length of the vectorsas they would be returnedby the chip wascomputed.In Fig. 12, the length of the error vector AG is shown. The maximumvalueis about0.91 X 10d4,what shows a goodcorrelationto the estimatederror. Fig. 13shows the angulardeviation in degreeseconds.All figures reveal the additional truncation error introduced by the scale unit, if the input vector length exceeds 2’5 - 1. COMPLEXITY

The total numberof gatesneededfor the functional units is approximately66.500.Assuminga 50% atray utilization, which should be achievablein consideration of the regular structureof the chip, Gate Arrays

270

G.

KIWTEL

et al.

I.00006

1.m

l.MxKI2

l.OWOO

0.66666

0.66696

0.66664 ‘_

. .

‘_ ._ :

Fig.

1 I. Length

of z.

PI6 Lmwlh

l3cxll

Fig. 12. Length

vmof

of

AZ

.‘.

A VLSI-design for fast vector normalization

211

16

14

12

10

8

8

4

2

0 :

alo

232

2M4

2V%

%I%

2x22 Lrmm

o?E

2%?4

i?m

296

2Yx

Z-32

"eaor

Fig. 13.Angulardeviationfrom “true” unit vectorin degreeseconds. providing morethan 130.000raw gatescomeinto consideration. 17. CONCLUSION

We presenteda single-chipVLSI solution to one of theessentialtasksin computergraphics,the normalization of vectors.This approachis superiorover other hardwaresolutionssuchas look-up tablesor microprogrammedALUs, sinceit achievesboth high speed and high accuracy at minimum costs. Advances in VLSI technologycan directly be exploited. Using the upcoming ASIC generationsand enhanceddesign methodologies(Standard Cells or Full CustomDesign) the clock frequency can be increasedsignificantly while decreasingthe requiredchip areaat the sametime. Thus, the vector normalizerwill turn into a macrocellwhich canbe placedalongwith additional functionalunits onto a singlechip, so that a complete Phongshaderwith a generationratein excessof 1OOM pixel/s will be feasibleas a single-chipdevice in the nearfuture. Acknowledgemen&--This work was supervised by Prof. Str&r and is part of the advanced graphics accelerator proj-

ect at WSIIGRIS, University of Tuebingen, supported partially by the CEC’s ESPRIT programme. Claus Dreischer, who implemented the design, gave valuabIe suggestions and provided the gate counts and error statistics. Thanks to Andreas Schilling for many helpful discussions. The Gate Array library (TCl6CG) was provided by Toshiba. REFERENCES

1. J. F. Blinn, Cornpurer Display of Curved Surfaces. Ph.D.Thesis, University of Utah, Department of Computer Science (1978). 2. Phong Bui-Tuong, Jllumination for computer-generated pictures. CACM, 18(6), 311-317 (1975). 3. S. Gottwald, H. Ktistner, M. Hellwich, and H. Kiistner (Eds.), Handbuch der Mathematik. Buch und Zeit Verlagsgesellschaft, K&r, 44-45 ( 1986). 4. Rolf Hoffmann, Rechnerentwurf: Rechenwerke, Mikroprogrammierung, RISC. Oldenbourg Verlag, Miinchen 130-131 (1993). 5. Gttnter Knittel, VERVE-Voxel engine for real-time visualization and examination. Computer Graphics Forum,

12(3), 37--48(1993). 6. Gttnter Knittel and Wolfgang StraL?er,An accelerator for Voxel Graphics. In Proc. of the 2nd lTG/GI Workshop on Workstations (Hagen, May 24-27, 1993), VDE-Verlag, Berlin, 69-78 (1993). 7. Giinter Knittel and Wolfgang Str&r, “A compact volume rendering accelerator.”In Proc. of the ACM/IEEE Symposium on Volume Visualization (Washington, DC, October 17-18, 1994) (1994).