Journal of Microcomputer Applications (1990) 13,43-55
On parallel scan-conversion transputer networks
algorithms
for
H. E. Bez and L. Parks
Department of Computer Studies, University of Technology, Loughborough, Leicestershire LEI I 3TU, U.K.
Two fundamental
scan line procedures are considered from the point of view of parallel implementation on transputer networks. The paper describes algorithms for the scan conversion of polygons, and the related hidden surface elimination problem for threedimensional models with polygonal data structures. In each case parallel designs have heen implemented and timed in various configurations against a functionally equivalent, but not necessarily optimized, single processor code. The results presented demonstrate that scan line algorithms can be efficiently implemented on suitably configured networks of transputers.
1.
Introduction
Computer graphics is widely recognized as an ideal problem domain for parallel computing [l-3]. Many end-users of systems with graphical interfaces are demanding increasing sophistication and, by implication, a much higher performance from the graphics component. Examples, for which high quality fully rendered images are required on the fly, include computer-aided engineering design, flight simulation, and animation. One solution is to design special hardware accelerators to provide the required performance [4,5]. This approach, however, tends to be inflexible and is not readily updated, generalized or reconfigured. Alternatively, software solutions are feasible using reconfigurable general purpose parallel processing systems which meet the computational demands of these and similar applications. The software approach, to which we address ourselves in this paper, more readily lends itself to rapid prototyping and optimization. Until recently, however, the availability of readily usable and affordable parallel processing systems was not wide; but the advent of the INMOS transputer, and other similar microcomputer chips, has lead to a proliferation of relatively inexpensive workstations with a reconfigurable parallel processing sub-system. These novel systems provide an opportunity to design and compare parallel algorithms for a wide variety of application areas-including computer graphics. Variations in the structure of algorithms can be investigated to optimize performance for different transputer configurations, matching algorithm design and network topology. In this paper we describe some experimental work on the parallel implementation on a transputer-based workstation, of some graphical procedures that are fundamental to many computer graphics applications, and to the implementation of general purpose graphics systems, such as GKS. Many problems in this domain are ideally suited to coarse-grain parallel implementation and the transputer is therefore a compute node of the appropriate type for our purposes. 43 0745-7 138/90/O 10043 + 13 $03.00/O
(6 1990 Academic
Press Limited
44
H. E. Bez and L. Parks
Figure 1.
Overview
of the transputer
system.
The problems chosen, i.e. the scan conversion of polygons and the elimination of hidden surfaces, direct the attack at the potential output bottleneck of many graphics applications. The loan of a transputer system, based on the INMOS BOO4and BOO3evaluation boards hosted by a PC AT clone, from the RAL/DTI transputer initiative has enabled us to implement and test our algorithms in OCCAM 2. An overview of the architecture, which is an example of a MIMD parallel processing machine without shared memory, is shown in Figure 1. Two standard configurations for MIMD parallel processing systems, to exploit different types of parallelism, are processor pipelines and processor farms. A processor pipeline is an ordered set of processors in which the output of each processor is the input of its successor, and concurrency is achieved by allowing a number of sub-tasks to be in various stages of execution at the same time. A processor farm consists of a network of slave, or worker, processors linked to a master or farm processor which is in turn linked to a host machine or a further network. Each of the slave processors performs the same task and, normally, data packets (i.e. items for one execution of the task) are sent one at a time to the network by the master processor. The master may also receive output from the slave processors and deliver it to the required destination. Alternatively, in cases where this might lead to a bottleneck and impair efficiency, output data may be routed more directly to its destination-this is often the case for transputer representations of processor farms where, due to the limited number of communication links available, it is necessary to construct the farm in a way that requires all but one of the slave processors not to be linked directly to the master processor.
I
I
L_____I
i__J
Figure 2.
A pipeline of processors.
On parallel scan-conversion algorithms for transputer networks 45
Worker I
Master
Worker 2
Figure 3.
A processor
farm architecture
It is to be noted that if there are to be more than three worker processors on a transputer based farm implementation then there will be insufficient transputer links available to map the network shown in Figure 3 directly onto the hardware. The usual solution to this [6] is to map the process farm onto the hardware topology illustrated in Figure 4 where it can be seen that the physical connectivity is similar to that of a pipeline of processors; the software configuration is however quite different. The mapping uses two transputer links per worker processor allowing the remaining two to be used, for example, for routing computed data to its destination. Throughout the paper we have based our efficiency calculations on the time required to run the algorithms on a multi-transputer network, relative to the time required to run the same algorithm on a single transputer. We are thus measuring how efficiently the algorithms distribute, rather than estimating their performance against a best serial solution. We define the speedup S(A) and efficiency E(A) of a distributed algorithm A, executing on N processors, to be S(A) = TJT,
and E(A) = lOO.S(A)/N
where T, denotes the time taken for A to execute on a network consisting of q transputers in some configuration.
--Jgq+Z~E~ Figure 4.
Transputer
realization
* * r+Jq of a processor
farm.
46
H. E. Bez and L. Parks
In general, if high efficiency values are to be achieved, it is necessary to both balance the work loads between the processors and to maximize the compute to communication ratio for the network. In the second section of the paper we discuss our parallel solution to the problem of filling the interior of a region defined by a polygonal boundary. The third section describes the extension of the polygon fill primitive to a hidden surface elimination algorithm by incorporating depth processing.
2.
Scan converting
polygons
Initially two approaches to the scan conversion of polygons were considered; the simple scan line algorithm using scan line coherence, and the approach that incorporates both scan line and edge coherence. Serial algorithms for both techniques are widely documented [7]. Other workers [8] have considered the same problem for a shared bus, coarse-grained parallel processing architecture with global memory. They have sought to compare alternative parallelizations of standard algorithms whereas our algorithm is based on a strategy designed, ab initio, for parallel execution. We compare our approach and results with those of Ghosal and Patnaik [8]. Our method is shown to extend, in a natural way, to a full hidden surface elimination algorithm. The algorithm assigns particular scan lines to particular processors according to the number of worker processors available. If two workers are available, then for each polygon one processor is responsible for the even numbered scan lines that intersect the polygon, and the other the odd numbered ones. In general if there are N worker processors available then each one is responsible for processing every Nth scan line within the polygon extent, and the mapping chosen processes those scan lines with numbers having remainder r on division by N on processor number r + 1. The pth processor, where 1
=.q+N/m
andyi+,=y,+N,
where N is the number of processors and m is the slope of the edge, to determine the intersections that its scan lines have with the edge, see Figure 5. It is to be noted that the
On parallel scan-conversion algorithms for transputer networks Polygon
47
edge \
(X( J 1, Y(i )) Figure 5. subsequent
The distribution of scan lines to processors. If N=4 then the point (x, , , I y, +,) will be computed to (x,, y,) on the same processor, i.e. processor number I+ y; mod 4.
formulae use real arithmetic. This could, of course, be replaced by an integer method akin to Bresenham’s algorithm for scan converting vectors [7]. The question of initializing or seeding this incremental procedure on each processor and on each polygon edge arises. To do this it is necessary to find, for the pth processor, the scan line of least index that intersects the edge being processed. If {[x(&y(j)], [x(i-t I), y(j+ 1)]) denotes a polygon edge, and y(j+ l)>yo’), then the y-value of the seed for processor number p, is given by y(j)-y(jJ
mod N+(p-
1) if @- l)>yo)
mod N
and YO-YO
mod N+(N+p-
1) if @- l)
mod N
from which the corresponding x value may be determined. For cases where y(j + 1)
48
H. E. Bez and L. Parks
fashion on a single processor, i.e. each compute node of the network of a parameterized PROC, denoted Q(), having the form
executes an instance
PAR throughput SC 2 output
The process for the pth processor can be written Q(p,. . .), where it is to be noted that the chief difference between the processes Q(p,. . .) and Q@‘,. . .) is in the set of scan lines processed by each (as determined by the congruence (p - 1) -y mod N, discussed earlier in the paper). The distributed algorithm, on an N processor network, may therefore be expressed as PLACED
PAR p = I FOR N
Q(P,. .) The treatment of singularities, relating to the special cases where scan lines are parallel to polygon sides or intersect polygon vertices, is by standard methods and is omitted from the pseudo-code for clarity. PROC sc_2 /* two-dimentional polygon scan conversion WHILE polygons to scan convert DO BEGIN get a polygon II={r,,. . ., rk} from FOR each polygon edge E E {(r,,rJ,. BEGIN (i) find the first scan .x-co-ordinate of the (ii) (iii)
procedure
*/
throughput
. .,(r,.,r,)}
DO
line that intersects intersection point.
step along the edge E incrementing intersection
the edge E and evaluate
x appropriately
the
for each scan line
sort into the existing list of intersections (for this line) for the polygon being processed (x value. type) sorted on increasing x value
END FOR each scan line DO BEGIN pair the computed data points for the polygon fI on their x values and add these pairs to the list for the whole polygon set being processed¬e that at this point they are just added to the end of the list and not sorted into it. Each member of the list, for a given scan line, has the form (xl ,x2,poly_id)) where xl
Main steps of the two-dimensional
polygon
scan converter.
The timing tests, shown in Table 1 below, are those obtained for scan converting the polygon TI, of Figure 8, and a polygon 0 which may be obtained from 17 by multiplying its vertex co-ordinates by two.
On parallel scan-conversion
from
to
algorithms for transputer networks
Previous
Previous
> to Next
<
from
Nexl
Computed dota to destlnotlon
Figure 7.
Table 1.
Slave processor
for polygon
Timed execution of the scan conversion algorithm for polygons II and 0 Time
No. of Processors
1 2 3 4
scan conversion
E
S
l-l
0
n
0
n
0
480 267 194 163
1322 686 478 372
I.8 2.47 2.94
I.93 2.77 3.55
90% 82% 14%
97% 92% 90%
(22,
80)
(46,
52)
(56,
52)
(2,38) (26,
(22,2) Figure 8.
30)
(40,301
(52,
(46, 2) The test polygon
30)
(58,2) II
(66.
30)
49
50
H. E. Bez and L. Parks
As the figures in Table 1 show, the performance of the polygon scan converter is problem dependent and the question arises as to what is a suitable polygon-or set of polygonson which algorithms can be benchmarked. The larger the polygon the more computation each processor in the network has to do without any increase in the communication overhead, and consequently, the better the performance. The algorithm is very efficient on the larger polygon 0 but also performs well on the smaller, and perhaps more typical, polygon II.
3.
The hidden surface
elimination
algorithm
Most of the widely known hidden surface elimination techniques [7] are amenable to some form of parallel implementation. For example the ray tracing method has been efficiently implemented on a transputer based system configured as a processor farm [6]. Our hidden surface algorithm is scan line based and has been developed from the polygon scan conversion algorithm described in the previous section. It incorporates zdepth sorting on line segments, belonging to polygon interiors, in a manner analogous to the painter’s algorithm. An outline is given in the pseudo-code algorithm shown in Figure 9 and further details are contained in the appendix; hse executes after the process SC_3 has terminated. The SC_3 procedure is a simple extension of SC_2, incorporating three dimensional processing by incrementing the z-coordinate in addition to the xcoordinate of each intersection point. The compute node for the complete algorithm is shown in Figure 10. PROC hse FOR each scan line DO
BEGIN (i) sort the vector segments on their largest z-coordinate to form a list L (ii)
re-sort the list L to resolve problems that may occur when vectors z-extents overlap
m)
output the modified list L to be written into refresh buffer in order
END/“’ Figure 9.
Main steps of the hidden detail removal algorithm.
To resolve the ambiguities alluded to in step (ii) of hse, it is necessary to carry out some z-
depth tests to ensure that the vector segments are scan converted in the correct order. If V denotes the vector at the start of the sorted list L, then before the position of V in the list L is confirmed, it must be tested against each vector U in the list L whose z-extent overlaps the z-extent of V. This test is a sequence of up to three sub-tests, performed in order of increasing complexity. As soon as one of the sub-tests succeeds the position of V, relative to a vector U having overlapping z-extent with V, on L is confirmed. If all three tests fail the positions of V and U in the list L must be swapped. (1) The vectors x-extents do not overlap; hence V must be scan converted before U, and be in its correct position relative to U in the list L. (2) V (tested line) is wholly on that side of the line containing
U (test line) which is
On parallel scan-conversion algorithms for transputer networks
from
Prevrous
to Previous
____
-/
8
<
51
to Next
c
(
output
___
from
Next
computed data to destlnotlon
Figure 10.
Slave
processor for hidden surface elimination
further from the viewer; hence V must be scan converted before U, and be in its correct position relative to U in the list L. (3) U(tested line) is wholly on that side of the line of V(test line) which is nearer to the viewer; hence V must be scan converted before U and be in its correct position, relative to U, in the list L. A pseudo-code representation of the re-sort procedure is given in the Appendix to the paper. The code uses a left-handed co-ordinate system, as is conventional for imagespace computations in computer graphics. We ran the algorithm for a number of worst case polygon data sets, i.e. those for which the initial sorting process produces a list ordering that is completely inverted to that which is required. These cases provide the highest achievable efficiency values since they require the maximum amount of sorting. The results for a typical worst case are shown in Table 2 where it can be seen that the increased degree of computation involved, over the two-dimensional case, has produced significantly improved efficiency coefficients for the distributed codes. Although the algorithm produces super-linear performance with two processors for these problems, it should be stressed that for a more typical problem the speed-up will be sublinear.
Table 2.
Typical worst-case performance for the hse algorithm
No. of processors
Time
1
8050
2 3 4
3922 2713 2116
S
E
2.05 2.97 3.8
103% 99% 9.5%
52
4.
H. E. Bez and L. Parks
Conclusions
The results presented in the paper show that scan-line algorithms for polygon filling and hidden surface elimination can perform well on networks of transputers. They exhibit good speed-up and efficiency coefficients for small numbers of processors. It is to be noted that the algorithms described are also suited to implementation on a shared memory parallel processing system, of the type considered by Ghosal and Patnaik [8]. Their two-dimensional polygon fill methods also appear to perform well showing comparable speedup figures to ours on similar polygons. However it does seem that their theoretical performance estimates, and therefore presumably also their experimental results, exclude consideration of the computation overhead involved in the construction of the edge table (which uses a parallel bucket sort), and this makes direct comparison difficult. They do not consider the three dimensional case in the paper cited. The current version of the scan-line algorithm for hidden surface elimination, whilst taking full account of coherence in the X, y and z directions during the initial phase of processing (in SC-~), it does not use z-coherence in the re-sort procedure; i.e. re-sort makes no use of the fact that data items (i.e. segment lists) for scan lines that are ‘close together’ are likely to require re-sorting in precisely the same way. It should be possible to develop hse to incorporate this depth coherence into the final sorting, enabling scan lines to be re-sorted in groups and thus reducing the total time required to process a model. However, such modifications are unlikely to significantly alter the speedup and efficiency values for the distributed codes. In addition the use of integer only methods. such as Bresenham’s [7], in the processing of polygon edges will improve the throughput of the code if not the relative performance of the distributed forms.
Acknowledgements We wish to thank the Rutherford Appleton Laboratory and the Department of Trade and Industry for the provision of both a transputer system and a training course under the transputer initiative, to support the work described in this paper. In addition we would like to thank Professor D. J. Evans, of the Department of Computer Studies at Loughborough, for allowing us access to an alternative transputer system, also on loan under the transputer initiative, for some of the development work.
References 1. A. Glassner & H. Fuchs 1985. Hardware enhancements for raster graphics. In Fundamental Algorithms for Computer Graphics, NATO AS1 Series F: Computer and Systems Sciences. Vol. 17. Berlin: Springer-Verlag, 63 1-658. 2. P. M. Dew. J. Dodsworth & D. T. Morris 1985. Systolic array architectures for high performance CAD/CAM workstations. In Fundamental Algorithms for Computer Graphics, NATO ASI Series F: Computer and Systems Sciences, Vol. 17. Berlin: Springer-Verlag, 659-694. 3. A. C. Kilgour 1985. Parallel architectures for high performance graphics systems. In FundamentaI Algorithms for Computer Graphics, NATO AS1 Series F: Computer and Systems Sciences, Vol. 17. Berlin: Springer-Verlag, 695-703. 4. G. Abram & H. Fuchs 1984. VLSI Architectures for computer graphics. Proc. NATO studies. Berlin; Springer-Verlag. 5. A. Thomas 1987. Specialised hardware for computer graphics. In Techniques for Computer Graphics. Berlin: Springer-Verlag.
On parallel scan-conversion algorithms for transputer networks
53
6. J. Packer 1987. Exploiting concurrency; A ray tracing example, INMOS Technical Note. 7. 7. J. D. Foley & A. Van Dam 1982. Fundamentals of Interactive Computer Graphics. Reading. MA: Addison-Wesley. 8. D. Ghosal & L. M. Patnaik 1986. Parallel polygon scan conversion algorithms: performance evaluation on a shared bus architecture. Computers and Graphics, 10,7-25. 9. M. Hu &J. D. Foley 1985. Parallel processing approaches to hidden-surface removal in image space. Computers and Graphics. 9, 303-3 17.
I
Helmut Bez received a first class degree in Mathematics in 1972 from the University of Wales, and MSc and DPhil degrees from Oxford University in 1973 and 1976 respectively. In 1976 he joined Rolls-Royce Aero Engines, and in 1980 was appointed to the academic staff at Loughborough University of Technology where he is a senior lecturer in the department of Computer Studies. His research interests include, computer graphics, computer aided design, parallel processing, manmachine interfaces and mathematical methods. Publications include research papers in these areas and a successful textbook on mathematics for computer science. Lesley Parks received her BSc(Econ) degree in 1969 from the University of London and then worked in education for several years. In 1986 she was awarded an MSc in computer science from the University of Newcastle and was appointed as a programmer in the Department of Computer Studies at Loughborough University of Technology. Her research interests include educational software and parallel processing.
Appendix This appendix contains pseudo-codes for the sc3 procedure and the w-sort step of the hse algorithm presented in the paper-pseudo-codes for the depth checking primitives of resort are also included. List elements
U and V, for a given scan line, have the form
V= [xv(l), XV(~), zv( l), zv(2), poly_id], xv(l)
PROC SC 3 /* three-dimentional polygon scan conversion procedure */ WHILE polygons to scan convert DO BEGIN get a polygon II-{r,.. ., r,} from throughput FOR each polygon edge E-E ((r,,r2),. . .,(rkrr,)} DO BEGIN (i) find the first scan line that interesects the edge E and evaluate z-coordinates of the intersection point.
the x- and
54
H. E. Bez and L. Parks (ii) (iii)
step along the edge E incrementing line intersection
x and z appropriately
for each scan
sort into the existing list of intersections (for this line) for the polygon being processed (x value, z value, type) sorted on increasing x value
END FOR each scan line intersecting II DO BEGIN pair the computed data points for the polygon II on their .Xvalues and add these pairs to the list for the whole polygon set being processed-note that at this point they are just added to the end of the list and not sorted into it. Each member of the list, for a given scan line, has the form (xl ,x2,zl ,z2,poly_id) where x 13 x2 END END
PROC re-sort (L); /* sorting procedure for z-depth processing, L is the input list with pointer array NL */ Lqointer + L-head REPEAT Vt L(L_ pointer) /* determine the sublist of vectors, zlist with pointer array NZ, having overlapping z-extent with V */ compute zlist (V) _ pointer+zlist_head swapped + true REPEAT IF not (swapped) THEN z_ pointer+ NZ(z_ pointer) Utzlist (_ pointer) swappedttrue IF No_Overlap_of X( V, Cl) THEN swappedcfalse ELSE BEGIN IF V_Behind_U( V. U) THEN swappedtfalse ELSE BEGIN IF U_Infront_V( V,U) THEN swappedefalse END IF swapped THEN BEGIN swap V and U in L V+ L(Lgointer) compute zlist (V) /* replace V by U in existing zlist */ zqointertzlist_head END
On parallel scan-conversion algorithms for transputer networks UNTIL (NZ(z_ pointer) = Null) L_ pointer+NL(L_ pointer) UNTIL (NL(L_ pointer) = Null)
The pseudo-codes for the depth checking functions of re-sort now follow FUNCTION No_Overlap_of_X( V,cr): Boolean BEGIN IF ((xv(2)
FUNCTION V_Behind_U( V,v>: Boolean /* V tested line(finite), U test line(infinite) */ compute_test_line_constants (U,A(Cr)J( v),C( v)) BEGIN IF ((A(U)xv( 1) + B( U)zv( 1) + C(v) < 0) AND (4 CJJxv(2)+ B( &v(2) + C(U) < 0)) THEN V_Behind_Uttrue ELSE V_Behind_U+false END
FUNCTION U_Infront_ V(V,v): Boolean /* V test line(infinite), U tested line(finitej */ compute_test_line_constants (V.A( v),B( v),C( v) BEGIN IF ((A( v)xu( 1) + B( v)zu( 1) + C( V’)> 0) AND (A( V’)xu(2)+ B( V)zu(2) + C( v, > 0)) THEN U_Infront_V+-true ELSE U_Infront_ V+- false END PROC compute_test_line_constants (L,A(L),B(L),C(L)) BEGIN A(L)+-zL(2) - zL( 1) B(L)+xL( 1) - xL(2) C(L)+ - ZL( 1) B(L) -.rL( 1) A(L) END
55