A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach

A parallel algorithm for the arbitrary rotation of digitized images using process-and-data-decomposition approach

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 10, 188-192 ( 1990) A Parallel Algorithm for the Arbitrary Rotation of Digitized Images Using Pro...

525KB Sizes 0 Downloads 34 Views

JOURNAL

OF PARALLEL

AND DISTRIBUTED

COMPUTING

10, 188-192

(

1990)

A Parallel Algorithm for the Arbitrary Rotation of Digitized Images Using Process-and-Data-Decomposition Approach HAMID R. ARABNIA Department of Computer Science, 415 Graduate Studies Research Center (GSRC), The University of Georgia, Athens, Georgia 30602

A parallel algorithm for the rotation of digitized images is presented. The parallelism used is of a type that is not commonly realized by parallel algorithm designers. This algorithm can be regarded as a process-and-data-decomposition type of algorithm. This is the decomposition of a process into a number of subprocesses and the allocation of each subprocess to a processor for execution, together with the decomposition of data into smaller portions and the allocation of each portion to a processor for execution. The algorithm is targeted at an MIMD machine made Up of transputers. 0 1990 Academic PWSS, IN.

(c) Process and data decomposition-this is the decomposition of data into smaller portions and the allocation of each portion to a processor for execution, together with the decomposition of a process into a number of subprocesses and the allocation of each subprocess to a processor (or a group of processors) for execution. Thus this approach can be regarded as the combination of (a) and (b).

The rotation of digitized images has been studied on conventional computers [ 71. In our previous papers, we presented a series of algorithms for the rotation of digitized images on machines with SIMD architectures [ 21, and on a particular network topology of transputers [ 11. All these algorithms use the data-decomposition approach. Here, we investigate how a network of transputers might be used to rotate a digitized image by an arbitrary angle using the processand-data-decomposition approach. We use an image data structure called stripcode which we have found has many nice properties in exploiting the parallelism of SIMD and MIMD machine architectures. Stripcode is essentially runlength code. Run-length code exploits the horizontal coherence between adjacent pixels on a scanline. Thus some account is taken of the image structure and, in general, a reduction in the number of data objects in the image representation results. Consider a run-length encoded image and specify a background color. Each run of pixels of the same color is called a “strip.” Each strip of color, other than the background color, is specified by the position coordinates of its origin (ordinal number of its first pixel, scanline number), its length, and its color; the background color is not explicitly coded. An image made up of strips is shown in Fig. 1. The numbers inside the strips represent the order in which they would be stored in a list ( 1 indicates first strip in the list, 2 indicates second, and so on). The order used in storing the stripcode of an image is from left to right and bottom to top of the image space. The key to the success of the rotation algorithm described here is in the careful design of the data flow together with overlapping all interprocessor communications with other processing.

1. INTRODUCTION We present a parallel algorithm for the rotation of digitized images. This algorithm rotates a digitized image by 0 degrees about a specified point into a resultant digitized image. It does not perform antialiasing, but it can be modified to do so. The algorithm will be of interest not only to those researchers who need to manipulate digitized images in real time, but also to those that design parallel algorithms in general. We feel that the algorithm will probably be of special interest to the latter researchers. This is because the algorithm uses a type of parallelism that is not commonly realized by parallel algorithm designers. In general, the parallelism of a parallel machine can be exploited by one of the approaches that follow: (a) Process decomposition-the decomposition of a process into a number of subprocesses and the allocation of each subprocess to a processor for execution. Here, the algorithm designer generates a set of concurrent processes which may operate simultaneously and cooperatively to solve a given problem. This approach is used quite commonly by users of MIMD (Multiple Instruction on Multiple Data stream) machine architectures. (b) Data decomposition-the decomposition of data into smaller portions (they are not necessarily equal) and the allocation of each portion of data to a processor for execution. This approach is used by users of both MIMD and SIMD (Single Instruction on Multiple Data stream) machine architectures. 0743-73 15/90 $3.00 Copyright 0 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.

In this paper, we are using an approach which we call processapproach.

and-data-decomposition

188

AN

ALGORITHM

FOR

ARBITRARY

189

ROTATION

rings, namely, the outer ring (Tr , T2, TX, T4) and the inner ring( T’,, Ti, T;, Tb). 4. THE ROTATION

FIG.

1.

A stripcoded

image;

the numbers

indicate

their

order.

The structure of the paper is as follows: A brief introduction to the transputer and Occam is given in Section 2. In Section 3 the transputer network is described. The description of the rotation algorithm is given in Section 4 which is illustrated by a simple example. Assessment and some concluding remarks are made in Section 5. 2. THE TRANSPUTER

AND OCCAM

The algorithm presented in this paper is designed for a particular network topology of transputers. A transputer [ 41 is a programmable VLSI device which provides all the resources of a computer including processing, memory, and concurrent communications on a single chip. The reason that the term concurrency is associated with the transputer is that transputers can readily be built into networks and arrays, each working on its own job using its own local memory. The transputer has four bidirectional communication links to other transputers. Communication via a link takes place when both the inputting and the outputting processes are ready. Consequently, the process which first becomes ready must wait until the second one is also ready. The transputer has some on-chip RAM (4K bytes) together with an interface to external memory. On the same chip, there is a high-performance 32-bit processor capable of 10 to 20 RISC MIPS. Occam [ 51 is the lowest level at which transputers are programmed. It very closely follows the principles of Communicating Sequential Processes, CSP [ 31. More conventional languages are also available. 3. THE TRANSPUTER

ALGORITHM

The operation rotates the image by 8 degrees about a specified point. The image is encoded in stripcode. The stripcode for each block is held in the memory of the transputer in the outer ring which is assigned to that block. In the example shown in Fig. 2, the strips in Block 1 will be held in the memory of T, , Block 2 in TZ, and so on. After the rotation operation each transputer in the outer ring will hold the stripcode for that part of the rotated image which is in its block, namely, the block to which the transputer is assigned. 4.1. General Description The algorithm described in this paper can be regarded as a process-and-data-decomposition type of algorithm as defined earlier, because: (i) The algorithm is divided into two major processes, one for each of the two rings of transputers. These two processes are executed simultaneously (described in Sections 4.1.1 and 4.1.2). (ii) In addition, the image data are divided into smaller portions (the strips in each block of the image space), each allocated to a transputer in the outer ring. Each of these data portions is operated on by the transputers simultaneously. Therefore, not only are the data portions processed simultaneously, but the two major processes are also executed simultaneously at the same time. In other words, (i) and (ii ) are performed concurrently. 4.1,1. Work Done by the Transputers in the Outer Ring. The angle and the center of rotation is made available to the transputers in the outer ring. Each of these transputers performs the steps that follow on its strips: (0, ) The strips are rotated by 6 degrees about the specified point: this gives the rotated image exactly, but not digitized;

NETWORK

The image space is divided into a number of blocks where each block contains an equal number of scanlines. Figure 2a shows an image space divided into four blocks. A pair of transputers is allocated to each block of the image space and these transputers are connected into two rings, one transputer from each pair in each ring. Each transputer pair is also connected together. The number of transputers on each ring is the same as the number of blocks. One additional transputer handles input and output; this transputer is not considered to belong to any of the two rings. Figure 2b shows the network which is required for an image space subdivided into four blocks. The network consists of two

a

FIG. 2. The transputer network blocks. (a) An image space divided four blocks.

for an image space divided into four into four blocks. (b) The network of

190

HAMID

R. ARABNIA

FIG. 3. Rotation and clip of image strips. (a) Unclipped. (b) Clipped.

to form strips and merge them with the strips already held in memory to keep the strips in their correct order. (I*) Compact the strips just received. Thus on a scanline, adjacent strips with the same color are represented by one longer strip. (13) Output all strips to the next transputer in the inner ring and input all strips from the previous transputer in the inner ring. Each of these transputers now holds only the strips it received from the previous transputer in the inner ring. Finally, when all strips on the scanlines have been received:

i.e., they cannot be displayed (note that each transputer rotates only those strips which are assigned to it). Then the rotated strips are clipped to the image space; the rectangular parts of the rotated strips that lie completely outside the image space are removed. See Fig. 3 for the pictorial representation of the result of this step. The next two steps are done in sequence, scanline by scanline. The order in which the scanlines are processed is crucial. An example is shown in Fig. 4, where the image space is made up of 12 scanlines and is divided into four blocks; for each transputer the process order of the scanlines is shown. The numbers in the scanlines show the order in which they are processed. Figure 4a shows the process order of scanlines in the transputer responsible for the first block; Fig. 4b shows the order for the second block, and so on. Therefore as can be seen from Fig. 4a, T, first processes the first scanline (from bottom); second, it processes the fourth scanline; third, it processes the seventh scanline; and so on. ( 02) Find the segments of the rotated strips on the current scanline (each transputer processes its scanlines in the “process order” assigned to it). A segment occupies the portion of the scanline between the points of intersection of the rotated strip and the lower edge of the scanline (see Fig. 5 ) . (03) The segments found on the scanline are output to the transputer in the inner ring to which there is a direct connection. Thus in Fig. 2b, T1 sends the segments to T ‘i, T2 sends to T 5, and so on. Finally, when all the scanlines have been processed: ( 04) Input the strips from the transputer in the inner ring of the connected pair. These strips are the rotated image strips which belong to the block that this transputer is allocated. 4.1.2. Work Done by the Transputers in the Inner Ring. Each transputer in the inner ring performs the steps that follow; the first three steps are repeated as many as there are scanlines: (Ii) Input the segments from the transputer in the outer ring that is connected directly to this transputer. As this is being done digitize the segment’s ends (truncate or round)

( 14) Output the strips to the transputer in the outer ring of the connected pair. Notice that when the segments on a scanline are ready to be output from T, to T k, T k holds the strips which belong to the same block as the segments to be received. Consider the data flow. Recall the order in which the scanlines are processed by the transputers in the outer ring: this is the order in which the data are sent to the transputers in the inner ring. Thus, the data in the memory of a transputer in the inner ring are the stripcode for one particular block. These data are passed round the inner ring so that the data for the block to which it belongs can be added from the outer ring. The data for one scanline are added as the data in the inner ring complete one revolution. Hence, the number of revolutions of the data round the inner ring is equal to the number of scanlines in each block. 4.2. Example As an example take an image space which contains 12 scanlines and is divided into four blocks. Assume that step 0, has already been executed. The rotated strips and their segments on the first scanline of the second block are shown in Fig. 5; they all have the same color. The transputer identities inside the rotated strips show in which memory each of these rotated strips is being held (cf. Fig. 2). The scanline process order is shown in Fig. 4, the block order in Fig. 2a, and the network in Fig. 2b. The horizontal digitization is not shown in this example. a

b

C

d

::

10 ; ; 5 3” 2 I FIG. 4. Process order of scanlines in the transputers in the outer ring. (a) Order for T, ( b) Order for Tz (c) Order for T3. ( d) Order for Td

AN

ALGORITHM

FOR

ARBITRARY

2

1

FIG.

5.

An example:

a, b, c, d, e, fare

segments.

The rotation algorithm finds the strips on the first scanline of euch block concurrently. In addition, while steps O2 and O3 are being executed in T, (a transputer in the outer ring) for a scanline, steps I,, 12, and I3 are being executed in T i (the corresponding transputer in the inner ring) for another scanline. However, the description which follows only traces the path of the data for the first scanline of the second block. T2 outputs a to T b (steps 02, 03). T $ now has a (step I, ). Compaction does nothing (step 12). T 5 outputs a to T’, (step 13). T; now has a. T, outputs nothing to T \ (steps 02, OX). T ‘, still has a. Compaction does nothing. T ‘, outputs a to T b (step 13). Tk now has a. T,outputsd,e,ftoTb(steps02,03).Tbnowhasa, d, e, f (step I, ). Compaction produces a, D (step I*), where D is the compacted d, e, f. T k outputs a, D to T ; (step IX). T; now hasa, D. T3 outputs b, cto T; (steps 02, OJ). T; now has a, b, c, D ( step I, ) . Compaction produces a, B , D ( step I2 ) , where Bisthecompactedb,~.T$outputsa,B,DtoTi(stepI3). T$ now has a, B, D. l

l

l

l

The data have passed through all the transputers of the inner ring (one complete revolution) and T ; now holds the strips of the first scanline of its block (i.e., Block 2). 5, AN ASSESSMENT AND CONCLUDING REMARKS

In summary, the algorithm described divides the whole process of rotation into two major operations: (a) segment calculation and (b) merge, strip compaction, input /output operations. These two major operations are executed concurrently (in the outer and the inner rings). In addition, the image data are divided into smaller portions, blocks, where each block of data is operated on simultaneously. Operation (a) is, in general, more time consuming than operation (b),

191

because it involves a search through the list of the rotated strips for each scanline. Therefore it is not necessary to have transputers with the same processing power in both rings: 20 MIPS transputers could be used to execute (a) and 10 MIPS to execute (b) to achieve the same overall performance. On the network of transputers, the execution time depends on A4 and N, where M is the maximum number of strips in a block, and N is the number of scanlines in the image space. Thus, if one block contains many strips and the others none at all then the speed of rotation would be the same as if the other blocks had the same number of strips. In general, the more blocks (i.e., more transputers) the smaller A4 becomes. When more transputers are used in the network, less memory is needed in each transputer. This trade-off can be exploited to gain more speed by using more transputers with smaller but faster memories. A transputer is able to process, input, and output data concurrently. This capability is exploited by using two memory buffers in each transputer (node) in the network rings. By using these buffers in a particular order, all the interprocessor communications (excluding loading the original image into the network, and dumping the result to the outside world) are overlapped with other processes. Our experiments show that the speedup is almost linear as the number of blocks (i.e., transputers) increases. This means that an image containing S strips, where M < S < B*M, with B blocks in the image space can be rotated in approximately the same amount of time as an image containing P strips, where M G P < (B + K) *M, with B + K blocks in the image space (assuming that the number of scanlines is not changed). This is not an unexpected result, since the major overhead in parallelizing the algorithm is the interprocessor communications which, as mentioned earlier, have been overlapped with other more time consuming processes. However, when the interprocessor communications become more expensive than the process of segment calculation, this increase in speedup will slow down. This happens when there are J blocks (i.e., 2 *J + 1 transputers), where J is close to the number of scanlines in the image space, and the image to start with contains only a few strips. The algorithm has been implemented on an Occam compiler which runs on a sequential machine [ 61. It was tested and timed on a number of computer-generated images. We assumed that, for each transputer, the external memory is IMS2600-12 dynamic RAM ( 150 ns access time), all data are held in external memory, and the type is T4 14 (32 bit, 10 MIPS). The execution times were obtained from counts of program elements together with published times taken for each programming element [ 41. Typical execution times for small angles of rotation ( lo to 10”) with the maximum of 640 strips in a block and with 256 scanlines in the image space were 450 to 510 ms, and for larger angles 485 to 590 ms.

__...______..._____. 1 Block

ROTATION

192

HAMID

R. ARABNIA

The algorithm works faster for small angles of rotation, because the smaller the angle, the fewer segments are generated before strip compaction. The timings do not include the execution times of reading the image into the transputer network and writing the resultant image onto the frame buffer. These execution times depend on the type of the frame buffer used and the way it has been connected to the network. Our experiments show that the performance of a network with 2 1 transputers (i.e., 10 blocks in the image space) would be comparable to the performance of most rotation algorithms targeted at medium size SIMD machines (64 X 64 to 96 X 96 bit-serial processors); for an example see [2]. However, the transputer network is much less expensive. ACKNOWLEDGMENTS I am indebted to Dr. Martin Oliver (School of Mathematical Sciences, University of Bath, England) for his comments on the manuscript, and for many stimulating discussions.

REFERENCES I. Arabnia, H. R., and Oliver, M. A. A transputer network for the arbitrary rotation of digit&d images. Comput. J. 30, 5 ( 1987), 425-433. Received July 6, 1989; accepted September 5, 1989

2. Arabnia, H. R., and Oliver, M. A. Arbitrary rotation of raster images with SIMD machine architectures. Compuf. Graphics Forum 6, 1 ( 1987), 3-12. 3. Hoare, C. A. R. Communicating sequential processes. Comm. ACM 21, (1978), 666-677. 4. INMOS. IMS T414 transputer. Product description. INMOS Ltd, Bristol, UK, 1985. 5. INMOS. Occam Programming Mum&. Prentice-Hall, New York, 1984. 6. INMOS. Occam Programming System on Unix. INMOS Ltd, Bristol, UK, 1985. 7. Johnston, E.G., and Rosenfeld, A. Geometrical operations on digitized pictures. In Lipkin, B. S., and Rosenfeld, A. (Eds.). Picture Processing and Psychopictorics. Academic Press, New York, 1970, pp. 2 17-24 1. HAMID ARABNIA received a B.Sc. honors degree in mathematics and computing in 1983 from the Polytechnic of Wales (Pontypridd, United Kingdom) and a Ph.D. degree in computer science from the University of Kent (Canterbury, England) in 1987. For nine months in 1987, he worked as a Computer Science Consultant for Caplin Cybernetics Corp. (London, England), where he helped in the design and implementation of a number of image-processing algorithms. These algorithms were targeted at an MIMD machine architecture made up of transputers. Dr. Arabnia is currently an assistant professor of computer science at the University of Georgia (Athens, Georgia), where he has been since 1987. His research interests include parallel algorithms in general, and the application of parallel processing to computer graphics and image processing.