On efficient global information extraction methods for parallel processors

On efficient global information extraction methods for parallel processors

COMPUTER GRAPHICS AND IMAGE PROCESSING 14, 159-169 (1980) On Efficient Global Information Extraction Methods For Parallel Processors A. I’. REEV...

752KB Sizes 0 Downloads 98 Views

COMPUTER

GRAPHICS

AND

IMAGE

PROCESSING

14, 159-169 (1980)

On Efficient Global Information Extraction Methods For Parallel Processors A. I’. REEVES School of Electrical Engineering, Purdue University, West Lafayette,Indiana 47907 Received October 25, 1979; revised November 29, 1979; accepted January 18, 1980 The capability of a parallel processor for the fast implementation of image procusaing algorithms is demonstrated by the analysis of selected examples. An earlier study considered thrcsholding by the mode method as a challenge for the parallel processor. In this paper we identify the necessity for an efficient global feature extraction mechanism and show that a small increase in hardware (less than 5%) can result in more than two orders of magnitude speed improvement for this and some other algorithms. A hardware feature extraction (bit counting) mechanism is proposed and its performance is analyxed for several image processing algorithms. The main contention of this paper is that current parallel proces~~ should incorporate this form of hardware if they are to be effectively utilized for many image processing applications. 1. INTRODUCTION

In a paper by Cordella et al. [l] the algorithm of thresholding using the mode method [9] is described, and it is argued that this is a representative task on which to compare parallel and sequential processors. A detailed analysis is given in [l] which shows that a parallel processor with 16, 384 simple processing elements can compute the threshold algorithm approximately ten times faster than a sequential computer. The rather poor speed improvement, when compared to the difference in processing hardware, is due to two main reasons. First, in [l] the entire computation is achieved within the parallel processor. In practice, however, the parallel processor is comected to a conventional sequential host processor, and some algorithms can be realized more efficiently by using a combination of parallel processor and host processor operations, especially when feature information is to be processed. Ingenious algorithms are necessary to perform all operations in the parallel processor; although very novel and interesting, this may not be very efficient. Second, the majority of the computation time is required for counting the number of bits in a bit plane distributed throughout the parallel processor. The parallel processor described, and other current parallel processor designs, do not have an efficient mechanism for feature extraction operations such as bit counting. Therefore most of the time is required for the parallel processor to compute a function for which it was not designed. These parallel processors are very efficient for implementing image processing “filtering” algorithms involving near neighbor information but are not very efficient for algorithms requiring global information. An efficient bit-counting mechanism will be described which requires only a small amount of additional hardware. In the following sections the capabilities of ‘bit counting as a global feature extracting mechanism is considered. A practical bit-counting implementation is proposed and the effect of this hardware on the realization of several image processing algorithms is discussed. 159 0146-664x/80/100159-1 1$02.oCJ/o Copyrieht0 1980 by Academic Press, Inc. All rights of reproduction

in any form reserved.

160

A. P. REEVES 2. BINARY

ARRAY

PROCESSORS

In this study and in [l], parallel binary array processors are considered. Binary array processors are single-instruction-stream multiple-data-stream (SIMD) processors which consist of a large two-dimensional array of simple processing elements (PEs) and a common control unit. Each PE is constrained to bit serial operations and has one-bit-wide interconnections with its near neighbor PEs. The important features of a BAP architecture are efficient storage of image data, fast access to near neighbor information, and high speed due to the large number of PEs. With the advent of Large Scale Integration technology it has become feasible to construct large parallel BAPs [2-41. By implementing several PEs on a single chip, BAPs with over 10,000 PEs can be constructed. There are currently three main current efforts to implement large parallel BAPs. CLIP4 is being constructed by Duff [2] at University College, London; it consists of special chips containing 8 PEs with local memory. A 96 X 96 array of PEs is currently being constructed; however, in this paper, CLIP will be considered to have 128 X 128 PEs for comparison with other BAPs. The main application of CLIP is the analysis of biomedical images. In [l] a processor called CLIP5 is analyzed which is a hypothetical MS1 CLIP implementation with 128 x 128 PEs. A massively parallel processor (MPP) is currently being constructed by NASA to analyze LANDSAT D images. Special 8-PE chips are being developed for MPP which require additional local memory chips. The MPP processor will have matrix of 128 X 128 PE’s. The distributed array processor (DAP) is being developed by ICL for general applications. Currently a 32 x 32 processor is operational and a 128 x 128 processor is being considered. The new processor will contain special LSI chips. The architecture of a characteristic PE is shown in Fig. 1. It consists of three main parts: an ALU, a local memory, and a near neighbor selector. The ALU contains an accumulator A and a carry register C; it can perform all logical functions on two operands and can also perform a single bit add operation. Most instructions involve combining data from the local memory with the accumulator as in conventional computers. When required, near neighbor information from neighboring PEs can be obtained via the neighbor selector. Some PE architectures [2,4] enable a combinational logic path to be set up from the neighbor input PE to the NN output; in this way a signal can propagate, in one instruction, through a set of PEs. This type of instruction could be achieved with a sequence of single PE near neighbor instructions but, depending upon the technology used for the PE implementation, the single propagating instruction may be up to ten times faster. A very simple OR function feature extraction mechanism is usually implemented which indicates if any bit in a bit plane is set. On some BAPs a second, slightly

FIG.

1. A typical PE architecture.

PARALLEL

GLOBAL

INFORMATION

EXTRACTION

161

more powerful mechanism is sometimes available; however, it is not as powerful as the full bit-plane counter to be described. 3. BIT

COUNTING

FOR

FEATURE

EXTRACIION

Current BAP designs have very primitive capabilities for extracting global feature information over the whole array. Many low-level image processing algorithms are based on local, near neighbor information and are efficiently implemented on BAPs. However, higher-level algorithrhs, such as thresholding with the mode method, require global information. The one mechanism which is implemented on these BAPs is an OR function over all PEs. This scheme has been used in many algorithms but is frequently not very efficient. Slightly more powerful schemes are available. A row of bits may be obtained from the DAP. Sixteen dispersed data points may be sampled by the MPP hardware and a bit counter along one edge of CLIP was considered. A mechanism is proposed here which counts all the bits in a bit plane. The bit-counting operation is a very powerful feature extraction scheme especially if used in conjunction with a sequence of conventional BAP operations. For example, the bit count can be used for the following applications: (a) For a binary feature image, the bit count gives the area of the object or objects represented by l’s in the image. (b) After a single edge detection instruction the bit count gives the perimeter of the object or objects in a binary image. (c) After an exclusive OR operation between two binary images the bit sum gives the Hamming distance between them. (d) The average value of the pixels in an image with n bits per pixel can be calculated with n bit-counting operations. (e) The median value of the pixels in an image of n-bit pixel may be computed in n2 BAP instructions and n bit-counting operations. (f) The histogram of an image may be computed in n2” BAP instructions and 2” bit-counting operations. This algorithm will be discussed in the following sections. 4. BIT

COUNTING

IMPLEMENTATION

A bit counting mechanism is required which is capable of rapidly counting the number of bits set in a bit plane of 16,384 bits. Several implementations for parallel bit counters have been proposed in the literature. Foster and Stockton [5] have described a scheme of bit counting by a network of full adders. Swartzlander [6] derives an upper bound for the delay of this type of network and suggests and compares several different schemes for implementing bit counters. He considers using ROMs to replace a set of full adders and partially analog circuits to achieve higher performance. Kobayashi and Ohara [7] describe a synthesis method for constructing large bit counters from small bit counter modules. They illustrate this technique with bit-counter modules having 7 inputs and a 3-bit sum output. With modern LSI technology it should be possible to fabricate a 31-input, 5-output module on a single chip. A bit counter for a BAP could be economically constructed with a network of these chips.

162

A. P. REEVES

The bit-counter module has U inputs and V = log,(U + 1) outputs where U < ZV, the total number of bits to be counted. A single full adder (FA) can be considered as a module with U = 3 and Y = 2’ as ‘shown in Fig. 2. A 7-input module could be constructed from four FAs as shown in Fig. 3. The proposed module size is U = 31, V = 5; this could be constructed with four 7, 3 bit-counters and three parallel adders as shown in Fig. 4. This design requires 26 FAs and takes 7 FA delays to generate the output. The design techniques using only fixed-size modules is illustrated in Fig. 5 where the 3 1, 5 counter is implemented using only 7, 3 modules. If the 7, 3 modules were constructed with FAs as shown in Fig. 3 then this implementation would require 28 FAs and takes 12 FA delays to generate the output.

Fta. 2. Full adder or 3,2 module.

-=

FIG. 3.7,3 bit counter constructed with 3,2 modules.

FIG. 4.31,5 counte-r de&n using a minimum number of FAs.

PARALLEL

GLOBAL

FIG. 5.31,5

counter

INFORMATION

constructed

EXTRACTION

with

7,3

163

modules.

The 3 1, 5 chip could be based on the design given in Fig 4; however, groups of FAs should be replaced by two-level logic designs to reduce the number of delays through the network. If we assume that the chip is constructed with the same technolom which is used for the PEs and local memory, then it should be to sum the 31 bits in less than a PE memory cycle. The propos& chip is s Fig. 6 and could be contained in a @-pin package. The output of the counter is held in a 5-bit register; this is to make it possible to pipeline data through a network of these chips. Using the synthesis scheme outlined in [7] a bit counter has been desi for 16,3&t inputs with 3 1,5 modules. It will take several PE memory q&s for the data to propagate through this network. The throughput of the network may be improved by using a pipeline scheme which allows new data to be input to the network before the current data has reached the output.

FIG. 6. Bit-counting

chip

organixstion.

164

FIG.

A. P. REEVES

7. Pipelined

version

of a 31, 5 counter

with

7, 3 modules.

RI-R5

arc extra

:

pipeline

registers.

The pipeline scheme till require a few extra registers to ensure that the data is synchronized through the network. A pipelined version of the counter shown in Fig. 5 is given in Fig. 7; it contains 18 bits of extra register storage. Also, in this case, each 7,3 counter contains an output register. The results of the 16,384 counter design are shown in Table 1. Two other counter designs are also considered in Table 1; their organization is shown in Fig. 8. The first, the every-8 counter, samples one out of 8 adjacent PEs for each input. A bit plane is counted by shifting the data in the PEs 8 times to input all of the bit plane to the counter. A 15bit adder accumulator is required to add up the 8 IZbit TABLE Comparison

of Different

1 Bit-Counting

Schemes.

FUll counter No. of inputs to counter No. of logic levels No. of 31,s chips (Extra register bits for pipeline) 31,5 chips&a (CLIP4) percentage of PE chips (Mm Memny cycles required to count one bit plane Memory cycles required to count n bit danes

16,384 8 627 48 30.6% 15.3%

2048 6 85 19 4.2% 2.1%

128 3 6 16 0.3% 0.2%

8

15

132

7 + 8n

4 + 12811

7+n

PARALLEL

Full

counter

GLOBAL

INFORMATION

(a)

Every--B

counter

EX~CIION

165

(b)

15 bit adderlaccumulato,

Edge

co"nter

(c)

FIOURE 8.

numbers which are generated in sequence by the counter. The second, the edge counter, is a 12%input bit counter connected to one edge of the PE matrix. 128 shifts are required to input a whole bit plane to this counter; the bit sum is accumulated in a 15-bit adder accumulator as shown in Fig. 8~. The cost of the counters as a percentage of the PE hardware is given in Table 1. In the CLIP4 design a limited amount (32 bits/PE) of local memory is designed into each chip of 8 PEs. In the DAP and MPP designs the local memory will be on a separate chip. Therefore, a 128 x 128 matrix will require at least 2048 CLIP4 chips and at least 40% chips for DAP or MPP. From Table 1 it would appear that the full counter is only twice as fast as the every-8 counter but requires 8 times more hardware. However, the every-8 counter requires 8 PE operations to load the bit-plane and the full counter only requires one PE operation. The full counter will be more efficient when sets of bit planes are to be counted or when bit counting can be done concurrently with PE operations. This will be demonstrated in the following examples. 5. BIT COUNTING

WITH CELLULAR

AUTOMATA

An interesting, alternative, serial bit-counting scheme based on cellular automata theory has been described by Rosenfeld [lo]. He has shown that a tree structure network of cellular automata can generate the sum of M inputs in 2 log, (m) steps. The network consists of a balanced binary tree of serial full adder nodes. Each node consists of a FA with buffers for the sum and carry out signals. The carry out buffer is connected to one of the inputs of the FA to realized a serial adder. A bit plane to be counted is input at the leaves of the tree and after log, (m) clock cycles the least significant bit of the sum is output at the root of the tree. Successive more significant sum bits are generated for the next log, (m) clock cycles. If the sum of n bit planes is required then successive bit planes may be loaded into the counter every 1 + log, (m) clock cycles. Therefore the time required to count n bit planes is n + (n + 1) log, (m) clock cycles.

166

A. P. REEVES

A network to count 16,384 inputs requires 16,383 nodes and has a delay of log, (16,384) = 14 nodes. A possible LSI implementation with the 40-pin package constraint is a 32-input, single-output chip containing a balanced tree of 31 nodes. A bit counter for 16,384 inputs would require 512 + 16 + 1 = 529 of these chips. Therefore this scheme would require 16% less chips than the full counter design. However, this scheme is also much slower than the full counter, i.e., 3.5 times slower for single bit-plane counts and approximately 14 times slower for multiple bit-plane counts. This scheme is also slower than the every-8 counter design which requires much less logic. It is possible to extend this concept to a two-dimensional pyramid cellular automaton; however, only a small improvement in performance is obtained. In this case each node receives inputs from four lower-level nodes and contains a 7, 3 bit counter with a 3-bit buffer register. The most significant 2 bits of this buffer are connected back to 3 inputs of the 7,3 counter. The delay time through the network is reduced to [log, (m)/2] clock cycles; however, log, (m) clock cycles are still required to output the sum. The saving in time is a constant [log, (m)/2 j clock cycles for either the single or multiple bit plane counting applications. LSI packaging is slightly more difficult than the binary tree scheme; two 16-input, single-output networks can be realized on a 40-pin chip. For a 16,384 input counter 512 + 32 + 1 + 1 = 546 of these chips would be required. This extended scheme is still slower than the simpler every-8 counter. In many cases a bit plane to be counted will only have a few bits set to one. If it is known a priori that a maximum of K inputs bits can be set to one, then the counter may be stopped after 1log, (K + 1) 1 outputs have been generated as all the more significant sum bits will be zero. For the general case, a zero detector could be connected to each node, and detector outputs at each level of the tree or pyramid could be ORed together. Then it would be possible to modify the control sequence dynamically to minimize the number of clock cycles required. 6. HISTOGIWM

GENERATION

The first example algorithm is the computation of the histogram of an image distributed in the local memory of the PEs. The image size which is considered is 128 X 128 pixels with 32 gray levels (5-bit pixels); this is the same size used in [ 11. On a BAP the histogram is generated in two stages. First, 32 bit planes, one for each gray level, are generated; then the number of bits in each bit plane is counted. In [ 1] each bit plane is generated by evaluating a Boolean expression for each gray level. For example the expression for gray level 9 is

where xi is the ith bit of the pixel. In the general case the number of operations required to calculate the bit plane for it gray levels is II log,(n). For 32 gray levels 160 operations are required. An alternative more efficient method for calculating the bit planes exists in which partial results of computations are recorded so that repeated identical operations are not required. For example, the equation for the eighth gray level is

PARALLEL

GLOBAL

INFORMATION

167

EXTRACTION

TABLE 2 Number of Clock Cycles Required to Generate Histogram

--___ Conventional minicomputer

No bit counter

Edge counter

= 328,000

= 40,ooo

4,259 4,219“

Every-8 counter

Full counter ~-.____ 422 168 381’ 128’

“Using the optimized bit-plane generation algorithm.

G, differs from G, in the value of x0 only. Therefore, we could compute %.,*x3.X2*X, and save this result, G, could be computed by OBing with Z,, and G, could be computed by ORing with x,,. This method saves computing Xs.xs-Z2.X, twice for Gs and G,. The general form of this algorithm for n gray levels requires logz(n) - 2 temporary storage bit planes and requires 4n-8 operations. For 32 gray levels, 120 operations are required. The figures in Table 2 for the conventional minicomputer and for a BAP without a bit counter are taken from [l]. In their algorithm the parallel processor is only about eight times faster than the conventional computer. The every-8 counter achieves a further two orders of magnitude speed improvement with a less than 5% increase in hardware; a further three times speed improvement is possible with the full bit counter hardware. 7. THRESHOLDING

(THE MODE METHOD)

Thresholding with the mode method [9] presupposes that the histogram has a bimodal distribution. The valley of histogram is located and a bit plane is generated having l’s for all pixels above this value and O’s elsewhere. The most computation-dependent part of this algorithm is generating the histogram. In this example a 128 x 128 image of 5-bit pixels will be considered. The number of memory cycles to achieve thresholding with different hardware schemes is shown in Table 3. The data for the conventional computer and the BAP with no bit counter have been obtained from [ 11. In [I] the histogram is accumulated in the PE matrix, and when completed the valley is found in about 330 clock cycles. However, only the valley of the histogram is required. The host computer should be able to compare the value of each element of the histogram as it is generated with previous value. When the second sign change occurs the valley is detected. The host computer should be able to

Thresholding

Histogram generation Valley detection Threshold@ Total

TABLE 3 (T’he Mode Method)

Conventional minicomputer

No bit counter

Edge counter

Every-8 counter

Full counter

= 328,000

= 40,a.m

4,219 128

382 128

128 128

u 164,000

827 6 4,353

6 516

6 262

cz 492,000

rr41,ooo

168

A. P. REEVES

achieve this simple operation while each element is being computed by the BAP. In Table 2, 128 additional memory cycles have been allocated for this in case the host computer cannot do both functions at the same time. The generation of the threshold bit plane described in [I] with the threshold value in PE matrix requires 500 memory cycles. If the threshold value is stored in the host computer then a comparison can be made with all image elements in 5 memory cycles; one extra cycle is required to store the result. One advantage of checking each histogram element as it is computed is that only the points up to the valley need to be computed; the rest of the histogram is not needed. In fact, if there is some a priori information which indicates the approximate position of the valley, then this area of the histogram may be examined first. This technique is not suitable for the conventional computer, which must examine all image points, i.e., compute the whole histogram, regardless of any a priori information. In [l] for an image of size n x n the computation time for the conventional computer is proportional to rz2 and the computation time for the parallel processor is approximately proportional to n (assuming rr x n PEs are available). The proposed scheme’s processing time increases very little with increasing n (assuming n x IZ PEs are available). The only increase in time is due to the extended delay through the bit-count mechanism, which is proportional to log (n) [7]. For example, the difference in computation time between n = 128 and n = 2048 is less than 10% of the total time required to compute with n = 128. 8. IMAGE

REGISTRATION

A technique has been developed for the real-time registration of time-varying imagery [8] in which the camera is moving. This technique involves adaptively thresholding each image into a binary bit-plane feature image. Feature images from two consecutive frames are compared by computing the Hamming distance or bit difference between them. One bit plane is slid over the other until the bit difference between them is a minimum. The displacement of the bit planes at this minimum indicates the amount of movement of the camera. For this example a 128 x 128 image of 8-bit pixels is considered. The amount of time to compute each binary feature image is 1123 memory cycles [8]. Assume that the displacement of the camera is known to within 10 pixel positions. Therefore one bit plane will be slid over the other, 10 pixels in each direction, i.e., 21 x 21 different displacements will be considered. The bit difference must be computed in 21 X 21 = 441 different positions. The cost of implementing this algorithm with different bit counting hardware is given in Table 4. Note: the time required to shift the image is 0 as PEs have direct access to near neighbor information. TABLE 4 Number of Memory Cycles Required to Implement the Image Registration Algorithm

Computation of feature image Computation of 441 difference values Total

counter

Edge

Every-8 counter

full counter

1123 56,452 57,575

1123 3535 4658

1123 449 1572

PARALLEL

GLOBAL

INFORMATION

EXTRACI’ION

169

This algorithm demonstrates that the full counter can be significantly faster than the every-8 counter. Also, the edge counter gives a very slow performance and the no counter option would not be acceptable for real-time applications. 9. CONCLUSION

It has been shown that the performance of a BAP for many image processing algorithms involving global information can be greatly improved by the addition of bit-counting hardware. The cost of the hardware is only a small percentage of the total BAP cost; a practical design for this hardware has been presented. The full bit counter mechanism produces the best results; however, for many applications the cheaper every-8 counter may be almost as efficient, i.e., when only a small amount of global information is required. Cordella et al. [l] used a thresholding algorithm to compare a sequential and a parallel processor. In this paper a parallel processor scheme is presented which achieves more than two orders of magnitude speed improvement for the parallel case. It is concluded that feature extraction hardware, such as bit counting, is essential for the efficient implementation of many image processing algorithms. This contradicts one of the conclusions given in [ 11. REFERENCES 1. L. Cordella, M. J. B. Duff, and S. Levialdi, Thresholding: A challenge for parallel processing, Computer Grqohics Image Processing 6, 207-220, 1977. 2. M. J. B. Duff, CLIW: A large scale integrated circuit array parallel processo r, Proc. 3rd Intemational Joint Conference on Pattern Recognition, pp. 728-832, 1976. 3. K. Batcher, MPP-A massively parallel processor, Proc. 1979 International Conference on Parallel Processing, 1979, p. 249. 4. P. M. Flanders, D. J. Hunt, S. F. Reddaway, and D. Parkinson, Efficient high speed computing with the distributed array processor, in High Speed Computer and Algorithm Organization, pp. 113- 128, Academic Press, New York, 1977. 5. C. C. Foster and F. D. Stockton, Counting responders in associative memory, IEEE Trans. Computers 1580- 1583, 1973. 6. E. E. Swartzlander, Parallel counters, IEEE Trans. Cornputers22, 1021-1024, 1973. 7. H. Kobayashi, and H. Ohara, A synthesizing method for large parallel wunters with a network of smaller ones, IEEE Trans. Computers 27,753-757, 1978. 8. A. P. Reeves, and A. Rostampour, Sequential image analysis with a parallel binary array processor, Workshop on Computer Analysis of TimeVarying Imagery, April 5-6, 1979, pp. 113- 115. 9. A. Rosenfeld, and A. C. Kak, Digital Picture Processing, Academic Press, New York, 1976. 10. A. Rosenfeld, Picture Languages, Academic Press, New York, 1979.