Benchmarking processors for image processing Image processing tasks apply an important subclass of DSP operations. M B Sandier, L Hayat and L D F Costa compare the execution of these instructions on six general-purpose DSP microprocessors
A range of commercially available processors is investigated to evaluate their potential in image processing systems. The examination covers a selection of the more important instructions in use in image processing algorithms and a selection of the algorithms themselves as a simple benchmark. It also includes a survey of multiprocessor systems implemented with some of these processors. It is concluded that digital signal processors (DSPs) are better suited to embedded systems but the transputer may offer advantages in a research environment, of general usefulness for supercomputing power. microprocessors DSP
imageprocessing benchmarking
There is now a variety of ways in which to implement digital signal processing (DSP) algorithms; whenever possible, a realtime solution is used. However, attempts may be made to classify the spectrum of possible solutions into three categories: • custom architectures • general-purpose DSP architectures • general-purpose architectures. The custom category comprises systems constructed from low-level building blocks like multipliers, adders and memory and, arguably, ASICs. Amongst these latter, a recent development is the Cathedral software suite 1 for silicon compilation of DSP functions and algorithms. This style of implementation has the greatest potential for high-speed operation and offers low silicon area and power consumption. The general-purpose DSP category consists of systems implemented using mass-produced processors, such as general-purpose microprocessors and special-purpose DSP microprocessors. In these there may be one or more processors together with support devices such as multiDepartment of Electronic and Electrical Engineering, King's College London, Strand, London WC2R2LS,UK An earlier version of this paper appeared in Proc. IEEEICASSP'89 ~c~IEEE1989 Paperreceived: 5 May 1989. Revised17 July 1990
pliers and convolvers. The advantages of this approach are that development time, and hence cost, may be significantly red uced, development tools are available in a style known to microprocessor engineers, resulting in an accelerated learning phase, and the systems developed for one application may be reprogrammed relatively easily with another algorithm for a different purpose. The class of algorithms is, however, somewhat restricted by the architecture chosen. The use of assembly language programming is gradually dying away as the C programming language becomes available for all microprocessors, both conventional (e.g. Motorola 68000) and non-conventional (e.g. transputers and DSPs like the Texas Instruments TMS320 series). The final, general-purpose, category is used to cover non-realtime implementations of algorithms, presumably in a high-level programming language on a generalpurpose computer, which could be a PC or a mainframe. The advantages of this approach are that the algorithm designer has complete flexibility in the operations that may be implemented and may also be able to employ mixed language programming, for example in an application which requires both numerical and symbolic manipulation (e.g. FORTRAN and Lisp). However, the system is unlikely to operate in realtime and will often have a restricted I/O bandwidth, meaning that processing must normally occur off-line. Such systems are generally used to experiment with and develop new algorithms. This paper deals with the second category, as this would appear to offer both high performance and reasonable ease and cost of acquiring tools and skills in small, medium-sized and large companies. Image processing will provide the focus for the applicability of this style of implementation. This is gaining great importance as a subclass of DSP operations (ISDN, satellite and remote sensing, computer and robot vision) and incorporates those instructions of greatest importance in DSP algorithms. It also exhibits a high throughput requirement. Consider processing a 512 x 512 pixel monochrome video image in a single frame period: this requires the computation of some 256 000 new items of data in 40 ms, or more than 6 M compound operations
01 41-9331/90/09583-06 © 1990 Butterworth-H einemann Ltd Vol 14 No 9 November 7990
583
per second. Typical compound operations are discussed below. The processors discussed themselves fall into three classes. These are the general-purpose microprocessors, typified by the Motorola 680202; the general-purpose digital signal processing microprocessors like the TI 320 series (16-bit integer and floating-point versions) 3, 4, the Motorola 56000 (24-bit integer)S; and the NEC/~PD77230 (24-bit integer and floating-point)6; last, there is the ubiquitous transputer, examined in its floating-point form, the T8007.
C O M P A R I S O N OF INSTRUCTION EXECUTION First, the various processors are compared on the basis of individual instructions, chosen as being typical of image processing algorithms. These are numerical (add, multiply, subtract, divide and the floating-point version thereof) and data access instructions (storing and loading). The speeds in terms of the number of instruction cycles are shown in Table 1, together with typical instruction cycle speeds. These figures should only be taken as a rough guide, since they will depend strongly on the type and speed of memory used in an implementation; this might be on-chip or off-chip and, in the case of off-chip memory, may involve zero, one or more wait states. In all cases in the the table, the best case has been selected. It is clear that in arithmetic instructions, which are microcoded in the transputer and the 68020, the DSP chips show a clear advantage. It is interesting to note that integer multiplication on the transputer is by default executed in the normal ALU: it is obviously good programming policy to use floating-point multiplication even for integer quantities when using the T800. It is not surprising to see that the DSP chips outperform the others in this selection of instructions, as this is the purpose of their design. It is clear from the table that there is not much to choose between the various DSP chips, until instruction cycle timings are taken into account. More importantly, the two strongest influer~ces on the choice of DSP device should be ease of programming and interface potential (particularly with respect to multiprocessor systems). The execution speed of the T800's floating-point operations is fast, but still significantly slower than Table 1.
floating-point versions of DSPs. This carl probably be explained by the asynchronous coupling between the CPU and floating-point unit (FPU), requiring a wait for the other to become ready. This would also account for the uncertainty over exact execution speeds. This uncertainty must make it difficult to design realtime systems. It is worth noting that all DSP chips are designed to have some level of intemal parallelism (often in the form of pipelining). This implies that many of the instructions in the table can be executed together. For example, the TMS320c25 can simultaneously perform a multiply/ accumulate operation, together with an index register update and a result scaling Conventional microprocessors do not have this feature and nor do transputers, except in that floating-point operations and communications may occur in parallel with CPU operations. The payoff is that execution of algorithms is far more efficient than simply the sum of the execution speeds of the individual instructions. C O M P A R I S O N OF A L G O R I T H M EXECUTION An attempt was made to benchmark the various processors. Only in the cases of the Hough transform and the 3 x 3 convolution did the on-chip multiplier (where present) come into operation. Each algorithm was implemented, as far as possible, in the same way on the different processors. Assembly language was used throughout, except for the transputers, where Occam was used. No attempt was made in any case to optimize the code by using expert assembly language programming techniques. For each of the seven algorithms, the code was generated from the same flow diagram; these are depicted in Figure 1. Figures 1 b to 1g are the flow diagrams for the algorithms themselves and Figure la is the overall diagram, with the individual algorithms fitting into the box marked RPROCS. In the case of the Hough transform, the trigonometric functions are implemented by table lookup. Only the TMS320c25 and T414 Transputer were actually available for this comparison, so the figures for the other processors were obtained by coding the algorithms and calculating the operational speeds from manufacturers' data sheets. For the T800 this procedure
Instruction timings
Processor
TMS320C25
TMS320C30
M56000
N EC
T800
M68020
Instruction cycle (ns)
100
60
97.5
150
33
60
t 1 38 39 6-9 6-9 11-18 11-18 2 2
3 3 28 56
Instruction ADD SUB MULT DIV ADDF SUBF MULTF DIVF LOAD STORE
584
Number of cycles 1 1 1
1 1 1
1 1 1 1
1 1 1 1 1
1 1
1 1
1 1 1 1 1 1 t 1 1
5 5
Microprocessors and Microsystems
d
a
FLAG=0 a:=img[x-1 ,y+l] b:=img[×,y+l] c:=irng[x+ 1,y+l] d:=img[x+l ,y] e:=img[x+l y-l] f:=img[x,y-1] g =img[x-1 ,y-i~ h:=]mg[x-1 ,y] dx =c+d+d-e a-h-hdy:=a~b+b+c e-f-f-,
I.
2
I P'='MG~×.~]]
d:=abs(dx)+abs(dyl
_
_
_
P2 IMG[X,Y1] P3=IMG[X+IY-l] P4-1MG[X+1,Y] Pg-IMG[X+1,Y+I] F~=IMG[X,Y+1] P7=IMGIX1,Y+1] P8=IMG[X-1Y] Pg=IMG[X-1.Y+I]
_
img[x-1 ,y-1]:~O
img[x-1 y-1]:=1
¥
~ X , Y I = 0 V FLAG=FLAG+I]
I
N
' I
I
IMG[
COUx,yOO__y÷U~l1
CxOU~++ 1--yC?yU+ li~1., T"
e
b
~
j
P1=IMG[X,Y] I P2=IMG[X,Y-1] [ P3=IMG[X+1,Y 1] I P4=IMG{X+I,Y] I P5=IMG[X+I,Y+I] I P6=IMG[X,Y+1] [ P7=IMG[X-1,Y+I] I PB=IMG[X-1,Y] I Pg=IMG[X-1,Y+I] TEMP=(ABS(P2+P3+P4+P5+P6+P7+PS+P98 P1 )
<~
N
Y
Y
i r! :=I ]
g
Y
arg ~itb'ctq
ir:=rouqd({x'cos(arg)+y'sm'ar~ acc[ith,ir]:= acc[,th =r:."
C
"b d'!
ith =ilb+l :=img[x,y] h st[i}:=hist[i]+l
L
Figure 1. Flow diagrams for algorithm implementation on single processors: a, overall algorithm harness; b, thresholding; c, histogramming," d, Sobel edge-detection; e, Laplacian edge-detection; f, thinning," g, Hough transform
Vol 14 No 9 November 1990
585
Table 2.
Algorithm timings for 256 X 256 image
Processor
TMS320C25
TMS320C30
M56000
NEC
T800
M68020
Instruction cycle (ns)
100
60
97.5
150
~~
60
Algorithm Threshold Histogram Sobel Laplacian 3 x 3 convolution Thinning Hough
Execution speed (s) 0.085 0.11 0.249 0.19 0.203 0.38 1.367
0.066 0.041 0.11 0.071 0.063 0.193 0.465
was adapted slightly, whereby the algorithms were run on the T414, substituting the floating-point instructions with others with the same execution time; the timings were then scaled to account for the different instruction cycle speeds. In the case of the TMS320c30, the code was adapted from TMS320c25 code. All the timings are for the processing of a single 256 X 256 pixel image. Thresholding, thinning and the Hough transform all exhibit some degree of data dependency. To take account of this, the same simple test images were 'executed' on all processors (real or modelled). In the case of the Hough transform it was assumed that 10% of the pixels were white (approximately 6500 pixels) and that there were 90 increments in theta (i.e. the resolution of straight line identification is two degrees). The timings for these algorithms are given in Table 2, with execution speeds in seconds. The advantage of using DSP processors for this class of algorithm is clear. The importance of the internal parallelism is demonstrated even for those algorithms which do not use any multiplications. An unusual feature is that the 68020 is predicted to outperform the T800 in most algorithms, even though its clock runs at approximately half the speed. The exceptions are the algorithms needing multiplication, and even then the transputer, with its floating-point coprocessor on-chip, is predicted to be only about twice as fast; i.e. by an amount accounted for entirely by the different clock speeds. In considering the comparison of the various DSP processors, it should be noted that the investigators have more experience of Texas Instruments devices, though the 56000 is still faster than the TMS32Oc25 (by factors up to about 3) despite their similar clock speeds. The TI320c30 is the faster floating-point DSP and the M56000 is the faster integer unit. This is not surprising, since the faster units are the newer. It is also clear that the newer TI device has advantages to offer over the older, even for implementations which do not use its floating-point ability. The only algorithm for which the c30 is not as efficient as the c25 is thresholding (where speed is reduced only by about 20% for a 40% reduction in cycle time) and this can be explained by the branching instruction taking four cycles in the c30, as opposed to only two cycles in the c25. The transputer was the slowest floating-point device, even when compared with the old NEC processor, with a clock speed nearly five times slower! The TSOO is 10-20 times slower in processing than the TI32Oc30, though it
586
0.057 0.038 0.236 O.153 0.165 0.204 0.639
0.097 0.08 0.55 0.45 O.71 0.9 0.725
0.35 0.35 1.25 0.97 1.27 2.025 4.75
0.224 0.228 1.227 0.688 2.461 0.888 6.222
does have a cost advantage. (Note that the NEC device is difficult to program in assembly language, due to its wide instruction word, and this may have influenced the timings.)
COMPARISON OF MULTIPROCESSOR IMPLEMENTATIONS In this section, some multiprocessor implementations of the same image processing algorithms are compared. Figures of merit have only been obtained for four architectures: two from references and two from work done in the vision engineering group at King's College (this is because very few published papers supply details of execution speeds). Fortunately, in all cases the work has been performed on 256 x 256 pixel images. In Table 3, the various systems are referred to by a coding scheme. System TMS1, the so-called OSMMA system (see Figure 28'9), consists of eight TMS320c25s operating with a 32 MHz cycle time. System TMS2 consists of eight TMS32010 processors operating with a cycle time of 10 MHz 1°. System Trans111 consists of 16 T414 15 MHz transputers and is a prototype supernode 12 now constructed using T8OOs.System Trans213 (See Figure 3) consists of eight T414 2OMHz transputers in a commercially available development system. Further details of all these architectures may be obtained from the relevant papers 8-11,13. As before, the thinning operation timings must not be regarded as definitive, since the algorithm is strongly datadependent. The timings for the Hough transform are also data-dependent but depend only on the number of edge pixels in the image, not the image content orthe absolute dimensions of the image. For the transputer network, the
Table 3. System
Algorithm timings for multiprocessors TMS1
Algorithm Threshold Histogram Sobel Laplacian Thinning Hough
TMS2
Transl
Trans2
Execution speed (ms) 18 16 63 42 225 128
105
57 40 203
580
230 650
Microprocessors and Microsystems
Memories
Processors
,
.
Convolutionor edge detectionkernel, scanningimage, eachimplemented on a singleprocessor
MS lb I
~
MS 2b I I'
I
I
I~l
I;GE
SEGMENT1
~GE
SEGMENT2
GE SEGMENT3 MS 3b I
I I
iiii~GE SEGMENTN a
b
Figure 2. The OSMMA image processing system: a, processor-memow interconnection; b, mapping of memories and processors onto image To
Links
Z
PC Master
mainly in computer vision applications. Future studies will extend both the range of algorithms and the range of processors under scrutiny. Transputer systems are generally more expensive for similar throughput, but are more flexible in the sense of reprogrammability, and a single system might be used for all processing levels, numeric and symbolic, in a computer vision system. The transputer's versatility is such that the same system could also be used for other tasks, such as matrix equation solution, computer graphics and other 'supercomputing' tasks. DSP devices exhibit a clear advantage on a performance-per-processor basis, and in embedded realtime systems they will prove the more cost-effective. At video rates of 40 ms per frame, it appears that histogramming can be performed on 256 x 256 images with a single M56000 or a single TMS320c30. If 128 × 128 images are considered, the range of algorithms which can be implemented on a single processor is enlarged and all the DSPs presented are capable of executing at least one algorithm within the time constraint. The selection of a particular DSP device should not be based solely on data of the type presented. In particular, all considerations of cost, board area, power consumption and ease of implementation have been omitted. This is important because it may well be possible, for example, to use two lower-cost processors in place of one expensive one. If an application cannot be implemented on a single processor, the other major factor is the ease with which a multiprocessor system can be constructed using a particular device. All new DSP processors have provision for multiprocessor architectures (e.g. global memory, serial communications etc.), though programming provisions are not so well developed as those for the transputer, with Occam or its others languages.
REFERENCES Free linkstu connectto OSMMA subsystem
Figure 3.
Transputer architecture for Hough transform
timing is also data-dependent, and is given for 4k image points (approximately 7% image occupancy). It is not easy to draw conclusions from such a mix of systems, even though they represent only two processor families. It is clear however that the DSP-based systems outperform their transputer counterparts. It is not possible, nor is it intended, to give comprehensive performance figures for such multiprocessor image processing architectures; this section is intended only to give a broad picture of what can be achieved with multiprocessor systems. CONCLUSIONS DSP processors, conventional microprocessors and transputers have been compared on the basis of their ability to perform image processing tasks, with a view to assessing their potential for implementation in realtime systems. A full selection of image processing algorithms has not been attempted, and the algorithms used are those of use
Vol 74 No 9 November 7990
1 Pype, Pet al. 'A technology transfer from research to development' Proc. 5th Annual ESPRIT ConL (November 1988) pp 3-4 2 MC68020 32-bit Microprocessor User's Manual 2nd Edition, Motorola, Inc. Austin, TX, USA 3 TMS320c25 User's Guide Texas Instruments, Austin, TX, USA 4 The Third Generation of the TMS320 Family of Digital Signal Processors--Functional Specification Texas Instruments, TX, USA (1988) 5 DSP56000 Digital Signal Processor User's Manual Motorola Inc., Austin, TX, USA (1986) 6 NEC uPD77230 Digital Signal Processor Product Description NEC 7 The Transputer Instruction Set -- A Compiler Writer's Guide Inmos, Bristol, UK (1987) 8 Naqvi, A A An Efficient Multiprocessor Architecture for Low Level Computer Vision PhD Thesis, University of London, UK (1988) 9 Naqvi, A A and Sandier, M B 'Performance of the OSMMA image processing system' in Dew, P M, Earnshaw, R A and Heywood, T R Eds. Parallel Processing for Computer Vision and Display AddisonWesley, Reading, MA, USA (1989) pp 145-152
587
10 Ngan, K N e t al. 'Parallel image processing system based on the TMS32010 Digital Signal Processor' lEE Proc. Vol 134 Pt E No 2 (March 1987) pp 119-124 11 Sleigh, A C e t al. 'RSRE experience implementing computer vision algorithms on transputers, DAP and DIPOD parallel processors' In Page, I (Ed) Parallel A r c h i t e c t u r e and C o m p u t e r Vision Clarendon Press, Oxford, UK (1988) 12 lessope, C e t al. 'Base Level Software for the P1085 Supernode' Proc. 5th A n n u a l ESPRIT Conf. (November 1988) pp 796-813 13 Sandier, M B and Eghtesadi, S 'Transputer based implementations of the Hough transform for computer vision' M i c r o p r o c e s s . and M i c r o p r o g r a m . Vol 24 (1988) pp 403-408 Liaqat Hayat was born in Sorgodha, Paliskan in 1963. He graduated from Punjab University, Lahore, Pakistan in 1983 and has an MSc in electronics from Quaid-i-Azam University, Islamabad, Pakistan obtained in 1986. He joined the image processing and computer vision group at King's College, London, UK in 1988, where he is working on the implementation of various image processing algorithms on a ~'~ multiprocessor computer architecture. His research interests include multiprocessor architectures and their application to imageprocessing and computer vision.
588
Luciano Da Fontouta Costa was born in Sac) Carlos, Sao Paulo,Brazil in 1962. He graduated in electronic engineering in 1985 at the .-,., University of Sao Paulo, Brazil and received a BSc in computer science from the Federal University of Brazil in 1986. He also has an MSc ~ ~ ~ | t in applied physics from the University of Sao Paulo, obtained in 1986. He has been studying for a PhD degreeat King's College, London, UK ~' in the image processing and computer vision group of the department of electronic and electrical engineering since April ?988 and is working on transputer applications and the Hough transform. His special interests are ir~ computer vision, pattern recognition, digital signal processing and parallel architectures. .................
Mark Sandier has BSc and PhD degrees in electrical engineering from the University of Essex, UK, obtained in 1978 and 1984 respectively. He has lectured at King's College, London since October 1982. His research interests are in computer architectures for image processing and computer vision, and the application of digital signal processing to digital audio.
Microprocessors and Microsystems