Transputer implementation of interpolators for radar image transformations J E L Hollis* and T E Cronk tt
The need for enhanced performance in small scale image processing systems in order to achieve real-time image transformation to match image capture rates has been increasing with the availability of economic large-scale images, such as those from earth satellites. In this paper the design of a floating point, transputerbased interpolator for synthetic aperture radar (SAR) image transformations is investigated and methods are devised for the measurement of interpolation performance. Assessment of interpolator speed and accuracy, to maximize interpolation throughput within the bounds of required accuracy, has allowed the practical compromises to be improved for real images. An interpolator accuracy test for band-limited, sampled image data has been employed to gauge the accuracy and efficiency of multi-point interpolators for application to 2-D SAR image interpolation. The performance improvements realized by a multi-transputer implementation have been evaluated and assessed. With sufficient processors the system achieved image transformation at a rate approaching that of the image capture.
Keywords: transputer optimization, multi-processor image rotation, synthetic aperture radar image processing
Computationally intensive, multi-point interpolators are required for remotely sensed synthetic aperture radar (SAR) image processing 1. The processor requirements are demanding. In practice, direct implementation of a twodimensional multi-point interpolator of the type required for SAR images leads to slow operation because of the complex data access and large number of computational operations. Data accessing can be simplified, and the number of computations reduced considerably, if the twodimensional interpolator is replaced by two one-dimensional interpolators ~'~ or if data can be rearranged in memory quickly. *Cranlield University, C(~mmunications& Information SystemsEngineering Group, Shrivenham, Oxfordshire, SN6 8LA, UK fUniversity of Portsmouth, School of Systems Engineering, Portsmouth, Hampshire, PO1 3DJ, UK .f-Now at Hitachi UK Ltd., Maidenhead, UK Paper received: ~ November 1994. Revised: 12 December 1994
The transputer capability in these directions is examined in this paper. The interpolation throughput sustainable from a floating point transputer has been measured to establish its potential in SAR processing. Transputer implementation of a multi-point interpolator for band-limited, SAR image data interpolation has been investigated and performance quantified. Further speed improvements for multi-processor implementations for this application have been investigated and presented.
INTERPOLATOR IMPLEMENTATION For an efficient interpolation technique with complex band-limited SAR image data, it was necessary to minimize the loading imposed by the interpolator on the digital processor. Whilst it is not a problem to implement an interpolator it is a problem to measure how well it is doing. Performance criteria were required as a basis on which to judge improvements.
Performance quantification An estimate of the theoretical maximum FLOP rate for the processor in an interpolation application was calculated, and used to gauge the performance improvement achieved with optimization techniques. It was also used to assess the efficiency of the multi-point interpolators. The method of calculating the FLOP rate needs to be borne in mind when comparing the performance figures reported here with those elsewhere. The processing rate was defined in terms of the number of floating-point operations required to carry out the mathematical interpolation, not the number of operations that are actually executed by the transputer in implementing an interpolation in a particular rendering. Defined in this sense the measure represents the best that can be achieved.
0141-9331/95/$9.50 '!" 1995 Elsevier Science B.V. All rights reserved Microprocessors and Microsystems Volume 19 Number 4 May 1995
179
Transputer implementation of interpolators: J E L Hollis and T E Cronk For example, in a four-point Sinc interpolator (Equation (1), below), each interpolation requires: An imaginary and real multiplication for each point in the four-point interpolation - 8 multiplies Plus (interpolator length - 1) additions for both imaginary and real terms = 6 adds = 14 FLOPs per interpolation. Using the number of processor cycles per floating point instruction on a T800 transputer gave the number of cycles taken for a floating point multiplication: 2 cycle load + 2 cycle load + 11 cycle multiplication + 2 cycle store = 17 cycles/multiplication Similarly the actual number of cycles per floating point addition could be obtained: 2 cycle load + 2 cycle load + 6 cycle add + 2 cycle store = 12 cycles/addition. Thus the total number of processor cycles per interpolation could be derived: 136 cycles (8 multiplies) + 72 cycles (6 additions) = 208 cycles/interpolation. The average number of cycles per FLOP operation = 208/ 14 = 14.857 cycles/operation. At a clock rate of 20MHz, the theoretical maximum throughput for the interpolation on this transputer is 20"10~'/14.85 = 1.346 MFLOPs. In general the number of FLOPs per interpolation was first calculated; then using the average number of processor cycles per floating point operation for a T800 transputer, the average number of cycles per operation was estimated, and hence the MFLOP rate. This figure takes no account of the additional floating point operations necessary in the overall interpolation process, such as those required to align the interpolator: the actual throughput therefore falls short of this idealized level. Running the interpolator on a floating point transputer using Occam coding with no specific consideration given to performance enhancements, an interpolation throughput of 0.173 MFLOPs was measured. This figure represented only a 12.8% efficiency on our criterion and implied that a significant speed improvement could potentially be gained.
make significant reductions in real-time computational loading. The key optimization factors that were used to enhance the interpolation process were: • Multi dimensional array access. The use of two-dimensional array constructs imposes a CPU overhead in the generation of addresses for data access. Minimization of the use of two-dimensional variables produced an improvement in interpolation speed (14.5°/,,). • Use of loops. For efficient use it is important that each loop should contain adequate instructions to make the looping time insignificant. This was the case in the interpolator kernel loop. A good speed improvement (53.3%) was obtained by expanding the interpolation loop. • Assembly code. Assembly code programming allowed more control over the processor than was possible using higher level languages (HELIOS) and compilers. Several useful techniques were: (a) Regularly used data was maintained on the stack. This saved address recalculation and reduced the total data access time. (b) Careful utilization of integer and floating point processing minimized the delays that parallel processors experienced waiting for each other. (c) Calculation of multi-dimensional addresses was enhanced by storing them in local workspace. Invoking these techniques and interpolating from a constant position (no repositioning calculations) produced a 331.8% improvement. Pre-rounding. Pre-rounding the interpolating function look-up table data removed the slow rounding operations from the interpolator repositioning calculations (7% improvement). • Use of fast on-chip memory. The on-chip RAM, located at the bottom of the address map, can be accessed approximately five times faster than the off-chip memory. Interpolation throughput was therefore enhanced by storing the interpolation function look-up table and frequently used variables and data values in this on-chip memory (12% improvement).
•
Table 1 presents the interpolation speed improvement achieved by each optimization technique. Applying these techniques allowed a 20MHz floating-point transputer implementing a four-point interpolation kernel to achieve a maximum complex interpolation throughput of 63.5"10 ~ interpolations per second (66% of theoretical maximum). At this rate it took some 14 min to perform the four-point interpolation to geometrically alter a typical SAR image of
Performance enhancement Table 1 Summary of interpolator speed enhancements
By utilizing the inherent parallelism of the transputer, making good use of on-chip cache memory and employing careful code organization and algorithms, a speed improvement on the original code of 418% was achieved, allowing a 66.5% speed efficiency to be extracted from the processor in this practical interpolation application. This result illustrates how coding for speed improvement can
180
Technique
Speed improvement (%)
Removal of 2D array variable Loop expansion Assembly code insertion Rounding/use of on chip RAM
14.5 53.3 331.2 19.0
Microprocessors and Microsystems Volume 19 Number 4 May 1995
Transputer implementation of interpolators: J E L Hollis and T E Cronk 5000 by 5000 pixels, when using sufficient memory to store both the complete input and output images simultaneously.
INTERPOLATOR CONSIDERATIONS Sinc function interpolation gives good results for bandlimited data 4. In practice, however, the sinc function should be truncated, and applied via a look-up table of restricted size to improve speed. Unfortunately, this can sometimes distort the data and produce undesirable artifacts. To implement the standard one-dimensional sinc function multi-point interpolator on a digital computer, the summation was truncated to the required length with ± m either side of the point of interpolation. The interpolation equation is: ,o
m
x(k') = ~ /;
p
x(k')
(1)
(ap c, + ibp cp)
(2)
m
~ p
x(k + p).sinc(k'- k - p) m
Testing interpolator accuracy The measurement of interpolator accuracy presents a problem in that some method of testing interpolation across the full data range was required. Since SAR data is band limited and sampled at discrete intervals, it can be represented as a linear sum of complex exponentials, exp(±i2xnA). Testing a limited number of complex exponentials across the data bandwidth is then adequate. In a typical SAR radar application the sample rate is some 19 MHz and the maximum signal frequency 8 MHz. To test up to the Nyquist sampling frequency requires a highest normalized test frequency A of 8/19 cycles per sample. Varying A from 1/19 to 8/19 provided tests across the bandlimited range. With this degree of test data curtailment it was possible to achieve results with a practical number of tests and allow interpolator implementations to be assessed for their accuracy. A measure of the accuracy of an interpolation is the error between the exact value of pixel intensity Kk..... n and the interpolated value /interpolated. A more useful measure is the mean square error (MSE) over N interpolations:
MSE
(Ik ......
/___, 0
m
where x ( k ' ) = integer data samples, k ' = point for interpolation, k - nearest known sample point, p = sample displacement, m - integer, ap = Real(x(k + p), b, = Imag(x(k + p) and c, = sinc interpolating function. An important feature of this interpolation, providing the band limits are suitably arranged, was that it could be implemented as purely real operations and both the real and imaginary terms of the complex data interpolated independently using the same interpolation kernel with the same addressing. If the band limits are defined from 0 to N, rather than from -N/2 to N/2, then an additional phase term (e i~((N 1)/N(k' k))) is introduced in the interpolation which prevents the real and imaginary terms of the data from being independently interpolated. However, the data may be frequently shifted to the band limits -N/2 to N/2, interpolated, and then shifted back. The speed of the interpolator was enhanced by making use of this fact. The sinc function interpolation coefficients (ap, bp) were dependent on the displacement of the output image pixel from the input image pixels (the interpolator 'index'). Adjacent pixels in the input image were addressed by an increment of one; thus, the sinc operand was always in the range -0.5 to 0.5. A look-up table was used to store the required sinc function coefficients to avoid recalculation at every interpolation; however, this restricted the number of stored coefficient values, and effectively limited the spatial resolution of the interpolator. Interpolators for spatial resolutions of 1/8th to 1/32nd were investigated in this work. Using a look-up table structure allows a broad class of interpolators to be implemented (such as Lagrange polynomial interpolators, sinc function interpolators, etc.) by simply calling the corresponding function coefficient lookup table data. This provides a powerful interpolator test tool.
- - /interpolated) 2
(3)
N
This gives a measure of the interpolator performance. Recording this for both real and imaginary parts of the data allowed the overall accuracy performance to be assessed. The peak error produced by an interpolator across the whole bandwidth is also an indicator. The total joint mean square error and peak error values characterize the interpolator accuracy, allowing a comparison between interpolators and indicating the suitability of an interpolator for a given application. In general, if the mean square and peak variations are less than any existing data noise, the interpolator accuracy is adequate. Tests were carried out on a wide range of sinc function and Lagrange polynomial multi-point interpolators with lengths 4, 5, 6 and 7 points, and with interpolator spatial resolutions limited between 1/8th to 1/32nd.
Effects of spatial resolution Limiting the interpolator spatial resolution to 1/16ths of the pixel spacing compared to 1/32nds was found to have little effect on the interpolator accuracy. When limited to 1/8ths, however, the interpolator's performance degraded unacceptably. An example of the degradation is given in Table 2, which shows the interpolation error increasing to a level above the data error. Table 2
Effect o f spatial r e s o l u t i o n l i m i t a t i o n o n i n t e r p o l a t o r a c c u r a c y
Spatial resolution
Total joint MSE
Peak error
1/32 1/16 1/8
5.29"10 ~ 5.32"10 ~ 8.76"10 ~
4.88"10 `2 5.23"10 _2 8.46"10 `2
Microprocessors and Microsystems Volume 19 Number 4 May 1995
181
Transputer implementation of interpolators: J E L Hollis and T E Cronk The fact that the resolution can be limited quite severely (1/16ths) without adversely affecting interpolator accuracy is significant as it supports the use of look-up tables, and allows an associated improvement in interpolator speed.
Interpolation Rate Against Number of Slaves 35 y " Interpolations/second (Thousands)
26
20
Interpolator length
For many applications a four-point interpolation, implemented on a transputer, was found to be sufficiently accurate and offered the minimum computational requirement. Interpolator tests confirmed that even-length interpolators (4, 6 point), in general, produced smoother responses than odd-length interpolators (5, 7 point), which at certain sample rates produced pronounced steps. These steps occurred at the points where the interpolator kernel moved along the data by one sample point. In the 'even length' interpolator case, the shift occurred when interpolating at a known sample point where an exact result was produced because a zero offset sinc function was used. In the 'odd-length' case, however, the shift occurred when the interpolation point was between two sample points. If the input data value was such that the discarded sample point had a low intensity value and the new sample point had a high intensity value, a step in the response was created. This effect was further exaggerated if the interpolator bias changed, i.e. if the 'odd-point' took its weight from above, rather than below, the point of interpolation. With certain images, the 'odd-length' interpolator may produce undesired artifacts, Figure 1.
Parallel implementations
A processor farming multi-processor architecture maps well to the processing of image data in sections (tiles)4. Figure 2 shows the image resampling rate attained (number of output image pixels generated/second) against the number of slave processors. The image resampling rate was calculated by dividing the number of transformed
Example of Odd Interpolator A= 2/19 Pixel
16 .....
.....
10
J
o
1
o
2 3 4 X " Number of slave processors
5
Figure 2 Multi-processorimplementations
pixels produced by the time taken to produce them. The graph illustrates the advantage gained in using multiple processors. The processing rate increases with the number of processors used.
Application to real images
Figure 3 shows a processed image of Portsea Island. This was produced by first extracting data from a Landsat image of the Solent area (Matra Marconi Space, ERS-] data); this data was then interpolated and rotated using the developed multi-processor, transputer based system. The graphic system's pixel intensity was defined to a 256 grey scale palette. Due to monitor limitations, this produced a coarse pixel resolution, allowing only sections of a complete Landsat image scan to be viewed at a time. The results confirrned the predicted speed enhancement of the multiprocessor system.
Step
Intensity
1.5~
0.5 0
-
0.5 .1
.
.
- 1.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
I
K
I
J
I
i
2
4
6
8
10
12
14
Integer Data Points YJ
Figure 1
182
Real Calculated
+
Real
Imaginary
D
Imaginary
Calculated
Interpolated
Odd-length multi-point interpolator artifact
6
Interp.
Figure 3
Landsat image after rotation with a transputer interpolator
Microprocessors and Microsystems Volume 19 Number 4 May 1995
Transputer implementation of interpolators: J E L Hollis and T E Cronk
CONCLUSIONS The following conclusions were drawn from the work: • ]he maximum complex interpolation throughput obtainable from a single 2 0 M H z floating-point transputer implementing a four-point interpolation kernel is 63.5"10 ~ interpolations per second. Speed improvements to approach this ideal can be obtained by a number of judicious techniques in the processing organization. With sufficient memory to store both the complete input and output images it took about 14 min to perform the interpolation required to geometrically alter a typical SAR image of 5000 by 5000 complex data points. • It was found adequate to apply the interpolator via a look-up table such that interpolator spatial resolution was limited to 1/16ths of input data separation. • For many applications a four-point interpolator, implemented on a transputer, is sufficiently accurate and offers the minimum computational system. • 'Even-length' interpolators produce smoother responses than 'odd-length' interpolators. • Interpolator speed can be greatly enhanced by utilizing the fact that, when the data pass-band is at baseband, interpolation can be implemented by purely 'real' operations. • Parallel processing of an image, by sub-tiling, concurrently transforming the sub-tile image sections, and finally reassembling the transformed tiles to construct the final image, can greatly enhance the interpolator speed. We have since employed similar approaches in correction of image warping using a Sun workstation and Intel i860 vector processors.
ACKNOWLEDGEMENTS The authors wish to thank Matra-Marconi Space for their scientific involvement and input in the work described in this paper. Thanks must also be given to IBM for their sponsorship of the parallel processing equipment. The work was carried out under a Science and Eng-
ineering Research Council (SERC) grant together with Matra Marconi Space support. It was undertaken at the request of Matra Marconi Ltd.
REFERENCES 1 Shlien, S 'Geometric correction, registration and re-sampling of Landsat imagery' Can. J. Remote Sens. Vol 5 No 1 (1979) pp 74-89 2 Friedmann, D E '~wo-dimensional re-sampling of line scan imagery by one-dimensional processing' Photogramm. Eng. Remote 5ens. Vol 47 No /0 (1981) pp 1459-1467 Friedmann, D E 'Operational re-sampling for (orreeting images to a geocoded format' Presented at the Fifteenth International Symposium on Remote Sensing of Environment, Ann Arbor, MI (1981) pp 195-212 4 Cronk, T E 'Application of transputers to high speed interpolation of complex SAR images', Doctorate, Portsmouth Llniversity (September 1992)
After graduating in electronic engineering from Imperial College, London University, James Hollis undertook, on invitation, a doctorate al Imperial College. A ten year period of service with the Mhristry of Defence provided experience in the design and installation of electronic systems. He entered the academic environment first at Imperial College and later with Portsmouth Polytechnic/University of Portsmouth. He was appointed to Director of Resources in the Department of Electronic Engineering, at Portsmouth. Secondments to industry included periods at Plessey Ltd, UK, and Motorola Lid, FRG. In 1990 he joined the Oanfield University, Communications and Intormation Systems Group, where he is currently engaged on research ~n in]age processing.
.....
Initially schooled at the Nautical ('ollege, Sunningdale, Thomas Cronk pursued his higher education with the University of Portsmouth, graduating in electronic engineering in 1989. He joined Matra-Marc oni Lid on a s[,onsored studentship in the space systems group at MSS, Portsmouth. Research into the application of transputers to high speed interpolators k~r complex radar images with Matra-Mareoni led to registering tot a higher degree at Portsmouth which he completed it] 1992. His previous experience includes digital design work with British Petroleum Research Centre in the computers and communications group and a teaching associateship with the University ot Portsnrouth in computer aided design. Currently he is with Hitachi (UK) l td, Maidenhead, UK.
M i c r o p r o c e s s o r s and M i c r o s y s t e m s V o l u m e 19 N u m b e r 4 M a y 1995
183