Pattern Recognition Letters 6 (1987) 101-106 North-Holland
July 1987
D POD: A n image understanding development and implementation system A.C. SLEIGH Royal Signals & Radar Establishment, St. Andrews Road, Great Malvern WRI4 3PS, United Kingdom
P.K. B A I L E Y Logica Received October 1985 Revised 3 March 1986
Abstract: A recently completed processing system aimed at real-time image analysis using a data-flow network of powerful processors is described. The system has many unique features, including a new language Fith, which combines high efficiency with an advanced programme environment offering both rapid execution and rapid algorithm development and tuning. The paper describes DIPOD and outlines the experience gained during its commissioning at RSRE. Key words: Parallel architectures, real-time processing, image understanding.
1. Introduction During 1982-1983, a programme in image understanding and high level scene analysis at the Royal Signals and Radar Establishment led to the identification of a set of requirements for a processor architecture which could support research into real-time advanced image processsing, including the capability to demonstrate credible performance in certain key applications. These requirements were summarised as: - provide raw processing performance consistent with the demands of real-time advanced image analysis; - offer an interactive programming environment with high-level language features to support data dependent and list processing algorithms; - include a fully integrated editor, with rapid programme modification for 'tuning' processes with live data; - allow processors to be configured to map the parallel and pipelined nature of data-driven stages in complete image analysis heteroarchies; - be extensible in order that new hardware can be British Crown Copyright 1985
added to perform particular intensive tasks, such as using convolution or median filter chips, without extensive modification to the system software or compilers; - h a v e integrated TV image input and output facilities. Systems then available failed to meet even a subset of these requirements, being either too slow, too inflexible, difficult to programme or involving extremely long compilation and linking operations after any programme or parameter change. Several approaches were initially pursued. One option was many 68000 type processors, but the desired data-flow tasks were too intensive to be executed on a single processor, leading to complex and inefficient interprocessor communication. Another option was to use one of the ADA engines emerging at that time, but these were again much too slow and the software environments were too inflexible to support hardward extensions without rewriting the ADA support environment with every change. Bit-slice processors were considered and quickly looked attractive. A bit-slice processor uses stan101
Volume 6, Number 2
PATTERN RECOGNITION LETTERS
dard processing elements which p e r f o r m functions such as arithmetic, stack operations, and m e m o r y access across a small number of bits. These components are arranged to produce the width of processor word required, giving an architecture under the control of the designer. The bit-slice components are activated by setting bits across a wide microcode word, each word corresponding to one m i c r o - c o m m a n d in a sequence of micro-code instructions to p e r f o r m functions such as m e m o r y access or arithmetic. It was clear that a bit-slice processor could increase processing speed by a factor of 10 or more over standard microprocessors, and this would allow most operations to be executed on a single processing node at the desired update rate, greatly simplifying the inter-processor communication requirements. Data is then only passed between m a j o r stages, such as boundary list extraction or shape classification; a much more tractable problem than, say, implementing a Sobel operation across 10 processors. Additionally, it was argued that while current single chip processors have a performance of 0.1 to 0.5 MIes, future processors are likely to have performance an order faster, and the architectural issues of these future faster processors will be qualitatively different. A machine which could provide this performance today would serve as a valuable research tool for parallel processing architectures implemented in V H P I C and WSI. However, previous bit-slice machines failed to realise their theoretical performance, except doing repetitive operations on large amounts of data. Complex, data dependent algorithms ran more slowly than anticipated, mainly because of prog r a m language shortcomings which either offered crude facilities (encouranging number-crunching rather than elegance) or involved large overheads mapping a high level language onto unsuitable hardware. Hence our main concern was how to combine bit-slice processing with a powerful but efficient p r o g r a m m i n g environment. The solution we adopted elegantly overcame these problems by designing the p r o g r a m m i n g language and the machine as an inegrated whole. The hardware gave the language special support, such as hardware stacks and m e m o r y address 102
July 1987
generation, while the language took account of the limitations the machine could provide. The result was a processor which can support high level language concepts directly in hardware or short sequences of bit-slice micro-code. The underlying concept adopted was based loosely on Forth (Brodie, 1981) with the 'immediate' and 'run-time' action of words giving great control of the hardware and unlimited extensibility. However, Forth is a difficult language to use and is very insecure the p r o g r a m m e r has more freedom than is good for him. Hence a new language, Fith, was devised which uses a functional notation (in sympathy with PoP) to provide the much needed checking absent in Forth. Additionally, scoping, full mutual recursion, higher order functions and dynamic m e m o r y allocation were provided. The language runs directly on the bit-slice machine (there is no inner-interpreter), with the primitives mapping either onto single microcode words (magic primitives) or onto short sequences of micro-code. Special primitives have been written for operations such as convolution, and were easily substituted for equivalent Fith secondary functions, but for most operations the Fith prog r a m m e is almost as fast, thanks to the pre-fetch tree scanning used to fill the microcode sequencer FIFO. The remainder of this paper describes the main features of the system, shown in Figure 1, outlines the language Fith, and discusses the performance achieved so far in use at RSRE.
2. The system architecture There are two levels of architecture: the interprocessor communication through the Inter Node Bus (INB), Figure 2, and the design of the processing nodes hung onto the Bus, Figure 3. Currently there are three types of node: the P r o g r a m m e Support Environment (PSE), the I n p u t / O u t p u t Node (ION) and the Fith Execution Node (FEN). The INB is a 32-bit wide passive bus used for token passing between nodes, each node having hardware for receiving the token, giving efficient datadriven communication. Data is transferred at 20 Mbytes as pre-defined data objects within the pro-
Volume 6, Number 2
PATTERN RECOGNITION LETTERS
July 1987
Host Super Mini-computer For Fith Language Development And Control
Image Input And Output OutputFrameStort~_]
JInputFrame Store
]
I.IN IX O p e r a t i n g
/ High-Speed System Bus Hinimum 20 MBytes/s :<
M68000 Co~.Ifol Processor
H~h r..a~cit7
:
Memory Module
.
.
.
.
.
.
.
!
i
i
Processor
High-Speed Fith Language Processor
Up to 13 Additional High-Speed, Fith Language Processors Figure 1.
grammes on each FEN. The virtual pathways thus set up between nodes refer to each data object, so different objects can go to different nodes (or not be sent at all). The PSE consists of a standard mini computer which itself runs emulated Fith in C on Unix. The editor, incremental compiler and downloader are all written in Fith on top of this emulator.
The ION uses two CCTV framestores with a 6800 emulating Fith to take IO Objects to and from the inter-node bus, as determined by a configuration file set up prior to activating the token. The most important Nodes are the FEN's. With the PSE and ION, up to 14 F E N ' s can be connected to one bus (but busses could in principle be chained through gateways to provide greater paral-
Inter-node Bus
L INB INTERFACE
]
FITH EXECUTION NODE ARCHITECTURE
Program Support
IF..... t.... I I and I/0 I
Envi. . . . . . t
L~
I
TOP LEVEL OF DIPOD SYSTEM
ControlBus(Addr+ Ooto) Prog.memo~, ~lPre--fetch
Poss Future
Figure 2.
Up to
Figure 3. 103
Volume 6, N u m b e r 2
P A T T E R N R E C O G N I T I O N LETTERS
There are three 16-bit busses enabling low-level pipelines to be set up to fetch, process and replace values in a single 120 nano-second cycle, and to allow effective use of a 32-bit multiply and accumulate chip. Memory address is via a 24-bit bus accessed either by a symbol table or by offset calls. Note that Fith pre-fetch, m e m o r y address generation, stack access and arithmetic can all occur in parallel, leading to a very high cycle utilisation factor compared with standard microprocessor architectures. A novel feature is the use of distributed micro-code, so that new hardware can be added with its own section of the micro-code word addressed by the control bus.
lelism). Figure 2 shows the structure of each FEN. A 68000 control processor is used to pass data, p r o g r a m m e and microcode f r o m the INB to the bit-slice processor. Three areas of m e m o r y exist: C o m m o n M e m o r y is permanently connected to the bit-slice processor while it is active, but m a y be loaded during initialisation by the 68000 control processor. - B a n k Switch M e m o r y which comprises two blocks of m e m o r y with identical addresses which m a y be alternately connected to the control processor and INB buffer or to the bit-slice processor in a 'ping-pong' arrangement. In m a n y algorithms this enables IO between nodes to take place in parallel with the active task. Tagged P r o g r a m M e m o r y which has pre-fetch tree scanning hardware, including a return stack, to thread through the Fith code delivering only primitive functions to the sequencer FIFO. { {
July 1987
3. Fith TM Fith is a threaded-interpreted language. Each
EXAMPLE OF INNER SOBEL PRIMITIVE IN Fith} These primitives can be microcoded for speed }
def (sobevpt(r evoff) > sum : i) { r is sobel size, evoff is size of image, sum is output, i is a local scalar} putexmae(0) {Intiallse multlply/accumulator} putlomac(0) puthlmac(0) for (i 0 r i) mac(gethlbyte peekcache+) mac(getlowbyte peekcache+) mac(gethibyte peekcache+) putoffset(add(offset dees(evoff))) endfor put(lomac sum) end { CONFIGURATION
FILE FOR RUNNING ALGORITHM ON FEN's}
def (configure : i J) setproe(tm(F2)) cursor("fpatch") seteursor(256 256) vom("outarray" 64 64) patch(STREAM) slgnedlmage(true) slgnedcolour expandVOM
{ { { { { { { {
Initlalise network, set to F2 Declare "fpateh" as cursor Move cursor to 256,256 Declare "outarray" at 64,64 Capture "fpatch" all time Visible signed cursor False colour output (signed) expand by 2 in x,y directions
readmask("/usr/dipod/fith/ip/SOBELMASKS/maskl") { Read up sobel mask } { Inltlalise the output array outarray } for (i 0 decs(sizecolumn(iadr(outarray))) i) for (j 0 decs(slzerow(iadr(outarray))) I) putarr(0 outarray(i J)) endfor endfor primitlve(iadr(sobevpt) previous("sobevpt")) primitive(ladr(soboddpt) previous("soboddpt")) download("run sobel") { Program to run on FEP end
Figure 4.
104
} } } } } } } }
Volume 6, Number 2
P A T T E R N R E C O G N I T I O N LETTERS
function creates an entry in a vocabulary which contains pointers to immediate and run-time function calls, so execution consists of scanning the tree formed by this threaded code until a function is encountered which can be executed directly on the processor. Each function is defined in a similar way to most block structured languages, with input and output parameters being specified. The usual control structures are provided, such as looping and If THEN ELSE branches. Variables have a scope restricted to their defining block, an essential feature for complex programmes. Objects can be static or dynamic, although allocation and freeing of dynamic m e m o r y must be performed explicitly (at present). This is not a significant limitation in a real-time data driven system, since all dynamic m e m o r y can be reset at the beginning of each cycle, and anyway, garbage collection would introduce unpredictable pauses. Higher order functions are available, and these can be used for extending the language facilities. Templates are used to define and efficiently access complex data structures. Basic list processing primitives are included, and the language caters for an integrated screen editor, incremental compiling and auto-loading of functions from disc. Debugging tools and a help system complete the comprehensive user interface. An important aspect is the ability to perform all operating system functions in Fith, and a measure if Fith's power is its use to program all aspects of the Dipod system from editor, FEN emulation, and virtual pathway definition to debugging aids. This uniform arrangement makes the system very flexible at all levels, and avoids the frustrations of hitting arcane operating system calls when delving into the heart of the system. A sample of Fith is shown in Figure 4.
4. Experience during commissioning of DIPOD at RSRE The basic DIPOD system delivered to RSRE in June 1985 had two nodes, and has been prog r a m m e d to p e r f o r m a number of low and medium level image analysis algorithms, including numbercrunching and data-dependent list processing. A further 4 nodes are due to be installed in 1986, and
July 1987
some work has taken place devising data flow graphs for the resulting 6 node system. Two aspects have been of special concern during the commissioning: the update rate and latency which can be achieved for typical algorithms, and the ease with which algorithms can be implemented, de-bugged and tuned. Experience has been good on both counts. The processing speed has been consistent with executing a 'useful' operation every 120 nano-second clock cycle (i.e. taking full account of any overheads such as fetching and storing data, setting loop counters, etc). A Sobel process has been implemented with an update rate of 0.11 seconds for a 128x 128 patch, corresponding to a mean speed of 6 MOPS for the 43 operations necessary to perform each pixel evaluation, close to the maximum possible. A similar efficiency has been achieved for convolution of an 8 x 8 template over the same size patch. List processing algorithms, such as breaking a boundary chain at points of curvature change, have been implemented efficiently, with a simple list reversal operating at 750K elements per second. A benchmark recursive reversal of a 30 element list takes 190 micro-seconds. The high level language facilities have proved effective in writing these quite complex algorithms, and the ability to modify p r o g r a m m e and parameters in running algorithms within seconds has been very important in developing stable algorithms. A suite of algorithms has been tested which forms a Sobel map, uses this to 'seed' boundary tracings, forms linked lists of boundaries, breaks these into straiht line segments and arcs of measured parameters, and then makes nets of the line and arc segments which form closed shapes. This whole process takes several minutes on our VAX, but on DIPOD updates at a rate of once every 2 seconds for a 128x128 image, including reconstructing the image from the output symbolic net list. Being able to see the results of processing in real-time, and to be able to modify code or parameters within a few seconds, is a very exciting experience which has brought new insights into why some stages are sensitive to small image changes, and how parameters should be tuned in response to image data or shapes found. 105
Volume 6, Number 2
PATTERN RECOGNITION LETTERS
5. Conclusion
The DIPOD system has many features important to real-time image analysis. It provides fast and transparent data driven communication between processors which are 10 or more times faster than current single chip microprocessors. This speed is equally realizable for complex list processing as it is for regular arithmetic, and each processor is powerful enough to avoid the complexities of intensive inter-processor communication. The programming environment is flexible with many advanced features, yet provides direct access to the innermost hardware when desired. Both software
106
July 1987
and hardware are extensible, and other hardware elements, such as 'silicon algorithms', can be introduced as special nodes. DIPOD aims to promote research into real-time machine vision, and could also be exploited in application problems. It forms an important element in the RSRE Machine Vision programme, and will soon be commercially available from Logica.
References Brodie, L. (1981). Starting Forth. Prentice-Hall, Englewood Cliffs, NJ.