Available online at www.sciencedirect.com
Microprocessors and Microsystems 32 (2008) 95–106 www.elsevier.com/locate/micpro
Design and evaluation of a hardware/software FPGA-based system for fast image processing J.A. Kalomiros b
a,*
, J. Lygouras
b
a Technological and Educational Institute of Serres, Department of Informatics and Communications, Terma Magnisias, 62100 Serres, Greece Section of Electronics and Information Systems Technology, Department of Electrical and Computer Engineering, Polytechnic School of Xanthi, Democritus University of Thrace, Greece
Available online 1 October 2007
Abstract We evaluate the performance of a hardware/software architecture designed to perform a wide range of fast image processing tasks. The system architecture is based on hardware featuring a Field Programmable Gate Array (FPGA) co-processor and a host computer. A LabVIEW host application controlling a frame grabber and an industrial camera is used to capture and exchange video data with the hardware co-processor via a high speed USB2.0 channel, implemented with a standard macrocell. The FPGA accelerator is based on a Altera Cyclone II chip and is designed as a system-on-a-programmable-chip (SOPC) with the help of an embedded Nios II software processor. The SOPC system integrates the CPU, external and on chip memory, the communication channel and typical image filters appropriate for the evaluation of the system performance. Measured transfer rates over the communication channel and processing times for the implemented hardware/software logic are presented for various frame sizes. A comparison with other solutions is given and a range of applications is also discussed. 2007 Elsevier B.V. All rights reserved. Keywords: Hardware/software co-design; Image processing; FPGA; Embedded processor
1. Introduction The traditional hardware implementation of image processing uses Digital Signal Processors (DSPs) or Application Specific Integrated Circuits (ASICs). However, the growing need for faster and cost-effective systems triggers a shift to Field Programmable Gate Arrays (FPGAs), where the inherent parallelism results in better performance [1,2]. When an application requires real-time processing, like video or television signal processing or real-time trajectory generation of a robotic manipulator, the specifications are very strict and are better met when implemented in hardware [3–5]. Computationally demanding functions like convolution filters, motion estimators, two-dimensional Discrete Cosine Transforms (2D DCTs) and Fast Fourier
*
Corresponding author. Tel.: +30 2321022854; fax: +30 2321039703. E-mail address:
[email protected] (J.A. Kalomiros).
0141-9331/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2007.09.001
Transforms (FFTs) are better optimized when targeted on FPGAs [6,7]. Features like embedded hardware multipliers, increased number of memory blocks and system-on-a-chip integration enable video applications in FPGAs that can outperform conventional DSP designs [2,8]. On the other hand, solutions to a number of imaging problems are more flexible when implemented in software rather than in hardware, especially when they are not computationally demanding or when they need to be executed sporadically in the overall process. Moreover, some hardware components are hard to be re-designed and transferred on a FPGA board from scratch when they are already a functional part of a computer-based system. Such components are frame grabbers and multiple-camera systems already installed as part of an imaging application or other robotic control equipment. Following the above considerations we conclude that it is often needed to integrate components from an already
96
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
installed computer-based imaging application dedicated to some automation system, with FPGA-based accelerators that exploit the low-level parallelism inherent in hardware structures. Thus a critical need arises for an embedded software/hardware interface that can allow for high-bandwidth communication between the host application and the hardware accelerators. In this paper we apply and evaluate the performance of an example mixed hardware/software design that includes on the one side a host computer running a National Instruments (NI) LabVIEW imaging application, equipped with a camera and a frame-grabber, and on the other side a Altera FPGA board [9] running an image filter hardware accelerator and other system components. The communication channel transferring image data from the host computer to the hardware board is a high-speed USB2.0 port by means of an embedded macrocell. The various hardware parts and peripherals on the FPGA board are controlled and interconnected by a Nios-II embedded soft-processor. As a result of this evaluation one can explore the range of applications suitable for a host/co-processor architecture including an embedded Nios-II processor and utilizing a USB2.0 communication channel. In the following, we first give a short account of the tools we used for system design. We also present an overview of the particular image filtering application we embedded in the FPGA chip for the evaluation of the host/co-processor system architecture. We describe the modular interconnection of different system parts and assess the performance of the system. We examine the speed and frame-size limits of such a design when it is dedicated to image processing. Finally, we compare our mixed host/co-processor USB-based design in terms of other architectures and other communications media. 2. Design tools overview The design of a DSP system with FPGAs often utilizes both high-level algorithm development tools and hardware description language (HDL) tools. It can also make use of third-party intellectual property (IP) cores implementing typical DSP functions or high speed communication protocols [1]. In our application we use model-based design tools like The Mathworks Simulink (based on Mathwork’s MATLAB) with the libraries of Altera’s DSP-Builder. The DSP-Builder uses model design to produce and synthesize HDL code, which can then be integrated with other hardware design files within a synthesis tool, like the Quartus II development environment. In the present work, we designed image filter components using DSP-Builder libraries and the resulting blocks were integrated with the rest of the system in Quartus’ System-On-a-Programmable-Chip (SOPC) Builder. SOPC-Builder design software resides as a tool in the Quartus environment. Its purpose is to integrate an embedded software processor like Altera’s Nios-II with hardware
logic and custom or standard peripherals within an overall system, often called System-On-a-Programmable-Chip (SOPC). SOPC-Builder provides an interface fabric in order to interconnect the Nios-II processing path with embedded and external memory, the filter co-processors, other peripherals and the channels of communication with the host computer. Nios-II applications were written in ANSI C and were compiled and downloaded to the FPGA board by means of Altera’s Nios II Integrated Development Environment (IDE), a tool dedicated to assemble code for Nios processors. The purpose of Nios-II applications is to control processing and data streaming between the components of the system and its peripherals. On the host side one may develop a control application by means of any suitable language like C. We use LabVIEW software by National Instruments Corporation [10], which provides a very flexible platform for image acquisition, image processing and industrial control. 3. Modeling and implementation of the filter design The main target of this work is to evaluate the performance of a host/co-processor architecture including an embedded Nios-II processor and utilizing a communication channel between host and hardware board, like a USB2.0 channel. The task-logic performed by the embedded accelerator can be any image function within the limitations of existing FPGA devices. For our purpose we built a typical image-processing application in order to target the FPGA co-processor. It consists of a noise filter followed by an edge-detector. Noise reduction and edge detection are two elementary processes required for most machine vision applications, like object recognition, medical imaging, lane detection in next-generation automotive technology, people tracking, control systems, etc. We model noise and edge filtering using the Altera DSPBuilder Libraries in Simulink. An example of this procedure can be found in [11]. Noise reduction is applied with a Gaussian 3 · 3 kernel while edge detection is designed using typical Prewitt or Sobel filters. These functions can be applied combined in series to achieve edge detection after noise reduction. The main block diagram of our filter accelerator is shown in Fig. 1. Apart from noise and edge filter blocks, there is also a block representing the intermediate logic between the Nios-II data and control paths and our filter task logic. Such intermediate hardware fabric follows a specific protocol referred to as Avalon interface [12]. This interface cannot be modeled in the Simulink environment and is rather inserted in the system as a Verilog file. Design examples implementing the Avalon protocol can be found in Altera reference designs and technical reports [13]. In brief, our Avalon implementation consists of a 16-bit data-input and output path, the appropriate Read and Write control signals and a control interface that allows for selection between the intermediate output from
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
97
Fig. 1. Main block diagram of the filter accelerator. Gauss and edge filters are connected with the rest of the system through Avalon interface.
the Gauss filter or the output from the edge detector. Data input and output to and from the task logic blocks is implemented with the help of Read and Write instances of a 4800 bytes FIFO register. Each image frame when received by the hardware board is loaded into an external SDRAM memory buffer and is converted into an appropriate 16-bit data stream by means of Nios-II instruction code. Data transfer between external memory buffers and the Nios-II data bus is achieved through Direct Memory Access (DMA) operations controlled by appropriate instruction code for the Nios-II soft processor. Nios-II code flow for this system is discussed in Sections 5 and 6.
Incoming pixels are processed by means of a simple 2D digital Finite Impulse Response (FIR) filter convolution kernel, working on the grayscale intensities of each pixel’s neighbors in a 3 · 3 region. Image lines are buffered through delay-lines producing primitive 3 · 3 cells where the filter kernel applies. The line-buffering principle is shown in Fig. 2. A z 1 delay block produces a neighboring pixel in the same scan line, while a z 640 delay block produces the neighboring pixel in the previous image scan line. We assume image size of 640 · 480 pixels. The line-buffer circuit is implemented in the same manner for both noise and edge filters. Frame resolution is incorporated in the line-buffer diagram as a hardware built-in parameter. If a
Fig. 2. Line buffering principle for a 2D FIR filter implemented for a 640 · 480 frame.
98
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
Fig. 3. Gauss filter calculations for the 3 · 3 kernel shown in the inset. In the lower matrix are the pixels that correspond to the matrix kernel, as produced by the line buffer in Fig. 2.
Fig. 4. Horizontal gradient calculations for a Prewitt kernel shown in the inset. In the lower matrix are the pixels that correspond to the matrix kernel, as produced by the line buffer in Fig. 2.
change in frame size is required we need to re-design and re-compile. The number of delay blocks depends on the size of the convolution kernel, while delay line depth depends on the number of pixels in each line. Each incoming pixel is at the center of the mask and the line buffers produce the neighboring pixels in adjacent rows and columns. Delay lines with considerable depth are implemented as dedicated RAM blocks in the FPGA chip and do not consume logical elements. After line buffering, pipelined adders and embedded multipliers calculate the convolution result for each central pixel. Fig. 3 shows the model-design for implementation of the 3 · 3 Gauss kernel calculations. As is shown in Fig. 3 model-based design transfers the necessary arithmetic into a parallel digital structure in a straightforward manner. Logic-consuming calculations, like multiplications are implemented using dedicated multipliers available in medium-scale Altera FPGAs, like the Cyclone II chip. When the two filters work in combination, the output of the Gaussian kernel is input to a 3 · 3 Sobel or Prewitt
edge detection kernel. First, the kernel-pixels are produced using a line-buffer block identical to the one in Fig. 2. The edge detector is composed of horizontal and vertical components. Fig. 4 shows the model-design for the Prewitt kernel calculations of horizontal image gradient. A similar implementation is used for vertical gradient. For simplicity we combine horizontal and vertical edge detection filtering by simply adding the corresponding magnitudes. A binary image is produced by thresholding the result by means of a comparator. These final steps are shown in Fig. 5. Fig. 6 shows an input and the successive outputs of the hardware co-processor for a 640 · 480 pixel image. 4. Embedded system design The co-processor parts described above were implemented as components of an embedded system controlled by a Nios-II processor. The Nios-II software cpu which is used here for data streaming control, is often the basis for industrial as well as academic projects. It can be used
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
99
Fig. 5. Total gradient calculation by combining vertical and horizontal Prewitt components. A binary image is produced by thresholding.
Fig. 6. Result of successive processing using a 3 · 3 Gauss kernel and a Prewitt edge filter, both implemented in hardware. From left to right is the original grayscale image (a), the output of the Gauss filter (b) and the final edge result (c). Threshold value for edge filtering is T = 70.
in its evaluation version along with the tools for assembling and downloading instruction code [14]. Once installed within the synthesis software, the Nios processor becomes integrated as a library component in Quartus’ SOPCbuilder tool. DSP-Builder converts the model-based design into HDL code appropriate for integration with other hardware components. The filter is readily recognized by the synthesis software as a System-on-a-Programmable-Chip (SOPC) module and can be integrated within a Nios-II system with suitable hardware fabric. Other modules that are necessary for a complete system are the Nios-II soft processor, external memory controllers, DMA channels, and a custom IP peripheral for high speed USB communication with the host. A VGA controller can be added in order to monitor the result on an external screen. Many of such peripheral functions can be found as open source custom HDL Intellectual Property (IP) or as evaluation cores provided by Altera or third party companies. USB 2.0 high speed connectivity is added to the FPGA board by means of a daughter-card by System Level Solutions (SLS) Corporation [15]. It can be added to any Altera board featuring a Santa-Cruz peripheral connector. This daughter-card provides an extension based on CY7C68000 PHY USB2.0 transceiver. A USB2.0 IP core compliant with Transceiver Macrocell Interface (UTMI) protocol allows integration of the USB function with the Nios-II system. We tested evaluation versions of the IP
core and present practical transmit and receive rates in Section 8. The FPGA chip along with the embedded Nios-II processor is always a slave device in the communication via the USB channel, while the host computer is always the master device. A block diagram of the hardware system implemented on the FPGA board is shown in Fig. 7. The channel to the host computer is also shown. The embedded system is assembled by means of the SOPC-Builder tool of the synthesis software, by selecting library components and defining their interconnection. After being generated by SOPC Builder, the system can be inserted as a block in a schematic file for synthesis and fitting processing. The only additional components that are necessary are PLLs for Nios and memory clocking. After we synthesize and simulate the design by means of the tools described in Section 2, we target a Cyclone II 2C35F672C6 FPGA chip incorporated on a development board manufactured by Altera Corporation. This board along with the peripheral USB2.0 extension is shown in the picture of Fig. 8. The board also features external memory and several typical peripheral circuits. 5. Nios programming After device configuration, we assemble and download instruction code for the Nios-II processor, according to the specifications of Nios’ Hardware Abstraction Layer
100
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
(b) It allows the system to perform simple software operations on the input data instead of using dedicated hardware stages for such processing. For example, Nios instruction code can be used to convert image arrays into appropriate one-dimensional data streams. Nios instruction code is downloaded to on-chip memory. Our code for the particular design opens the USB port and waits to read the transmitted pixel array. It then controls pixel streaming from input to final output as described in the following section.
6. Activity flow The flow of software and hardware activity that supports the operation of a system designed according to the mixed architecture detailed in this work can be outlined as follows:
Fig. 7. Basic hardware architecture. The FPGA system is shown in dashed square.
Fig. 8. The development board (DSP CycloneII Edition) with the USB2.0 extension card.
(HAL). The interested reader can look for details in the reference manuals for Nios-II and its dedicated Integrated Development Environment (Nios-II IDE) [16]. Nios code is originally written in ANSI C and has a twofold purpose: (a) It controls hardware operations, like DMA transfers between hardware units. It also offers a programming interface for handling data channels, with API commands like ‘‘open’’, ‘‘read’’, ‘‘write’’ and ‘‘close’’.
(a) An image stream is transferred from the host computer to the hardware board for processing through the high speed USB2.0 communication channel. As described in the next section the host application communicates with the USB port using Advanced Programming Interface (API) calls for data input and output. (b) According to Nios processor instructions, embedded DMA hardware operations transfer data from memory to the Nios data path and into the hardware task logic by means of the Avalon interface. (c) The data stream is processed through the hardware accelerator. (d) DMA operations stream the filtered output result from the hardware task-logic back to external memory (see Fig. 7). (e) The final result is output to the on board VGA digital-to-analog channel, which is peripheral to the Nios-II processor and is supported by embedded DMA hardware transfers. However, a digital to analog converter for a VGA port is not always implemented on a development board, so a possible alternative is the resulting binary image to be channeled back to the host computer via the USB connection for further processing or simply for display. The assessment of the design performance that is presented in Section 8 includes all the above activity steps. It is important to note that the above system is not merely a black-box custom design implemented for a particular application, but represents a design methodology that can be used for a wide range of custom applications. The technical details of the operations abstracted above are well documented for the user of the particular development platform, so that every aspect of the design can be tested for repeatability. A more detailed analysis of the sys-
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
tem development techniques is however out of the scope of the present article. 7. Host-based setup and application On the host part a vision system is implemented, appropriate for a spectrum of industrial applications. The host computer is a Windows XP Pentium IV featuring an onboard high speed USB 2.0 controller and a NI 1408 PCI frame grabber. The host application is a LabVIEW virtual instrument (VI) that controls the frame grabber and performs initial processing of the captured images (see Fig. 9). The frame grabber can support up to five industrial cameras performing different tasks. In our system the VI application captures a full frame sized 640 · 480 pixels from a CCIR analog B& W CCD camera (Samsung BW2302). It may then reduce the image resolution applying an image extraction function, down to 320 · 240 pixels. It can produce even smaller frames in order to trade size for transfer rate. The LabVIEW host application communicates with the USB interface using API functions and a Dynamic Link Library (DLL) and transmits a numeric array out of the captured frame. An advantage of using LabVIEW as a basis for developing the host application is that it includes a VISION library able to perform fast manipulation of image data or a preprocessing of the image if it is necessary. When the reception of the image array is completed at the hardware board end, the system loads the image data to the filter co-processor and sends the output to the VGA controller via SRAM memory (see Fig. 7). Alternatively the output can be sent back to the host application by means of a ‘‘write’’ Nios command. The procedure is repeated with the next captured frame.
101
8. Evaluation of the system performance The above set up was tested with various capture rates and frame resolutions. Using several test-versions of the SLS USB2.0 megafunction we measured receive (rx) and transmit (tx) throughput between the host PC and the target hardware system. We use a payload of 307,200 bytes for both directions. We find that using the Nios II HAL driver, the latest evaluation version 1.2 of the IP core transfers in high speed operation 65Mbits per second in receive mode and about 80 Mbps in transmit mode. In full speed the transfer rate is 9 Mbps. However, data transfer rate from the host computer to the hardware board is only one factor that affects the performance of an image processing system designed according to a host/co-processor architecture, like the one studied here. There are also software issues to be taken into account both at the host end and at the Nios-II embedded processor side. For example, frame capturing and serialization prior to transferring are factors that limit frame rate in video applications. On the other hand, the Nios-II embedded processor controls the data flow following instruction code downloaded to embedded memory, as described in Section 5. So, the overall performance of the system depends on the fine-tuning of all these factors. The LabVIEW software allows for an efficient handling of array structures and also possesses image grabbing and vision tools that reduce processing time on the host side. Beside the above software limitations, there are also hardware issues related to an integrated System-on-a-Programmable-chip, like the time needed for Direct Memory Access (DMA) transactions between units. The performance of the hardware board is divided into the processing rates of the hardware filter co-processor and the performance of
Fig. 9. Front panel of example host application designed in LabVIEW environment for fast image processing and data exchange. The USB2.0 hardware is accessed by incorporating API calls.
102
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
the rest of system, like external memory buffers and the interconnect fabric. This second factor adds an overhead depending on memory clocking and the structure of the interconnect units. We evaluate the performance of the proposed architecture taking into account and measuring when possible the following delay times: (a) Time to grab an image frame and serialize it. (b) Transfer time over the USB2.0 channel. (c) Nominal time needed by the co-processor filter in order to process the image frame. (d) Overhead time needed for data flow and control in the integrated hardware system. Table 1 summarizes the response time of the above operations and reports frame rates for two typical frame sizes. As a whole the system results in a practical and stable video rate of 20 frames per second at an image resolution of 320 · 240 pixels. Similarly, larger frames with dimensions 640 · 480 pixels can be transferred and processed at a rate of approximately seven frames per second. When the board is clocked at 100 MHz the hardware image filter processes a 640 · 480 pixels frame in a minimum of 3.1 ms while DMA transfers and other control flow add approximately 30 ms in order to transfer image data between units. The frame is transferred from the host computer to the hardware board in approximately 26 ms. Other possible latencies include proper data manipulation by the Nios-II instruction code and depend on the frame size. As a conclusion, delays are divided between software and hardware procedures on both sides of the system. The hardware filter co-processor does not substantially delay the overall system performance. Host and Nios-II processing time and transfer rates over the communication channel can be a bottleneck for large frames. Since open core USB2.0 embedded technology is still developing one may expect it soon to make even better use of the available bandwidth. Table 2 summarizes the hardware requirements for the overall system we implemented in this study (see also Fig. 7). The table reports number of logical elements and memory bits needed to implement the functions presented above in a medium FPGA chip, the Altera Cyclone II EP2C35F672C6. The full potential of this chip is 33000
Table 1 Summary of measured processing times and frame rates Image frame size Hw filter minimum processing time – clock 100 MHz Overhead time needed for data flow and control in SOPC system Host processing delays USB transfer rate Practical frames per second (max)
320 · 240 0.77 ms
640 · 480 3.1 ms
15 ms
30 ms
20 ms 65 Mbps 20
70 ms 65 Mbps 7
Table 2 Hardware and clock requirements (CycloneII) Logic elements
Memory bits
Nios-II clk
DDR2 clk
USB2.0 clock
16,000
200,000
100 MHz
133,33 MHz
60 MHz
Logic elements and 480000 memory bits. The table also reports clock frequencies for the soft processor and the external DDR2 memory. Clocks were implemented by means of two Phase Locked Loops (PLLs). 9. Comparison with other systems In the following, we present a comparison with other image processing solutions, in terms of performance and flexibility. In order to establish some numerical comparison between the presented architecture and a purely software solution, we performed a series of experiments. We synthesized designs with variations of the basic filter operations, introducing different degrees of computational complexity. We also implemented in hardware a Sum of Absolute Differences (SAD) algorithm for dense depth map calculations, which is based on correlation operations and is much more intensive computationally than simple convolution. We compared the results with the same algorithms implemented in software and running on a Pentium IV processor at 3 GHz with 512 MB RAM. Typical results are presented in Fig. 10 and summarized in Table 3. The software results were attained by programming analytically the corresponding procedures in NI’s LabVIEW G language, using optimized library functions for array processing. We used pre-captured AVI video sequences and processed each frame with the same image functions as in the hardware version of the algorithm. Frame resolutions of 320 · 240 and 640 · 480 pixels were assessed separately, since they need different transferring times to our hardware co-processor. Table 3 shows frame rates for processing video files of both resolutions as a function of computational complexity. In the first column of Table 3, the corresponding image processing operation is given. Computational complexity in the second column is measured as multiples of n · N, where n is the size of the convolution kernel, equal to 3 · 3, and N is a total of 320 · 240 image pixels. In the case of the SAD algorithm complexity is considered to be the product n · N · D, where n is the size of the comparison window applied on an image of N pixels and for a total disparity range of D pixels [17]. Our hardware version of this algorithm implemented as a FPGA co-processor will be published in a future article. The third and fourth column of Table 3 give total frame rate for each operation when implemented in pure software and in our proposed architecture, respectively. Let us note that the frames per second (fps) value in the case of the SAD algorithm refers to a single transmitted frame instead of the stereo pair in order to have a result comparable with other values. For both stereo
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
103
Fig. 10. Frame-rate comparison between software implementations of a variety of complex image functions (lower curve, rhombs) and mixed software/ hardware implementations of equivalent complexity (upper curves). Results are given as a function of computational complexity (see Table 3). Left figure corresponds to frame resolution 320 · 240, while right figure corresponds to resolution 640 · 480 pixels.
Table 3 Frame-rate comparison between PC-based implementations and host/coprocessor designs for a number of image processing operations Frame resolution 320 · 240 Operation
Gauss filter (n · N), n=3·3 Edge (2 · n · N), n=3·3 Gauss + edge (3 · n · N), n=3·3 SAD (n · N · D), n = 3 · 3, D=8 SAD n = 3 · 3, D = 16 SAD n = 3 · 3, D = 32 SAD n = 5 · 5, D = 32
Complexity (·0.7 · 106)
Pure software (fps)
Mixed design (host/coprocessor) (fps)
Logic elements
1
15
20
14,500
2
6
19.5
14,800
3
4.5
19.5
15,900
8
1.7
18
13,000
16
1
17.5
16,000
32
0.43
17.5
23,000
89
0.3
17
30,000
Frame resolution 640 · 480 Gauss filter (n · N), n=3·3 Edge (2 · n · N), n=3·3 Gauss + edge (3 · n · N), n=3·3 SAD (n · N · D), n = 3 · 3, D=8 SAD n = 3 · 3, D = 16 SAD n = 3 · 3, D = 32
4
6
7.2
14,500
8
2
7.3
14,900
12
1.17
7.2
16,000
32
0.63
7
13,200
64
0.33
7
16,000
128
0.16
7
23,500
Resource usage is also shown.
images the whole cycle of transmitting, processing and projecting on a monitor screen is 14 fps (resolution 320 · 240), when working with the current version of the SLS USB2.0 megafunction. The last column of Table 3 gives total logic elements required for mapping each algorithm on a Cyclone 2C35F672 Altera FPGA. The required hardware resources increase with increasing number of stages in the hardware pipeline. However, resources do not necessarily increase with increasing computational complexity. The reason is that the same hardware structure can in principle be adequate for both small and large frames, since the same parallel computations are needed per pixel and per clock cycle, as pixels stream into the system pipeline. Frames of increased resolution only need deeper line buffers, as shown in Fig. 2. Hence, smaller frames need less dedicated memory bits than larger frames. On-chip memory requirements for the implementations of Table 3 vary between 185,000 and 230,000 memory bits out of a total of 480,000. Hardware implementations in Table 3 require a nominal processing time 0.77 ms for every 320 · 240 frame and 3.1 ms for every 640 · 480 frame (see also Table 1). SAD needs twice that time in order to process a stereo pair. Additional times for data transferring and control are as discussed in Section 8. Fig. 10 presents the above results in diagram form. In Fig. 10a the processing results of a video file with frame resolution 320 · 240 are presented. Our hardware/software co-design results are shown with triangles. In Fig. 10b our hardware processing results are shown with squares and refer to frame resolution of 640 · 480. The rhombs, at the lower part of both diagrams represent frame rates achieved by pure software running on the host computer. It can be seen that the performance of the host PC without the hardware co-processor is good in the case of simple processes but, as it is expected, it falls rapidly with increasing computational complexity. The performance of our hardware/software mixed architecture is almost constant as complexity increases, because in all cases processing is completed in a single pass, at clock rate. A slight decrease is due to pre-processing steps, according to the type of
104
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
operation. However, the performance of our system is dependent on frame size, as is shown by the corresponding curves in Fig. 10a and b. The reduction of frame rate in the case of increased resolution is clearly attributed to increased payload during USB transfers and DMA transactions from board memory to the FPGA system. By better optimizing the software algorithms or by adopting faster processors in the future, the performance of the software system can be enhanced, however the gap will still persist, with increasing computational demands. The bottom line of the above analysis is clearly that the suitability of the proposed mixed architecture is application-dependent. It is better justified in the case of algorithms with increased computational complexity, where the PC alone cannot respond at a reasonable frame-rate. Keeping a moderate frame resolution helps our design to respond at a decent rate, which however is still less than true video rate. Depending on the system requirements other approaches may also be used. Using a video input card like the Altera DC-Video-TVP5146N one can input video data to the FPGA and perform a range of image functions, like simple filtering, color processing, recasting to different video formats, compression, etc. [18]. In this case one can by-pass the computer-based frame grabber and avoid the use of a host PC computer. This has the principal cost of losing the flexibility inherent in the software part of the system as well as losing the PC-based control over the cameras and frame grabber, but the result is an all-hardware system that excels in performance. Processing rate is in principle limited by the frame grabber, which in the case of NTSC composite input signal is 30 fps while for PAL video input it is 25 fps. Similarly, processing at video-rate may be obtained by using the ASICs approach, which usually incorporates all parts in an all-hardware custom system. Image processing implementations of medium and high complexity appear in the literature and most of them can manipulate images in real time [19,20]. They can reach a frame rate of 30 frames per second or even more in some custom systems. In some cases they can also incorporate a CCD interface [21]. However, ASIC implementations are complex and expensive systems that suffer in terms of flexibility. They have slow design cycle and are certainly away from the plug and play approach adopted in the present article. Pure DSP-based designs support thousands of MIPS and are comparable in performance with desktop PCs, even running at slower clock speeds. They also consume much less power. Being purely software-based they are much more flexible than hardware implementations. In addition, the control flow part of an algorithm is better mapped to software than to hardware. Moreover, they can easily incorporate video input and output channels with little additional hardware [22]. However, just as is the case with ordinary serial computers they cannot perform computationally demanding tasks at a high video-rate [23].
The use of FPGAs in hardware implementations can partly bridge the flexibility gap because FPGAs are re-programmable and have a rapid design cycle. Compared with DSPs, systems implemented using FPGAs are more efficient, especially for data flow algorithms with minimal control processing. However, building with FPGAs, we also have to implement in hardware protocols and interfaces for video I/O. In the host/co-processor architecture applied in our design, the FPGA absorbs sections of the algorithm that are costly in software, while Nios-II executes flow control. Although Nios is a medium performance processor it can greatly help by implementing the control path, as well as simple pre-processing and post-processing steps. In our system, an application can further be partitioned between the host computer and the hardware board. The host part interfaces with a frame-grabber and controls peripheral equipment. It can also execute parts of the algorithm that are software friendly. Hardware task logic operates as an accelerator of any computationally demanding function and can be added in the SOPC system as a library component. It can also be re-used, shared and imported in any SOPC system that follows the same design platform. These characteristics testify of the high level of flexibility inherent in this design. As shown from the discussion above the presented system performs better than a general purpose computer or a pure DSP design in the case of heavy computational tasks, since the hardware task logic is very fast. It is less efficient than a pure hardware system, since such systems can perform at 30 frames per second or more. However it is more flexible and has a clear potential to evolve. USB2.0 transfer rate will certainly be enhanced in future versions. In addition, this particular IP core complies with Altera’s OpenCore Plus program [24] that offers free evaluation and simulation of a megafunction. Finally the proliferation of USB2.0 communication ports in modern computers and video equipment makes this proposed channel a suitable commodity choice when building a hardware/ software architecture based on a FPGA co-processor. 10. Conclusions The design of a general hardware/software system based on a host computer and a FPGA co-processor was presented in this article. The system performance was studied in the case of image processing algorithms and represents a design methodology that can be used for a wide range of custom applications. It is based on a hardware board featuring a medium FPGA device, which is configured as a System On a Programmable Chip, with a Nios-II software processor in the system control path. Other hardware components are local and on-chip memory and a UTMI-compliant macrocell by SLS Corporation, allowing fast communication with a host computer. An integrated system controlled by a Nios-II software processor provides significant flexibility for the transferring
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
and processing of image data. An optimum choice of hardware/software features may accelerate the system performance. Partitioning a machine vision application between a host computer and a hardware co-processor may solve a number of problems and can be appealing in an academic or industrial environment where compactness and portability of the system is not of primal importance. The processing and transfer rates reported in Tables 1 and 3 can be sufficient for a spectrum of academic or industrial vision applications. Obviously, the system is better justified in the case of very demanding image processing computations that fail to perform at a reasonable rate using a personal computer. Such tasks are point pattern matching for biometric applications [25] or block matching for finding correspondences between images of a stereopair [17]. The later application is usually depended on many cameras and has significant computational demands. Cameras and frame grabbers are better controlled by a computer with custom software since image grabbing and transmitting protocols are not easily transferred into hardware designs. On the other hand, the real-time calculation of disparities from two or more cameras is better performed by hardware [26–28]. Alternatively the host/co-processor Nios-II based architecture can be used for implementing and testing in hardware a variety of image-processing academic designs, given its modular flexibility and potential for using open source code. It can also be used to manage image processing in a number of educational applications and student machine-vision exercises [29]. Future work can include more tests of complex algorithms or comparisons of the presented architecture with future high throughput Ethernet channel or PCI channel for host/co-processor communication. Also, it is certainly meaningful to work towards better optimization of the USB2.0 communication channel. Next generation USB2.0 macrocells and Nios soft processors can still increase the range of applications of the proposed design. Acknowledgements This work was conducted in communication with the support team of System Level Solutions Corporation, who provided successive evaluation versions of the USB2.0 IP core for test. We need to especially thank software manager Tejas Vaghela for his constant support and advice. References [1] W.J. MacLean, An evaluation of the suitability of FPGAs for embedded vision systems, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 3, San Diego, California, USA, June 2005, p. 131. [2] R.J. Petersen, B.L. Hutchings, An assesment of the suitability of FPGA-based systems for use in digital signal processing, in: 5th International Workshop on Field-Programmable Logic and Applications, Oxford, England, August 1995, pp. 293–302.
105
[3] B.A. Draper, J.R. Beveridge, A.P.W. Bohm, Ch. Ross, M. Chawath, Accelerated image processing on FPGAs, IEEE Transactions on Image Processing 12 (12) (2003). [4] M.A. Vega-Rodriguez, J.M. Sanchez-Perez, J.A. Gomez-Pulido, Real time image processing with reconfigurable hardware, in: 8th IEEE International Conference on Electronics, Circuits and Systems (ICECS2001), vol. 1, Malta, September 2001, pp. 213–216. [5] M. Leeser, S. Miller, H. Yu, Smart camera based on reconfigurable hardware enables diverse real time applications, in: Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM2004), Napa, CA, USA, 2004, pp. 147– 155. [6] I.S. Uzun, A. Amira, A. Bouridane, FPGA implementations of fast fourier transforms for real time signal and image processing, IEE Proceedings—Vision, Image and Signal Processing 152 (3) (2005) 283–296. [7] N. Shirazi, P.M. Athanas, A.L. Abbot, Implementation of a 2D fast fourier transform on a FPGA-based custom computing machine, in: Proceedings of the 5th International Workshop on Field-Programmable Logic and Applications, vol. 975 of Lecture Notes in Computer Science, Oxford, UK, August–September 1995, pp. 282–292. [8] M. Rogers, M. Won, A. Soohoo, Altera FPGA co-processors accelerate the performance of 3-D stereo image processing, Altera Corporation, News & Views Spring/Summer 2005. [9] Altera Corporation Home-page:
. [10] National Instruments Corporation Home-page:
. [11] Edge Detection Using SOPC Builder and DSP Builder Tool Flow, Altera Technical Paper, AN-377-1.0, May 2005. [12] Avalon Interface Specification, Altera Technical Paper MNLAVABUSREF-3.1, 2005. [13] Quartus II Handbook vol. 4, p. 8 (Chapter 9), Altera Technical Paper QII54007-5.1.0, 2005. [14] Nios II Hardware Development, Altera Technical Paper TUN2HWDV-2.0, 2005. [15] System Level Solutions Corporation Home-page:
. [16] Nios II Software Developer’s Handbook, NII5V2-5.0, Altera, May 2005. [17] M.Z. Brown, D. Burschka, G.D. Hager, Advances in computational stereo, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (8) (2003) 993–1008. [18] Video and Image Processing Design using FPGAs, Altera White Paper WP-VIDEO 0306-1.1, March 2007. [19] F.M. Alzahrani, Tom Chen, A real-time edge detector: algorithm and VLSI architecture, Real-Time Imaging 3 (1997) 363–378. [20] I. Andreadis, K. Stavroglou, Ph. Tsalidis, Design and implementation for an ASIC for real-time manipulation of digital colour images, Microprocessors and Microsystems 19 (5) (1995) 247–253. [21] D. Doswald, J. Halfiger, O. Blessing, N. Felber, P. Niederer, W. Fichtner, A 30-frames/sec megapixel real-time CMOS image processor, IEEE Journal of Solid State Circuits 35 (11) (2000) 1732– 1743. [22] I.Y. Soon, C.K. Yeo, H.C. Ng, An analogue video interface for general-purpose DSP, Microprocessors and Microsystems 25 (2001) 33–39. [23] L. Tenze, S. Carrato, S. Olivieri, Design and real-time implementation of a low-cost noise reduction system for video applications, Signal Processing 84 (2004) 453–466. [24] OpenCore plus Evaluation of Megafunctions, Altera Technical Paper AN-320-1.3, 2005. [25] N.K. Ratha, A.K. Jain, Computer vision algorithms on reconfigurable logic arrays, IEEE Transactions on Parallel and Distributed Systems 10 (1) (1999) 29–43. [26] A. Darabiha, J. Rose, W.J. MacLean, Video-rate stereo depth measurement on programmable hardware, in: Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision & Pattern Recognition, vol. 1, Madison, WI, USA, June 2003, pp. 203–210.
106
J.A. Kalomiros, J. Lygouras / Microprocessors and Microsystems 32 (2008) 95–106
[27] T. Kanade, A. Yoshida, K. Oda, H. Kano, M. Tanaka, A stereo machine for video-rate dense depth mapping and its new applications, in: Proceedings of the 15th IEEE Computer Vision and Pattern Recognition Conference (ICVPR’96), San Francisco, June 1996, pp. 196–202. [28] J. Diaz, E. Ros, S. Mota, E. Ortigosa, B. Del Pino, High performance stereo computation architecture, International Conference on Field Programmable Logic and Applications, Tampere, Finland, August 2005, pp. 463–468. [29] T.S. Hall, A framework for teaching real-time digital signal processing with field-programmable gate-arrays, IEEE Transactions on Education 48 (3) (2005) 551–558.
John A. Kalomiros received the degree of Physics at the Aristotle University of Thessaloniki in 1984 and a Masters degree in Electronics in 1987. His doctoral thesis is on characterization of materials for micro-electronic devices. His recent research interests include microcontrollers and digital systems design with applications in Robotics. He is also working on systems for digital measurements and instrumentation. He teaches electronics and related subjects at the Technological and Educational Institute of Serres, Greece.
John N. Lygouras was born in Kozani, Greece in May 1955. He received the Diploma degree and the Ph.D. in Electrical Engineering from the Democritus University of Thrace, Greece in 1982 and 1990, respectively, both with honors. From 1982 he was a research assistant and since 2000 he is an Associate Professor at the Democritus University of Thrace, Department of Electrical and Computer Engineering. In 1997 he spent six months at the University of Liverpool, Department of Electrical Engineering and Electronics as a Honorary Senior Research Fellow. His research interests are in the field of robotic manipulators trajectory planning and execution. His interests also include the research on analog and digital electronic systems implementation and position control of underwater remotely operated vehicles.