Sensors and Actuators A 172 (2011) 552–560
Contents lists available at SciVerse ScienceDirect
Sensors and Actuators A: Physical journal homepage: www.elsevier.com/locate/sna
Low power FPGA-based image processing core for wireless capsule endoscopy Pawel Turcza a,c,∗ , Mariusz Duplaga b,c a
AGH University of Science and Technology, al. A. Mickiewicza 30, 30-059 Krakow, Poland Jagiellonian University Medical College, Grzegorzecka Str. 20, 31-531 Krakow, Poland c Center of Innovation, Technology Transfer and University Development, Jagiellonian University, Czapskich Str. 4, 31-110 Krakow, Poland b
a r t i c l e
i n f o
Article history: Received 27 March 2011 Received in revised form 6 August 2011 Accepted 22 September 2011 Available online 29 September 2011 Keywords: Image compression Wireless capsule endoscopy FPGA
a b s t r a c t The paper presents an FPGA-based image and data processing core for future generation wireless capsule endoscopy (WCE). The main part of the presented core is an image compressor, for which a hardware implementation architecture requiring only two clock cycles for processing a single image pixel is proposed. Apart from the image compressor, the presented core includes a camera interface, a FIFO queue storing the compressed image bitstream, a forward error correction encoder protecting transmitted data against random and burst transmission errors, and a system controller supervising internal WCE operations. The presented core has been implemented in a single ultra low power, 65 nm FPGA chip. Power consumption of the designed FPGA core was determined to be comparable to other ASIC-based WCE systems. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Until recently flexible endoscopy was one of the main methods available for diagnostic assessment of upper (esophagogastroscopy) and lower (colonoscopy, rectoscopy) parts of the gastrointestinal (GI) tract. The combined efforts of interdisciplinary research and development teams based on recent achievements in microelectronics yielded a new type of endoscopic device called the wireless capsule endoscope (WCE). It enables the evaluation of the entire GI tract with a single passage and minimum distress to the patient. The first WCE was made by Given Imaging Ltd. [1]. It is equipped with a tiny camera with LED-based illumination, a data processing unit, a small battery, and a radio transmitter. The capsule moves through the GI tract due to peristalsis and acquires images which are wirelessly transmitted to a recorder worn by the patient. At present, research efforts are focused on the development of future generation WCE, which could be manipulated remotely in real time [2–4]. Accurate remote manipulation of WCE requires transmission of images with a high frame rate, good quality and resolution. However, a single image with a VGA (640 × 480) resolution and a Bayer pattern amounts to 2.45 Mb, while the WCE radio transmitter can reach only 2 Mb/s [5,6] due to low power operation requirements and considerable attenuation of radio waves in the human body. Therefore transmission of images with a frame
∗ Corresponding author at: AGH University of Science and Technology, al. A. Mickiewicza 30, 30-059 Krakow, Poland. Tel.: +48 12 6173972. E-mail address:
[email protected] (P. Turcza). 0924-4247/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.sna.2011.09.026
rate appropriate for real-time remote WCE manipulations requires efficient image compression to reduce the amount of transmitted data. Depending on the required image resolution, the quality and frame rate compression factor should be in the 5 to 20 range, which is only possible when lossy compression is applied. Due to space limitations, the capsule endoscope uses a single-chip image sensor with a Bayer mosaic color filter array (CFA). Because of filtering, each pixel in the CFA image represents an intensity of only one of three basic colors. Recovery of a full-color image based on CFA data requires color interpolation, which is a complex, resourcedemanding operation, and as such cannot be performed inside the capsule. Standard image compression algorithms, including JPEG, JPEG2000 or MPEG, cannot be used to compress CFA data directly. Moreover, their hardware complexity and related power consumption are too high to enable their application in power-constrained devices such as capsule endoscopes [7]. As such, a dedicated, lowpower, low-complexity yet efficient image compression solutions [7–10] are proposed for the WCE application. The paper presents a fully integrated, dedicated, FPGA-based image processing system for future generation WCE. Its main part is an image compressor operating on CFA data directly. It is based on an algorithm [9] for which a new improved hardware architecture is proposed. According to it, the color space transformation is performed immediately following pixel order conversion (see Section 4.1). Such an approach offers two advantages: (1) the color space transformation does not require a dedicated memory block, and (2) the memory used by the pixel order converter stores raw image pixels instead of transformed ones, therefore its size is reduced. The second main improvement is the pixel order converter itself. In the proposed solution it requires a single memory module only.
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
553
Fig. 1. A simplified block diagram of image processing in a wireless endoscopy capsule.
Fig. 2. An RS frame series representing the bitstream resulting from compression of a single image frame.
The memory address generator unit has also been simplified. Apart from the image compressor, the presented system includes the following new modules: a camera interface, a FIFO storing the compressed image bitstream, a bit-serial Reed–Solomon encoder protecting the transmitted bitstream against errors, and a system controller supervising the operation of all of the core modules. The proposed system has been implemented in a single ultra low power 65 nm FPGA manufactured by Silicon Blue, a start-up company. The implemented system has been successfully validated. Due to the very low clock frequency, the power consumption of the digital core part is just 9.6 mW. This value is comparable to that achieved by modern ASIC designs [8,10]. The paper is organized as follows. Section 2 presents the overview of the proposed WCE system. The camera interface is presented in Section 3. In Section 4, the image compressor algorithm and its efficient low-power hardware architecture is discussed in detail. The performance of the image compressor algorithm and implementation results are shown in Sections 5 and 6 respectively. Finally, the conclusions are presented in Section 7.
transmitted bitstream is divided into 223 byte long data frames (see Fig. 2). Next, a 32 byte checksum is appended to each data frame. This enables correction of up to 16 erroneously transmitted bytes in each 255 byte long RS frame. The Reed–Solomon decoder requires precise frame synchronization. It is established by the decoder based on a unique marker (SOF – start of frame), which is placed at the beginning of the first RS frame of each image frame. A part of the last RS frame of each image frame is used to carry non-image data such as pH, pressure or temperature measurements. To assure system testability in the development stage and later during the production stage, a special virtual camera (VC) module has been implemented. It produces signals compatible with those from the DDR deserializer based on data provided to the system controller. The system controller is also able to switch remotely between real and virtual cameras and change compressor and image sensor parameters.
2. System architecture
A serial DDR interface is frequently used to reduce the number of pads and PCB traces, which in turn saves silicon chip area and simplifies PCB design. Since DDR uses both rising and falling clock signal edges to transfer data (Fig. 3), it offers double data throughput as compared with a single data rate (SDR) interface at the same clock frequency. The standard DDR receiver (Fig. 4) samples incoming data using a clock signal (clk2) generated by a PLL running at twice the frequency of the interface clock (vclk). The drawback of this approach is the need for a PLL, which as a mixed analogue-digital
A simplified block diagram of the image and data processing system for the future generation WCE is presented in Fig. 1. It consists of a QVGA (320 × 240) CMOS image sensor [11] with a color filter array, an FPGA chip implementing all image and data processing blocks, a radio data transmitter, and a PC-based image decoder. The image captured by the CMOS sensor is sent to the FPGA chip serially via a vdata line using a low-voltage differential double data rate (LVDS-DDR) interface. The frequency of the DDR clock signal (vclk) is 48 MHz, which means that the interface speed is 96 Mb/s and the entire image is readout in less than 10 ms. Such a fast readout is necessary to minimize charge leakage in the CMOS sensor. Serially received image data is converted to parallel using a DDR deserializer and compressed using the image compressor presented in Section 4. Due to the very fast image readout, the compressor output bitstream rate is too high for direct transmission, therefore it is stored temporarily in the FIFO buffer. The system controller takes care of the image compression process. It observes empty/full FIFO flags and appropriately adjusts the quantization factor to maximize the image quality without overflowing the FIFO. Compressed image data is highly error-prone. In order to protect it against transmission errors, the Reed–Solomon (RS) forward error correction (FEC) encoding [12] is applied. Firstly, the
3. Camera interface
Fig. 3. DDR signalling.
Fig. 4. A standard PLL-based DDR deserializer.
554
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
Fig. 5. The proposed PLL-less DDR deserializer.
circuit is difficult to design and requires a considerable amount of power. Moreover, it is not present in iCE65L08-CC72, an FPGA chip employed in the presented system. Therefore a new DDR deserializer (Fig. 5) using a vclk signal directly has been developed for the presented system. The proposed deserializer uses two separate D flip-flops to sample incoming data on two opposite clock edges and two data shift registers to deserialize odd and even data samples separately. The proposed interface runs at half the frequency of the standard deserializer, which lowers its power consumption. The block of Comparators and state machine in both circuits analyzes the incoming video stream (Fig. 6) searching for start/stop markers of the image frame (SOF) and row (SOR), as well as delineating pixel data. When SOF or SOR markers are recognized, vsync or hsync signals respectively are asserted. The valid pixel on the pxi bus is signalled by the active state of the vpxi line. 4. Image compression The miniaturized digital camera inside the WCE employs a CMOS image sensor with a color filter array to produce a color image. The CFA (Fig. 1) uses 2 × 2 repeating patterns with two green, one red and one blue pixel. Recovery of the full-color image based on CFA data requires color interpolation. However, the most straightforward bilinear interpolation does not produce satisfactory results [13]. More advanced interpolation methods [14] are too complex to be integrated inside a WCE. Moreover, the interpolation step introduces a redundancy (triples the amount of data) that is difficult to remove in the following compression steps. Therefore there are two different approaches to compressing CFA data: the conventional and the alternative (Fig. 7). In the conventional approach the CFA image is interpolated first (demosaicing) to recover missing color information and then compressed using traditional methods such as JPEG or MPEG. In the alternative method the interpolation step is reversed with the compression one. Such an approach has two important advantages over the conventional one: (1) demosaicing is done on the decompressor side where there is enough processing power to use one of the most advanced algorithms, and (2) the amount of data to be compressed does not increase because the interpolation step is omitted. Because of space and power limitations, image and data processing for WCE must be implemented in a single low-power integrated circuit (ASIC or FPGA). It means that the only possible approach to image compression in WCE is the alternative approach.
Fig. 7. Two approaches to CFA image compression: (a) conventional: demosaicing first; (b) alternative: compression first.
As has already been pointed out in the introduction, only lossy compression algorithms can guarantee the required compression ratio. Although lossy compression of medical data should be avoided whenever possible, it seems to be the best choice in WCE. The reasons are as follows: (1) Image acquisition using the CFA sensor results in the collection of incomplete information, with only one third of the intensities recorded. Recovery of the full-color image from CFA data requires color interpolation. Therefore the application of the CFA sensor is a type of lossy compression. As pointed out in [13], reconstruction errors due to interpolation are comparable to errors introduced by compression (see Table 1). (2) Image sensors produce a certain amount of noise, therefore the raw image undergoes some preprocessing prior to being displayed. Image compression algorithms reduce the amount of data to be transmitted by discarding some details. It is done by 2D approximation of the image to be compressed with smooth functions. Under certain conditions this leads to noise reduction. (3) Due to transmission bandwidth limitations in WCE there are two alternatives: (a) to lose some image details due to compression, or (b) to avoid compression and decrease the frame rate. However, accurate remote WCE manipulation requires low-delay, high frame rate image transmission, while the assessment of pathological lesions occurring in the GI tract requires images with high resolution and quality. With the advent of super-resolution techniques [15,16] which use a set of sub-pixel shifted low-resolution images of the same scene to create a higher resolution image, a reconciliation of these diverse goals is possible. A block diagram of the discussed image coder is presented in Fig. 8. It involves several processing steps. In the first step the compressed image is divided into separate N × N pixels blocks (Section 4.1). This operation is followed by a color space transformation (Section 4.2). In the next step the N × N pixel block undergoes linear transformation (Section 4.3). The resulting coefficients are quantized and efficiently encoded (Section 4.4) using the appropriate entropy encoder in the last processing step. 4.1. Conversion of a progressive scan to block-wise order
Fig. 6. The CMOS imager transmission protocol: SOF/EOF – start/end of frame markers, SOR/EOR – start/end of row markers.
Due to internal design constraints, most of the available CMOS video sensors offer progressive scan readout only. Since computation of the 2D transform requires block-wise image access, special hardware with a sufficiently large buffer is necessary to convert the progressive scan order to a block-like one. The principle of
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
555
Table 1 Compression results for six images (with a QVGA resolution) from different parts of the GI tract. CR is the compression ratio defined as the size of the CFA image to the length of the resulting bitstream. Algorithm
Image
(a)
(b)
(c)
(d)
(e)
(f)
Bilinear interpolation
RGB PSNR (dB)
30.84
32.34
30.58
34.16
35.36
34.64
JPEG
CFA PSNR (dB) RGB PSNR (dB) CR
34.49 34.10 8.16
34.87 34.36 10.44
36.59 35.92 10.42
37.0 36.72 9.72
38.53 38.15 12.93
37.92 37.56 13.78
JPEG 2000
CFA PSNR (dB) RGB PSNR (dB) CR
35.59 33.33 8.06
35.98 33.87 10.39
38.12 35.17 10.48
35.52 37.16 9.67
39.61 38.36 12.91
39.54 37.70 13.71
Proposed
CFA PSNR (dB) RGB PSNR (dB) CR
34.96 33.91 8.06
35.19 34.36 10.39
36.14 35.41 10.48
35.50 35.13 9.67
37.10 36.61 12.91
37.18 36.72 13.71
Fig. 8. Block diagram of a transform-based image coder.
operation and the possibility of efficient implementation of such a module are analyzed based on the example presented in Fig. 9. In the example it is assumed that the transform dimensions N × N are 3 × 3, and C = 12 is the numbers of pixels in a single image line. The 2D buffer, essential in the operation, is addressed with A = rC + c, where r and c are row and column numbers respectively. Incoming pixels are represented by natural numbers: 0, 1, 2, 3, . . .. Initially the empty 2D buffer is filled in with the incoming pixels row by row. Once the buffer is full, it is readout column-wise, e.g. 0, 12, 24, 1, . . ., which results in block-wise pixel ordering. Completion of the readout process ends the current conversion and makes it possible to start a new one. The requirement to sequentially write and read the entire 2D buffer in the approach presented above forces the video sensor readout circuit and compressor module to operate in the alternating mode. Although it is simple, such alternating workflow organization has one important drawback: usually, the image sensor readout cannot be interrupted. In order to make the image compressor operation independent from the image sensor readout, two buffers are necessary. An appropriate implementation is presented in Fig. 10(a). In addition to the two buffers, it includes two independent address generation units and two data multiplexers. The address generation units work according to the equation
An =
(An−1 + d) mod(NC − 1), NC − 1, 0,
An−1 + d = / NC − 1 = / An−1 An−1 + d = NC − 1 An−1 = NC − 1
(1)
where An denotes the address of the cell undergoing a read/write operation at a time instant n = 1, . . ., NC − 1 and A0 = 0. Parameter d depends on the type of the operation in the buffer. If the buffer is being written, the cell address should be incremented linearly i.e. An = An−1 + 1, therefore d = 1. Reading out the buffer contents column-wise results in block-wise pixel ordering, therefore
d = C. Since An−1 + d ≤ 2(NC − 1) and d ∈ {1, C}, the modulo operation required in (1) can be implemented as a conditional reduction of An−1 + d by NC − 1, denoted as CR(NC − 1) in Fig. 10, which leads to Eq. (2) and the corresponding division-less hardware efficient implementation shown in Fig. 10(a)
An =
An−1 + d ≤ NC − 1 NC − 1 + d > An−1 > NC − 1 An−1 = NC − 1
(2)
If the image sensor readout circuit and image compressor operate in fully synchronous mode, the implementation from Fig. 10(a) can be further simplified to the one presented in Fig. 10(b). In fully synchronous mode, the readout operation from one buffer (d = C) follows the write-in operation to the other buffer (d = 1). In such case two separate address generators (for buffer A and B) operating alternately can be replaced with a single one. In this case, two separate SRAM implementing buffers can also be replaced with a single one with double capacity, which saves power and silicon area for ASIC implementation. It also eliminates input and output data multiplexers. In order to distinguish one buffer from the other, the highest address bit is used. The value of parameter d is set alternately to 1 or C according to the required operation. Register (1) in Fig. 10(b) holds the current address in the same way as the registers in Fig. 10(b), while register (2) stores the previous address value. 4.2. Color and structure transformation of the CFA image Although each pixel in the CFA sensor represents a single color component, they are still significantly correlated. Therefore the first step in the compression of CFA images should be color transformation in order to map the CFA color space into a space in which better compression can be achieved. Since the CFA image results from down-sampling an RGB image, the appropriate color transformation could be derived from the transformation proposed for the RGB space. In order to reduce implementation costs, the authors propose to base it on an RGB to YCg Co conversion, originally designed for the H.264 video coding standard in Fidelity Range Extensions (FRExt) Y=
Fig. 9. Example contents of a 2D image buffer of progressive to block-wise order conversion module (N = 3, C = 12).
An−1 + d, An−1 + d − (NC − 1), 0,
1 2
G+
R+B 2
,
Cg =
1 2
G−
R+B 2
,
Co =
R−B , 2
where Y is the luma component, while Cg and Co are the green and orange chroma components respectively. Since the CFA image is
556
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
Fig. 10. Basic (a) and improved (b) versions of progressive to block-wise order converter. CR(NC − 1) denotes conditional reduction of input value by NC − 1 according to Eq. (2).
composed of 2 × 2 repeating patterns with two green, one red and one blue pixel, these four elements should be transformed together [9,17] as
⎡
⎤
Y1 ⎢ Y2 ⎥ 1 ⎣C ⎦ = 2 g Co
⎡
1 0 1 ⎢ 0 ⎣ 1/2 1/2 0 0
⎤⎡
⎤
1/2 1/2 G1 1/2 1/2 ⎥ ⎢ G2 ⎥ −1/2 −1/2 ⎦ ⎣ B ⎦ R −1 1
(3)
4.2.1. Image structure transformation The image resulting from the color transformation operation (3) is shown in the top row in Fig. 11 (on the right). Samples of Cg and Co data constitute regular arrays, which can be compressed directly. The luma component (Y1 , Y2 ) forms a diamond grid (known as a quincunx array), which requires additional structure transformation to remove empty pixels. Two such transformations, structure separation and structure conversion (Figs. 12 and 13), have been proposed in [18]. Structure separation results in two rectangular arrays: the first is composed of odd luma pixels (Y1 ), and the second contains even luma pixels (Y2 ). Prior to the separation, low pass filtering is applied to reduce aliasing. Application of a non-separable diamond 2D filter results in high hardware implementation cost especially in terms of required memory access. The structure conversion method is simply array squeezing (Fig. 13). False high frequencies in both directions, horizontal and vertical, which occur during operation, can be reduced by the application of reversible filtering [18] (deinterlacing) prior to squeezing.
Fig. 11. Image structure after color space transformation.
Fig. 12. Structure separation of the quincunx array into two rectangular arrays: one of odd pixels and the other of even pixels.
From the authors’ research it follows that reversible filtering is worthwhile only when lossless compression is used. In case of lossy compression filtering should be omitted [9]. 4.3. Linear transformation Statistical analysis reveals that neighbouring pixels in natural images are strongly correlated. It has been found that inter-pixel correlation in such images decays exponentially with increasing inter-pixels distance, i.e. E[xi , xj ] = |i−j| , i, j = 0, 1, . . ., N − 1, and is close to 0.95. Linear orthogonal transformation shows high effectiveness in reducing inter-pixel correlation and packing pixel energy into very few transform coefficients. Energy packing efficiency GTC of a given transformation [19] is determined as the
Fig. 13. Structure conversion of the quincunx array into a rectangular array.
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
557
Fig. 14. 2D-block transformation of the luma component.
ratio of arithmetic to geometric means of coefficient variance n2 i.e.
⎛ ⎜
N−1
(1/N) n=0 n2 1/N N−1 2 n=0 n
GTC = 10 log10 ⎝
⎞
⎟ ⎠.
The value of GTC depends on the transformation itself and its size N. However, the optimal one, Karhunena–Loève transform (KLT), is not used in practice, because it depends on signal statistics, and its 2D version is not separable and does not have an efficient implementation algorithm. The discrete cosine transform (DCT), discrete wavelet transform, or discrete Walsh–Hadamard transform are used instead of KLT. When choosing the optimal N the following should be considered: (a) The buffer size in the pixel order converter (see Section 4.2) depends linearly on transform size N. (b) The computational complexity of the N-point transform grows faster than N. (c) For most natural images ( = 0.95) the 16 × 16 DCT (GTC = 9.45 dB) is only marginally better than the 8 × 8 DCT (GTC = 8.82 dB). (d) Coefficient quantization artefacts become more visible as the block size increases. In most of the current coding standards, the 8 × 8 DCT block size is considered to be optimal. When adapting this fact to CFA image compression, it should be observed (Fig. 11) that every second pixel in the original chroma image Cg or Co is empty. Therefore a 4 × 4 block of a Cg or Co image corresponds to an 8 × 8 block of an image resulting from color transformation of the full-color image. For the Y component the situation is more complicated. When the Y image is transformed row-wise (Fig. 14), the distance between neighbouring pixels is 2 as in the Cg or Co √ image case. When it is transformed column-wise the distance is 2. For that reason, and due to the need to conserve hardware resources, the 4 × 4 block was chosen for the proposed image compression algorithm. 4.3.1. Discrete cosine transformation The 2D-DCT is a separable transformation usually implemented as a 1D row-wise transformation followed by the 1D column-wise transformation. In the discussed algorithm the 2D-DCT of 4 × 4 input data block B is computed as X = (Cf BCTf ) ⊗ Sf ,
(4)
where Cf denotes an integer approximation of the 1D-DCT matrix [20]:
⎡
1 1 ⎢2 1 Cf = ⎣ 1 −1 1 −2
1 −1 −1 2
⎤
1 −2 ⎥ 1 ⎦ −1
(5)
Fig. 15. Pipeline implementation of the 2D 4-point integer DCT (4), (5) without scaling (6).
⊗ is the Kronecker product and the superscript T is a transposition. Since the transformation (5) only represents an approximation of the original DCT, an additional scaling operation by matrix Sf
⎡
a2 ⎢ ab/2 Sf = ⎣ 2 a ab/2
ab/2 b2 /4 ab/2 b2 /4
a2 ab/2 a2 ab/2
⎤
ab/2 b2 /4 ⎥ . ab/2 ⎦ b2 /4
(6)
is required (a = 1/2, b = 2/5). The inverse DCT of the input 4 × 4 data block X is computed as: ˆ = CT (X ⊗ Si )Ci , B i where
⎡
1 1 1 ⎢ 1 1/2 −1 Ci = ⎣ 1 −1/2 −1 1 −1 1
⎤
1/2 −1 ⎥ , 1 ⎦ −1/2
⎡
a2 ab ⎢ ab b2 Si = ⎣ 2 a ab ab b2
⎤
a2 ab ab b2 ⎥ a2 ab ⎦ ab b2
4.3.2. 2D-DCT pipeline implementation The 2D-DCT presented above can be implemented very efficiently using pipeline architecture shown in Fig. 15. D and Di mark data registers, which store temporary results, shift in image pixels or shift out transformation results. The presented circuit operates in a continuous manner. Transformation of a new 4 × 4 image block can be started every 16 clock cycles. The shift register composed of D0 –D3 is used as a data input to the unit. Input data are transformed column-wise first. Next they are transposed and transformed in a row-wise manner. Registers D1 –D4 are used to shift out results of transformation of the previous image block. After the first column of the new image block is entered into registers D0 –D3 (which requires 4 clock cycles), registers D0 –D3 are connected to a 1-D DCT unit by a multiplexer M. The result
558
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
of the transformation is stored in the last column of a transposition buffer in the next (5th) clock cycle. Also, at the 5th clock cycle the first pixel of the second column of the transformed image block is shifted into the D0 register. After 16 clock cycles the whole image block is transformed column-wise and the result is stored in the register file of the transposition unit. The last storing operation is performed jointly with a transposition of the result, which is implemented as a register-to-register copy operation according to doubled arrows pointing to D registers of the transposition unit (Fig. 15). In the next clock cycle (17th) computation of the rowwise part of the 2D transformation is started. For this purpose the outputs of the transposition buffer are connected to the 1D DCT unit by the multiplexer M. The transformation result is stored in the shift register D1 –D4 using multiplexers m3 –m0 . At the same time (17th clock cycle) the transformation of a new image block is started. The first pixel of the first column of the new image block is stored in D0 . After an additional 3 clock cycles all pixels from the first column of the new image block are stored inside the register D0 –D3 . At this time the first row of coefficients (which were stored in D2 –D4 ) resulting from the transformation of the previous image block is already shifted out. This means that the first column of the transposition buffer is free. Therefore, at the next clock cycle, data inside the transposition buffer are shifted from columns 2–4 into 1–3 and the result of the transformation of the first column of the new block is stored in the last column of the transposition buffer. 4.4. Coefficient encoding Coefficients (4) resulting from the transformation of an image block are quantized and entropy encoded using a technique similar to the one known from the JPEG standard. DC and AC coefficients are encoded separately due to their different statistical properties. DC coefficients of adjacent 4 × 4 blocks exhibit strong correlation. They also represent a significant fraction of the total image energy, therefore their differences resulting from DPCM processing are entropy encoded. The remaining AC coefficients are encoded in a 2-step process. First, they are converted into intermediate sequences of symbols (z, v) where z is the number of consecutive zero-value AC coefficients in the sequence preceding the nonzero AC coefficient v. The AC coefficients are processed in the order they are available at the output of the 2D-DCT pipeline processor (Fig. 15). The zig–zag ordering known from the JPEG standard is not used, since for a small block size (4 × 4) it does not provide any advantages but requires additional hardware resources. If the remaining AC coefficients in the block are all zero, they are all represented by a single symbol (0, 0). In the second step variable-length Huffman codes (VLC) are assigned to the symbols s = 16|v|2 + z, where |v|2 denotes the length of a binary representation (without leading zeros) of v (if v > 0) or −v (if v < 0). The value of v is encoded with a variable length integer (VLI) code whose length in bits |v|2 has already been encoded using VLC. The VLI of v is equal to the binary representation of v if v > 0, or −v if v < 0. In the encoding process separate Huffman tables for DC and AC coefficients are used. In order to limit the number of entries in the AC Huffman table to 64, any symbol s for which |v|2 > 4 is encoded as an escape code (ESC) followed by 4 bits representing |v|2 and 4 bits representing z. The ESC code is represented by a symbol (0, 15), which represents 16 zeros (15 + 1). The symbol (0, 15) is never normally used because the number of AC coefficients in the block is just 15.
Table 2 FPGA resource utilization reported by Magma Blast logic synthesiser by the different modules of the proposed WCE core. The numbers in parentheses give the total numbers of available individual resources. Module name
Camera interface Image compressor Bitstream FIFO Error correction
Resource type LUT4 (7680)
Flip-flops (7680)
RAM4K (32)
291 3169 462 65
149 1418 111 264
1 11 16 0
standards. The results are presented in Table 1. A peak signalto-noise ratio (PSNR) was used as an image quality measure. CFA images used in the test were obtained from full-color images from a high quality standard endoscope by appropriate down-sampling. The availability of original RGB images enables the evaluation of PSNR in RGB and CFA color spaces (marked as RGB and CFA in Table 1). The first row of Table 1 shows the PSNR of an RGB image resulting from a bilinear demosaicing of the corresponding CFA image. The result is presented to give a background for further comparison [13] and to show how a dramatic drop in picture quality is due to an application of a CFA camera. JPEG and JPEG2000 operate on full-color images only, therefore prior to applying compression the CFA images were interpolated using the bilinear method. CFA images resulting from the compression by the proposed algorithm were converted to RGB space using the advanced demosaicing method [14]. From Table 1 it can be observed that for a chosen set of test endoscopic images the proposed image coder performs similarly to JPEG and JPEG2000 standards, while its implementation complexity is only a small fraction of the complexity of these standards. The achieved compression ratio is 10 on average, which enables low-delay, real-time, 24 fps image transmission as required for accurate remote manipulation of WCE. 6. Implementation results The prototype board used for system validation is presented in Fig. 16. It consists of a CMOS image sensor with illumination and optics [11], a miniaturized board with the FPGA chip, and a wireless transmitter [6]. All of the modules of the proposed image processing core, including the compressor, camera interface, FIFO, error correction, and system controller, were implemented in a
5. Compression results The performance of the implemented image compression algorithm was evaluated and compared to JPEG and JPEG2000
Fig. 16. The prototype FPGA environment used for system validation: (1) iCE65L08CC7, FPGA chip, (2) FPGA configuration memory, (3) camera connector, (4) CMOS sensor with optics and illumination [11], and (5) wireless transmitter [6].
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
559
Table 3 Comparison with different WCE systems (1 kbits = 1024 bits). System
Clock frequency
On-chip memory size
Power consumption by a digital core
Technology
Work [8] (8 fps, QVGA)
40 MHz
750 kbits (93.81 k Bytes [8])
ASIC 180 nm
Work [10] (8 fps, QVGA)
20 MHz
764 kbits
Proposed (24 fps, QVGA)
24 MHz
107 kbits = 40 kbits (converter Fig. 10(b)) + 64 kbits (FIFO) + 3 kbits (Huffman table)
Proposed (12 fps, QVGA)
12 MHz
107 kbits = 40 kbits (converter, Fig. 10(b)) + 64 kbits (FIFO) + 3 kbits (Huffman table)
Mean: 6.2 mW (1.8 V) Mean per image frame: 0.77 mW Mean: 1.3 mW (0.95 V) Mean per image frame: 0.16 mW Active: 12.5 mW, 8.2 ms Idle: 8.9 mW, 33.4 ms Mean: 9.6 mW (1.2 V) Mean per image frame: 0.4 mW Active: 10.2 mW, 16.5 ms Idle: 4.8 mW, 66.8 ms Mean: 5.9 mW (1.2 V) Mean per image frame: 0.49 mW
single iCE65L08-CC72, an ultra low power 65 nm FPGA chip [21] in a wafer-level chip-scale package with dimensions 4.38 mm by 4.79 mm. Due to its optimized architecture it was possible to implement all of the required memory as on-chip memory, i.e. as part of the FPGA chip. Its size and distribution among the system’s modules is given in Table 2, and is compared to other systems in Table 3. The only external system components are voltage regulators with decoupling capacitors, a quartz-based clock generator, and a serial FLASH configuration memory. Since the iCE65L08CC72 includes a one-time-programmable configuration memory, the external configuration memory can be removed in the final version of the system. The utilization of the most important FPGA resources, including 4-input lookup tables (LUT-4), flip-flops, and 4 kbits memory blocks (RAM-4K), by the different modules of the proposed core is given in Table 2. The performance of the designed FPGA system is shown and compared to others in Table 3. The power consumption of the system in the two phases, i.e. active and idle, is presented together with the lengths of the phases. In the active phase all of the modules are working. In the idle phase the image compressor does not work but the FIFO and FEC modules are still active. The length of the active phase is equal to the time necessary for the image acquisition since the image data are compressed instantaneously. The mean power consumption per single compressed image frame amounts to 0.4 mW when the system is working at a 24 frame rate. This value is lower than the 0.77 mW value reported in [8] but higher than the 0.16 mW value reported in [10]. The higher power consumption in comparison to [10] results mainly from high power consumption in the idle state. As such it could be easily reduced by the application of a clock gating technique, as was done in [10], since in the idle state the only active modules are FIFO and FEC. However, in the chosen FPGA this technique is not well supported at present. The power consumption of the system in the active state is kept low due to innovative hardware architecture which (1) needs only two clock cycles to process a single image pixel, (2) uses 7 times less on-chip memory than other designs [8,10], and (3) employs a very efficient yet low power Reed–Solomon (255, 223) encoder making use of the Berlekamp bit-serial architecture [12] for error protection of transmitted data. The employed encoder reduces the random bit error rate from 10−3 (typical value for wireless channel) to about 10−13 (error-free transmission), and corrects burst-like transmission errors (up to 16 bytes). Error-free transmission means that the FIFO space occupied by data already transmitted is available for new incoming image data, which reduces the system memory footprint and simplifies the overall system design. 7. Conclusions The paper presented a new, complete image and data processing, FPGA-based core for future generation wireless capsule
ASIC 180 nm FPGA 65 nm
FPGA 65 nm
endoscopy. The presented core, due to high processing efficiency (two clock cycles per image pixel), is able to compress an acquired image directly (8.2 ms/QVGA image at 24 MHz system clock) and therefore does not require temporary storage for image data. The requirement to retransmit erroneously received data was eliminated by data FEC encoding. In this way, the system memory footprint was reduced and the overall system design was simplified. Following successful tests, the developed core was integrated with an illumination module, an accelerometer sensor, a wireless transmitter [6], and a 3D inductive power supply module [23]. Its miniaturized version is shown in the paper [22] (in Fig. 7). When the WCE system operates at 24 fps, its power consumption amounts to 86 mW, which is split between the system’s components as follows: 9.6 mW is consumed by the presented digital core part, 40 mW by the CMOS sensor [11], 34 mW by the LED-based illumination module [22], and 2 mW by the radio transmitter [6]. The sufficiently long operation of the capsule can be easily guaranteed by the 3D inductive power supply module [23], which can provide up to 330 mW of power. Acknowledgments This work is supported by the European Community, within the 6th Framework Programme, through the Vector project (contract number 0339970). The authors would like to thank the project partners and the funding organization. References [1] A. Glukhovsky, Wireless capsule endoscopy, Sensor Review 23 (2003) 128–133. [2] R. Carta, G. Tortora, J. Thonè, B. Lenaerts, P. Valdastri, A. Menciassi, P. Dario, R. Puers, Wireless powering for a self-propelled and steerable endoscopic capsule for stomach inspection, Biosensors and Bioelectronics 25 (4) (2009) 845–851. [3] P. Swain, A. Toor, F. Volke, J. Keller, J. Gerber, E. Rabinovitz, R.I. Rothstein, Remote magnetic manipulation of a wireless capsule endoscope in the esophagus and stomach of humans (with videos), Gastrointestinal Endoscopy 71 (7) (2010) 1290–1293. [4] M. Simi, P. Valdastri, C. Quaglia, A. Menciassi, P. Dario, Design, fabrication, and testing of a capsule with hybrid locomotion for gastrointestinal tract exploration, IEEE Transactions on Mechatronics 15 (2) (2010) 170–180. [5] L. Wang, T.D. Drysdale, D.R.S. Cumming, In-situ characterization of two wireless transmission schemes for ingestible capsules, IEEE Transactions on Biomedical Engineering 54 (11) (2007) 2020–2027. [6] J. Thonè, S. Radiom, D. Turgis, R. Carta, G. Gielen, R. Puers, Design of a 2 Mbps FSK near-field transmitter for wireless capsule endoscopy, Sensors and Actuators A: Physical 156 (1) (2009) 43–48. [7] D. Turgis, R. Puers, Image compression in video radio transmission for capsule endoscopy, Sensors and Actuators A 123–124 (2005) 129–136. [8] X. Xie, G. Li, X. Chen, X. Li, Z. Wang, A low-power digital IC design inside the wireless endoscopic capsule, IEEE Journal of Solid-State Circuits 41 (11) (2006) 2390–2400. [9] P. Turcza, T. Zielinski, M. Duplaga, Hardware implementation aspects of new low complexity image coding algorithm for wireless capsule endoscopy, Springer-Verlag, LNCS 5101 (2008) 476–485. [10] X. Chen, X. Zhang, L. Zhang, X. Li, N. Qi, H. Jiang, Z. Wang, A wireless capsule endoscope system with low-power controlling and processing ASIC, IEEE Transactions on Biomedical Circuits and Systems 3 (1) (2009) 11–22.
560
P. Turcza, M. Duplaga / Sensors and Actuators A 172 (2011) 552–560
[11] M. Vatteroni, D. Covi, C. Cavallott, L. Clementel, P. Valdastri, A. Menciassi, P. Dario, A. Sartori, Smart optical CMOS sensor for endoluminal applications, Sensors and Actuators A 162 (2010) 297–303. [12] E.R. Berlekamp, Bit-serial Reed–Solomon encoders, IEEE Transactions on Information Theory IT-28 (6) (1982) 869–874. [13] N.-X. Lian, L. Chang, V. Zagorodnov, Y.-P. Tan, Reversing demosaicking and compression in color filter array image processing: performance analysis and modeling, IEEE Transactions on Image Processing 15 (11) (2006) 3261–3278. [14] K. Hirakawa, T.W. Parks, Adaptive homogeneity-directed demosaicing algorithm, IEEE Transactions on Image Processing 14 (3) (2005) 360–369. [15] S. Farsiu, M. Elad, P. Milanfar, Multiframe demosaicing and super-resolution of color images, IEEE Transactions on Image Processing 15 (1) (2006) 141–159. [16] S. Farsiu, M.D. Robinson, M. Elad, P. Milanfar, Fast and robust multiframe super resolution, IEEE Transactions on Image Processing 13 (10) (2004) 1327–1344. [17] S.-Y. Lee, A. Ortega, A novel approach of image compression in digital cameras with a Bayer color filter array, in: Proc. of Int. Conf. Image Process., vol. 3, 2001, pp. 482–485. [18] C.C. Koh, J. Mukherjee, S.K. Mitra, New efficient methods of image compression in digital cameras with color filter array, IEEE Transactions on Consumer Electronics 49 (4) (2003) 1448–1456. [19] K. Sayood, Introduction to Data Compression, Morgan Kaufmann, San Francisco, CA, 1996. [20] H.S. Malvar, A. Hallapuro, M. Karczewicz, L. Kerofsky, Low-complexity transform and quantization in H.264/AVC, IEEE Transactions on Circuits and Systems for Video Technology 7 (2003) 598–603. [21] http://www.siliconbluetech.com. [22] C. Cavallotti, P. Merlino, M. Vatteroni, P. Valdastri, A. Abramo, A. Menciassi, P. Dario, An FPGA-based versatile development system for
endoscopic capsule design optimization, Sensors and Actuators A (2008), doi:10.1016/j.sna.2011.01.010. [23] R. Carta, J. Thonè, R. Puers, A wireless power supply system for robotic capsular endoscopes, Sensors and Actuators A: Physical 162 (2) (2010) 177–183.
Biographies Pawel Turcza received the M.S. degree in Computer Science in 1993 and Ph.D. degree in Electronic Engineering in 2001 from AGH University of Science and Technology in Cracow, Poland. Since 1993 he has been working in the Department of Instrumentation & Measurement of the AGH-UST as a Research & Teaching Assistant (1993), and Assistant Professor (2001). He is co-author of more than 45 scientific publications on International Journals and Conferences. His primary research interests include image processing, image coding, signal processing for communication and identification. Mariusz Duplaga graduated from Jagiellonian University Medical College in Kraków in 1991, completed doctoral thesis in medicine – 1999, specialisation in internal medicine and pulmonology. Currently holds research and teaching post in Institute of Public Health and Dept. Respiratory Medicine, Jagiellonian University Medical College. Teaching activities in the area of e-health and telemedicine, e-inclusion, evolution of public health as well as respiratory medicine. Participation in many national and international projects (MATCH, eHealth ERA, MPOWER, VECTOR). The main research topics cover the use of ICT for support of medical services with special focus on telemedicine and e-health, e-inclusion, telemonitoring and endoscopic diagnostics.