Fast hardware architecture for fixed-point 2D Gaussian filter

Fast hardware architecture for fixed-point 2D Gaussian filter

Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105 Contents lists available at ScienceDirect International Journal of Electronics and Communications ...

999KB Sizes 0 Downloads 10 Views

Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

Contents lists available at ScienceDirect

International Journal of Electronics and Communications (AEÜ) journal homepage: www.elsevier.com/locate/aeue

Regular paper

Fast hardware architecture for fixed-point 2D Gaussian filter Debasish Mukherjee ⇑, Susanta Mukhopadhyay Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad 826004, India

a r t i c l e

i n f o

Article history: Received 15 January 2019 Accepted 30 March 2019

Keywords: FPGA Fixed-point representation Generalized filter Separable filter Generalized filter architecture Separable filter architecture

a b s t r a c t This study presents the hardware architecture for 16-bit, 5  5 fixed-point 2D Gaussian kernel. Two filters are proposed, one using generalized kernel and other using separable kernel. The quality analysis of both the filters are obtained for different metrics and it is observed that the generalized filter achieves the best performance compared to all the existing methods. The proposed filters are evaluated on different images and its performance is similar to that of the original filter, evident from the difference in values of Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Metric (SSIM) of proposed filters to that of the original filter. Based on the principle of Distributed Arithmetic (DA), two hardware architectures, Generalized Filter Architecture (GFA) and Separable Filter Architecture (SFA) are proposed. SFA achieves an improvement in area, delay, power, Area Delay Product (ADP) and Area Power Product (APP) compared to GFA. GFA, on the other hand, achieves high performance compared to the SFA model. The performance of both the architectures are analyzed in terms of max speed and frames per second metric and it is shown that the proposed architectures achieve significantly better performance compared to all the existing techniques. Ó 2019 Elsevier GmbH. All rights reserved.

1. Introduction Digital images are often degraded by noise during image acquisition and transmission stage. As a preprocessing step, the noise must be removed without significantly degrading image details such as edges and texture. Noise smoothing/cleaning aims at identifying the noisy pixels and modifying the intensity values using some prescribed rules. Gaussian filter is a widely used filter for smoothing noise from digital images. The working mechanism of Gaussian filter resembles human vision system [1]. It has been used as an initial step of many edge detection algorithm such as Canny [2] and Marr and Hildreth [3]. Gaussian filter involves floating-point arithmetic since the Gaussian kernel coefficients are represented as floating point numbers. Hardware implementation of floating point operations on a dedicated platform consumes a large number of available resources thereby deteriorating the performance of the system. An alternative is the fixed-point representation that enhances the performance of the system at the cost of acceptable loss in quality of the filtered image. An important property of Gaussian kernel is that it is strictly separable. A 2D kernel is separable if it can be expressed as the product of two 1D kernels. The advantage is achieved through reduced computational complexity by processing two 1D kernels ⇑ Corresponding author. E-mail addresses: [email protected] (D. Mukherjee), msushanta2001@ gmail.com (S. Mukhopadhyay). https://doi.org/10.1016/j.aeue.2019.03.020 1434-8411/Ó 2019 Elsevier GmbH. All rights reserved.

instead of one 2D kernel. Though separability is an important topic but few designs focus on the optimization of the separable kernel on a hardware platform [4,5]. These design focused on the processing of arbitrary kernel without giving much attention to Gaussian filter. The existing literature shows some of the works in designing an efficient architecture for 2D Gaussian filter on a suitable hardware platform [6–10]. Some of the works approximate the filter coefficients and express them in power-of-two while the remaining works either consider integer kernel coefficients or uses a fixedpoint approximation of original filter coefficients. Power-of-two approximation can be realized using shift and adder logic that achieves low area utilization at the cost of higher error rate. For instance, Jaiswal et al. [6] and Garg et al. [7] proposed an energy efficient architectures for Gaussian filter using power-of-two approximation technique where the total energy consumption is reduced in the former by exploiting nearest pixel similarity and it is reduced in the later by eliminating the contribution of the coefficients in the outer boundary region. Kalali et al. [8] proposed a low energy 2D adaptive Gaussian blur hardware architecture based on the correlation of pixel similarity among the neighboring pixels. The kernel coefficients are considered as integers in this case. Hsiao et al. [9] also proposed a 5  5 power-of-two approximation architecture by exploiting the symmetric property of the Gaussian kernel. Khobortly and Hasan [10] proposed a fixedpoint Gaussian filter architecture that suffers from significant error

99

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

rate which can be further reduced by increasing the bit width of the filter coefficients. It is well known that the implementation of the Gaussian filter is a convolution of fixed sized filter over each pixel of the input image. This operation is time consuming since it involves computing the sum of the product of the kernel weights with the pixels in the neighborhood region. The state-of-the-art designs focuses on approximating the 2D Gaussian kernel either in terms of powerof-two or by fixed-point representation, without optimizing the computation involved in processing the entire image. This in turn, increases the processing time and subsequently reduces the performance of the system. Thus, in order to address the above issues, efficient design strategies are employed to devise performance enhanced architecture at the cost of acceptable loss in quality/ accuracy of the output image. The contribution in this article can be summarized as follows:  A 5  5 approximate Gaussian kernel coefficient is presented. For fixed-point representation, it is worth to point out that increasing the bit length of the filter coefficient, reduces the error rate at the cost of higher resource utilization. On the contrary, reducing the bit length increases the performance of the system with a higher error rate. Thus, an optimal bit length of the filter coefficient must be chosen that achieves a trade-off between area utilization and the error rate. Here, two kernels: generalized kernel and separable kernel with 16-bit fixed-point representations are proposed.  Two filters making use of two kernels are designed: generalized filter and separable filter. The quality analysis of the proposed filters are obtained and it is shown that generalized filter achieves lowest error rate compared to existing filters and the proposed separable filter. The efficacy of the proposed filters are evaluated on a set of noisy images and the results show that the application of the filter does not deteriorate the output image quality. Peak-signal-to-noise ratio (PSNR) and Structural Similarity Index Metric (SSIM) are used to analyze the performance of both the proposed filters for different images and it achieves similar performance to that of original filter.  Based on principle of Distributed Arithmetic (DA), two architectures namely, Generalized Filter Architecture (GFA) and Separable Filter Architecture (SFA) are devised. SFA achieves lower utilization in terms area, delay, power, Area Delay Product (ADP) and Area Power Product (APP) whereas GFA achieves better performance in terms of max clock frequency and processing time compared to SFA. Both the architectures achieves low processing time when evaluated on varied image dimensions. This implies the fact that both of them performs reasonably well for processing high resolution images.  The performance of both the architectures are measured in terms of max speed and frames per sec and both the architectures achieves better performance compared to all the existing works for processing 5  5 kernel. The rest of the paper is organized as follows. Section 2 explains the preliminaries. This section explains the fundamentals on 2D Gaussian smoothing filter and provides a brief description of DA. Section 3 highlights the proposed architecture for 2D 5  5 convolution. The experimental results for implementing approximated Gaussian kernel in Matlab simulation environment and FPGA are explained in Section 4 which is followed by comparison between existing architectures. Finally, conclusion is shown in Section 5.

work which are required for the formulation of the proposed approach. 2.1. Gaussian filter A 2D Gaussian function with standard deviation

rx ¼ ry ¼ rÞ and mean l (lx ¼ ly ¼ l) is given by: Gðx; yÞ ¼

This section briefly explains Gaussian smoothing filter and conventional DA approach. These descriptions provide the ground

ð1Þ

where x; y are the spatial coordinates and the standard deviation determines the width of the Gaussian function. The amount of smoothing (or blurring) increases with the standard deviation. The Gaussian filter is extensively employed for smoothing out the effect of noise in an image

~f ðx; yÞ ¼ f ðx; yÞ H gðx; yÞ

ð2Þ

where f is the 2D input image, ~f is the 2D output image and g represents the Gaussian kernel. The coefficients of the Gaussian kernel are generally normalized as shown in an example of 5  5 Gaussian kernel with r = 1 and l ¼ 0

2

Gorg

0:0030 6 6 0:0133 6 ¼6 6 0:0219 6 4 0:0133

0:0133 0:0219 0:0133 0:0030

3

7 0:0595 0:0983 0:0595 0:0133 7 7 0:0983 0:1622 0:0983 0:0219 7 7 7 0:0595 0:0983 0:0595 0:0133 5 0:0030 0:0133 0:0219 0:0133 0:0030

ð3Þ

2.2. Principle of distributed arithmetic Convolution involves sum of product of kernel weights and the pixel values in the neighbourhood region. The multiplication in convolution process is carried out by built-in multipliers found in modern FPGAs. However, as the kernel size is increased, the number of multiplier units increases exponentially. This fact will limit the design to small kernel size or forced to use high cost/density FPGA device. DA, a popular multiplier less approach that performs multiplication using precomputed look-up tables (LUTs) [11]. Implementing DA on FPGAs is highly area efficient as it can take the advantage of slice LUTs which in turn enhances the performance of the system. To understand DA paradigm, consider the sum of product shown in Eq. (4):



N1 X a½n  x½n

ð4Þ

n¼0

where a is a constant and x is a variable with N sample values respectively. An unsigned representation of x[n] is given by:

x½n ¼

B1 X xb ½n  2b

ð5Þ

b¼0 th

with xb ½n2 [0,1] and xb ½n denotes the b (4) can be written as:



N 1 B1 X X a½n  xb ½n  2b n¼0

bit of x[n]. Therefore Eq.

ð6Þ

b¼0

Rearranging Eq. (6) we get:

y¼ 2. Preliminaries

1 expðððx  lÞ2 þ ðy  lÞ2 Þ=2r2 Þ 2pr2

r (i.e.

B1 N1 X X 2b a½n  xb ½n b¼0

ð7Þ

n¼0

Implementation of the term (a½n  xb ½n) is realized using one LUT. A 2N word LUT is configured to accept an N-word input xb ¼ ðxb ½0; xb ½1; . . . ; xb ½NÞ as address line. The outputs are

100

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

weighted by factors that are appropriate power-of-two, which can be realized as shift logic and the inner product y takes B cycle for its computation. ROM size can act as a bottleneck and can constrain the design to few kernel size since ROM size increases exponentially with the kernel size. To eliminate this limitation, a single ROM unit is partitioned into several ROM units which are later combined using a tree-based adder unit. 3. Proposed architecture This section presents the proposed architecture for 2D convolution. Throughout the discussion, the input image is considered to be of dimension M  N and kernel of dimension R  S. The section first highlights the proposed 16-bit fixed-point approximation of filter coefficients which is followed by DA based architecture for processing 1  5 kernel. Based on fixed-point approximation, two designs namely, Generalized Filter Architecture (GFA) and Separable Filter Architecture (SFA) are proposed subsequently in this section. 3.1. Proposed fixed point implementation Filtering using Gaussian kernel coefficient involves floating point arithmetic that consumes a lot of hardware circuitry when implemented on a FPGA device. Generally, floating point operations are performed using a specialized Floating Point Unit (FPU) integrated into an FPGA device. Fixed-point approximations allow fractions to be represented in a manner suitable for hardware implementation. The package fixed_pkg provides synthesizable VHDL code for fixed-point operations [12]. It is included in VHDL-2008 and it also has VHDL-1993 compatibility version. However, these packages are less flexible and have limited usage for designing synthesizable HDL code. Another alternative is to design a test bench that can be used to visualize synthesizable fixed-point numbers. This is done by converting std logic v etor type to (l,-r) format. Fig. 1 shows the 16-bit fixed-point representation of the decimal value 0.364. For example, to represent 16-bit data (1,-15) format is used. The decimal equivalent of the coefficient x = x15 x14 x2 x1 x0 represented in (1,-15) format is given by:

x15 20 þ x14 21 þ . . . þ x1 214 x0 215

ð8Þ

Each of the original filter coefficients shown in Eq. (3) are represented in (1,-15) format. These 16-bit numbers are converted into its decimal equivalent (Eq. (8)) and are shown below. The 5  5 matrix so formed will be named as generalized kernel ðGGK Þ for the rest of this article.

2

GGK

0:0029 0:0132 0:0218 0:0132 0:0029

3

7 6 6 0:0132 0:0594 0:0982 0:0594 0:0132 7 7 6 6 ¼ 6 0:0218 0:0982 0:1621 0:0982 0:0218 7 7 7 6 4 0:0132 0:0594 0:0982 0:0594 0:0132 5 0:0029 0:0132 0:0218 0:0132 0:0029

For the separable filter, below we write 1D horizontal mask of dimension 1  5. It will be named as the horizontal kernel ðGHOR Þ for the rest of this article.

Fig. 1. A 16 bit fixed-point representation of the number 0.364.

GHOR ¼ ½ 0:0599 0:2209 0:3649 0:2209 0:0599  A kernel of the same dimension but oriented vertically ðGVER Þ together with horizontal kernel are used to obtain the separable kernel (GSK ) which is shown below:

GSK ¼ GHOR  GVER Two filters namely, generalized filter and separable filter are derived using these two kernels. 3.2. Proposed architecture for 1  5 convolution Fig. 2 shows the block diagram for 8-bit 1  5 convolution. The architecture is designed based on the principle of DA. As shown in Fig. 2 data are shifted at the onset of each clock cycle through the register unit (q reg4-q reg0). Input data is shifted-in through register REG4 and shifted-out from register REG0. A single 128  8 ROM unit is partitioned into two 16  8 ROM units whose outputs are added using an adder module. The ROM units stores the filter coefficients in fixed-point representation discussed in Section 3.1. Data from each bit of shift register is grouped into k clusters ðk ¼ 4Þ which are passed as an address line into the ROM unit through the decoder unit. For instance, the output from each bit of q reg3 to q reg0 register selects one ROM and the output from q reg4 to the remaining bits select the other ROM unit. These two ROM units are selected at the same clock pulse. Hence, all the data are generated at a single clock cycle. After shifting, the data from each ROM units are added using a pipelined adder to get the output for 1  5 convolution. 3.3. Generalized filter architecture Fig. 3 shows the block diagram for 5  5 GFA. It consists of processing elements (PE1 through PE5 ), line buffer units and the pipelined adder unit. Input data from din signal is passed to PE1 to PE5 unit, which are enabled based on en signal. Each PEi module is designed based on the principle of DA (Fig. 2). The result of processing 1  5 kernel (pout signal) is stored in the line buffer unit. For 5  5 kernel, the result of processing at most five rows must be stored in the line buffer. Thus, each PEi requires at most five line buffer unit. Each line buffer is enabled using on the chip select (cs) line which is determined by line counter module (chip select module). The line counter module is a modulo counter w.r.t the row dimension of kernel incremented for each row of the input image. Read (rd) and write (wr) memory signals are generated from Read & Write Control Unit which are realized as Finite State Machine (FSM). Intermediate results from the buffer units are passed to the pipelined adder through MUX unit that is selected based on the Mod Line Counter unit. The output from the pipelined adder gives the result of processing one 5  5 kernel. 3.4. Separable filter architecture Fig. 4 shows the block diagram for 5  5 SFA. It consists of ROW PE, COL PE and Memory Module unit. Like GFA, here as well both ROW PE and COL PE modules are implemented based on the principal of DA (Fig. 2) where ROW PE process horizontal kernel (1  5) and COL PE processes vertical kernel (5  1). Input data from the external memory is passed to ROW PE module which stores the result of processing 1  5 kernel on Memory Module. The Column Counter Module is a modulo-R counter where R is the row dimension of the kernel, incremented at each clock cycle whenever the result from ROW PE is available. In order to minimize the delay, partial results from Memory Module are accessed concurrently in COL PE module through MERGE UNIT. The output from COL PE gives the result of processing one 5  5 kernel.

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

101

Fig. 2. Block diagram for 1  5 convolution.

Fig. 3. Block diagram for 5  5 generalized filter architecture.

4. Experimental results To evaluate the performance of the proposed works, all the filters including the existing ones are implemented in Matlab simulation environment. The hardware architecture is prototyped on Virtex 6 (XC6VLX195T-2ff784) FPGA board and synthesized using Xilinx ISE 14.2 synthesizer tool. A fully parameterized Very HighSpeed Integrated Circuit Hardware Description Language (VHDL) model has been carried out to realize the functionality of the design. In addition, the quality analysis is obtained on a set of benchmark images [13] by measuring it using some standard metrics. FPGA is used as a hardware platform for implementing complex algorithms that exploits parallelism through concurrent execution of multiple functional units [14,15]. FPGA is organized as an interconnecting slice unit and dedicated resources such as

Block RAM, DSP Slice etc. that achieves significant performance and power improvement over CPU and GPUs. This section first highlights the Matlab simulation results followed by FPGA resource utilization and performance analysis. 4.1. Error analysis The objective of the fixed-point design is to minimize the error between the original values and the rounded values of the filter coefficient. Several design metrics are used to evaluate the efficacy of the proposed filters with respect to the existing ones. 4.1.1. Mean Error Distance (MED) It is measured as the average of all the error distances (ED):

102

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

Fig. 4. Block diagram for 5  5 separable filter architecture.

MED ¼

N 1X EDi N i¼1

ð9Þ

where EDi is the difference between original and rounded version of th

the i coefficient and N is the size of the kernel, N = n  n. 4.1.2. Mean Square Error (MSE) It is measured as the average of square of all the error between the true value and the approximated value:

MSE ¼

n X n 2 1X ½f ðr; sÞ  ^f ðr; sÞ N r¼1 s¼1

ð10Þ

where f ðr; sÞ is the filtered image obtained by applying conventional Gaussian mask of size N = n  n on an input image and ^f ðr; sÞ is filtered image obtained by convolving input image with the approximated Gaussian mask of same size. 4.1.3. Peak Signal to Noise ratio (PSNR) It is the most commonly used metric in image and video processing applications. It is defined as the ratio of the signal power to the noise power. PSNR, measured in decibel (dB), given by:



PSNR ¼ 20log 10

Imax MSE



ð11Þ

where Imax is the maximum value of pixels (255 for 8-bit unsigned) in the image and MSE is defined by Eq. (10).

is regarded as ideal output. We have compared the output of the proposed as well as the existing filters with the output produced by ideal filter based on different metrics mentioned above. Table 1 summarizes the performance of existing and proposed filters in terms of metrics like MED, MSE, PSNR and SSIM. From the table it can be clearly understood that the generalized filter obtains low error rate (MED and MSE) and higher PSNR and SSIM values compared to all the existing designs. Lower error rate implies that the filtering capabilities of the proposed filter is closer to that of original filter where as higher PSNR and SSIM values implies better output image quality compared to existing ones. Fig. 5 depicts noisy Lena image filtered using various approximation kernel. It can be noted that though separable filter achieves higher error rate but the visual quality of the output image (Fig. 5(d)) does not deteriorate compared to that of original filter (Fig. 5(b)). The proposed fixed-point 16 bit Gaussian filter has been tested on Cameraman (512  512), Lena (512  512), Mandrill (512  512), Peppers (512  512), Mountain (640  480), House (512  512) and Boat (512  512) images. Tables 2 and 3 shows PSNR and SSIM results for the proposed and the original filter for different image sizes respectively. It can be seen that the measured PSNR and SSIM values for the proposed filters is closer to that obtained by applying the original filter which can be observed clearly from the values of DPSNR and DSSIM respectively. These results imply that the proposed filters does not deteriorate the visual quality of the output image. 4.2. FPGA results

4.1.4. Structural Similarity Index (SSIM) This metric provides the structural similarity between the original image ðf Þ and the approximated image ðgÞ and it is given by the following equation:

SSIMðf ; gÞ ¼ where

ð2lf lg þ C 1 Þð2rfg þ C 2 Þ

ðl2f þ l2g þ C 1 Þðr2f þ r2g þ C 2 Þ

ð12Þ

lf ; lg ; rf ; rg and rfg are the local means, standard deviation

and covariance of the image respectively. Here C 1 ¼ ð0:01  LÞ2 and C 2 ¼ ð0:03  LÞ2 where L is the specified dynamic range value. The proposed filters has been tested on the image shown in Fig. 5 in which a Lena image is degraded by 2% additive salt-andpepper noise. For our experiment, a noisy image is subjected to the original Gaussian filter of mask size 5  5. This filtered output

Table 4 shows FPGA area utilization and performance measure for GFA and SFA respectively. Area utilization is measured in terms of (i) slice register (ii) slice LUTs and (iii) RAM buffer whereas performance is measured in terms of (i) maximum clock frequency and (ii) processing time. Slice register and slice LUTs measures the percentage of utilized resources with respect to available resources. The lower percentage of area utilization (about 1%) for both the design implies that the designs can be either implemented on a low-cost device or can support processing of large kernel size. It can be noted that all the memory units are inferred as block RAM unit of FPGA resource. It is worth to note that SFA shows better area utilization compared to GFA (55% improvement in slice register, 46% improvement in slice LUTs and 80% improvement in RAM buffer) whereas its performance is slightly reduced

103

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

Fig. 5. (a) Noisy input lena image (salt & pepper noise with density 0.02), (b) filtering using original 5  5 Gaussian kernel, (c) filtering using proposed generalized 16bit 5  5 kernel, (d) filtering using separable 16bit kernel, (e) filtering using Garg et al. approximation [7], (f) filtering using Khorbotly approximation [10].

Table 1 Quality analysis for 5  5 kernel. Metrics

MED MSE PSNR SSIM

Garg et al. [7]

Khorbotly et al. [10]

6.17 61.17 36.17 0.977

Proposed Filter Generalized

Separable

2.88 27.17 54.19 0.996

16.71 182.27 24.17 0.978

4.04 36.77 43.94 0.998

Table 2 PSNR values for the test images, kernel size: 5  5. Image

Original filter

Generalized filter

Separable filter

jDPSNRG j (dB)

jDPSNRS j (dB)

Cameraman Lena Mandril Peppers Mountain House Boat

32.0171 31.9077 28.0794 35.9273 19.7623 37.1866 29.032

31.2519 31.1632 27.6945 35.0384 19.6749 35.7237 28.5675

21.7704 21.8699 21.0707 24.9963 17.1080 24.2255 21.1767

0.7652 0.7445 0.3849 0.8889 0.0874 1.4629 0.4645

10.2467 10.0378 6.6238 10.931 2.6543 12.9611 7.8553

jDPSNRG j: PSNR difference between the original filter and generalized filter. jDPSNRS j: PSNR difference between the original filter and separable filter.

Table 3 Structural similarity (SSIM) values for test images, kernel size: 5  5. Image

Original filter

Generalized filter

Separable filter

jDSSIM G j

jDSSIMS j

Cameraman Lena Mandril Peppers Mountain House Boat

0.9521 0.9024 0.8623 0.9616 0.5906 0.9839 0.8341

0.9497 0.9013 0.8590 0.9605 0.5885 0.9836 0.8312

0.9256 0.8738 0.8011 0.9400 0.5449 0.9673 0.7974

0.0024 0.0011 0.0033 0.0011 0.0021 0.0003 0.0029

0.0265 0.0286 0.0612 0.0216 0.0457 0.0166 0.0367

jDSSIMG j: SSIM difference between the original filter and generalized filter. jDSSIMS j: SSIM difference between the original filter and separable filter.

104

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

Table 4 Area and performance comparison between GFA and SFA: image size 256  256, Kernel size: 5  5. Device: xc6vlx240t-2ff784. Design

Slice Reg.

Slice LUTs

RAM Buffer (Kbit)

Max. Clock Freq. (MHz)

Processing time (msec)

GFA SFA

785/301440 (1%) 355/301440 (1%)

1239/150720 (1%) 672/150720 (1%)

62.5 12.5

185.632 171.23

0.371 0.394

Table 5 Area, delay, power, APP and ADP for GFA and SFA. Design

Area (slices)

Delay (nsec)

Power (W)

APP

ADP

GFA SFA

413 221

8.2 7.15

0.242 0.1

99.94 22.1

3386.6 1580.1

Table 6 Comparative analysis with state-of-the art harwdare implementations for 5  5 kernel.

Fig. 6. Processing time for 5  5 kernel size.

compared to GFA. Table 5 compares all the design in terms of area, delay, power, APP and ADP. Area metric is measured in a number of slices consumed. Power consumption is measured using XPOWER analyzer of the implementation stage of the synthesizer tool. XPOWER analyzer measures two types of power: (i) dynamic power and (ii) static power. Dynamic power measures the power consumed from the logic resources of the FPGA device which is shown in Table 5. Compared to GFA, SFA shows an improvement of 46% in area, 13% in delay and 59% in power consumption. It also shows improved APP and ADP of 78% and 53% respectively. Considering all the cases (Tables 4 and 5) it can be concluded that SFA shows lower area utilization at the cost of slight reduction in max clock frequency and processing time measures. Fig. 6 shows the processing time of both design for varied input size and fixed 5  5 kernel size. It can be observed that the processing time required to process 2000  2000 image is about 21 ms for GFA and 23 ms for the SFA, implying a frame rate of 46 frames per sec and 43 frames per sec respectively. These results show that the architecture does not deteriorate the performance for processing high-resolution images using moderate size kernel.

4.3. Comparison with existing hardware implementation For comparison, Table 6 summarizes the max speed (MHz) and frames per sec of the proposed architectures with existing works for Virtex 5 and Virtex 6 FPGA device. The use of different devices in each design makes comparison difficult in terms of performance and resource utilization. Thus, to obtain a fair comparison the

Method

Device

Song et al. [16] Jaiswal et al. [6] Kalali et al. [8] GFA GFA SFA SFA

Virtex Virtex Virtex Virtex Virtex Virtex Virtex

5 6 6 6 5 6 5

Max. speed (MHz)

Frames per sec

141 159 152 185.6 182.9 171.2 159.9

50 full HD Not reported 74 full HD 90 full HD 88 full HD 83 full HD 77 full HD

proposed architecture is implemented on similar device configuration, image resolution and kernel size criterion to that of the existing works. It is worth to point out that the proposed architecture (both GFA and SFA) achieves the best max speed and frames per sec measure compared to the existing ones. Jaiswal et al. [6] proposed an energy efficient approximate Gaussian kernel by rounding the coefficient and express them in power-of-two thereby achieving lower area utilization but reduces the performance since it requires all the pixels within a given window to be accessed prior to processing. The pixels considered for processing in Song et al. [16] are stored in line buffer which is first multiplied by Gaussian kernel coefficient and then added to obtain the result for one 5  5 kernel. Like Jaiswal et al. [6], here too the pixels within a 5  5 window are stored before processing resulting in reduced performance compared to proposed works. Kalali et al. [8] proposed a hardware architecture based on pixel similarity within the neighborhood region but the measured performance is far below to that achieved by the proposed methods. 5. Conclusion This study presents a 16-bit 5  5 fixed-point Gaussian kernel and its hardware implementation for 2D convolution operation. The efficacy of the proposed filters are analyzed over different metric and in all cases, generalized filter achieved the best performance. Although separable filter achieved higher error rate compared to generalized filter but its filtering capabilities are similar to that of the original filter. Both the filters are tested on a set of noisy images and it has been shown that both of them closely approximate the original filter. Two DA based architecture namely, GFA and SFA are proposed. Both the architectures report low area utilization (about 1%) and high performance for processing high-resolution images. It also outperforms the processing capabilities of existing works in terms of max speed and frames per sec metrics. There are some advantages and disadvantages of both the design which are also highlighted. It can be concluded that SFA shows improved area, delay, power, APP and ADP compared to GFA. GFA, on the other hand, achieves better max clock frequency and processing time compared to SFA.

D. Mukherjee, S. Mukhopadhyay / Int. J. Electron. Commun. (AEÜ) 105 (2019) 98–105

References [1] Haddad Richard A, Akansu Ali N. A class of fast Gaussian binomial filters for speech and image processing. IEEE Trans Signal Process 1991;39(3):723–7. [2] Canny John. A computational approach to edge detection. In: Readings in computer vision. Elsevier; 1987. p. 184–203. [3] Marr David, Hildreth Ellen. Theory of edge detection. Proc R Soc Lond B 1980;207(1167):187–217. [4] Mukherjee Debasish, Mukhopadhyay Susanta. Fast hardware architecture for 2D separable convolution operations. IEEE Trans Circ Syst II Exp Briefs 2018;65 (12):2042–6. [5] Hu Yusong, Prasanna Viktor K. Energy-efficient parameterized 2-D separable convolution on FPGA. In: 2014 International green computing conference (IGCC). IEEE; 2014. p. 1–10. [6] Jaiswal Ankur, Garg Bharat, Kaushal Vikas, Sharma GK. SPAA-aware 2D Gaussian smoothing filter design using efficient approximation techniques. In: 2015 28th international conference on VLSI design (VLSID). IEEE; 2015. p. 333–8. [7] Garg Bharat, Sharma GK. A quality-aware energy-scalable Gaussian smoothing filter for image processing applications. Microprocess Microsyst 2016;45:1–9. [8] Kalali Ercan, Hamzaoglu Ilker. Low complexity 2D adaptive image processing algorithm and its hardware implementation. IEEE Trans Consum Electron 2017;63(3):277–84. [9] Hsiao Pei-Yung, Chen Chia-Hsiung, Chou Shin-Shian, Li Le-Tien, Chen Sao-Jie. A parameterizable digital-approximated 2D Gaussian smoothing filter for edge detection in noisy image. In: 2006 IEEE International symposium on circuits and systems, 2006. ISCAS 2006. Proceedings. IEEE; 2006. p. 4. [10] Khorbotly Sami, Hassan Firas. A modified approximation of 2D Gaussian smoothing filters for fixed-point platforms. In: 2011 IEEE 43rd Southeastern symposium on system theory (SSST). IEEE; 2011. p. 151–9. [11] White Stanley A. Applications of distributed arithmetic to digital signal processing: A tutorial review. IEEE Assp Mag 1989;6(3):4–19. [12] Bishop David. Fixed point package users guide. Packages and bodies for the IEEE; 2010. p. 1076–2008. [13] Benchmark input for image processing. http://www.imageprocessingplace. com. [14] Chen Min, Zhang Yuanzhi, Chao Lu. Efficient architecture of variable size HEVC 2D-DCT for FPGA platforms. AEU-Int J Electron Commun 2017;73:1–8. [15] Indrakanti Raghu, Haridas Nisha, Elias Elizabeth. High performance continuous variable bandwidth digital filter design for hearing aid application. AEU-Int J Electron Commun 2018;92:36–53.

105

[16] Song Sunmin, Lee SangJun, Ko Jae Pil, Jeon Jae Wook. A hardware architecture design for real-time Gaussian filter. In: 2014 IEEE international conference on industrial technology (ICIT). IEEE; 2014. p. 626–9.

Debasish Mukherjee is currently a Ph.D research scholar in the Department of Computer Science & Engineering, Indian Institute of Technology (erstwhile Indian School of Mines), Dhanbad, India. He did his B.Sc with honours in Computer Science from University of Calcutta, M.Sc in Computer Science from West Bengal State University and M.Tech in Computer Science & Engineering from West Bengal University of Technology in 2008, 2010 and 2013 respectively. His research interests include designing performance efficient architecture for image processing applications.

Susanta Mukhopadhyay did his B.Sc. with Honours in Physics from Presidency College, Calcutta, B.Tech and M. Tech in Radiophysics and Electronics from the University of Calcutta and Ph.D. in Image Processing from the Indian Statistical Institute, Calcutta in 1988, 1992, 1994 and 2003 respectively. During 2001-2003 and 20042007 he worked at the Sanford Burnham Prebys Medical Discovery Institute, La Jolla, California, USA as postdoctoral researcher and Nanyang Technological University, Singapore, as research fellow respectively. His research area and interest include image and video processing. He has published articles in internationally reputed journals like IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems II: Express Briefs, Pattern Recognition, Signal Processing etc. He is currently working as Associate Professor in the Department of Computer Science and Engineering, Indian Institute of Technology (erstwhile Indian School of Mines), Dhanbad, India.