GPU accelerated training of image convolution filter weights using genetic algorithms

GPU accelerated training of image convolution filter weights using genetic algorithms

Applied Soft Computing 30 (2015) 585–594 Contents lists available at ScienceDirect Applied Soft Computing journal homepage: www.elsevier.com/locate/...

3MB Sizes 0 Downloads 42 Views

Applied Soft Computing 30 (2015) 585–594

Contents lists available at ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

GPU accelerated training of image convolution filter weights using genetic algorithms Devrim Akgün a,∗ , Pakize Erdo˘gmus¸ b a b

Sakarya University, Computer Engineering Department, Serdivan, Sakarya 54187, Turkey Duzce University, Computer Engineering Department, Konuralp, Duzce 81620, Turkey

a r t i c l e

i n f o

Article history: Received 15 October 2014 Received in revised form 22 January 2015 Accepted 8 February 2015 Available online 16 February 2015 Keywords: GPU computing CUDA Genetic algorithms Image processing Convolution filter

a b s t r a c t Genetic algorithms (GA) provide an efficient method for training filters to find proper weights using a fitness function where the input signal is filtered and compared with the desired output. In the case of image processing applications, the high computational cost of the fitness function that is evaluated repeatedly can cause training time to be relatively long. In this study, a new algorithm, called sub-image blocks based on graphical processing units (GPU), is developed to accelerate the training of mask weights using GA. The method is developed by discussing other alternative design considerations, including direct method (DM), population-based method (PBM), block-based method (BBM), and sub-images-based method (SBM). A comparative performance evaluation of the introduced methods is presented using sequential and other GPUs. Among the discussed designs, SBM provides the best performance by taking advantage of the block shared and thread local memories in GPU. According to execution duration and comparative acceleration graphs, SBM provides approximately 55–90 times more acceleration using GeForce GTX 660 over sequential implementation on a 3.5 GHz processor. © 2015 Elsevier B.V. All rights reserved.

1. Introduction In image processing applications that use convolution filters, weights determine frequency response characteristics, and therefore, the effect of filtering operations on an output image. Calculating the required image filter weights has been of interest in many image-processing applications, such as noise elimination, blur filtering, and edge detection. In addition to analytical methods, bio-inspired methods are applied widely to train filter parameters using the desired image [1–5]. Image filtering applications usually involve a large number of computing operations, especially in the case of weights training applications for determining the desired parameters. Parallel implementations provide a good means of accelerating numerical algorithms, and find widespread utilization in image processing applications [6–9]. Presently, GPU technology provides good hardware for realizing parallel algorithms to accelerate various intense computing power-demanding numerical methods, including image-processing applications. GPU-based parallel implementations for accelerating GA [10–14] and their

∗ Corresponding author. Tel.: +90 2642957234. E-mail addresses: [email protected] (D. Akgün), [email protected] (P. Erdo˘gmus¸). http://dx.doi.org/10.1016/j.asoc.2015.02.010 1568-4946/© 2015 Elsevier B.V. All rights reserved.

applications in scientific and engineering applications are of interest among researchers [15–17]. In the presented study, a new GPU-based method is developed to accelerate training image filter weights using GA. Method development is realized by discussing other alternative GPU implementation approaches in detail. Because evaluating the fitness function requires a relatively long time when compared with other GA functions, parallel genetic algorithm (PGA) is realized through a master–slave GA with single-population [18,19]. For this purpose, the fitness values are determined using GPU to filter the input image rapidly using all individuals in the population that represents the filter mask weights. Then, the mean absolute error (MAE) between the original and filtered images is returned as a fitness value to GA for each set of mask weights data in the population. This operation can be realized by calling the kernel separately for each set of weights data in the population or by calling the kernel one time for the entire population. These are realized as direct and population-based methods in the presented paper. For DM, the fitness value for a mask weight in the population is computed on GPU. The sequential fitness function can simply be replaced with the GPU-based fitness function. According to this approach, the fitness values of the entire population are computed separately for each population member. However, this requires calling the GPU kernel a number of times equivalent to population size, which causes significant overhead on computing duration. Because of the

586

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

Fig. 1. An illustration for the convolution filtering.

repetitive calls to the kernel, GPU utilization is inefficient. PBM computes all fitness values individually on GPU, thus significantly reducing the overhead of calling the GPU kernel. Calculating all population for a pixel individually increases locality through use of the local memory in compute unified device architecture (CUDA) threads. Another population-based approach discussed in the paper is BBM, which utilizes block-shared memory by allocating a mask of pixel size. All threads within the block filter the same pixel in order to compute MAE for each set of weights data in the population. The most effective approach, called SBM, allocates a sub-block of image data instead of a mask size of image data in the shared memory in order to filter a block of pixels. In addition, SBM takes advantage of local memory for storing a selected mask weight from the population that is used repeatedly according to sub-block size. The performance of the developed methods is measured by comparing with the sequential results. Execution times and acceleration gains are obtained for various image, filter mask, and population sizes. GPU algorithms are realized using CUDA, which is provided by NVidia(R) to programmers in order to develop parallel algorithms that use the computing power of NVidia(R) GPUs. The rest of the manuscript is organized as follows. In the following section, we describe the problem and provide brief introductions to GA and CUDA. In Section 3, the proposed method and alternative programming approaches are discussed in detail. In Section 4, comparative experiment results that use the developed method and other alternatives are given. In Section 5, we provide brief conclusions regarding the results.

size are determined by the filter characteristics. The filtered pixels are calculated by moving this mask on the input image according to x and y coordinates and colours red, green and blue (RGB) according to i. During convolution filtering using colour image, the same mask defined by Eq. (2) is applied to RGB components. In the present paper, all experiments are realized with colour images.



w11

w12

···

w1M

⎢w ⎢ 21

w22

···

w2M

.. .

.. .

wM2

···

W =⎢ ⎢

⎣ ... wM1



⎥ ⎥ ⎥ .. ⎥ . ⎦

(2)

wMM

An illustration of the convolution application for image noise filtering is shown in Fig. 1, where the original image, noisy image, convolution application, and filtered image are given. A filter mask is placed on a pixel for filtering, and then all the corresponding image pixels within the mask are multiplied by the weights of the image filter. These weights can be determined according to the nature of the filtering operation either by some analytical or heuristic methods based on optimization formulations. According to Eq. (1), a filtered pixel is computed from the sum of each multiplication, and in most cases, threshold values are applied to ensure the results fall within the range of 0–255. This operation is repeated for each pixel of the input image in order to obtain each pixel of the output image. 2.2. Genetic algorithms and training convolution filter weights

2. Background 2.1. Convolution filtering Image convolution is an extremely useful mathematical operation in image-processing applications, such as noise filtering, edge detection, and image sharpening or blurring [20]. During filtering, each pixel is categorized by multiplying pixels within a neighbourhood with filter weights that are sometimes called convolution kernels. The formulation for implementing convolution is usually as given by Eq. (1).



(M−1)/2

OutImage (x, y, i) =



(M−1)/2

wm,n InImage

m=−(M−1)/2n=−(M−1)/2

× (x − m, y − n, i)

(1)

where w is the weights matrix sometimes called a mask, InImage is the 2D matrix that represents the image, and M is the filter mask size. Eq. (2) represents the mask, W, where the weights and mask

GA is one of the heuristics methods that provide a powerful method for finding global minimums in the search domain [21]. GA uses a candidate solution set called a population, and GA operations are applied to this set in order to obtain the desired solution within a given search space. Initially, a determined number of populations that are the probable solutions to a given problem are generated randomly. Each individual in the population represents a probable solution that describes the filter weights in the form of Eq. (1). This is followed by the computation of fitness values for each individual in the candidate population. A selection operation describes the method for selecting two parents using methods such as the roulette wheel or tournament. Then, some genetic operations called crossover and mutation are applied to selected individual populations. Elitism selects the specified number of individuals with best fitness values and adds them to the next generation. A stop criterion can be applied in terms of verifying the maximum iteration or breaking the loop for a tolerance value before reaching the maximum iteration. In digital filter design, determining filter weights is important in order to achieve the desired characteristics [22–25].

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

587

Fig. 2. Training filter weights with GA.

Fig. 3. Illustrations for the fitness value variations and mask weights variations (3 × 3 mask size).

Training filter weights is important in image filtering operations such as noise elimination, blur filtering, and edge detection. GA provides a good means of estimating filter weights similar to that shown in Eq. (2). In the present application, the searched parameters that use GA are the mask weights, as illustrated in Fig. 2. During running, the filter weights are adjusted by GA according to the value of the fitness function. Based on the generated population, the fitness of each individual in the population is determined by MAE of the computed and original images. Fitness of the ith population member can be calculated through MAE of the filtered and original images as given by Eq. (3) [26,27]. In this equation, X and Y describe image size, and the result is divided by the number of pixels. Since it is RGB image, the number of pixels is also multiplied by 3 to include red, green, and blue. GA is used here to find the best weight values to minimize MAE. A minimization example of MAE used as a fitness value to GA is shown in Fig. 3a. As generation progresses, the fitness value is minimized by varying the mask weights, as shown in Fig. 3b. According to filter dimension or noise characteristics, and the GA parameters, after some iteration, MAE begins to change slowly and the weights converge on some value.

An experimental measurement of the execution time required from the sequential functions to the entire execution is shown in Fig. 4. Because only a small ratio of GA functions is in the sequential

1  OrgImage(x, y, i) − OutImage(x, y, i) 3XY X−1 Y −1

Fitness =

Algorithm 1 (Genetic algorithm.).

2

x=0 y=0 i=0

(3) 2.3. Parallel genetic algorithms GA requires intense computation, and most of these operations can be realized independently. Computation of Eq. (3) requires filtering the input image and calculating MAE using a target image. When compared with other GA functions, the fitness function computations require most of the computing time because of intense image filtering computations. According to Algorithm 1, where the basic GA steps are provided, the fitness function is computed repeatedly as the iterations continue.

Fig. 4. The percentage of computations excluding fitness function to whole computing time (population size: 128, mask size: 3 × 3, number of generations: 100).

588

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

training applications, most of the computation time is spent computing the fitness function. In the present paper, PGA is realized through master/slave parallelization only where the fitness function is parallelized. 2.4. CUDA architecture and utilization

Fig. 5. A schematic view of single population master–slave PGA.

region, the master/slave parallelization has high efficiency potential for image filter training application. For example, a 256 × 256 image requires 0.02% of the running duration of the algorithm spent on the sequential region, and 99.8% of it spent on the parallel region. The running duration of the sequential region further reduces approximately to 0.002% when the image size is increased to 1024 × 1024. Therefore, an efficient parallelization of the fitness function provides high parallelization potential. One of the most popular and efficient methods is the singlepopulation master/slave parallelization of GA [18,19]. As depicted in Fig. 5, a master process executes GA and distributes the population to slave processes for computing the fitness function of each individual. Fitness values are returned to the master process for evaluation. In this process, all steps are the same for sequential GA, with the exception of population evaluation. For image-processing

The computational power of modern GPUs offers a considerable performance increase for parallelizable applications that demand intense arithmetic operations, such as the computation of the fitness function mentioned earlier. Simultaneous execution of a large number of lightweight threads in GPU allows some parallelizable algorithms to be computed relatively faster than CPU-based implementations [28,29]. NVidia(R) provides programmers with CUDA in order to programme NVidia(R) video cards for general purpose computing applications. A CUDA grid consists of a specified number of blocks and threads, as shown in Fig. 6. All threads within a block can access Global, Constant, and Texture memories. Threads within a block have no access to the shared memory of another block. Threads in a block can be synchronized for some operations or accurate read/write operations when requested. A thread can access shared memory relatively faster than it can global memory. CPU and GPU together form a powerful computing mechanism where CPU executes sequential codes and GPU executes parallel codes. In general, there are two main approaches for GPU utilization: pure device computing and hybrid computing, as depicted in Fig. 7a and b, respectively [31]. Both approaches involve allocating memory area on a device according to the parameters of

Fig. 6. CUDA memory model [30].

Fig. 7. Pure device and hybrid utilization based GPU computing approaches.

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

589

Fig. 8. Implementation approaches using GPU.

the algorithm, and transferring the necessary parameters. Pure device computing involves managing all stages of the algorithm in the device, and requires no interaction until all computations are complete. Hybrid computing involves using both CPU and GPU to realize computations. Some data transfer operations between CPU and GPU memory are realized during the computations. In the present study, fitness function computations are executed on a GPU device, and other GA functions are computed on the host. Because fitness function computations involves image filtering applications that use all weights data in the population, the GPU side involves a relatively high computational load, as explained in the previous section. 3. GPU based acceleration approaches In the presented study, GA is executed sequentially on a CPU, with the exception of the fitness function computations, as mentioned above. Because most computation time is spent determining the fitness function values for the population, the focus in the present paper is parallel computing of the fitness function when called in GA. Because of the GPU architecture explained in the previous section, the programming approach used to realize the parallel algorithm is important in GPU-based computing applications. One of the simplest ways to compute the fitness function on a GPU is to transfer the data to the GPU device and perform filtering for a selected population individual, as depicted in Fig. 8a. The other approaches discussed in this paper calculate the fitness values in one kernel call, as depicted in Fig. 8b.

run for each individual that represents the mask weights in order to compute fitness values separately. GPU implementation for DM is extremely simple, as shown by the CUDA kernel code provided by Algorithm 2. The pixel to be filtered using a CUDA thread is determined by threadIdx.x and blockIdx.x, and it is filtered using specified mask weights selected from the population. The MAE value for the individual is determined by adding the absolute error values in the block shared memory using a block shared array in the kernel. This operation is also realized through a parallel sum on the GPU when the fitness function is run. Following the synchronization of all threads in a block, the total block shared MAE value is determined and added to a global variable using the atomicFloatAdd function by all blocks in order to obtain all MAE values for the entire image. Although simple, this design does not take advantage of either thread local memory or bank shared memory. In addition, the design requires calling the GPU kernel by the population number, which is a costly operation. As discussed in the experiment results, using GPU for the computation of fitness values for each individual separately does not use computational power efficiently. Algorithm 2 (Direct method for CUDA kernel.).

3.1. Direct method (DM) DM is implemented by replacing directly the fitness function with its GPU-based equivalent. In GA iteration, the GPU kernel is

Fig. 9. Thread–memory interactions for PBM image.

590

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

3.2. Population based method (PBM)

also true for image mask weights as shown by Eq. (4), and therefore, the total number of global memory reads is doubled.

In this design consideration, once a thread obtains access to all pixels, all mask weights in the population are evaluated individually, and written to an array in the global memory to be sent back to the host as fitness values. The main disadvantage of the previous algorithm is the cost of repeated kernel runs for each individual in the population in order to calculate fitness values separately. In PBM, the image and population data are transferred to the global memory, and then all computations for each mask weights are realized within one kernel call. During execution, each thread accesses image and population data from the global memory, as shown in Fig. 9. Each thread filters an input pixel selected according to the block and thread, and computes MAE for the entire population. This eliminates repetitive GPU kernel calls for each population member. As shown by Algorithm 3, GPU implementation of PBM differs slightly from the previous method. The pixel to be filtered and its neighbours are copied to the local thread memory, and the same pixel is filtered by all the weights that use a For-loop. After computing the fitness value for the selected mask weights from the population, all threads within a block are synchronized to add all absolute errors. MAE is computed by adding all block shared MAE values using the atomic addition method. Following synchronization, all threads start a new filtering operation on the same pixel data using the next weights data in the population.

ImagMemReads = Width × Height × MaskSize2 × PopSize PopMemReads = Width × Height × MaskSize2 × PopSize

⎫ ⎪ ⎬ ⎪ ⎭

(4)

TotalMemReads = Width × Height × MaskSize2 × PopSize × 2

PBM not only eliminates repetitive kernel calls, but also reduces the total number of global memory accesses significantly, as shown by Eq. (5). Because the image data is copied to the thread local memory, global memory access is realized once instead of population-size times. On the other hand, the number of reads for the population remains the same as in DM. ImagMemReads = Width × Height × MaskSize2 PopMemReads = Width × Height × MaskSize2 × PopSize TotalMemReads = Width × Height × MaskSize2 × (PopSize + 1)

⎫ ⎪ ⎬ ⎪ ⎭ (5)

As shown by the experiment results, image data in the thread local memory produces better results than with the direct approach. On the other hand, an alternative approach can be used to improve the results further through the bank shared memory.

Algorithm 3 (Population based method for CUDA kernel.).

3.3. Block based method (BBM)

During computation, a thread accesses an input pixel and its neighbours in the global memory according to mask size. In DM, each thread repeatedly reads the image data of a mask size from the global memory depending on the population number. The same is

Because accessing the block shared memory is considerably faster than accessing the global memory in GPU, using shared memory for some repetitive operations can significantly boost performance depending on the algorithm. In this method, we filter a pixel and its neighbouring pixels by all the individuals in the population using the block shared memory. For this design consideration, block shared memory is allocated to image data, and each thread in the block is used to filter the same pixel using all mask weights in the population. As shown in Fig. 10, each thread filters the same pixel selected according to the thread number in a block. Therefore, the bank size is set to the population number. CUDA implementation of the GPU kernel is shown by Algorithm 4. In addition to the block shared MAE array defined in previous methods, a shared array for mask data is allocated to the image data. After the image data are transferred from the global to the block shared memory, the threads are synchronized to guarantee that the copy operations are completed, and then the pixel is filtered by all threads in the bank. Because the access time is extremely short when compared to global memory, utilizing the block shared memory improves performance to some extent. Because mask weights are called one time, they are used directly from the global

Fig. 10. Thread–memory interactions for BBM.

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

memory. Hence, this approach has the same number of global memory accesses as the previous approach described by Eq. (5). Algorithm 4 (Block based method for CUDA kernel.).

591

memory. Hence, before the GPU kernel is run, the number of blocks and threads is determined according to the number of sub-images and the population size, respectively. According to block number, selected parts of the input and output images are transferred to the block shared memory using shared variables. The block size should not be selected above a certain size as determined by the GPU architecture. In the present experiments, the block size is set to 16 × 16, which means filtering 256 image pixels. The mask data are also copied from the population array to the thread local array, whose size is determined by the mask size. Before filtering starts, all threads are synchronized to guarantee that all data are copied to specified variables. Then, the sub-image is filtered by the selected mask data in each thread without reading data from the global memory. The filtered pixels are subtracted from the original subimage in order to compute MAE. This error is added to the global variable to determine the entire absolute error. Algorithm 5 (Sub-image based method for CUDA kernel.).

3.4. Sub-images based method (SBM) To the best of our knowledge, this is the best method that we developed in this paper, as shown through experiment results. The previous approach utilizes shared memory to store a mask size of pixels to filter a single pixel by all threads in the bank, and all population members are read from the global memory as they are used one time to filter a pixel. The difference between SBM and other methods is that SBM moves an image sub-block to the shared memory, instead of a mask size of pixels. In addition, filter weights are transferred to an array within the thread local memory and they are used to filter the pixels of the sub-image block. In this method, the input image is divided into small sub-images, and each sub-image is managed within a thread block, as shown in Fig. 11. Once an image block is transferred to the shared memory, the threads within a block filter the same sub-image that resides in the shared memory. As mentioned in Eq. (1), filtering each pixel requires using its neighbouring pixels according to mask size. Once a block of memory is transferred to the shared memory, the pixel to be filtered and its neighbouring pixels can be read significantly faster than from the global memory. Therefore, a number of repeated accesses to the global memory are reduced to one access, which is realized before filtering starts. In all threads, the sub-image pixels are called repeatedly as the mask is moved to other pixels. In addition, storing the mask using a thread local array reduces the global memory accesses significantly. As shown by the experiment results, this helps to gain significant acceleration over previous approaches. The pseudocode for SBM kernel implementation is provided by Algorithm 5, where block number determines the sub-image location, and thread number determines the selected mask data for filtering from the candidate population present in the global

The number of global memory reads is reduced significantly when compared with the previous methods, as given by Eq. (6). Here, Nsub shows the number of sub-blocks, and SBlockSize shows the height and width of the sub-block. Because of the convolution operation, sub-block size is increased according to mask size in order to realize filtering for the edge pixels. In the present case, in order to filter a sub-block of size 16 × 16, the size of the copied sub-block with additional edges to the shared memory is increased to 18 × 18 for a mask size of 3 × 3, 20 × 20 for a mask size of 5 × 5, and so on. Table 1 lists numerical examples that illustrate the memory reads from the global memory for each method. A comparative

Fig. 11. Thread–memory interactions for SBM.

592

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

Table 1 Comparison of global memory reads.

256 × 256 512 × 512 1024 × 1024

Table 2 Obtained MAE after 100 iteration for test images (noisy image MAE: 26.0, population size: 512, image size: 512 × 512).

DM

PBM

BBM

SBM

7.5497e+7 3.0198e+8 1.2080e+9

5.8982e+5 2.3592e+6 9.4371e+6

5.8982e+5 2.3592e+6 9.4371e+6

8.2944e+4 3.3177e+5 1.3271e+6

inspection of the number of reads for all image dimensions shows that SBM has the least number of accesses to the global memory. ImagGMemReads = Nsub × (SBlockSize + MaskSize − 1) PopMemReads = Nsub × PopSize × MaskSize

⎫ ⎬

2

2 2

TotalMemReads = Nsub × ((SBlockSize + MaskSize − 1) + PopSize × MaskSize2 )

⎭ (6)

4. Experimental results Performance of the GPU-based and sequential algorithms was measured on a host computer with an AMD FX 8320 CPU running at 3.5 GHz, and a GeForce GTX 660 GPU that supports the computing capability of 3.0. The memory size of the host computer is 4 GB, and the operating system is 64-bit Windows 7. The codes that run on the host were written using the Microsoft (R) VC++ program that uses a CUDA library [30]. The performance of the developed methods is evaluated by comparing their execution time to sequential implementation. In the measurement of GPU running durations, all GPU operations, including memory allocations and data transfers between device and host, are included. Running durations for the sequential and developed methods are measured in seconds for the sample test cases, where the mask sizes are 3 × 3, 5 × 5, and 7 × 7; the population sizes are set to 128, 256, and 512; and the test image sizes are selected as 256 × 256, 512× 12, and 1024 × 1024. Table 2 lists the sample quality results that use 3 × 3, 5 × 5, and 7 × 7 masks in terms of MAE for various images shown in Fig. 12. As the mask size increases, the MAE obtained also increases because the number of iterations is fixed to 100 and the population size is fixed to 512. Because GA works in a stochastic manner, an average

Mask size

3×3 5×5 7×7

Test images Baboon

Peppers

Gold Hill

Pens

14.1 17.90 20.10

9.40 9.90 14.6

10.10 11.90 15.60

8.80 10.50 14.50

of 50 runs is provided for each test image. The differences between images are caused mainly by the image content. In order to obtain better results, the number of iterations and the population size can be increased further. The test images used in the performance measurements do not affect performance because the number of iterations is fixed to a constant number in order to see the real contribution of the GPU device. Therefore, the acceleration performance we measured in this paper depends on image size in addition to other GA parameters. The elapsed times show real times measured by including the entire steps of Algorithm 1, which describes the GA steps. For this purpose, the current system time is obtained at the start and end of Algorithm 1 to compute elapsed time. Therefore, in the GPU case, these durations include all GPU operations, such as initialization and data transfer. The experiment results that show the running times in seconds for the sequential and tested methods are listed in Table 3. As image and population sizes increase, the running times increase significantly because of the cost of evaluating the fitness function. When compared with the sequential running times, GPUbased methods show significant reduction in the running times. On the other hand, DM performs worse than PBM because of the cost of repeated kernel calls. PBM, where the fitness values for the entire population are computed in one GPU kernel call, also shows significant differences in running durations because of the design approaches explained in the previous section. Whereas BBM performs better than PBM, SBM, where a block of image is stored within the shared memory, performs the best among all tested methods. Based on these results, acceleration gains can be determined by computing the ratio of sequential algorithm running duration

Table 3 Elapsed real times for tested methods in seconds. (Mask size: 3 × 3, iterations: 100.). Image dimensions

Population size

Sequential

DM

PBM

BBM

SBM

256 × 256

128 256 512 128 256 512 128 256 512

44.02 88.35 176.7 176.69 359.75 720.33 714.88 1443.02 2893.06

14.08 27.35 55.89 48.46 96.28 192.24 186.15 373.44 746.72

4.09 9.41 18.53 17.66 34.89 69.35 69.15 137.4 274.15

2.52 4.52 9.57 9.61 17.37 37.44 38.58 68.68 148.37

0.76 1.224 2.259 2.157 4.091 7.944 8.319 15.52 30.7

512 × 512

1024 × 1024

Fig. 12. Test images used in the experiments. Left to right; peppers, gold hill, pens and baboon.

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

593

Fig. 13. Speed-up measurements versus image dimensions for each method.

to GPU-based algorithm running duration, as shown in Fig. 13. Whereas DM provides approximately three to four times more gain, PBM provides approximately nine to ten times more acceleration, as shown in Fig. 13a and b, respectively. As shown in Fig. 13c, BBM, where a block of thread is allocated to filter a pixel by all mask data in the population, performs better than PBM by providing 17–21 times more acceleration. As shown in Fig. 13d, the best acceleration is provided by SBM, where acceleration varies approximately by 55–90 times. More detailed results for SBM are shown in Fig. 14, where the mask size is set to 3 × 3, 5 × 5, and 7 × 7, respectively. In general, increasing the mask size has a degrading effect on acceleration gain. This is mainly because of the increase in the edges of the sub-blocks crossings, which cause additional reads from the global memory. When the results are evaluated according to image size, increasing the image size has a raising effect on acceleration gain. While increasing population size increases performance for an image size of 256 × 256, it does not increase performance at the same rate for an image size of 1024 × 1024. 5. Conclusions

Fig. 14. Speed-up measurements for sub-image block based method.

GPU-based implementation was shown to be extremely useful in accelerating GA-based optimization of mask weights. Sequential execution of the GA-based image mask training is a computationally heavy operation. In particular, evaluating the fitness function because of image filtering operations for each mask data in the population requires the most computation time. For this purpose, GPU-based acceleration provides a good method for obtaining reductions in computation duration. Compared with CPU architecture, GPU has a large number of thread blocks that can be utilized to realize general purpose computing operations. The methods introduced for GPU-based computation of the fitness function provided significant acceleration when compared to sequential execution. In the experiments, the sequential and GPU-based algorithms DM, PBM, BBM, and SBM were compared. DM, where there is a call to the GPU kernel for each set of mask data in the population in order to compute fitness values, resulted in significant overhead, as shown by the experiment results. Although GPU acceleration using DM showed poor results, GPU acceleration using the

594

D. Akgün, P. Erdo˘gmus¸ / Applied Soft Computing 30 (2015) 585–594

other population-based approaches, where the fitness values are computed in one kernel call, showed significant reduction in computing duration. PBM simply filters the same pixel for all mask data in the population. During this operation, a mask size of image data used in the filtering is maintained in the thread memory. For BBM, a mask size of image data is maintained in the shared memory and all the threads in the block filter the same pixel using a mask data in the population. Although this approach performed better acceleration than PBM, SBM, developed in the paper, performed the best acceleration. SBM provides a significantly efficient method by taking advantage of the block shared and thread local memories in GPU. SBM-based GPU acceleration of GA-based image mask training provided approximately 55–90 times more acceleration over sequential implementation using 3.5 GHz CPU and GeForce GTX 660 GPU. References [1] K. Sri Rama Krishna, et al., Genetic algorithm processor for image noise filtering using evolvable hardware, Int. J. Image Process. 4 (3) (2010) 240–250. [2] S.M. Bhandarkar, Y. Zhang, W.D. Potter, An edge detection technique using genetic algorithm-based optimization, Pattern Recognit. 27 (9) (1994) 1159–1180. [3] M. Wang, S. Yuan, A hybrid genetic algorithm-based edge detection method for SAR image, in: 2005 IEEE International on Radar Conference, IEEE, 2005. [4] T.F. Cootes, et al., Active shape models—their training and application, Comput. Vis. Image Understand. 61 (1) (1995) 38–59. [5] M. Paulinas, A. Uˇsinskas, A survey of genetic algorithms applications for image enhancement and segmentation, Inf. Technol. Control 36 (3) (2007) 278– 284. [6] D. Akgün, A practical parallel implementation for TDLMS image filter on multicore processor, J. Real-Time Image Process. (2014), http://dx.doi.org/10.1007/ s11554-014-0397-y. [7] D. Akgün, Performance evaluations for parallel image filter on multi-core computer using Java Threads, Int. J. Comput. Appl. 74 (11) (2013) 13–19. [8] J.D. Owens, et al., A survey of general purpose computation on graphics hardware, Comput. Graphics Forum 26 (1) (2007). [9] O. Schenk, M. Christen, H. Burkhart, Algorithmic performance studies on graphics processing units, J. Parallel Distrib. Comput. 68 (10) (2008) 1360–1369. [10] J.-m. Li, Z.-x. Chi, D.-l. Wan, Parallel genetic algorithm based on fine-grained model with GPU-accelerated, Control Decis. 23 (6) (2008) 697–704.

[11] P. Pospíchal, J. Schwarz, J. Jaros, Parallel genetic algorithm on the CUDA architecture, Appl. Evol. Comput. (2010) 442–451. [12] M.L. Wong, T.T. Wong, Implementation of parallel genetic algorithms on graphics processing units, Intell. Evol. Syst. (2009) 197–216. [13] M. Pedemonte, E. Alba, F. Luna, Bitwise operations for GPU implementation of genetic algorithms, in: Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, ACM, 2011. [14] D. Robilliard, V. Marion-Poty, C. Fonlupt, Population parallel GP on the G80 GPU, Genet. Program. (2008) 98–109. [15] A. Munawar, et al., Advanced genetic algorithm to solve minlp problems over gpu, in: 2011 IEEE Congress on Evolutionary Computation (CEC), IEEE, 2011. [16] W. Zhu, Y. Li, Gpu-accelerated differential evolutionary Markov chain Monte Carlo method for multi-objective optimization over continuous space, in: Proceedings of the 2nd Workshop on Bio-inspired Algorithms for Distributed Systems, ACM, 2010. [17] N. Fujimoto, S. Tsutsui, A highly-parallel TSP solver for a GPU computing platform, Numer. Methods Appl. (2011) 264–271. [18] E. Cantu-Paz, D.E. Goldberg, Efficient parallel genetic algorithms: theory and practice, Comput. Methods Appl. Mech. Eng. 186 (2) (2000) 221–238. [19] E. Alba, J.M. Troya, A survey of parallel distributed genetic algorithms, Complexity 4 (4) (1999) 31–52. [20] R.C. Gonzalez, R.E. Woods, S.L. Eddins, Digital Image Processing Using MATLAB, Pearson Prentice Hall, Upper Saddle River, NJ, 2004. [21] D.E. Goldberg, Genetic Algorithms, Pearson Education India, 2006. [22] J. Van de Vegte, Fundamental of Digital Signal Processing, 2005. [23] N. Karaboga, A. Kalinli, D. Karaboga, Designing digital IIR filters using ant colony optimisation algorithm, Eng. Appl. Artif. Intell. 17 (3) (2004) 301–309. [24] H.-C. Lu, S.-T. Tzeng, Design of two-dimensional FIR digital filters for sampling structure conversion by genetic algorithm approach, Signal Process. 80 (8) (2000) 1445–1458. [25] S.-T. Tzeng, Genetic algorithm approach for designing 2-D FIR digital filters with 2-D symmetric properties, Signal Process. 84 (10) (2004) 1883–1893. [26] W.-D. Chang, Coefficient estimation of IIR filter by a multiple crossover genetic algorithm, Comput. Math. Appl. 51 (9) (2006) 1437–1444. [27] F. Toyama, K. Shoji, J. Miyamichi, A new fitness function for shape detection using genetic algorithm, Syst. Comput. Jpn. 32 (9) (2001) 54–60. [28] D. Akgün, Ü. Sako˘glu, M. Mete, J. Esquivel, B. Adinoff, GPU-accelerated dynamic functional connectivity analysis for functional MRI data using openCL, in: Electro/Information Technology (EIT), 2014 IEEE International Conference on. IEEE, 2014, pp. 479–484. [29] S. Che, A performance study of general-purpose applications on graphics processors using CUDA, J. Parallel Distrib. Comput. 68 (10) (2008) 1370–1380. [30] NVIDIA CUDA Compute Unified Device Architecture Programming Guide: Version 1.1, 2013, http://www.geforce.com [31] D.S. Banerjee, P. Sakurikar, K. Kothapalli, Comparison sorting on hybrid multicore architectures for fixed and variable length keys, Int. J. High Perform. Comput. Appl. (2014), http://dx.doi.org/10.1177/1094342014526906.