Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images

Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images

Journal Pre-proof Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images Ahmad Salah, Kenli Li, Khalid M. Hosny,...

4MB Sizes 0 Downloads 32 Views

Journal Pre-proof Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images Ahmad Salah, Kenli Li, Khalid M. Hosny, Mohamed Darwish, Qi Tian

PII: DOI: Reference:

S0167-739X(19)30217-1 https://doi.org/10.1016/j.future.2020.01.051 FUTURE 5420

To appear in:

Future Generation Computer Systems

Received date : 27 January 2019 Revised date : 8 July 2019 Accepted date : 27 January 2020 Please cite this article as: A. Salah, K. Li, K.M. Hosny et al., Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images, Future Generation Computer Systems (2020), doi: https://doi.org/10.1016/j.future.2020.01.051. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier B.V.

Journal Pre-proof

lP repro of

Accelerated CPU-GPUs Implementations for Quaternion Polar Harmonic Transform of Color Images Ahmad Salaha,b , Kenli Lia , Khalid M. Hosnyb , Mohamed Darwishc , and Qi Tiand a College

of Computer Science and Electrical Engineering, Hunan University, Changsha 410082, (e-mail: {ahmad, lkl}@hnu.edu.cn) b Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt. E-mail: k hosny@ yahoo.com c Faculty of Science, Assiut University, Assiut, Egypt d The University of Texas at San Antonio, USA. Email: [email protected] [email protected]

Abstract

Image moments are used to capture image features. Moments are successfully used in object descriptions, recognition, and other applications. However, image representation and recognition using quaternion moments are compute-intensive processes. This obstacle makes this type of moment unsuitable for real-time or

large-scale applications, despite their high accuracy. In this work, we expose the challenges to parallelizing quaternion moment computations and transform

rna

the sequential algorithm to a parallel-friendly counterpart. We propose a set of parallel implementations for parallelizing quaternion moment image representa-

tions on different parallel architectures. The proposed implementations target multicore CPU-only, GPU-only, and hybrid CPU-GPUs (with multicore CPU and multi-GPU) environments. The loop mitigation technique is proposed to

Jou

boost the level of parallelism in massively parallel environments, balance the parallel workload, and reduce both the space complexity and synchronization overhead of the proposed implementations. The loop mitigation technique could be applicable to other applications with similar loop imbalances. Finally, ex∗ Kenli

Li Email address:

([email protected])

Preprint submitted to Journal of Future Generation Computer Systems

February 1, 2020

Journal Pre-proof

lP repro of

periments are performed to evaluate the proposed implementations. Applying

a moment order of 60 on a color image of 1,024 × 1,024 pixels, the proposed implementation achieved, on four P100 GPUs and a CPU with 16 cores, speedup of 257× and 277× over the baseline performance on a single Intel Xeon E5-2609

CPU core for image reconstruction and quaternion moment computation, re-

spectively. In addition, 180× speedup is achieved for color image watermarking application.

Keywords: Parallel Implementation, Speedup, CPU-GPU, Multi-GPU, Quaternion moments.

1. Introduction

Color images play an essential role in our life. Representation and recognition of color images are very crucial tasks. Recently, quaternion moments have

been used for extracting features of color images [1, 2]. Quaternion polar har5

monic transform (QPHT) moments and their variants are powerful tools used in many image processing and pattern recognition applications such as biomet-

ric recognition [3], cross-sensor iris spoofing detection [4], medical color image watermarking [5], and copy-move image forgery [6].

10

rna

The computation of QPHT moments faces two major problems, namely, accuracy and computational demands. In [7], the authors proposed a novel recurrence method for accurate computation of the quaternion polar complex exponent transform (QPCET) for high order moments. Using these methods, at-

tempts to accelerate the computational process resulted in much shorter elapsed CPU times than those of the state-of-the-art methods. Despite the high accuracy of and the advances in the CPU technology of the

Jou 15

aforementioned methods, the problem of long computational times for real-time applications was still unresolvable. The computational process of color image moments is intensive, especially for higher orders and large-sized images. Thus, these methods are impractical for real-time applications. The use of parallel

20

architectures in computing the quaternion moments for color images is the main

2

Journal Pre-proof

lP repro of

key to addressing this compute-intensive problem.

There have been several attempts to parallelize image moment computa-

tions. In [8], efficient data partitioning that uses a cluster of GPUs is proposed

for computing both Zernike and Legendre moments for gray-scale images in 25

Cartesian coordinates. Manuel et al. used the conventional zeroth-order ap-

proximation (ZOA) method to compute both Zernike and Legendre moments

[8]. These researchers employed the symmetric pixel values to reduce the computational times for these inaccurate moments. Hosny et al. [9] proposed new

parallel algorithms to accelerate the exact Legendre moments for 2D and 3D 30

grayscale images and objects. These algorithms use multicore CPU and GPU

parallel architectures where each pixel/voxel of the digital image/object can be handled independently.

Recently, Xuan et al. [10] used the conventional approximated ZOA method to compute Zernike moments (ZMs). These researchers proposed a new image 35

re-layout for their proposed parallel implementation and used the symmetry

property to reduce the computational times of ZMs for grayscale images. In

this approximated method, the pixels of the input image are mapped over the unit circle. Since the pixels in the four corners are outside of the unit circle,

40

rna

these pixels are ignored. For an input image of size 8 × 8, 12 pixels are ignored.

Approximately 20% of the pixels of the input images are ignored. This omission leads to a significant loss in the image information and makes the computational process inaccurate.

The original equation for ZMs contains a double integral over a unit circle [11]. In the approximated ZOA method, these double integrations are replaced by double summations that are considered to be another source of numerical

Jou

45

error. High order moments are required for certain applications such as image reconstruction and image analysis and representation. In contrast, applications such as face recognition use moments at low orders [12]. The accumulation of these errors limits the method of Xuan [10] to low order moments, while image

50

reconstruction requires high order moments. In their work [10], Xuan et al. did not mention any experiments regarding image reconstruction. 3

Journal Pre-proof

lP repro of

All the aforementioned published parallel works are devoted to handle grayscale images, and the computational complexity of the quaternion moments for color images is much higher than those for grayscale images. 55

The first attempt at parallel implementation of quaternion moments for color images was made by the authors in [13], where a straightforward parallel implementation for medical image watermarking using CPUs and GPUs was per-

formed using a simple two-loop fusion. A speedup of less than 20 times was achieved, where the baseline implementaiton utilizing a single CPU. This rela60

tively limited speedup motivates us to propose novel and more efficient parallel

implementations to compute the quaternion polar harmonic transforms for color images.

The recent advances in parallel architectures, i.e., multicore CPUs and graphics processing units (GPUs), have made accessing massively parallel re65

sources affordable. Moreover, efforts have been made toward improving the programming capabilities of efficient parallel implementations, i.e., OpenMP [14] and CUDA [15] for multicore CPUs and NVIDIA GPUs, respectively.

In addition, many researchers have used hybrid CPU-GPU systems to exploit

the idle CPU while waiting for the GPU(s) to finish the assigned workload(s) [16]. While these heterogeneous computing systems provide massively parallel

rna

70

environments, fully using the parallel resources is a challenging task, especially for computing QPCET moments at low orders. Therefore, leveraging GPU implementation to significantly accelerate QPCET moment computations is a practically useful and promising direction. These 75

implementations should speed up the sequential QPCET moment computation

Jou

while maintaining the accuracy of the sequential methods. The main contributions of the proposed work are:

1. We proposed an accurate parallel implementation for the QPCET moment with zero ignored pixels. The implementations are generic for any

80

quaternion-based moments.

2. We proposed a polar moment-based optimized prediction of the balanced

4

Journal Pre-proof

lP repro of

workload of heterogeneous processing units (PUs).

3. We proposed a technique for boosting the level of parallelism, reducing the

synchronization cost, and load balancing for loop iterations with unequal 85

computational loads called loop mitigation.

4. We proposed a set of parallel implementations including multicore CPU-

only, GPU-only and CPU-GPU implementations for the QPCET moment. In this work, the 4-point and 8-point symmetries of the polar system are used to achieve shorter computational times. Finally, the results show signifi90

cant speedups, of a baseline of a single CPU, for QPCET moment and image reconstruction operations for different image sizes with different moment orders. The rest of the paper is organized as follows. Section 2 presents preliminary

concepts on quaternion representation of color images and CPU-GPU heteroge-

neous computing. Section 3 explains the QPCET moment from a computational 95

perspective. Based on this perspective, Section 4 lists the challenges of paral-

lelizing the QPCET moment. The details of the proposed implementations are given in Section 5. The experimental results and discussions are reported in

Section 6. Finally, we draw conclusions and discuss future directions for the

100

rna

proposed work in Section 7.

2. Background

2.1. Quaternion Representation of Color Images In 1866, Hamilton generalized the definition of complex numbers and defined the “quaternion” [17]. He defined quaternions with one real-component and

Jou

three imaginary-components as follows:

105

q = a + bi + cj + dk

The conjugate and the modulus of the quaternion, q, are defined as follows:

5

(1)

lP repro of

Journal Pre-proof

q ∗ = a − bi − cj − dk; p |q| = (a2 + b2 + c2 + d2 )

(2)

where a, b, c, and d are real numbers, while i, j, and k are imaginary units that are defined according to the following relationships:

i2 = j 2 = k 2 = ijk = −1; ij = −ji = k

(3)

jk = −kj = i; ki = −ik = j

where the real number a = 0, the quaternion is called a pure quaternion. Ell

and Sangwine [18] used quaternions to represent color images. A color image, 110

f (r, θ), is represented by using a pure quaternion-based model:

f (r, θ) = fR (r, θ)i + fG (r, θ)j + fB (r, θ)k

(4)

where the red, green and blue channels of the color image are represented by fR (r, θ), fG (r, θ), and fB (r, θ), respectively.

2.2. Quaternion Polar Complex Exponential Transform

115

rna

The Polar Complex Exponential Transform (PCET) for gray-level images is defined as follows [19, 20]:

Mpq

1 = π

ˆ2πˆ1 0

ˆ

2

ˆ

e−i2πpr e−iqθ f (r, θ) rdrdθ

0

where the order p and the repetition q are p = q = 0,1,2,...,∞ and ˆi =

(5) √

−1.

Jou

Since the multiplication of quaternions is not commutative, the right-side

and the left-side QPCET moments are different. The right-side QPCET moments for the RGB color image, f (r, θ), in polar coordinates are defined as

120

[21]:

R Mpq =

1 π

ˆ2πˆ1 0

2

f (r, θ)e−µ2πpr e−µqθ rdrdθ

0

6

(6)

Journal Pre-proof

lP repro of

√ where µ = (i + j + k)/ 3 is called a unit pure quaternion.

Based on the orthogonality of the QPCET, the color image f (r, θ) can be reconstructed as follows:

fˆ(r, θ) =

∞ X

∞ X

2

R (µ2πpr ) (µqθ) Mpq e e

(7)

p=−∞ q=−∞

In computing environments, summation to infinity is impossible. Thus, Eq. 125

7 is modified, where infinity is replaced by user-defined maximum orders pmax and qmax.

fˆ(r, θ) '

pmax X

qmax X

2

R (µ2πpr ) (µqθ) Mpq e e

(8)

p=−pmax q=−qmax

R R , the computational complexity of Eq. 8 can be reduced Since M−p,−q = Mp,q

by replacing the initial limits of the double summation from −pmax and −qmax to 0. 130

The accuracy of the reconstructed image increases as the values of pmax and qmax increase. Digital images are defined in Cartesian coordinates, and the

basis functions of the polar harmonic transforms are defined over a unit circle in polar coordinates. Thus, the traditional method for computing QPCETs

135

rna

requires square-to-circle mapping as displayed in Fig. 1.

2.3. Accurate Computation of the QPCET in Polar Coordinates QPCET moments can be computed in either Cartesian or polar coordinates. Computation in polar coordinates is preferable, where the basis functions are defined in polar coordinates over a unit disk. Recently, a highly accurate method

Jou

for computation of QPHTs in polar coordinates was proposed in [22]. These 140

researchers used the polar raster, where the conventional square pixels are replaced by circular pixels as displayed in Fig. 1. A detailed discussion of the polar raster can be found in [11]. For QPCET symmetry properties, the symmetry points are displayed in Fig.

2. Each point from the first kind has 8-point symmetry, while each point of the

145

second kind has 4-point symmetry. For the illustrated example of an image 7

lP repro of

Journal Pre-proof

FIGURE 1: (a) Image plane, (b) polar raster

rna

FIGURE 2: QPCET symmetry property

sized 8 × 8 with a total number of 64 pixels, the first kind of symmetry consists of 6 × 8 = 48 pixels while the second kind consists of 4 × 4 = 16 pixels. In the remainder of this paper, we use the 4-point symmetry in QPCET moment computations.

The right-hand side QPCET moments for color images are represented by

Jou

150

the function f (x, y), where x and y are the image Cartesian coordinates, and this function is exactly computed in polar coordinates as follows: R Mpq =

1 XX ˆ f (ri , θi,j )Ip (ri )Iq (θi,j ) π i j

with:

8

(9)

lP repro of

Journal Pre-proof

Ip (ri ) =

U ˆi+1

Rp (r)rdr;

Ui

Iq (θi,j ) =

Vˆ i,j+1

e(−µqθ) dθ

(10)

Vi,j

where fˆ(ri , θi,j ) represents the interpolated color image function. Based on Eq. 155

4, this function is defined as follows:

fˆ(ri , θi,j ) = fˆR (ri , θi,j )i + fˆG (ri , θi,j )j + fˆB (ri , θi,j )k

(11)

The interpolated image function, fˆ(ri , θi,j ), and its RGB components, fˆR (ri , θi,j ), fˆG (ri , θi,j ), and fˆB (ri , θi,j ) are deduced from the input original color image by using the method of cubic interpolation [11]. After a set of mathematical ma-

nipulations, the QPCET moments are represented using the RGB channels as 160

follows:

R R R R Mpq = AR pq + iBpq + jCpq + kDpq

rna

where:

Jou

1 ˆ ˆ ˆ AR pq = − √ [imag(Mpq (f R )) + imag(Mpq (f G )) + imag(Mpq (f B ))] 3 1 R Bpq = real(Mpq (fˆR )) + √ [imag(Mpq (fˆG )) − imag(Mpq (fˆB ))] 3 1 R ˆ Cpq = real(Mpq (f G )) + √ [imag(Mpq (fˆB )) − imag(Mpq (fˆR ))] 3 1 R ˆ Dpq = real(Mpq (f B )) + √ [imag(Mpq (fˆR )) − imag(Mpq (fˆG ))] 3

(12)

(13)

The Mpq (fˆR ), Mpq (fˆG ), and Mpq (fˆB ) represent conventional PCETs for the

red, green, and blue channels, respectively. Eq. 13 shows that accurate compu-

165

tation of the PCET moments for the three single-channels results in accurate computation of the QPCET moments for the input color image. 9

Journal Pre-proof

using the following form:

lP repro of

Color images are reconstructed with a finite number of QPCET moments

fˆ(r, θ) = fˆA (r, θ) + fˆB (r, θ)i + fˆC (r, θ)j + fˆD (r, θ)k

(14)

where fˆA (r, θ), fˆB (r, θ), fˆC (r, θ), and fˆD (r, θ) are computed as proposed in [7]. 170

2.4. CPU-GPU Heterogeneous Computing

Use of GPU-only and multicore CPU-only to accelerate the sequential counterpart implementations has been successfully reported for different scientific problems, as in [23, 24]. Considerable research has been performed on hybrid

CPU-GPU implementation to increase the speedups obtained. While the GPU 175

is processing its workload, the CPU waits idle. Thus, assigning a portion of the GPU workload to the CPU should improve the parallel performance. The

main issue of this hybrid implementation is maintaining load balancing of the heterogeneous PUs.

Concluding the workloads of heterogeneous PUs can be addressed through 180

static or dynamic scheduling [25]. In static scheduling, each PU is assigned a

fixed workload before executing the code. For dynamic scheduling, the workload

rna

is assigned to each PU during the run-time and can be changed based on the

PU performance. The overhead of dynamic scheduling is monitoring the PU performance and then assigning suitable portions of the workload based on each 185

PU performance. In static scheduling, this overhead vanishes. Static scheduling measures the performance of the PUs prior to the execution of the code through benchmarking the different PUs. This evaluation

Jou

includes scoring the performances of each PU for executing the benchmarks, and then based on these scores, each PU receives a suitable workload. A linear

190

programming model is used in [26, 27] to distribute the workload between a CPU, a GPU, and an FPGA, leading to a higher level of acceleration.

10

Journal Pre-proof

lP repro of

2.5. Loop optimization techniques

Modern parallel systems have a low cost per Giga Floating Point Operations Per Second (GFLOPS). Loops considered the building blocks of any parallel 195

library. Thus, optimizing loops is a key task for the success of any parallel code to fully utilize these non-expensive massive parallel resources.

Loops can be optimized through several techniques. Loop unrolling is the

process of reducing the loop control by increasing the loop body size. This

process can be seen as re-writing the loop body as repeated independent state200

ments (iterations). Loop unrolling is utilized in several CUDA applications. In [28], the authors proposed using the loop unroll technique to parallelize the

process of image convolution with CUDA. This process includes abundant of

iterations, but each iteration includes a few computations. Thus, the loop unrolling increased the computations per iterations and reduced the total number 205

of iterations. Another recent method used loop unroll technique to boost the performance of the proposed method [29].

The loop interchange technique may boost a parallel code performance. Nested loops can be interchanged in the case that there is no dependencies appear in the body of two nested loops. In [30], the authors utilized loop in-

terchange technique to optimize the parallel code. Then, they evaluated the

rna

210

proposed parallel code on several tasks including Laplace filter to a 256×256 pixels image.

Finally, loop fusion, collapsing, is used to fuse nested loops into a single loop. In [31], the authors proposed a parallel implementation to address one of the computational fluid dynamic problems. The authors proposed using loop collapsing approach to collapse four nested loops into one loop; then the

Jou

215

collapsed outer loop iterations are mapped to the parallel resources. In [32], the authors compared these three techniques of loop optimization and provided recommendations to use the proper technique for several problems.

11

Journal Pre-proof

3. QPCET moments: A Computational Perspective

lP repro of

220

It is crucial to realize the mechanism of the QPCET moment from a computational perspective as a sequential process. This understanding is the entry

point to implementing the parallel counterpart. In this section, the QPCET

equations in terms of memory and computational operations are explained. The 225

method for converting the sequential version to a parallel version is illustrated by figures. For simplicity, the operations of these two versions are explained for

a single image channel. The other two channels are treated exactly the same way.

Since the quaternion moments are defined in polar coordinates over a unit 230

circle [21], the input color images of size N ×N are mapped to polar coordinates. The image channel in the polar raster domain consists of a set of rings (tracks), where each ring consists of a set of sectors (polar pixels). The number of rings

is M = N/2. For the original image, the total numbers of pixels in the polar and Cartesian domains are identical. In Fig. 3 and the remainder of this paper, 235

S is the number of sectors in the innermost ring. The number of sectors in ring i equals S × (2 × i − 1), where i ∈ [1,M ]. For an image sized 8 × 8, the number of rings is 4. The first four rings respectively consist of 4, 12, 20, and 28 sectors.

rna

The total number of sectors for these four rings equals 64, which is equivalent to 8 × 8 pixels of the original image. 240

Fig. 3 shows the polar raster of an image sized 8 × 8 and its counterpart representation using a vector of vectors. This vector of vectors, shown in Fig. 3, represents only one image channel (red, green, or blue). The entire original ordinary image is stored as three vectors of vectors.

Jou

Fig. 4 depicts Eq. 9, which is used to compute the QPCET moment (Mpq )

245

from a computational perspective. Fig. 4 includes the interpolated image, denoted fˆ, the radial kernel of order p, denoted Ip , and the angular kernel of

repetition q, denoted Iq , from left to right. The interpolated image consists of N × N pixels spread over M rings. The computation of Mpq includes two parameters p and q, as in Eq. 9. The higher the values of p and q, the higher the

12

lP repro of

Journal Pre-proof

FIGURE 3: Polar raster of the interpolated image and its representation by a vector of vectors for one channel

250

consumed memory, computational time, and accuracy of the extracted features

and the reconstructed polar image. In Eq. 10, the possible values of p belong to [0, pmax]. In the top part of Fig. 4, the computed QPCET moment is stored

as a two-dimensional array of size (pmax + 1) × (qmax + 1), where pmax and qmax are user defined. 255

The ring number i has the radius ri ; the ring i has (pmax + 1) values for p. These (pmax + 1) values are stored in a two-dimensional array, representing the

rna

kernel Ip of size M ×(pmax+1). Finally, for each sector θi,j , there are (qmax+1) different q values. These values are calculated by Eq. 10. Thus, the kernel Iq values for each sector are represented by a vector of two-dimensional arrays, 260

which is a three-dimensional data structure. This data structure is illustrated in the rightmost part of Fig. 4. All of the two-dimensional arrays of kernel Iq have the same number of rows (qmax + 1), and the number of sectors equals

Jou

S × (2 × i − 1), where i is the ring number. Computing an element mpq of the moment Mpq includes scanning all the

265

elements of the vector of vectors of the leftmost side in the interpolated image. This scan includes a summation of the product of each sectori,j value times the value of element p of ring i from the radial kernel Ip times the value of the element q of the corresponding sectori,j from the angular kernel Iq . Similarly, image reconstruction from the QPCET moment is depicted in Fig. 13

Journal Pre-proof

lP repro of

q

p

Moment matrix: (pmax+1)× (qmax+1)

j

q=0

qmax+1

i

q = qmax

i

q=0

j

qmax+1

p

i

q = qmax

f^ : ∑

( ∗( ∗(

− ) + )) =

×

Ip:

(pmax+1) ×M

Iq: ^ × (

+ ) =

×

×(

+ )

FIGURE 4: QPCET moment computation of a single image channel

j

=

i

(Reconstructed polar image) ^ : ∑

q

( ∗( ∗(

!

!

#

="

="

− ) + )) =

×

p

rna

Moment matrix: (pmax+1)x(qmax+1)

FIGURE 5: QPCET image reconstruction of a single image channel

270

5. The process includes constructing a vector of vectors for the reconstructed image sectors. Each reconstructed sectori,j is computed as a summation of the 2

Jou

product of two terms, namely, the element moment mpq and e(µ2πpr ) e(µqθ) .

4. Parallel QPCET moment challenges In this section, we profile the QPCET running time to determine the com-

275

putational bottleneck. Then, we discuss the challenges of QPCET moment and image reconstruction computations from a parallel perspective.

14

Journal Pre-proof

lP repro of

4.1. QPCET moment profiling

Profiling sequential QPCET moments is the first step toward implementing the parallel counterpart. Table 1 lists the execution times, in seconds, of the 280

main steps of the moment computation and image reconstruction for an im-

age of 256 × 256 pixels at different order values, 20 and 60. Each experiment is repeated ten times and the reported execution times are in seconds with a

confidence interval of 95%. The moment computing step highly dominates the other steps, namely, converting the regular image to a polar image and the 285

computations of the radial kernel Ip and the angular kernel Iq . On the other

hand, image reconstruction needs no component to be prepared, and its execu-

tion time is approximately five times on average that of computing the QPCET

moment. Table 1 shows that using the 4-point symmetry significantly reduces the runtime. Compared with the 4-point symmetry, the 8-point symmetry has 290

a higher memory requirement, and the running time reduction achieved is not

significant. Thus, 4-point symmetry is considered as the base sequential code to be parallelized in the proposed work.

Finally, the computational times of the kernels Ip and Iq are image indepen-

dent; thus, their computational times are not counted in the moment compu-

tation for two reasons. The first reason is that as kernels Ip and Iq are image

rna

295

independent, these kernels can be precomputed and saved to the secondary memory. Then, these values are loaded in a trivial amount of time before computing the QPCET moment. The second reason is that if the code were to handle a set of k images of the same size, then kernels Ip and Iq are computed 300

or loaded from the secondary memory only once, regardless of the number of

Jou

images used. The overhead time for computing the kernels Ip and Iq for a single image is the computing time of kernels Ip and Iq divided by k. 4.2. Challenges of parallel QPCET moment implementation Moment computation consists of four nested loops, as listed in Algorithm 1.

305

The iterations of the first two loops are computationally equal, while the third loop iterations have different computational loads. Thus, loop fusion of the 15

Journal Pre-proof

(in seconds)

Image Size Points symmetry Moment order Image Interpolation Computing kernel Ip Computing kernel Iq Moment computing Image reconstruction

lP repro of

TABLE 1: Execution times of sequential QPCET moments and image reconstruction steps

256 × 256

4-point symmetry

No symmetry

20

60

20

60

1.772 ± 0.01

1.772±0.01

1.851±0.02

1.851±0.02

0.001±0.00

0.004±0.00

0.003±0.00

0.007±0.00

0.526±0.01

1.610±0.01

1.411±0.01

4.313±0.02

7.675±0.06

65.187±0.52

31.621±0.01

275.015±0.34

38.432±0.13

337.125±1.05

108.314±0.375

955.412±7.44

outermost two loops is straightforward but provides a limited level of parallelism. For example, for computing the QPCET moment with an order of 20 (pmax =

qmax = 20), using the aforementioned two-loop fusion results in (21×21) = 441 310

independent iterations. This level of parallelism is suitable for multicore CPUs but is insufficient for a single modern GPU. The P 100 GPU has 3,584 cores; the number of threads for a P 100 GPU should be as twice the number of GPU cores

to hide the memory access latency. Moreover, this two-loop fusion is extremely

315

rna

insufficient for a multi-GPU environment.

In addition, within the iterations of the third and fourth loops, the p and q values of the outermost two loops are required in the two innermost loops. Thus, mapping the fused loop iteration number to the original values of p and q should be addressed.

Fusing the first three loops of the QPCET moment computation yields M × (pmax + 1) × (qmax + 1) dependent iterations; each of these iterations handles

Jou

320

one ring of the image. As shown in Fig. 1, the rings have different numbers of sectors. Thus, the iterations of the fused three-loop are not computationally equal and are not preferable for parallel architectures. Unequal computational workloads lead to load imbalance.

325

Finally, fusing the four loops highly leverages code level parallelism. For

16

Journal Pre-proof

lP repro of

example, for a small-sized image of 128 × 128 pixels and a moment order of 20, fusing all four loops results in 128 × 128 × 21 × 21 = 7,225,344 dependent iterations. While fusing all four loops extremely increases the level of parallelism,

the increase comes at the cost of increasing the level of synchronization, lines 8 330

to 10 in Algorithm 1.

For an N × N image, considering three-loop fusion, each iteration of the

fused loop computes the moment for ring i. Thus, the synchronization is at the ring level, with M = N/2 synchronization steps. For four-loop fusion, an

iteration of the fused loop computes the moment for the sector j of ring i for 335

one pixel of the N 2 pixels. Thus, N 2 synchronization steps are required despite the iterations being computationally equal. In terms of numbers,the number of

synchronization steps for a small image of size 128 × 128 are 64 and 16,384 for the three-loop and four-loop fusions, respectively.

The main challenge of QPCET moment computation is to choose the proper 340

number of loops to be fused. From the above discussion, two-loop fusion is synchronization-free but has low levels of parallelism and independent iterations; this level of parallelism improves as the moment size increases. On the

other hand, three-loop fusion has a low synchronization cost and higher level

345

rna

of parallelism compared with two-loop fusion. Finally, four-loop fusion is extremely high in synchronization cost and has the highest level of parallelism. Thus, in the following, we exclude the four-loop fusion due to its extremely high synchronization cost, and we consider the two-loop and three-loop fusions only. 4.3. Challenges of parallel image reconstructions

Jou

Algorithm 2 is the sequential interpolated image reconstruction of the QPCET 350

moment. In Algorithm 2, the outermost loop iterations are not computationally equal. The number of sectors equals S × (2 × i − 1), where i is the ring number. The parallel implementation of Algorithm 2 with no loop fusion will assign each ring sectors to one parallel resource, e.g. CPU core. For instance, a 128 × 128 pixels image has 64 rings; given S = 4, the most inner ring includes

355

four sectors, four independent tasks, while the last ring includes 512 sectors, 17

Journal Pre-proof

lP repro of

independent tasks, details of rings’ structure are depicted in Fig. 2. Thus, one

parallel resource will handle 4 independent tasks and another parallel resource will handle 512 independent tasks. This example shows the possible load imbal-

ance if we didn’t fuse the first and second outer loops. Thus, this algorithm is 360

inconvenient for multicore CPUs and GPUs without loop fusion. This inequal-

ity can be trivially emphasized through a straightforward parallelizing approach for sequential image reconstruction using built-in OpenMP scheduling options [14].

Interchanging the two inner loops with the two outer loops makes Algorithm 365

2 similar in structure to Algorithm 1. This cannot be achieved due to the synchronization cost. For Algorithm 2 the two outer loops represent the pixels

of the reconstructed image. Each pixel to be reconstructed must iterate all the moment elements, the two inner loops. Thus, each iteration of the two outer loops uses the two inner loops to iterate the moment and reconstruct 370

one pixel. The process of pixel reconstruction is independent of other pixels. Thus, fusing the two outer loops requires no synchronization. On the other

hand, interchanging the two outer loops with the two inner loops will result in dependent iterations, synchronization cost. After interchanging the loops, each

375

rna

iteration of the outer two loops reconstructs a part of each pixel. The pixel is

reconstructed after all the iterations of the two outer loops, the interchanged loops, are completed. Thus, loop interchange will degrade the performance of Algorithm 2.

Intuitively, loop fusion is the most convenient approach to parallelize Algorithm 2, which requires a mapping function to convert the fused loop iteration number to the original ring i and sector j values. The fusion of the first two

Jou

380

loops results in a number of independent iterations equal to the number of image pixels.

There are two main issues to be addressed. The first issue is avoiding com-

putationally unequal iterations of the fused loops to balance the workloads. As

385

this situation results in a low use of parallel resources, certain threads/cores finish earlier and become idle. In addition, the implementation of these fused loops 18

Journal Pre-proof

1

for p = 0: pmax do

2

for q = 0: qmax do

3

for i = 1: M do

lP repro of

Algorithm 1 Sequential QPCET moment computation for a polar image

for j = 1: S × (2 × i + 1) do

4 5

Red Mp,q [p][q] += Red polarImage[i][j] × Ip [p][i] × Iq [i][q][j]

6

Green Mp,q [p][q] += Green polarImage[i][j]×Ip [p][i]×Iq [i][q][j]

7

Blue Mp,q [p][q] += Blue polarImage[i][j] × Ip [p][i] × Iq [i][q][j] end for end for

8

Red Mp,q [p][q] ×= 1.0/P I

9

Green Mp,q [p][q] ×= 1.0/P I

10

Blue Mp,q [p][q] ×= 1.0/P I end for end for

should have minimal synchronization overhead and space complexity. The sec-

ond issue is mapping the fused iteration number to the original iteration number

390

rna

of each fused loop in a constant amount of time, so that this mapping step has no effect on the overall computational time.

5. Proposed parallel implementation of QPCET moment As discussed in Sec. 4, the main challenge is to find a compromise between

Jou

a low level of parallelism and no synchronization from one side and high levels of both parallelism and synchronization on the other side. Three-loop fusion

395

has a high level of parallelism and low synchronization cost but has computationally unequal independent tasks. This paper proposes the novel concept of loop mitigation; it mitigates the load imbalance and the synchronization cost

of three-loop fusion of QPCET moment computation. The loop mitigation produces computationally equal three-loop fused iterations with less than M syn19

Journal Pre-proof

1 2 3

for i = 1: M do

lP repro of

Algorithm 2 Sequential polar image reconstruction using the QPCET moment

for j = 1: S × (2 × i + 1) do for p = 0: pmax do

for q = 0: qmax do

4

2

Red Ap,q [p][q] += Red Mp,q [p][q] × e(µ2πpr ) e(µqθ)

5

2

Green Ap,q [p][q] += Green Mp,q [p][q] × e(µ2πpr ) e(µqθ)

6

2

Blue Ap,q [p][q] += Blue Mp,q [p][q] × e(µ2πpr ) e(µqθ)

7

end for end for end for end for

400

chronization steps. In addition, for moment computing, the image size and/or the moment order determines the portion of the workload assigned to each PU. Thus, based on the image size and the moment order value, we use static

scheduling with a moment-optimized prediction of heterogeneous PU workloads

405

rna

to achieve load balance.

5.1. From a fused loop iteration to the original loops iterations Mapping functions of algorithm 1:

The mapping of the iteration number of the two-loop fusion of the QPCET

Jou

moment computation to the original loop iteration numbers is calculated by:

p = if used /(qmax + 1) q = if used

mod (qmax + 1)

(15)

where if used is the iteration number of the fused two loops, and if used ∈

410

[0, (pmax + 1) × (qmax + 1)] for two-loop fusion. Similarly, the mapping of the iteration number of the three-loop fusion of

20

Journal Pre-proof

calculated by:

lP repro of

the QPCET moment computation to the original loop iteration numbers is

p = (if used

mod ((pmax + 1) × (qmax + 1))/(qmax + 1)

q = (if used

mod ((pmax + 1) × (qmax + 1))

mod (qmax + 1)

i = if used /((pmax + 1) × (qmax + 1))

(16)

where if used ∈ [0, (pmax + 1) × (qmax + 1) × M ]. 415

Mapping functions of algorithm 2:

For two-loop fusion, the total number of fused loop iterations equals the total number of image pixels N 2 . In Algorithm 2, the mapping of the fused iteration number to the counterpart ring i and sector j is calculated by:

p i = b( if used )/2c,

j = if used − S × i2

(17)

where i ∈ [0, M − 1], j ∈ [0, (S × (2 × i + 1)) − 1], if used ∈ [0, N 2 − 1], and

S = 4; S is the number of sectors in the innermost ring of the polar image, as

rna

420

shown in the left side of Fig. 1. Ring i of the polar image is defined to be in the range [1,M ] in this paper. Herein, the value for i is computed in the range of [0,M − 1] for simple manipulation. The conversion can be addressed by simply adding one to the value obtained for i from Eq. 17. 5.2. The loop mitigation

Jou

425

For a low or moderate order QPCET moment in a massively parallel envi-

ronment, i.e., a modern powerful single GPU, the number of parallel resources

cannot be fully utilized using the two-loop fusion technique. Thus, a higher level of loop fusion is required to boost the level of parallelism for the QPCET

430

moment computation. For the moment computation, the two innermost loops perform a summation that is stored in a variable at the next outer loop, lines 8 21

Journal Pre-proof

lP repro of

to 10 in Algorithm 1. Thus, loop fusion for both or any of the innermost loops of these two operations include a synchronization overhead.

For example, computing the moment element mp,q is done by completing the 435

two innermost loops, lines 4 to 7 in Algorithm 1. The workload for computing moment element mp,q is divided by n threads. Then, each thread computes a

portion of the workload, and the moment element mp,q is computed by applying

a reduction with a sum operator on the n memory cells. In contrast, one memory cell is required when two-loop fusion is used. In addition, a synchronization step 440

is executed after computing the moment for each of the M rings.

As a compromise to this situation, we propose the loop mitigation tech-

nique, which boosts the level of parallelism for computing the QPCET moment, balances the workloads of the iterations, and reduces the synchronization time overhead. There are M executions of the synchronization step for three-loop 445

fusion. For three-loop fusion, the loop mitigation reduces the number synchro-

nization step executions to M/2 to log(M/2) based on the tunable parameter mmitig value, as explained in the following section.

Boosting the level of parallelism and balancing the workload:

The main idea of the loop mitigation technique is to rearrange the computationally unequal iterations of the third loop of Algorithm 1 into computationally

rna

450

equal iterations. Thus, the inequality of the iterations is hidden, and the workloads are balanced without fusing the third loop. Next, the rearranged iterations are spread over a set of smaller loops. The loop mitigation decomposes a loop of computationally unequal iterations into a set of smaller loops with equal computational loads. Then, the computationally equal load loops can be distributed

Jou

455

over the parallel resources. The loop mitigation uses that ring i has (2×(i−1)+1) sectors, and i ∈ [1,M ];

each sector represents an iteration. Thus, for a given balanced loop, the total number of sectors/iterations is a summation of the assigned ring sectors. Each

460

cell of Fig. 6 represents S sectors. From Fig. 6, loop mitigation divides the M rings into two halves, low and 22

lP repro of

Journal Pre-proof

S×1

S×3

S×5



….

• • • • • • • • • • •

• • •

S× (2M-5)

• • • • • • • • •

S×5

• • • • •

S× (2M-3)

S×(2M-5)

S×3

• • • • • • • • • • •

PU2

S×(2M-3)

• • • • • • • • •

S×1

• • • • • • •

PU1

S×(2M-1)

S×(2M-1)

• • • • •

S×(2M-3)

• • •

S×5

S×3



S×(2M-5)

S×1

PU0

• • • • • • •

S× (2M-1)

× (2 ) Sectors (pixels)

FIGURE 6: Loop mitigation of polar image rings

high. Then, this method balances the load by pairing the first ring of the low half and last ring of the high half, the second ring of the low half and the next to

last ring of the high half and so on. Thus, this method generates M/2 pairs of 465

rings, which is the limit for boosting the level of parallelism for the third loop.

rna

The M/2 pairs of rings are computationally equal.

For a series of odd numbers in the range [1,(2 × M ) − 1], loop mitigation balances the workloads of the smaller loops by rearranging the rings so that the total number of iterations for each smaller loop equals a multiple of S ×(2×M ). 470

For example, in Fig. 6, the polar image consists of six rings, M = 6. Thus, PU0 computes the moment for rings 1 and 6, PU1 computes the moment for rings 2

Jou

and 5, and PU2 computes the moment for the remaining rings. Each PU has S × 2M = S × 12 sectors/iterations. Loop mitigation delivers an equal workload for each PU for different colors. Thus, while the left part of Fig. 6 contains one

475

loop, the right part contains three smaller loops.

23

Journal Pre-proof

lP repro of

Reducing synchronization time overhead:

After a pair of rings is computed, a synchronization step is required. In total, loop mitigation executes the synchronization step M/2 times because there is only one instance of the moment in the memory. 480

Another solution to reduce the number of synchronization step executions is to have M/2 memory instances of the computed moment. Thus, each equal-

load loop can independently store the calculated values in its own instance of

the moment. Then, a reduction sums the M/2 moment instances after the M/2 smaller loops are finished. Reduction of M/2 terms can be accomplished in 485

log(M/2) iterations. These instances can be within a single PU or spread over

several PUs. Thus, loop mitigation significantly reduces the time overhead for the obligatory synchronization step.

Loop mitigation has a memory overhead of (pmax+1)×(qmax+1)×mmitig , where mmitig is the number of moment instances, and 1 ≤ mmitig ≤ M/2. In 490

addition, mmitig is used to divide the M/2 equal-load loops over the parallel resources. Thus, mmitig should be a factor of M/2 to guarantee load balance between the PUs involved. For the loop mitigation, the number of executions ofthe synchronization step is calculated as

M (2×mmitig ) (log(2

× mmitig )).

495

rna

Loop mitigation requires synchronization at the PU level once all the PUs

have finished their workloads. In CUDA [15], a GPU kernel is used for the GPU level of synchronization. Synchronization is required once all the GPUs have finished their workloads.

5.3. Load balancing in a heterogeneous environment

Jou

The proposed implementation can run in heterogeneous parallel computing 500

environments. These environments include CPU-GPU, with a single or multiple GPUs. The loop mitigation enables partitioning the workload between PUs, i.e.,

the GPU(s) and the host multicore CPU. The workloads of each of the GPUs and CPU are tunable. The user can set the workload values of each involved PU in terms of the number of ring pairs, which is discussed in Section 5.2. Summing

24

Journal Pre-proof

the assigned ring pairs for all the PUs yields M/2 pairs. These workloads should

lP repro of

505

be proportional to the computational power of the processing units.

Several efforts have been proposed to predict the proper workload of each

PU; these efforts are summarized in [25]. The most suitable approach for the proposed implementation is that considering the relative performances of the 510

PUs. In this context, the proposed implementation includes a utility that estimates the optimal workloads of the heterogeneous PUs. This utility should run one time for every change in the number of PUs.

In [33], the authors proposed using linear regression to predict the workload

of a CPU and a GPU based on recorded execution times for a single CPU and 515

a single GPU. Similarly, the proposed utility runs the proposed parallel code on CPU-only and GPU-only for several test images with different sizes and

different moment order values. The execution times of these tests are recorded. Then, multiple regression is used to estimate the workload of the CPU and

GPU based on the recorded execution times. For this purpose, we use the QR 520

decomposition approach. The coefficients of the multiple regression equation are calculated using the recorded real execution times. The multiple regression

equation includes CPU-only and GPU-only execution times as the independent

value y. 525

rna

variables x1 and x2 and the GPU workload in terms of ring pairs as the predicted

To analyze the time for CPU-GPU implementation, we have to calculate the number of cores to determine the expected speedup. As GPU core power is different from CPU core power, we suggest using the aforementioned workload portions to express the overall number of CPU and GPU cores as CPU cores,

Jou

Ptotal . The proposed method to calculate Ptotal is to estimate the relative 530

performance of all the CPU and GPU cores for a given image size and moment order; Ptotal is calculated as follows:

Ptotal = N um of CP U cores(1 + (

where g is the number of GPUs.

25

g X CP U exec time )) GP Ui exec time i=1

(18)

Journal Pre-proof

lP repro of

Assuming a given image size and moment order, the execution time for 16 core CPUs performing moment computation is one second, and the execution 535

time with a GPU using the same code is 0.5 seconds. Given two identical GPUs and using Eq. 19 for this example, Ptotal equals 16 × (1 + 4) = 80 CPU cores. In this example, a single GPU is twice as fast as the 16 core CPU; thus, one GPU equals 32 CPU cores. As a result, Ptotal equals the summation of the actual 16

CPU cores plus an estimated 64 CPU cores from the two GPUs. This estimation 540

changes as either or both of the two factors of image size and moment order change.

5.4. Implementation Complexity Analysis

The analysis includes both the time and memory space costs of the proposed implementations. The time and memory costs of moment computations and 545

image reconstruction operations are dominated by two factors, the image size and moment order, pmax and qmax, respectively. These two operations include the same four loops but in a different order, as shown in Algorithms 1 and 2.

Thus, the two algorithms have the same computational cost. The following analysis is for one image channel; for the three image channels, the constant three is omitted. Thus, the final cost equals the cost of one image channel.

rna

550

For the time analysis, the two algorithms include four loops; two for iterating all possible p and q values of sizes (pmax + 1) and (qmax + 1), respectively. The other two loops iterate over all the sectors (pixels) of the polar image, with N × N pixels in total. Thus, the computational cost of the sequential algorithm 555

is O(N 2 × pmax × qmax) and is O((N 2 × pmax × qmax)/Ptotal ) for the parallel

Jou

version, where Ptotal is defined in Subsection 5.3. For the space analysis, Algorithm 1 accesses the three data structures of fˆ

(the interpolated image), kernel Ip , and kernel Iq , with space complexities of O(N 2 ), O(M ×pmax), O(N 2 ×qmax), respectively, as illustrated in Fig. 4. The

560

dominant data structure is the one representing Iq . Thus, the space costs of the CPU-only and GPU-only implementations are O(N 2 × qmax) for moment computation operations. For the CPU-GPU implementation, the required memory 26

Journal Pre-proof

lP repro of

for the CPU is different from that of the GPU, as the CPU accesses all three

of the aforementioned data structures while each GPU accesses parts of these 565

three data structures, due to loop mitigation technique, Section 5.2. The GPU space cost of the CPU-GPU implementation is O((N 2 × qmax)/g) for moment computational operations, where g is number of GPUs. The space cost of the

image reconstruction operation is O(N 2 + pmax × qmax) for the proposed implementations, as illustrated in Fig. 5.

570

6. Experimental Results 6.1. Setup

The experiments are performed on a computer with two Intel(R) Xeon(R) E5-2609 v4 1.70 GHz CPUs with 8 cores, four P100 NVIDIA GPUs with 16

GB memory, and 500 GB RAM. The OS is 64-bit Linux. The implementations 575

are written in the C++ programming language, GCC version 5.4.0. We used

the standard OpenMP threads library [14] for multicore CPUs and CUDA 10.1 [15] for the GPU programming. For the reported results, each experiment is

repeated ten times and the reported execution times are in seconds with a confidence interval of 95%. The test images are of birds dataset [34]. In the following, ’CPU’ refers to using CPU with 16 cores and ’GPU’ refers to using

rna

580

one P100 GPU. In the following, all the reported times are in seconds and the throughputs are the number of processed images per minute. The reported speedups are calculated based on the baseline performance on a single Intel Xeon E5-2609 CPU core. All reported speedups and execution times do not 585

include the data transfer time between the host and device memory, except for

Jou

the image watermarking experiments. The data transfer time overhead between CPU and GPU(s) is discussed in Section 6.4. 6.2. Accuracy

First, the accuracy of the sequential QPCET moment implementation is

590

compared with the proposed parallel implementation counterparts. The se-

27

lP repro of

Journal Pre-proof

FIGURE 7: Visual perception of the accuracy of the proposed method

quential and parallel implementations have identical accuracies for all the im-

ages tested. The normalized image reconstruction error (NIRE) and the visual perception of the reconstructed image are used in quantitative and qualitative evaluation of the reconstruction accuracy of the proposed parallel method. Fig. 595

7 shows the values of NIRE and the visual perception of the reconstructed image of Lena with different moment orders. 6.3. Speedup and time improvement

Second, we compare the speedup of loop mitigation with three-loop fusion

against three-loop fusion and two-loop fusion on a single GPU. Table 2 lists the 600

execution times in seconds alongside the speedups obtained for three-loop fusion and loop mitigation with mmitig = 1 and mmitig = M/2. The performance of

rna

the three-loop fusion with the loop mitigation surpasses that of the two-loop and three-loop fusions, especially for large mmitig . Next, Fig. 8 shows how increasing the value of mmitig affects the speedup 605

at different moment orders for an image of 1,024 × 1,024 pixels. Thus, the loop mitigation implementation is selected for further experiments due to its superior performance.

Jou

Then, we evaluate the speedup, sequential running time improvement (Time Impr.), and throughput of the proposed implementations. Tables 3 and 4 show

610

the speedups and execution time improvements for the proposed implementations of moment computation, respectively. Similarly, Tables 5 and 6 show the speedups and execution time improvements for the proposed implementations of the interpolated image reconstruction, respectively. The execution time im-

28

lP repro of

Journal Pre-proof

TABLE 2: Execution times in seconds & speedup of two-loop fusion vs. three-loop fusion vs. the loop mitigation on a single GPU

Image

Mom.

Sequent.

Size

Order

[7]

Parallel [13]

The proposed implementations

2-loop

3-loop

Speedup

fusion

20 256×

7.68 ± 0.06

256 40

29.47 ± 0.20

60

65.19 ± 0.52

20 512×

31.04 ± 0.19

512

117.61± 40

4.80 ±

4.99x

13.11x

4.85 ±

4.91 ±

0.01

124.07±

76.78± 0.01

478.82±

76.94±

40

Jou 13.35

0.01

0.00

9.87x 0.02

0.06

0.53 ±

1.98 ±

4.33 ±

1.98 ±

7.56 ±

59.45x

59.82x

62.61x

63.30x

16.92± 20.49x

0.08

58.89x

0.00

51.44± 12.93x

62.01x

0.00

48.51±

81.55± 13.26x

29

2.58x

6.19x

1.05 ±

0.01

48.10±

0.00

0.01

20.26x 0.02

77.37±

55.25x

0.01

12.79±

0.00

0.53 ±

0.00

9.72x

1.63x

42.82x

0.00

12.10±

76.09±

1054.32± 79.50±

60

20.54x

2.59x

13.19x

0.18 ±

0.00

0.01

0.09

0.01

9.72x

11.98±

19.65±

6.22x

2.72

3.17 ±

6.07x

1.62x

0.97

3.03 ±

1.65x

13.12x

Speedup

0.00

0.00

0.00

19.76±

20

13.28x

19.39±

0.01

2.55x

0.00

0.00

19.22±

0.04

6.07x

18.85±

1.62x

3.01 ± 0.00

0.02

19.15±

2.04

1.62x

0.00

rna 259.18±

Speedup

M/2

0.00

0.01

60

1024

4.73 ±

6.12x

0.72

1024×

1.60x

0.00

4.86 ±

m=1

fusion

0.00

4.81 ±

m=

Speedup

62.30x 0.01

Journal Pre-proof

70 60 50 40 30 20 10 0 2

lP repro of

Speedup

Obtained Speedup as Mmitig and moment order change (1,024×1,024 pixels image with M=512)

4

8

16

32

64

128

256

Mmitig value

Momnet_Order =20

40

60

FIGURE 8: Effect of changing mmitig value on the obtained speedup

provements for the proposed implementations are calculated by: (sequential time − parallel time) × 100 sequential time 615

(19)

Figs. 9 and 10 depict the throughput of the different proposed implementations on different PUs for 60 seconds. These two figures show the throughputs for various image sizes and different moment order values of 20, 40, and 60.

The proposed implementations are evaluated using relatively large-sized im-

ages of 2,048 × 2,048 and 4,096 × 4,096 pixels. The execution times of this test is reported in Table 7 for multicore CPU, single GPU, and four GPUs with a

rna

620

CPU. For a 4,096 × 4,096 pixels image with a moment order 60, the required memory is more than the P100 GPU memory capacity, 16GB. This is due to the angular kernel Iq memory size. The Iq kernel is a three-dimensional array of size N × N × qmax; each element of this array is a complex number where a 625

complex number reserves 8 bytes for the real part and another 8 bytes for the

Jou

imaginary part. Thus, the total memory size of the Iq kernel of a 4,096 × 4,096

pixels image with a moment order 60 equals 4,096 × 4,096 × 61 × 16 = 16.3 × 108 bytes, 16.3GB.

In the same context, the GPU memory utilization of the GPU proposed im-

630

plementations of moment computations are listed in Table 8. For the CPU+4×GPU implementation, the memory utilization value is the same for the four GPUs, as the implementation evenly divides the computational loads over the four 30

lP repro of

Journal Pre-proof

TABLE 3: Execution times and speedups of moment computing for the proposed parallel implementations

CPU

GPU

CPU+4×GPUs

Image

Mom.

Size

Order

exec. time

speedup

exec. time

speedup

exec. time

speedup

20

0.64 ± 0.00

12.08x

0.18 ± 0.00

42.82x

0.10 ± 0.00

78.97x

40

1.90 ± 0.01

15.47x

0.53 ± 0.00

55.25x

0.18 ± 0.00

164.90x

60

4.28 ± 0.01

15.22x

1.05 ± 0.00

62.01x

0.27 ± 0.00

237.80x

20

2.57 ± 0.00

12.09x

0.53 ± 0.00

58.89x

0.19 ± 0.00

162.48x

40

7.63 ± 0.03

15.42x

1.98 ± 0.01

59.45x

0.52 ± 0.00

228.32x

60

17.23 ± 0.05

15.04x

4.33 ± 0.01

59.82x

1.05 ± 0.01

247.30x

20

10.30 ± 0.04

12.05x

1.98 ± 0.00

62.61x

0.66 ± 0.00

186.73x

40

30.28 ± 0.11

15.81x

7.56 ± 0.00

63.30x

1.73 ± 0.00

276.00x

60

69.51 ± 0.30

15.17x

16.92 ± 0.01

62.30x

3.81 ± 0.00

276.89x

256 × 256

512 × 512

1024×

rna

1024

TABLE 4: Execution times in seconds and improvement ratios for moment computation Image

The proposed implementations

Mom.

Sequential[7]

Order

Jou

Size

CPU

Time

CPU+GPU

Time

CPU+4×GPUs

Time

time

Impr.

time

Impr.

time

Impr.

20

124.11

10.30

91.70%

1.68

98.64%

0.66

99.46%

40

478.82

30.28

93.68%

6.13

98.72%

1.73

99.64%

60

1054.32

69.51

93.41%

13.28

98.74%

3.81

99.64%

1024× 1024

31

lP repro of

Journal Pre-proof

TABLE 5: Speedups of image reconstruction for parallel implementations Image

Mom. Order

CPU

GPU

CPU+4×GPUs

20

15.65x

51.71x

184.68x

40

15.75x

52.79x

199.82x

60

15.94x

52.68x

201.87x

20

15.91x

58.13x

251.57x

40

15.92x

58.42x

257.57x

60

15.99x

58.46x

260.22x

20

15.97x

61.79x

258.84x

40

15.56x

61.32x

258.87x

60

15.95x

61.63x

256.98x

Size

256 × 256

512 × 512

1024× 1024

TABLE 6: Execution times in seconds and time improvement ratios for image reconstruction

Mom.

Sequential

Size

Order

[7] time

256× 256

512

CPU

Time

GPU

Time

CPU+GPU

Time

CPU+4×GPUs

Time

time

Impr.

time

Impr.

time

Impr.

time

Impr.

20

38.43

2.46

93.61%

0.74

98.07%

0.59

98.47%

0.21

99.46%

40

150.96

9.58

93.65%

2.88

98.09%

2.27

98.49%

0.76

99.50%

60

337.42

21.17

93.73%

6.41

98.10%

4.99

98.52%

1.67

99.50%

20

152.79

9.60

93.72%

2.63

98.28%

2.22

98.55%

0.61

99.60%

Jou

512×

The proposed implementations

rna

Image

40

601.55

37.78

93.72%

10.29

98.29%

8.73

98.55%

2.34

99.61%

60

1348.98

84.39

93.74%

23.07

98.29%

19.52

98.55%

5.18

99.62%

20

614.51

38.44

93.74%

9.95

98.38%

8.41

98.63%

2.37

99.61%

40

2399.99

154.92

93.57%

39.14

98.37%

33.01

98.62%

9.27

99.61%

60

5393.58

338.22

93.73%

87.53

98.38%

73.90

98.63%

20.99

99.61%

1024× 1024

32

Journal Pre-proof

large sized images

Size

2,048

4,096

order

lP repro of

TABLE 7: Execution times of moment computation and image reconstruction for relatively

CPU

GPU

moment

reconst

moment

reconst

moment

reconst

20

41.83 ± 0.18

153.00 ± 0.22

7.53 ± 0.00

40.22 ± 0.01

2.06 ± 0.00

9.67 ± 0.00

40

123.13 ± 0.51

602.53 ± 0.54

31.24 ± 0.02

157.90 ± 0.05

7.11 ± 0.01

37.94 ± 0.01

60

279.49 ± 0.72

1347.91 ± 0.81

69.06 ± 0.07

352.81 ± 0.19

16.00 ± 0.01

84.94 ± 0.03

20

170.12 ± 0.53

612.00 ± 0.28

31.65 ± 0.01

161.19 ± 0.03

7.73 ± 0.03

38.14 ± 0.01

40

494.24 ± 4.40

2407.84 ± 2.72

126.76 ± 0.15

632.57 ± 0.15

29.92 ± 0.01

149.68 ± 0.03

60

1113.64 ± 5.44

5390.93 ± 2.29

out of memory

out of memory

66.00 ± 0.01

335.50 ± 0.7

GPUs. The proposed implementation of moment computing divides the data and processing between the parallel resources. On the other hand, the pro635

posed implementation of image reconstruction divides the processing only not

the data, as the data size is very tiny for this task. For image reconstruction moment, the memory utilization is low. The memory utilization varies from

2.50% to 6.86% for 256×256 pixels and 4,096×4,096 pixels images, respectively,

rna

for both single GPU and CPU+4×GPU implementations. 640

6.4. Data transfer times of the CPU and GPU The data transfer between the CPU (host) and the GPU (device) has two types, image data dependent and independent. The former type presents data that can be used by any image of the same size, the radial kernel Ip and the angular kernel Iq values. The latter type presents the image pixels; thus, this data is image dependent. These two types of data transfer between the host and

Jou

645

device(s) have a time overhead. Table 9 lists the required time of both types for a single GPU and 4×GPU implementations. The general rule that the more the image size and the moment order the more the data transfer time. For the image independent data, the transferred data volume is determined

650

CPU+4×GPU

based on the image size and moment order. The angular kernel Iq is the dominating portion with N × N × (qmax + 1) values, where N represents the number 33

Journal Pre-proof

Size

256

512

1024

2,048

4,096

lP repro of

TABLE 8: GPU(s) memory utilization of moment computations

order

GPU

CPU+4×GPU

20

1.79%

0.50%

40

2.27%

1.68%

60

2.46%

1.70%

20

2.51%

1.76%

40

313%

2.28%

60

3.77%

2.43%

20

4.21%

2.06%

40

6.41%

2.61%

60

8.67%

2.46%

20

10.93%

4.08%

40

19.16%

6.07%

60

27.61%

8.10%

20

37.61%

10.46%

40

69.80%

18.19%

60

out of memory

25.90%

rna

Moment Computation Throughput

CPU

GPU(P100)

CPU+P100

CPU+4×P100

700.0

Througput per a minute

600.0 500.0 400.0 300.0 200.0

Jou

100.0

0.0

(256×256) 20

40

60

(512×512) 20

40

60

(1024×1024 ) 20

40

60

CPU

94.4

31.5

14.0

23.4

7.9

3.5

5.8

2.0

0.9

GPU(P100)

334.7

112.5

57.1

113.8

30.3

13.8

30.3

7.9

3.5

CPU+P100

350.0

147.5

66.4

116.6

37.2

16.4

35.6

9.8

4.5

CPU+4×P100

617.3

335.8

218.9

314.1

116.5

57.3

90.3

34.6

15.8

(Image size) Moment Order

FIGURE 9: Different moment computing implementation throughputs

34

Journal Pre-proof

lP repro of

Interpolated Image Reconstruction Throughput CPU

Througput per a minute

350.0 300.0 250.0 200.0 150.0 100.0 50.0 0.0

GPU(P100)

CPU+P100

CPU+4×P100

(256×256) 20

40

60

(512×512) 20

40

60

(1024×1024) 20

40

60

24.4

6.3

2.8

6.2

1.6

0.7

1.6

0.4

0.2

GPU(P100)

80.7

20.8

9.4

22.8

5.8

2.6

6.0

1.5

0.7

CPU+P100

102.2

26.4

12.0

27.0

6.9

3.1

7.1

1.8

0.8

CPU+4×P100

288.3

79.4

35.9

98.6

25.7

11.6

25.3

6.5

2.9

CPU

(Image size) Moment Order

FIGURE 10: Different image reconstruction implementation throughputs

of the image rows and columns, and qmax represents the angular kernel order;

each value represents a complex number, i.e. two double variables. In Table 9,

the transfer times of the image independent increases as the image size and the 655

moment order increase. The transfer time of the 4×GPUs is more than a single

GPU. This is because the CPU memory buses are used by four GPUs instead of a single GPU.

For the image dependent data, the transferred data is going the two ways. First, the image pixel values are transferred from the host to the device; herein,

the moment order has no effect on the data transfer time. Second, after comput-

rna

660

ing the moment, the device transfers the moment instances, the loop mitigation’s mmitig value, to the host memory; the higher the moment order the higher the transfer rate. Using 4×GPU slightly improve the transfer time as only certain portions, rings, of the image pixels should be moved to each GPU. Finally, Table 9 lists the data transfer times for the moment computations

Jou

665

only. The ratio of the data transfer to the computational times of the image reconstruction task is relatively tiny. For instance, the ratio of the data transfer time to the computational times of the image reconstruction tasks are 0.4% and

0.03% for images of 256×256 pixels and 1,024 × 1,024 pixels, respectively, where

670

the moment order is 20.

35

Journal Pre-proof

lP repro of

TABLE 9: Data transfer times between CPU and GPU(s) of moment computations in seconds

GPU

Size

order

256 × 256

512 × 512

1024 × 1024

4×GPU

image independ.

image depend.

image independ.

image depend.

20

0.16 ± 0.00

0.02 ± 0.00

0.47 ± 0.01

0.01 ± 0.00

40

0.17 ± 0.00

0.06 ± 0.00

0.48 ± 0.01

0.05 ± 0.00

60

0.18 ± 0.01

0.12 ± 0.00

0.48 ± 0.01

0.11 ± 0.00

20

0.18 ± 0.01

0.04 ± 0.00

0.48 ± 0.01

0.03 ± 0.00

40

0.21 ± 0.01

0.12 ± 0.00

0.48 ± 0.01

0.10 ± 0.00

60

0.23 ± 0.01

0.25 ± 0.00

0.48 ± 0.01

0.23 ± 0.00

20

0.22 ± 0.01

0.09 ± 0.00

0.49 ± 0.01

0.07 ± 0.00

40

0.31 ± 0.01

0.25 ± 0.01

0.50 ± 0.02

0.22 ± 0.01

60

0.40 ± 0.02

0.52 ± 0.00

0.57 ± 0.20

0.45 ± 0.02

6.5. Performance Analysis

In Table 2, the three-loop fusion speedup almost doubles as the image size doubles because of the increase in computation time. For example, for an image

sized 256 × 256, the number of rings is M = 128; thus, the synchronization step 675

is executed 128 times. Doubling the image size to 512 × 512, both the number of

rna

rings and the number of times the synchronization step is executed double. On the other hand, the number of pixels in the latter image is four times that of the

former image. Thus, the computational time increases more than the synchronization time, which leads to a better speedup. Despite this improvement, the 680

CPU-only implementation with 16 cores performs better than the GPU-only

Jou

implementation using the three-loop fusion; the speedups obtained are 15.17× (last row of Table 3) and 12.93× (last row of Table 2), respectively. The proposed implementation achieves 257× and 277× speedups for the

interpolated color image reconstruction and QPCET moment computation, re-

685

spectively, using one CPU with 16 cores and four P100 GPUs with a moment order of 60 and an image of 1,024 × 1,024 pixels. These execution times of the CPU-GPU implementations are three and two 36

Journal Pre-proof

lP repro of

orders of magnitude shorter than the sequential version execution times for moment computing and image reconstruction, respectively. The reported speedups 690

and time improvements in Tables 3 and 4 show that the loop mitigation tech-

nique smoothly scales up when using a CPU with multiple GPUs. For instance, for a low moment order of 20 with a limited level of parallelism, loop mitigation

boosts the level of parallelism to achieve a speedup of 78.97× using CPU-GPUs, as shown in Table 3. As the moment order and image size increase, the speedups 695

increase; the speedup reaches 276.89× for the same image and a moment order 60. The image reconstruction results, in Tables 5 and 6, show the same behavior as the moment computations.

Comparing the throughputs of the loop mitigation technique with the GPUonly with those with a CPU and a single GPU, the latter has a slight advantage. 700

To keep the CPU busy while the GPU is working, a portion of the GPU work-

load, the ring pairs, is assigned to the CPU. The more ring pairs that are

are assigned to the GPU, the higher the level of parallelism on the GPU side. Thus, sharing the load increases the overall throughput but decreases the GPU throughput. 705

Finally, from a parallel processing perspective, the number of FLOPS out-

rna

lines the improvement from one implementation to another. For instance, the

double precision implementation of the 2-loop fusion in [13], on a P100 GPU, performs 1.13 × 108 , 1.12 × 109 , and 1.10 × 109 FLOPS, for images of 256 × 256, 512 × 512, and 1,024 × 1,024 pixels, respectively, where the moment order is 710

60. On the other hand, the double precision proposed implementation using the loop mitigation technique, on a P100 GPU, performs 5.26 × 109 , 5.21 × 109 ,

Jou

and 5.19 × 109 FLOPS for images of 256 × 256, 512 × 512, and 1,024 × 1,024

pixels, respectively, where the moment order is 60. These results show that the proposed implementation improved the existing method by approximately five

715

times.

These numbers of FLOPS are far from the theoretical computational peak

of the P100 GPU, 4.3 × 1012 FLOPS, for two reasons. First, the theoretical computational peak supposed processing flow of ready operations without con37

Journal Pre-proof

720

lP repro of

sidering the memory latency factor. Second, the QPCET algorithm performs computations across a large volume of data elements. For instance, for an image

of 256×256 pixels and moment order 20, one independent task (GPU thread) of the 2-loop fusion in [13] should access 3 × (256 × 256) double variables of the image pixels, 21×(256×256) complex number, each has two double variables, of the Iq kernel, and 21 × 128 complex numbers, where 128 is the number of image 725

rings. Fig. 4 explains the size of these data structures in more details. Thus,

in total one thread accesses 3,014,656 double variables; each consists of eight bytes. This huge number of data access pattern affects the GPU processing performance. On the other hand, the proposed method improves this performance

by reducing the number of data access by dividing them on the mmitig value. 730

For instance, using mmitig = 64, for the example discussed above, each thread

of the proposed method accesses 47,104 double variables only, as each thread should handle the data of two rings only. These values of data access increase as the image size and the moment order increase. 6.6. Color image watermarking 735

The efficiency of the proposed method is tested in watermarking of color

Three experiments are performed using three different well-known

rna

images.

datasets. The first experiment is performed with the high-resolution fundus (HRF) color image dataset [35]. This dataset contains 45 color images of the same size 3,504×2,336 and divided into 15 images of healthy patients, 15 images 740

of patients with diabetic retinopathy, and 15 images of glaucomatous patients. The dataset images are resized to be 2,048 × 2,048. Nine randomly selected

Jou

color images are displayed in Fig. 11a. The second experiment is performed with the MEDNOD dataset of skin

lesions [36]. This dataset contains 170 color images of different sizes and divided

745

into two classes. The first class contains 70 color images of malignant melanoma while the second class contains 100 color images of nevus lesion. For simplicity, all color images, of this dataset, are resized to be 1,024 × 1,024. Six randomly

selected color images are displayed in Fig. 11b. 38

Journal Pre-proof

datasets

Dataset

lP repro of

TABLE 10: Average color image watermarking exec. times in second and speedups of three

Sequential

GPU

4×GPUs

time

time

speedup

time

speedup

Soccer Team

394.22

7.31

53.93x

2.12

185.95x

MEDNOD

1573.71

29.51

53.33x

8.61

182.78x

HRF

6229.31

116.21

53.60x

34.24

181.93x

The third experiment is performed with the soccer teams color image dataset 750

[37]. This dataset contains 280 color images of different sizes. These images are

divided into 7 classes for 7 soccer teams, A.C. Milan, Barcelona, Chelsea, Juven-

tus, Liverpool, Real Madrid, and PSV Eindhoven where each class containing 40 images. For simplicity, all color images are resized to be 512 × 512. Eight color images are displayed in Fig. 11c. 755

Fig. 12 depicts the average logarithmic execution times of the color images

watermarking for the utilized three datasets. The reported execution times are

for the sequential, GPU, and 4×GPUs implementations of a single image of each dataset with moment order of 30. Table 10 lists the execution times in seconds

760

rna

with the obtained speedup for the same datasets and implementations.

7. CONCLUSION

This paper proposed parallel implementations for the QPCET moment and image reconstruction based on the computed moment for CPU-only, GPU-only,

Jou

and CPU-GPU architectures using OpenMP and CUDA. In this context, the implementations include different techniques to address the low level of paral-

765

lelism for low order moments and workload imbalances between heterogeneous PUs. The loop mitigation technique is proposed to boost the level of paral-

lelism, reduce the GPU space complexity and synchronization overhead, and balance the workloads. In addition, multiple regression is used for load balancing between CPUs and GPUs. Finally, the performance of the proposed 39

lP repro of

Journal Pre-proof

(a)HRF sample images

Jou

rna

(b)MEDNOD sample images

(c)Soccer teams dataset sample images

FIGURE 11: Samples of three different color image datasets for watermarking

40

Journal Pre-proof

lP repro of

Logarithmic values of execution times for watermarking time of different datasets Log10(exec. time)

4 3.5 3 2.5 2 1.5 1 0.5 0 Soccer teams(512×512)

MEDNOD(1,024×1,024)

Sequential

GPU

HRF(2,048×2,048)

4×GPUs

FIGURE 12: Logarithmic values of execution times for watermarking time of different datasets

770

implementation is evaluated and reported to underline the effectiveness of the proposed parallel implementations through using different applications. For all the experiments, the proposed CPU-GPU implementation achieves significant

average speedups of 257× and 277× for color image reconstruction and quaternion moment computation, respectively. For our future work, the proposed 775

implementations will be extended to the generic quaternion domain. Thus, the method could be adapted, with a simple modification, to deal with other quater-

rna

nion image moments, such as quaternion exponents, quaternion radial Harmonic Fourier moments, and quaternion ZMs.

8. Acknowledgments 780

This work was supported in part by the Program of China under Grant 2016YFB0201402, and in part by the National Natural Science Foundation of

Jou

China under Grant 61702170, Grant 61602350, Grant 61602170, and Grant 61750110531.

References

785

[1] L.-Q. Guo, M. Zhu, Quaternion Fourier–Mellin moments for color images, Pattern Recognition 44 (2) (2011) 187–195. 41

Journal Pre-proof

lP repro of

[2] H.-y. Yang, L.-l. Liang, Y.-w. Li, X.-y. Wang, Quaternion exponent moments and their invariants for color image, Fundamenta Informaticae 145 (2) (2016) 189–205. 790

[3] B. Kaur, S. Singh, J. Kumar, Iris recognition using Zernike moments and

polar harmonic transforms, Arabian Journal for Science and Engineering (2018) 1–10.

[4] B. Kaur, S. Singh, J. Kumar, Cross-sensor iris spoofing detection using orthogonal features, Computers & Electrical Engineering 73 (2019) 279– 795

288.

[5] Z. Xia, X. Wang, W. Zhou, R. Li, C. Wang, C. Zhang, Color medical image

lossless watermarking using chaotic system and accurate quaternion polar harmonic transforms, Signal Processing 157 (2019) 108–118.

[6] J. Zhong, Y. Gan, J. Young, P. Lin, Copy move forgery image detection via 800

discrete radon and polar complex exponential transform-based moment in-

variant features, International Journal of Pattern Recognition and Artificial Intelligence 31 (02) (2017) 1754005.

rna

[7] K. M. Hosny, M. M. Darwish, Highly accurate and numerically stable higher

order QPCET moments for color image representation, Pattern Recognition 805

Letters 97 (2017) 29–36.

[8] M. J. M. Requena, P. Moscato, M. Ujald´ on, Efficient data partitioning for the GPU computation of moment functions, Journal of Parallel and Distributed Computing 74 (1) (2014) 1994–2004.

Jou

[9] K. M. Hosny, A. Salah, H. I. Saleh, M. Sayed, Fast computation of 2D and

810

3D Legendre moments using multi-core CPUs and GPU parallel architectures, Journal of Real-Time Image Processing (2017) 1–15.

[10] Y. Xuan, D. Li, W. Han, Efficient optimization approach for fast GPU computation of Zernike moments, Journal of Parallel and Distributed Computing 111 (2018) 104–114. 42

Journal Pre-proof

[11] Y. Xin, M. Pawlak, S. Liao, Accurate computation of Zernike moments in

lP repro of

815

polar coordinates, IEEE Trans. on Image Processing 16 (2) (2007) 581–587.

[12] S. Farokhi, S. M. Shamsuddin, J. Flusser, U. U. Sheikh, M. Khansari,

K. Jafari-Khouzani, Rotation and noise invariant near-infrared face recognition by means of Zernike moments and spectral regression discriminant 820

analysis, Journal of Electronic Imaging 22 (1) (2013) 013030.

[13] K. M. Hosny, M. M. Darwish, K. Li, A. Salah, Parallel multi-core CPU and GPU for fast and robust medical image watermarking, IEEE Access 6 (2018) 77212–77225.

[14] L. Dagum, R. Menon, Openmp: an industry standard API for shared825

memory programming, IEEE computational science and engineering 5 (1) (1998) 46–55.

[15] C. Nvidia, Nvidia CUDA C programming guide, Nvidia Corporation 120 (18) (2011) 8.

[16] J. S. Vetter, S. Mittal, Opportunities for nonvolatile memory systems in 830

extreme-scale high-performance computing, Computing in Science & Engi-

rna

neering 17 (2) (2015) 73–82.

[17] W. R. Hamilton, Elements of quaternions, Longmans, Green, & Company, 1866.

[18] T. A. Ell, S. J. Sangwine, Hypercomplex Fourier transforms of color images, 835

IEEE Trans. on image processing 16 (1) (2007) 22–35.

Jou

[19] C. Wang, X. Wang, Y. Li, Z. Xia, C. Zhang, Quaternion polar harmonic Fourier moments for color images, Information Sciences 450 (2018) 141– 156.

[20] P.-T. Yap, X. Jiang, A. C. Kot, Two-dimensional polar harmonic trans-

840

forms for invariant image representation, IEEE Trans. on Pattern Analysis and Machine Intelligence 32 (7) (2010) 1259–1270. 43

Journal Pre-proof

lP repro of

[21] X.-y. Wang, W.-y. Li, H.-y. Yang, P. Wang, Y.-w. Li, Quaternion polar complex exponential transform for invariant color image description, Applied Math. and Computation 256 (2015) 951–967. 845

[22] K. M. Hosny, M. M. Darwish, Accurate computation of quaternion polar complex exponential transform for color images in different coordinate systems, Journal of Electronic Imaging 26 (2) (2017) 023021.

[23] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, K. Li, A parallel random forest algorithm for big data in a Spark cloud computing environment, 850

IEEE Trans. on Parallel and Distributed Systems 28 (4) (2017) 919–933.

[24] K. Li, W. Yang, K. Li, Performance analysis and optimization for SpMV on GPU using probabilistic modeling, IEEE Trans. on Parallel and Distributed Systems 26 (1) (2015) 196–205.

[25] S. Mittal, J. S. Vetter, A survey of CPU-GPU heterogeneous computing 855

techniques, ACM Computing Surveys 47 (4) (2015) 69.

[26] Q. Liu, W. Luk, Heterogeneous systems for energy efficient scientific com-

puting, in: International Symposium on Applied Reconfigurable Comput-

rna

ing, Springer, 2012, pp. 64–75.

[27] Y. Ogata, T. Endo, N. Maruyama, S. Matsuoka, An efficient, model-based 860

CPU-GPU heterogeneous FFT library, in: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, IEEE, 2008, pp. 1–10.

Jou

[28] V. Podlozhnyuk, Image convolution with cuda, NVIDIA Corporation white paper, June 2097 (3).

865

[29] H. Jiang, N. Ganesan, Cudampf: a multi-tiered parallel framework for accelerating protein sequence search in hmmer on cuda-enabled gpu, BMC bioinformatics 17 (1) (2016) 106.

44

Journal Pre-proof

lP repro of

[30] A. Di Biagio, G. Agosta, Improved programming of gpu architectures

through automated data allocation and loop restructuring, in: 23th In870

ternational Conference on Architecture of Computing Systems 2010, VDE, 2010, pp. 1–8.

[31] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, M. Min, Nekbone

performance on gpus with openacc and cuda fortran implementations, The Journal of Supercomputing 72 (11) (2016) 4160–4180. 875

[32] R. Reyes, F. de Sande, Optimization strategies in different cuda architec-

tures using llcomp, Microprocessors and Microsystems 36 (2) (2012) 78–87.

[33] C.-K. Luk, S. Hong, H. Kim, Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, in: Proceedings of the 42nd

Annual IEEE/ACM International Symposium on Microarchitecture, ACM, 880

2009, pp. 45–55.

[34] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-UCSD birds 200.

[35] A. Budai, R. Bock, A. Maier, J. Hornegger, G. Michelson, Robust vessel

885

2013.

rna

segmentation in fundus images, International journal of biomedical imaging

[36] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, N. Petkov, Med-node: a computer-assisted melanoma diagnosis system using nondermoscopic images, Expert systems with applications 42 (19) (2015) 6578–

Jou

6585. 890

[37] J. Van De Weijer, C. Schmid, Coloring local feature extraction, in: European conference on computer vision, Springer, 2006, pp. 334–348.

45

Journal Pre-proof

Jou

rna

lP repro of

Highlights: 1. We proposed an accurate parallel implementation for the QPCET moment with zero ignored pixels. 2. We proposed a polar moment-based optimized prediction of the balanced workload of heterogeneous processing units (PUs). 3. We proposed a technique for boosting the level of parallelism, reducing the synchronization cost, and load balancing for loop iterations with unequal computational loads called loop mitigation. 4. We proposed a set of parallel implementations including multicore CPUonly, GPU-only and CPU-GPU implementations for the QPCET moment.

lP repro of

Journal Pre-proof

Jou

rna

The authors declares that there is no conflict of interest regarding the publication of this article.

Journal Pre-proof Ahmad Salah received the PhD degree in computer science from Hunan University, China. He received the Master degree in CS from Ain-shams University, Cairo, Egypt. His current research interests are parallel computing, computational biology and algorithms.

lP repro of

Kenli Li received the PhD degree in computer science from Huazhong University of Science and Technology, China, in 2003. He was a visiting scholar at the University of Illinois at Urbana-Champaign from 2004 to 2005. He is currently a full professor of computer science and technology at Hunan University and deputy director of the National Supercomputing Center in Changsha. His major research areas include parallel computing, high-performance computing, and grid and cloud computing. He has published more than 130 research papers in international conferences and journals, such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, Journal of Parallel and Distributed Computing, ICPP, CCGrid. He is an outstanding member of CCF. He is a member of the IEEE and serves on the editorial board of the IEEE Transactions on Computers.

Khalid M. Hosny is a professor of information technology, faculty of Computers and Informatics at Zagazig University. He received the B.Sc., M.Sc. and Ph.D. from Zagazig University, Zagazig, Egypt in 1988, 1994, and 2000 respectively. From 1997 to 1999 he was a visiting scholar, University of Michigan, Ann Arbor and University of Cincinnati, Cincinnati, USA. He is a senior member of ACM and a member of IEEE and the IEEE computer society. His research interests include image processing, pattern recognition and computer vision. Dr. Hosny published more than 50 papers in international journals. He is an editor and scientific reviewer for more than thirty international journals.

rna

Mohamed M. Darwish is assistant professor of Computer science, Faculty of Science, Assiut University, Assiut, Egypt. He received the B.Sc. (Hons.), M.Sc. and Ph.D. in Computer Science from Faculty of Science, Assiut University. His research area is Image Processing, Association Rule Mining and Medical Image Analysis.

Jou

Qi Tian received the B.E. degree in electronic engineering from Tsinghua University, Beijing, China, in 1992, the Ph.D. degree in ECE from the University of Illinois at Urbana-Champaign, Champaign, IL, USA, in 2002, and the M.S. degree in ECE from Drexel University, Philadelphia, PA, USA in 1996. He is currently a Full Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA), San Antonio, TX, USA. Dr. Tian is the Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA (TMM), the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), Multimedia System Journal (MMSJ), and is on the Editorial Board of Journal of Multimedia (JMM), and Journal of Machine Vision and Applications (MVA). He is the Guest Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and the Journal of Computer Vision and Image Understanding. His research projects are funded by ARO, NSF, DHS, Google, FXPAL, NEC, SALSI, CIAS, Akiira Media Systems, HP, Blippar, and UTSA. He was the recipient of the 2014 Research Achievement Award from College of Science, UTSA. He was the recipient of the 2010 ACM Service Award. He was the coauthor %of a Best Paper in ACM ICMR 2015, a Best Paper in PCM 2013, a Best Paper in MMM 2013, a Best Paper in ACM ICIMCS 2012, a Top 10 Paper Award in MMSP 2011, a Best Student Paper in ICASSP 2006, and coauthor of a Best Student Paper Candidate in ICME 2015, and a Best Paper Candidate in PCM 2007.

Journal Pre-proof

lP repro of

Ahmad Salah

Kenli Li

rna

Khalid M. Hosny

Jou

Mohamed Darwish

Qi Tian