Journal Pre-proof Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images Ahmad Salah, Kenli Li, Khalid M. Hosny, Mohamed Darwish, Qi Tian
PII: DOI: Reference:
S0167-739X(19)30217-1 https://doi.org/10.1016/j.future.2020.01.051 FUTURE 5420
To appear in:
Future Generation Computer Systems
Received date : 27 January 2019 Revised date : 8 July 2019 Accepted date : 27 January 2020 Please cite this article as: A. Salah, K. Li, K.M. Hosny et al., Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images, Future Generation Computer Systems (2020), doi: https://doi.org/10.1016/j.future.2020.01.051. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier B.V.
Journal Pre-proof
lP repro of
Accelerated CPU-GPUs Implementations for Quaternion Polar Harmonic Transform of Color Images Ahmad Salaha,b , Kenli Lia , Khalid M. Hosnyb , Mohamed Darwishc , and Qi Tiand a College
of Computer Science and Electrical Engineering, Hunan University, Changsha 410082, (e-mail: {ahmad, lkl}@hnu.edu.cn) b Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt. E-mail: k hosny@ yahoo.com c Faculty of Science, Assiut University, Assiut, Egypt d The University of Texas at San Antonio, USA. Email:
[email protected] [email protected]
Abstract
Image moments are used to capture image features. Moments are successfully used in object descriptions, recognition, and other applications. However, image representation and recognition using quaternion moments are compute-intensive processes. This obstacle makes this type of moment unsuitable for real-time or
large-scale applications, despite their high accuracy. In this work, we expose the challenges to parallelizing quaternion moment computations and transform
rna
the sequential algorithm to a parallel-friendly counterpart. We propose a set of parallel implementations for parallelizing quaternion moment image representa-
tions on different parallel architectures. The proposed implementations target multicore CPU-only, GPU-only, and hybrid CPU-GPUs (with multicore CPU and multi-GPU) environments. The loop mitigation technique is proposed to
Jou
boost the level of parallelism in massively parallel environments, balance the parallel workload, and reduce both the space complexity and synchronization overhead of the proposed implementations. The loop mitigation technique could be applicable to other applications with similar loop imbalances. Finally, ex∗ Kenli
Li Email address:
(
[email protected])
Preprint submitted to Journal of Future Generation Computer Systems
February 1, 2020
Journal Pre-proof
lP repro of
periments are performed to evaluate the proposed implementations. Applying
a moment order of 60 on a color image of 1,024 × 1,024 pixels, the proposed implementation achieved, on four P100 GPUs and a CPU with 16 cores, speedup of 257× and 277× over the baseline performance on a single Intel Xeon E5-2609
CPU core for image reconstruction and quaternion moment computation, re-
spectively. In addition, 180× speedup is achieved for color image watermarking application.
Keywords: Parallel Implementation, Speedup, CPU-GPU, Multi-GPU, Quaternion moments.
1. Introduction
Color images play an essential role in our life. Representation and recognition of color images are very crucial tasks. Recently, quaternion moments have
been used for extracting features of color images [1, 2]. Quaternion polar har5
monic transform (QPHT) moments and their variants are powerful tools used in many image processing and pattern recognition applications such as biomet-
ric recognition [3], cross-sensor iris spoofing detection [4], medical color image watermarking [5], and copy-move image forgery [6].
10
rna
The computation of QPHT moments faces two major problems, namely, accuracy and computational demands. In [7], the authors proposed a novel recurrence method for accurate computation of the quaternion polar complex exponent transform (QPCET) for high order moments. Using these methods, at-
tempts to accelerate the computational process resulted in much shorter elapsed CPU times than those of the state-of-the-art methods. Despite the high accuracy of and the advances in the CPU technology of the
Jou 15
aforementioned methods, the problem of long computational times for real-time applications was still unresolvable. The computational process of color image moments is intensive, especially for higher orders and large-sized images. Thus, these methods are impractical for real-time applications. The use of parallel
20
architectures in computing the quaternion moments for color images is the main
2
Journal Pre-proof
lP repro of
key to addressing this compute-intensive problem.
There have been several attempts to parallelize image moment computa-
tions. In [8], efficient data partitioning that uses a cluster of GPUs is proposed
for computing both Zernike and Legendre moments for gray-scale images in 25
Cartesian coordinates. Manuel et al. used the conventional zeroth-order ap-
proximation (ZOA) method to compute both Zernike and Legendre moments
[8]. These researchers employed the symmetric pixel values to reduce the computational times for these inaccurate moments. Hosny et al. [9] proposed new
parallel algorithms to accelerate the exact Legendre moments for 2D and 3D 30
grayscale images and objects. These algorithms use multicore CPU and GPU
parallel architectures where each pixel/voxel of the digital image/object can be handled independently.
Recently, Xuan et al. [10] used the conventional approximated ZOA method to compute Zernike moments (ZMs). These researchers proposed a new image 35
re-layout for their proposed parallel implementation and used the symmetry
property to reduce the computational times of ZMs for grayscale images. In
this approximated method, the pixels of the input image are mapped over the unit circle. Since the pixels in the four corners are outside of the unit circle,
40
rna
these pixels are ignored. For an input image of size 8 × 8, 12 pixels are ignored.
Approximately 20% of the pixels of the input images are ignored. This omission leads to a significant loss in the image information and makes the computational process inaccurate.
The original equation for ZMs contains a double integral over a unit circle [11]. In the approximated ZOA method, these double integrations are replaced by double summations that are considered to be another source of numerical
Jou
45
error. High order moments are required for certain applications such as image reconstruction and image analysis and representation. In contrast, applications such as face recognition use moments at low orders [12]. The accumulation of these errors limits the method of Xuan [10] to low order moments, while image
50
reconstruction requires high order moments. In their work [10], Xuan et al. did not mention any experiments regarding image reconstruction. 3
Journal Pre-proof
lP repro of
All the aforementioned published parallel works are devoted to handle grayscale images, and the computational complexity of the quaternion moments for color images is much higher than those for grayscale images. 55
The first attempt at parallel implementation of quaternion moments for color images was made by the authors in [13], where a straightforward parallel implementation for medical image watermarking using CPUs and GPUs was per-
formed using a simple two-loop fusion. A speedup of less than 20 times was achieved, where the baseline implementaiton utilizing a single CPU. This rela60
tively limited speedup motivates us to propose novel and more efficient parallel
implementations to compute the quaternion polar harmonic transforms for color images.
The recent advances in parallel architectures, i.e., multicore CPUs and graphics processing units (GPUs), have made accessing massively parallel re65
sources affordable. Moreover, efforts have been made toward improving the programming capabilities of efficient parallel implementations, i.e., OpenMP [14] and CUDA [15] for multicore CPUs and NVIDIA GPUs, respectively.
In addition, many researchers have used hybrid CPU-GPU systems to exploit
the idle CPU while waiting for the GPU(s) to finish the assigned workload(s) [16]. While these heterogeneous computing systems provide massively parallel
rna
70
environments, fully using the parallel resources is a challenging task, especially for computing QPCET moments at low orders. Therefore, leveraging GPU implementation to significantly accelerate QPCET moment computations is a practically useful and promising direction. These 75
implementations should speed up the sequential QPCET moment computation
Jou
while maintaining the accuracy of the sequential methods. The main contributions of the proposed work are:
1. We proposed an accurate parallel implementation for the QPCET moment with zero ignored pixels. The implementations are generic for any
80
quaternion-based moments.
2. We proposed a polar moment-based optimized prediction of the balanced
4
Journal Pre-proof
lP repro of
workload of heterogeneous processing units (PUs).
3. We proposed a technique for boosting the level of parallelism, reducing the
synchronization cost, and load balancing for loop iterations with unequal 85
computational loads called loop mitigation.
4. We proposed a set of parallel implementations including multicore CPU-
only, GPU-only and CPU-GPU implementations for the QPCET moment. In this work, the 4-point and 8-point symmetries of the polar system are used to achieve shorter computational times. Finally, the results show signifi90
cant speedups, of a baseline of a single CPU, for QPCET moment and image reconstruction operations for different image sizes with different moment orders. The rest of the paper is organized as follows. Section 2 presents preliminary
concepts on quaternion representation of color images and CPU-GPU heteroge-
neous computing. Section 3 explains the QPCET moment from a computational 95
perspective. Based on this perspective, Section 4 lists the challenges of paral-
lelizing the QPCET moment. The details of the proposed implementations are given in Section 5. The experimental results and discussions are reported in
Section 6. Finally, we draw conclusions and discuss future directions for the
100
rna
proposed work in Section 7.
2. Background
2.1. Quaternion Representation of Color Images In 1866, Hamilton generalized the definition of complex numbers and defined the “quaternion” [17]. He defined quaternions with one real-component and
Jou
three imaginary-components as follows:
105
q = a + bi + cj + dk
The conjugate and the modulus of the quaternion, q, are defined as follows:
5
(1)
lP repro of
Journal Pre-proof
q ∗ = a − bi − cj − dk; p |q| = (a2 + b2 + c2 + d2 )
(2)
where a, b, c, and d are real numbers, while i, j, and k are imaginary units that are defined according to the following relationships:
i2 = j 2 = k 2 = ijk = −1; ij = −ji = k
(3)
jk = −kj = i; ki = −ik = j
where the real number a = 0, the quaternion is called a pure quaternion. Ell
and Sangwine [18] used quaternions to represent color images. A color image, 110
f (r, θ), is represented by using a pure quaternion-based model:
f (r, θ) = fR (r, θ)i + fG (r, θ)j + fB (r, θ)k
(4)
where the red, green and blue channels of the color image are represented by fR (r, θ), fG (r, θ), and fB (r, θ), respectively.
2.2. Quaternion Polar Complex Exponential Transform
115
rna
The Polar Complex Exponential Transform (PCET) for gray-level images is defined as follows [19, 20]:
Mpq
1 = π
ˆ2πˆ1 0
ˆ
2
ˆ
e−i2πpr e−iqθ f (r, θ) rdrdθ
0
where the order p and the repetition q are p = q = 0,1,2,...,∞ and ˆi =
(5) √
−1.
Jou
Since the multiplication of quaternions is not commutative, the right-side
and the left-side QPCET moments are different. The right-side QPCET moments for the RGB color image, f (r, θ), in polar coordinates are defined as
120
[21]:
R Mpq =
1 π
ˆ2πˆ1 0
2
f (r, θ)e−µ2πpr e−µqθ rdrdθ
0
6
(6)
Journal Pre-proof
lP repro of
√ where µ = (i + j + k)/ 3 is called a unit pure quaternion.
Based on the orthogonality of the QPCET, the color image f (r, θ) can be reconstructed as follows:
fˆ(r, θ) =
∞ X
∞ X
2
R (µ2πpr ) (µqθ) Mpq e e
(7)
p=−∞ q=−∞
In computing environments, summation to infinity is impossible. Thus, Eq. 125
7 is modified, where infinity is replaced by user-defined maximum orders pmax and qmax.
fˆ(r, θ) '
pmax X
qmax X
2
R (µ2πpr ) (µqθ) Mpq e e
(8)
p=−pmax q=−qmax
R R , the computational complexity of Eq. 8 can be reduced Since M−p,−q = Mp,q
by replacing the initial limits of the double summation from −pmax and −qmax to 0. 130
The accuracy of the reconstructed image increases as the values of pmax and qmax increase. Digital images are defined in Cartesian coordinates, and the
basis functions of the polar harmonic transforms are defined over a unit circle in polar coordinates. Thus, the traditional method for computing QPCETs
135
rna
requires square-to-circle mapping as displayed in Fig. 1.
2.3. Accurate Computation of the QPCET in Polar Coordinates QPCET moments can be computed in either Cartesian or polar coordinates. Computation in polar coordinates is preferable, where the basis functions are defined in polar coordinates over a unit disk. Recently, a highly accurate method
Jou
for computation of QPHTs in polar coordinates was proposed in [22]. These 140
researchers used the polar raster, where the conventional square pixels are replaced by circular pixels as displayed in Fig. 1. A detailed discussion of the polar raster can be found in [11]. For QPCET symmetry properties, the symmetry points are displayed in Fig.
2. Each point from the first kind has 8-point symmetry, while each point of the
145
second kind has 4-point symmetry. For the illustrated example of an image 7
lP repro of
Journal Pre-proof
FIGURE 1: (a) Image plane, (b) polar raster
rna
FIGURE 2: QPCET symmetry property
sized 8 × 8 with a total number of 64 pixels, the first kind of symmetry consists of 6 × 8 = 48 pixels while the second kind consists of 4 × 4 = 16 pixels. In the remainder of this paper, we use the 4-point symmetry in QPCET moment computations.
The right-hand side QPCET moments for color images are represented by
Jou
150
the function f (x, y), where x and y are the image Cartesian coordinates, and this function is exactly computed in polar coordinates as follows: R Mpq =
1 XX ˆ f (ri , θi,j )Ip (ri )Iq (θi,j ) π i j
with:
8
(9)
lP repro of
Journal Pre-proof
Ip (ri ) =
U ˆi+1
Rp (r)rdr;
Ui
Iq (θi,j ) =
Vˆ i,j+1
e(−µqθ) dθ
(10)
Vi,j
where fˆ(ri , θi,j ) represents the interpolated color image function. Based on Eq. 155
4, this function is defined as follows:
fˆ(ri , θi,j ) = fˆR (ri , θi,j )i + fˆG (ri , θi,j )j + fˆB (ri , θi,j )k
(11)
The interpolated image function, fˆ(ri , θi,j ), and its RGB components, fˆR (ri , θi,j ), fˆG (ri , θi,j ), and fˆB (ri , θi,j ) are deduced from the input original color image by using the method of cubic interpolation [11]. After a set of mathematical ma-
nipulations, the QPCET moments are represented using the RGB channels as 160
follows:
R R R R Mpq = AR pq + iBpq + jCpq + kDpq
rna
where:
Jou
1 ˆ ˆ ˆ AR pq = − √ [imag(Mpq (f R )) + imag(Mpq (f G )) + imag(Mpq (f B ))] 3 1 R Bpq = real(Mpq (fˆR )) + √ [imag(Mpq (fˆG )) − imag(Mpq (fˆB ))] 3 1 R ˆ Cpq = real(Mpq (f G )) + √ [imag(Mpq (fˆB )) − imag(Mpq (fˆR ))] 3 1 R ˆ Dpq = real(Mpq (f B )) + √ [imag(Mpq (fˆR )) − imag(Mpq (fˆG ))] 3
(12)
(13)
The Mpq (fˆR ), Mpq (fˆG ), and Mpq (fˆB ) represent conventional PCETs for the
red, green, and blue channels, respectively. Eq. 13 shows that accurate compu-
165
tation of the PCET moments for the three single-channels results in accurate computation of the QPCET moments for the input color image. 9
Journal Pre-proof
using the following form:
lP repro of
Color images are reconstructed with a finite number of QPCET moments
fˆ(r, θ) = fˆA (r, θ) + fˆB (r, θ)i + fˆC (r, θ)j + fˆD (r, θ)k
(14)
where fˆA (r, θ), fˆB (r, θ), fˆC (r, θ), and fˆD (r, θ) are computed as proposed in [7]. 170
2.4. CPU-GPU Heterogeneous Computing
Use of GPU-only and multicore CPU-only to accelerate the sequential counterpart implementations has been successfully reported for different scientific problems, as in [23, 24]. Considerable research has been performed on hybrid
CPU-GPU implementation to increase the speedups obtained. While the GPU 175
is processing its workload, the CPU waits idle. Thus, assigning a portion of the GPU workload to the CPU should improve the parallel performance. The
main issue of this hybrid implementation is maintaining load balancing of the heterogeneous PUs.
Concluding the workloads of heterogeneous PUs can be addressed through 180
static or dynamic scheduling [25]. In static scheduling, each PU is assigned a
fixed workload before executing the code. For dynamic scheduling, the workload
rna
is assigned to each PU during the run-time and can be changed based on the
PU performance. The overhead of dynamic scheduling is monitoring the PU performance and then assigning suitable portions of the workload based on each 185
PU performance. In static scheduling, this overhead vanishes. Static scheduling measures the performance of the PUs prior to the execution of the code through benchmarking the different PUs. This evaluation
Jou
includes scoring the performances of each PU for executing the benchmarks, and then based on these scores, each PU receives a suitable workload. A linear
190
programming model is used in [26, 27] to distribute the workload between a CPU, a GPU, and an FPGA, leading to a higher level of acceleration.
10
Journal Pre-proof
lP repro of
2.5. Loop optimization techniques
Modern parallel systems have a low cost per Giga Floating Point Operations Per Second (GFLOPS). Loops considered the building blocks of any parallel 195
library. Thus, optimizing loops is a key task for the success of any parallel code to fully utilize these non-expensive massive parallel resources.
Loops can be optimized through several techniques. Loop unrolling is the
process of reducing the loop control by increasing the loop body size. This
process can be seen as re-writing the loop body as repeated independent state200
ments (iterations). Loop unrolling is utilized in several CUDA applications. In [28], the authors proposed using the loop unroll technique to parallelize the
process of image convolution with CUDA. This process includes abundant of
iterations, but each iteration includes a few computations. Thus, the loop unrolling increased the computations per iterations and reduced the total number 205
of iterations. Another recent method used loop unroll technique to boost the performance of the proposed method [29].
The loop interchange technique may boost a parallel code performance. Nested loops can be interchanged in the case that there is no dependencies appear in the body of two nested loops. In [30], the authors utilized loop in-
terchange technique to optimize the parallel code. Then, they evaluated the
rna
210
proposed parallel code on several tasks including Laplace filter to a 256×256 pixels image.
Finally, loop fusion, collapsing, is used to fuse nested loops into a single loop. In [31], the authors proposed a parallel implementation to address one of the computational fluid dynamic problems. The authors proposed using loop collapsing approach to collapse four nested loops into one loop; then the
Jou
215
collapsed outer loop iterations are mapped to the parallel resources. In [32], the authors compared these three techniques of loop optimization and provided recommendations to use the proper technique for several problems.
11
Journal Pre-proof
3. QPCET moments: A Computational Perspective
lP repro of
220
It is crucial to realize the mechanism of the QPCET moment from a computational perspective as a sequential process. This understanding is the entry
point to implementing the parallel counterpart. In this section, the QPCET
equations in terms of memory and computational operations are explained. The 225
method for converting the sequential version to a parallel version is illustrated by figures. For simplicity, the operations of these two versions are explained for
a single image channel. The other two channels are treated exactly the same way.
Since the quaternion moments are defined in polar coordinates over a unit 230
circle [21], the input color images of size N ×N are mapped to polar coordinates. The image channel in the polar raster domain consists of a set of rings (tracks), where each ring consists of a set of sectors (polar pixels). The number of rings
is M = N/2. For the original image, the total numbers of pixels in the polar and Cartesian domains are identical. In Fig. 3 and the remainder of this paper, 235
S is the number of sectors in the innermost ring. The number of sectors in ring i equals S × (2 × i − 1), where i ∈ [1,M ]. For an image sized 8 × 8, the number of rings is 4. The first four rings respectively consist of 4, 12, 20, and 28 sectors.
rna
The total number of sectors for these four rings equals 64, which is equivalent to 8 × 8 pixels of the original image. 240
Fig. 3 shows the polar raster of an image sized 8 × 8 and its counterpart representation using a vector of vectors. This vector of vectors, shown in Fig. 3, represents only one image channel (red, green, or blue). The entire original ordinary image is stored as three vectors of vectors.
Jou
Fig. 4 depicts Eq. 9, which is used to compute the QPCET moment (Mpq )
245
from a computational perspective. Fig. 4 includes the interpolated image, denoted fˆ, the radial kernel of order p, denoted Ip , and the angular kernel of
repetition q, denoted Iq , from left to right. The interpolated image consists of N × N pixels spread over M rings. The computation of Mpq includes two parameters p and q, as in Eq. 9. The higher the values of p and q, the higher the
12
lP repro of
Journal Pre-proof
FIGURE 3: Polar raster of the interpolated image and its representation by a vector of vectors for one channel
250
consumed memory, computational time, and accuracy of the extracted features
and the reconstructed polar image. In Eq. 10, the possible values of p belong to [0, pmax]. In the top part of Fig. 4, the computed QPCET moment is stored
as a two-dimensional array of size (pmax + 1) × (qmax + 1), where pmax and qmax are user defined. 255
The ring number i has the radius ri ; the ring i has (pmax + 1) values for p. These (pmax + 1) values are stored in a two-dimensional array, representing the
rna
kernel Ip of size M ×(pmax+1). Finally, for each sector θi,j , there are (qmax+1) different q values. These values are calculated by Eq. 10. Thus, the kernel Iq values for each sector are represented by a vector of two-dimensional arrays, 260
which is a three-dimensional data structure. This data structure is illustrated in the rightmost part of Fig. 4. All of the two-dimensional arrays of kernel Iq have the same number of rows (qmax + 1), and the number of sectors equals
Jou
S × (2 × i − 1), where i is the ring number. Computing an element mpq of the moment Mpq includes scanning all the
265
elements of the vector of vectors of the leftmost side in the interpolated image. This scan includes a summation of the product of each sectori,j value times the value of element p of ring i from the radial kernel Ip times the value of the element q of the corresponding sectori,j from the angular kernel Iq . Similarly, image reconstruction from the QPCET moment is depicted in Fig. 13
Journal Pre-proof
lP repro of
q
p
Moment matrix: (pmax+1)× (qmax+1)
j
q=0
qmax+1
i
q = qmax
i
q=0
j
qmax+1
p
i
q = qmax
f^ : ∑
( ∗( ∗(
− ) + )) =
×
Ip:
(pmax+1) ×M
Iq: ^ × (
+ ) =
×
×(
+ )
FIGURE 4: QPCET moment computation of a single image channel
j
=
i
(Reconstructed polar image) ^ : ∑
q
( ∗( ∗(
!
!
#
="
="
− ) + )) =
×
p
rna
Moment matrix: (pmax+1)x(qmax+1)
FIGURE 5: QPCET image reconstruction of a single image channel
270
5. The process includes constructing a vector of vectors for the reconstructed image sectors. Each reconstructed sectori,j is computed as a summation of the 2
Jou
product of two terms, namely, the element moment mpq and e(µ2πpr ) e(µqθ) .
4. Parallel QPCET moment challenges In this section, we profile the QPCET running time to determine the com-
275
putational bottleneck. Then, we discuss the challenges of QPCET moment and image reconstruction computations from a parallel perspective.
14
Journal Pre-proof
lP repro of
4.1. QPCET moment profiling
Profiling sequential QPCET moments is the first step toward implementing the parallel counterpart. Table 1 lists the execution times, in seconds, of the 280
main steps of the moment computation and image reconstruction for an im-
age of 256 × 256 pixels at different order values, 20 and 60. Each experiment is repeated ten times and the reported execution times are in seconds with a
confidence interval of 95%. The moment computing step highly dominates the other steps, namely, converting the regular image to a polar image and the 285
computations of the radial kernel Ip and the angular kernel Iq . On the other
hand, image reconstruction needs no component to be prepared, and its execu-
tion time is approximately five times on average that of computing the QPCET
moment. Table 1 shows that using the 4-point symmetry significantly reduces the runtime. Compared with the 4-point symmetry, the 8-point symmetry has 290
a higher memory requirement, and the running time reduction achieved is not
significant. Thus, 4-point symmetry is considered as the base sequential code to be parallelized in the proposed work.
Finally, the computational times of the kernels Ip and Iq are image indepen-
dent; thus, their computational times are not counted in the moment compu-
tation for two reasons. The first reason is that as kernels Ip and Iq are image
rna
295
independent, these kernels can be precomputed and saved to the secondary memory. Then, these values are loaded in a trivial amount of time before computing the QPCET moment. The second reason is that if the code were to handle a set of k images of the same size, then kernels Ip and Iq are computed 300
or loaded from the secondary memory only once, regardless of the number of
Jou
images used. The overhead time for computing the kernels Ip and Iq for a single image is the computing time of kernels Ip and Iq divided by k. 4.2. Challenges of parallel QPCET moment implementation Moment computation consists of four nested loops, as listed in Algorithm 1.
305
The iterations of the first two loops are computationally equal, while the third loop iterations have different computational loads. Thus, loop fusion of the 15
Journal Pre-proof
(in seconds)
Image Size Points symmetry Moment order Image Interpolation Computing kernel Ip Computing kernel Iq Moment computing Image reconstruction
lP repro of
TABLE 1: Execution times of sequential QPCET moments and image reconstruction steps
256 × 256
4-point symmetry
No symmetry
20
60
20
60
1.772 ± 0.01
1.772±0.01
1.851±0.02
1.851±0.02
0.001±0.00
0.004±0.00
0.003±0.00
0.007±0.00
0.526±0.01
1.610±0.01
1.411±0.01
4.313±0.02
7.675±0.06
65.187±0.52
31.621±0.01
275.015±0.34
38.432±0.13
337.125±1.05
108.314±0.375
955.412±7.44
outermost two loops is straightforward but provides a limited level of parallelism. For example, for computing the QPCET moment with an order of 20 (pmax =
qmax = 20), using the aforementioned two-loop fusion results in (21×21) = 441 310
independent iterations. This level of parallelism is suitable for multicore CPUs but is insufficient for a single modern GPU. The P 100 GPU has 3,584 cores; the number of threads for a P 100 GPU should be as twice the number of GPU cores
to hide the memory access latency. Moreover, this two-loop fusion is extremely
315
rna
insufficient for a multi-GPU environment.
In addition, within the iterations of the third and fourth loops, the p and q values of the outermost two loops are required in the two innermost loops. Thus, mapping the fused loop iteration number to the original values of p and q should be addressed.
Fusing the first three loops of the QPCET moment computation yields M × (pmax + 1) × (qmax + 1) dependent iterations; each of these iterations handles
Jou
320
one ring of the image. As shown in Fig. 1, the rings have different numbers of sectors. Thus, the iterations of the fused three-loop are not computationally equal and are not preferable for parallel architectures. Unequal computational workloads lead to load imbalance.
325
Finally, fusing the four loops highly leverages code level parallelism. For
16
Journal Pre-proof
lP repro of
example, for a small-sized image of 128 × 128 pixels and a moment order of 20, fusing all four loops results in 128 × 128 × 21 × 21 = 7,225,344 dependent iterations. While fusing all four loops extremely increases the level of parallelism,
the increase comes at the cost of increasing the level of synchronization, lines 8 330
to 10 in Algorithm 1.
For an N × N image, considering three-loop fusion, each iteration of the
fused loop computes the moment for ring i. Thus, the synchronization is at the ring level, with M = N/2 synchronization steps. For four-loop fusion, an
iteration of the fused loop computes the moment for the sector j of ring i for 335
one pixel of the N 2 pixels. Thus, N 2 synchronization steps are required despite the iterations being computationally equal. In terms of numbers,the number of
synchronization steps for a small image of size 128 × 128 are 64 and 16,384 for the three-loop and four-loop fusions, respectively.
The main challenge of QPCET moment computation is to choose the proper 340
number of loops to be fused. From the above discussion, two-loop fusion is synchronization-free but has low levels of parallelism and independent iterations; this level of parallelism improves as the moment size increases. On the
other hand, three-loop fusion has a low synchronization cost and higher level
345
rna
of parallelism compared with two-loop fusion. Finally, four-loop fusion is extremely high in synchronization cost and has the highest level of parallelism. Thus, in the following, we exclude the four-loop fusion due to its extremely high synchronization cost, and we consider the two-loop and three-loop fusions only. 4.3. Challenges of parallel image reconstructions
Jou
Algorithm 2 is the sequential interpolated image reconstruction of the QPCET 350
moment. In Algorithm 2, the outermost loop iterations are not computationally equal. The number of sectors equals S × (2 × i − 1), where i is the ring number. The parallel implementation of Algorithm 2 with no loop fusion will assign each ring sectors to one parallel resource, e.g. CPU core. For instance, a 128 × 128 pixels image has 64 rings; given S = 4, the most inner ring includes
355
four sectors, four independent tasks, while the last ring includes 512 sectors, 17
Journal Pre-proof
lP repro of
independent tasks, details of rings’ structure are depicted in Fig. 2. Thus, one
parallel resource will handle 4 independent tasks and another parallel resource will handle 512 independent tasks. This example shows the possible load imbal-
ance if we didn’t fuse the first and second outer loops. Thus, this algorithm is 360
inconvenient for multicore CPUs and GPUs without loop fusion. This inequal-
ity can be trivially emphasized through a straightforward parallelizing approach for sequential image reconstruction using built-in OpenMP scheduling options [14].
Interchanging the two inner loops with the two outer loops makes Algorithm 365
2 similar in structure to Algorithm 1. This cannot be achieved due to the synchronization cost. For Algorithm 2 the two outer loops represent the pixels
of the reconstructed image. Each pixel to be reconstructed must iterate all the moment elements, the two inner loops. Thus, each iteration of the two outer loops uses the two inner loops to iterate the moment and reconstruct 370
one pixel. The process of pixel reconstruction is independent of other pixels. Thus, fusing the two outer loops requires no synchronization. On the other
hand, interchanging the two outer loops with the two inner loops will result in dependent iterations, synchronization cost. After interchanging the loops, each
375
rna
iteration of the outer two loops reconstructs a part of each pixel. The pixel is
reconstructed after all the iterations of the two outer loops, the interchanged loops, are completed. Thus, loop interchange will degrade the performance of Algorithm 2.
Intuitively, loop fusion is the most convenient approach to parallelize Algorithm 2, which requires a mapping function to convert the fused loop iteration number to the original ring i and sector j values. The fusion of the first two
Jou
380
loops results in a number of independent iterations equal to the number of image pixels.
There are two main issues to be addressed. The first issue is avoiding com-
putationally unequal iterations of the fused loops to balance the workloads. As
385
this situation results in a low use of parallel resources, certain threads/cores finish earlier and become idle. In addition, the implementation of these fused loops 18
Journal Pre-proof
1
for p = 0: pmax do
2
for q = 0: qmax do
3
for i = 1: M do
lP repro of
Algorithm 1 Sequential QPCET moment computation for a polar image
for j = 1: S × (2 × i + 1) do
4 5
Red Mp,q [p][q] += Red polarImage[i][j] × Ip [p][i] × Iq [i][q][j]
6
Green Mp,q [p][q] += Green polarImage[i][j]×Ip [p][i]×Iq [i][q][j]
7
Blue Mp,q [p][q] += Blue polarImage[i][j] × Ip [p][i] × Iq [i][q][j] end for end for
8
Red Mp,q [p][q] ×= 1.0/P I
9
Green Mp,q [p][q] ×= 1.0/P I
10
Blue Mp,q [p][q] ×= 1.0/P I end for end for
should have minimal synchronization overhead and space complexity. The sec-
ond issue is mapping the fused iteration number to the original iteration number
390
rna
of each fused loop in a constant amount of time, so that this mapping step has no effect on the overall computational time.
5. Proposed parallel implementation of QPCET moment As discussed in Sec. 4, the main challenge is to find a compromise between
Jou
a low level of parallelism and no synchronization from one side and high levels of both parallelism and synchronization on the other side. Three-loop fusion
395
has a high level of parallelism and low synchronization cost but has computationally unequal independent tasks. This paper proposes the novel concept of loop mitigation; it mitigates the load imbalance and the synchronization cost
of three-loop fusion of QPCET moment computation. The loop mitigation produces computationally equal three-loop fused iterations with less than M syn19
Journal Pre-proof
1 2 3
for i = 1: M do
lP repro of
Algorithm 2 Sequential polar image reconstruction using the QPCET moment
for j = 1: S × (2 × i + 1) do for p = 0: pmax do
for q = 0: qmax do
4
2
Red Ap,q [p][q] += Red Mp,q [p][q] × e(µ2πpr ) e(µqθ)
5
2
Green Ap,q [p][q] += Green Mp,q [p][q] × e(µ2πpr ) e(µqθ)
6
2
Blue Ap,q [p][q] += Blue Mp,q [p][q] × e(µ2πpr ) e(µqθ)
7
end for end for end for end for
400
chronization steps. In addition, for moment computing, the image size and/or the moment order determines the portion of the workload assigned to each PU. Thus, based on the image size and the moment order value, we use static
scheduling with a moment-optimized prediction of heterogeneous PU workloads
405
rna
to achieve load balance.
5.1. From a fused loop iteration to the original loops iterations Mapping functions of algorithm 1:
The mapping of the iteration number of the two-loop fusion of the QPCET
Jou
moment computation to the original loop iteration numbers is calculated by:
p = if used /(qmax + 1) q = if used
mod (qmax + 1)
(15)
where if used is the iteration number of the fused two loops, and if used ∈
410
[0, (pmax + 1) × (qmax + 1)] for two-loop fusion. Similarly, the mapping of the iteration number of the three-loop fusion of
20
Journal Pre-proof
calculated by:
lP repro of
the QPCET moment computation to the original loop iteration numbers is
p = (if used
mod ((pmax + 1) × (qmax + 1))/(qmax + 1)
q = (if used
mod ((pmax + 1) × (qmax + 1))
mod (qmax + 1)
i = if used /((pmax + 1) × (qmax + 1))
(16)
where if used ∈ [0, (pmax + 1) × (qmax + 1) × M ]. 415
Mapping functions of algorithm 2:
For two-loop fusion, the total number of fused loop iterations equals the total number of image pixels N 2 . In Algorithm 2, the mapping of the fused iteration number to the counterpart ring i and sector j is calculated by:
p i = b( if used )/2c,
j = if used − S × i2
(17)
where i ∈ [0, M − 1], j ∈ [0, (S × (2 × i + 1)) − 1], if used ∈ [0, N 2 − 1], and
S = 4; S is the number of sectors in the innermost ring of the polar image, as
rna
420
shown in the left side of Fig. 1. Ring i of the polar image is defined to be in the range [1,M ] in this paper. Herein, the value for i is computed in the range of [0,M − 1] for simple manipulation. The conversion can be addressed by simply adding one to the value obtained for i from Eq. 17. 5.2. The loop mitigation
Jou
425
For a low or moderate order QPCET moment in a massively parallel envi-
ronment, i.e., a modern powerful single GPU, the number of parallel resources
cannot be fully utilized using the two-loop fusion technique. Thus, a higher level of loop fusion is required to boost the level of parallelism for the QPCET
430
moment computation. For the moment computation, the two innermost loops perform a summation that is stored in a variable at the next outer loop, lines 8 21
Journal Pre-proof
lP repro of
to 10 in Algorithm 1. Thus, loop fusion for both or any of the innermost loops of these two operations include a synchronization overhead.
For example, computing the moment element mp,q is done by completing the 435
two innermost loops, lines 4 to 7 in Algorithm 1. The workload for computing moment element mp,q is divided by n threads. Then, each thread computes a
portion of the workload, and the moment element mp,q is computed by applying
a reduction with a sum operator on the n memory cells. In contrast, one memory cell is required when two-loop fusion is used. In addition, a synchronization step 440
is executed after computing the moment for each of the M rings.
As a compromise to this situation, we propose the loop mitigation tech-
nique, which boosts the level of parallelism for computing the QPCET moment, balances the workloads of the iterations, and reduces the synchronization time overhead. There are M executions of the synchronization step for three-loop 445
fusion. For three-loop fusion, the loop mitigation reduces the number synchro-
nization step executions to M/2 to log(M/2) based on the tunable parameter mmitig value, as explained in the following section.
Boosting the level of parallelism and balancing the workload:
The main idea of the loop mitigation technique is to rearrange the computationally unequal iterations of the third loop of Algorithm 1 into computationally
rna
450
equal iterations. Thus, the inequality of the iterations is hidden, and the workloads are balanced without fusing the third loop. Next, the rearranged iterations are spread over a set of smaller loops. The loop mitigation decomposes a loop of computationally unequal iterations into a set of smaller loops with equal computational loads. Then, the computationally equal load loops can be distributed
Jou
455
over the parallel resources. The loop mitigation uses that ring i has (2×(i−1)+1) sectors, and i ∈ [1,M ];
each sector represents an iteration. Thus, for a given balanced loop, the total number of sectors/iterations is a summation of the assigned ring sectors. Each
460
cell of Fig. 6 represents S sectors. From Fig. 6, loop mitigation divides the M rings into two halves, low and 22
lP repro of
Journal Pre-proof
S×1
S×3
S×5
•
….
• • • • • • • • • • •
• • •
S× (2M-5)
• • • • • • • • •
S×5
• • • • •
S× (2M-3)
S×(2M-5)
S×3
• • • • • • • • • • •
PU2
S×(2M-3)
• • • • • • • • •
S×1
• • • • • • •
PU1
S×(2M-1)
S×(2M-1)
• • • • •
S×(2M-3)
• • •
S×5
S×3
•
S×(2M-5)
S×1
PU0
• • • • • • •
S× (2M-1)
× (2 ) Sectors (pixels)
FIGURE 6: Loop mitigation of polar image rings
high. Then, this method balances the load by pairing the first ring of the low half and last ring of the high half, the second ring of the low half and the next to
last ring of the high half and so on. Thus, this method generates M/2 pairs of 465
rings, which is the limit for boosting the level of parallelism for the third loop.
rna
The M/2 pairs of rings are computationally equal.
For a series of odd numbers in the range [1,(2 × M ) − 1], loop mitigation balances the workloads of the smaller loops by rearranging the rings so that the total number of iterations for each smaller loop equals a multiple of S ×(2×M ). 470
For example, in Fig. 6, the polar image consists of six rings, M = 6. Thus, PU0 computes the moment for rings 1 and 6, PU1 computes the moment for rings 2
Jou
and 5, and PU2 computes the moment for the remaining rings. Each PU has S × 2M = S × 12 sectors/iterations. Loop mitigation delivers an equal workload for each PU for different colors. Thus, while the left part of Fig. 6 contains one
475
loop, the right part contains three smaller loops.
23
Journal Pre-proof
lP repro of
Reducing synchronization time overhead:
After a pair of rings is computed, a synchronization step is required. In total, loop mitigation executes the synchronization step M/2 times because there is only one instance of the moment in the memory. 480
Another solution to reduce the number of synchronization step executions is to have M/2 memory instances of the computed moment. Thus, each equal-
load loop can independently store the calculated values in its own instance of
the moment. Then, a reduction sums the M/2 moment instances after the M/2 smaller loops are finished. Reduction of M/2 terms can be accomplished in 485
log(M/2) iterations. These instances can be within a single PU or spread over
several PUs. Thus, loop mitigation significantly reduces the time overhead for the obligatory synchronization step.
Loop mitigation has a memory overhead of (pmax+1)×(qmax+1)×mmitig , where mmitig is the number of moment instances, and 1 ≤ mmitig ≤ M/2. In 490
addition, mmitig is used to divide the M/2 equal-load loops over the parallel resources. Thus, mmitig should be a factor of M/2 to guarantee load balance between the PUs involved. For the loop mitigation, the number of executions ofthe synchronization step is calculated as
M (2×mmitig ) (log(2
× mmitig )).
495
rna
Loop mitigation requires synchronization at the PU level once all the PUs
have finished their workloads. In CUDA [15], a GPU kernel is used for the GPU level of synchronization. Synchronization is required once all the GPUs have finished their workloads.
5.3. Load balancing in a heterogeneous environment
Jou
The proposed implementation can run in heterogeneous parallel computing 500
environments. These environments include CPU-GPU, with a single or multiple GPUs. The loop mitigation enables partitioning the workload between PUs, i.e.,
the GPU(s) and the host multicore CPU. The workloads of each of the GPUs and CPU are tunable. The user can set the workload values of each involved PU in terms of the number of ring pairs, which is discussed in Section 5.2. Summing
24
Journal Pre-proof
the assigned ring pairs for all the PUs yields M/2 pairs. These workloads should
lP repro of
505
be proportional to the computational power of the processing units.
Several efforts have been proposed to predict the proper workload of each
PU; these efforts are summarized in [25]. The most suitable approach for the proposed implementation is that considering the relative performances of the 510
PUs. In this context, the proposed implementation includes a utility that estimates the optimal workloads of the heterogeneous PUs. This utility should run one time for every change in the number of PUs.
In [33], the authors proposed using linear regression to predict the workload
of a CPU and a GPU based on recorded execution times for a single CPU and 515
a single GPU. Similarly, the proposed utility runs the proposed parallel code on CPU-only and GPU-only for several test images with different sizes and
different moment order values. The execution times of these tests are recorded. Then, multiple regression is used to estimate the workload of the CPU and
GPU based on the recorded execution times. For this purpose, we use the QR 520
decomposition approach. The coefficients of the multiple regression equation are calculated using the recorded real execution times. The multiple regression
equation includes CPU-only and GPU-only execution times as the independent
value y. 525
rna
variables x1 and x2 and the GPU workload in terms of ring pairs as the predicted
To analyze the time for CPU-GPU implementation, we have to calculate the number of cores to determine the expected speedup. As GPU core power is different from CPU core power, we suggest using the aforementioned workload portions to express the overall number of CPU and GPU cores as CPU cores,
Jou
Ptotal . The proposed method to calculate Ptotal is to estimate the relative 530
performance of all the CPU and GPU cores for a given image size and moment order; Ptotal is calculated as follows:
Ptotal = N um of CP U cores(1 + (
where g is the number of GPUs.
25
g X CP U exec time )) GP Ui exec time i=1
(18)
Journal Pre-proof
lP repro of
Assuming a given image size and moment order, the execution time for 16 core CPUs performing moment computation is one second, and the execution 535
time with a GPU using the same code is 0.5 seconds. Given two identical GPUs and using Eq. 19 for this example, Ptotal equals 16 × (1 + 4) = 80 CPU cores. In this example, a single GPU is twice as fast as the 16 core CPU; thus, one GPU equals 32 CPU cores. As a result, Ptotal equals the summation of the actual 16
CPU cores plus an estimated 64 CPU cores from the two GPUs. This estimation 540
changes as either or both of the two factors of image size and moment order change.
5.4. Implementation Complexity Analysis
The analysis includes both the time and memory space costs of the proposed implementations. The time and memory costs of moment computations and 545
image reconstruction operations are dominated by two factors, the image size and moment order, pmax and qmax, respectively. These two operations include the same four loops but in a different order, as shown in Algorithms 1 and 2.
Thus, the two algorithms have the same computational cost. The following analysis is for one image channel; for the three image channels, the constant three is omitted. Thus, the final cost equals the cost of one image channel.
rna
550
For the time analysis, the two algorithms include four loops; two for iterating all possible p and q values of sizes (pmax + 1) and (qmax + 1), respectively. The other two loops iterate over all the sectors (pixels) of the polar image, with N × N pixels in total. Thus, the computational cost of the sequential algorithm 555
is O(N 2 × pmax × qmax) and is O((N 2 × pmax × qmax)/Ptotal ) for the parallel
Jou
version, where Ptotal is defined in Subsection 5.3. For the space analysis, Algorithm 1 accesses the three data structures of fˆ
(the interpolated image), kernel Ip , and kernel Iq , with space complexities of O(N 2 ), O(M ×pmax), O(N 2 ×qmax), respectively, as illustrated in Fig. 4. The
560
dominant data structure is the one representing Iq . Thus, the space costs of the CPU-only and GPU-only implementations are O(N 2 × qmax) for moment computation operations. For the CPU-GPU implementation, the required memory 26
Journal Pre-proof
lP repro of
for the CPU is different from that of the GPU, as the CPU accesses all three
of the aforementioned data structures while each GPU accesses parts of these 565
three data structures, due to loop mitigation technique, Section 5.2. The GPU space cost of the CPU-GPU implementation is O((N 2 × qmax)/g) for moment computational operations, where g is number of GPUs. The space cost of the
image reconstruction operation is O(N 2 + pmax × qmax) for the proposed implementations, as illustrated in Fig. 5.
570
6. Experimental Results 6.1. Setup
The experiments are performed on a computer with two Intel(R) Xeon(R) E5-2609 v4 1.70 GHz CPUs with 8 cores, four P100 NVIDIA GPUs with 16
GB memory, and 500 GB RAM. The OS is 64-bit Linux. The implementations 575
are written in the C++ programming language, GCC version 5.4.0. We used
the standard OpenMP threads library [14] for multicore CPUs and CUDA 10.1 [15] for the GPU programming. For the reported results, each experiment is
repeated ten times and the reported execution times are in seconds with a confidence interval of 95%. The test images are of birds dataset [34]. In the following, ’CPU’ refers to using CPU with 16 cores and ’GPU’ refers to using
rna
580
one P100 GPU. In the following, all the reported times are in seconds and the throughputs are the number of processed images per minute. The reported speedups are calculated based on the baseline performance on a single Intel Xeon E5-2609 CPU core. All reported speedups and execution times do not 585
include the data transfer time between the host and device memory, except for
Jou
the image watermarking experiments. The data transfer time overhead between CPU and GPU(s) is discussed in Section 6.4. 6.2. Accuracy
First, the accuracy of the sequential QPCET moment implementation is
590
compared with the proposed parallel implementation counterparts. The se-
27
lP repro of
Journal Pre-proof
FIGURE 7: Visual perception of the accuracy of the proposed method
quential and parallel implementations have identical accuracies for all the im-
ages tested. The normalized image reconstruction error (NIRE) and the visual perception of the reconstructed image are used in quantitative and qualitative evaluation of the reconstruction accuracy of the proposed parallel method. Fig. 595
7 shows the values of NIRE and the visual perception of the reconstructed image of Lena with different moment orders. 6.3. Speedup and time improvement
Second, we compare the speedup of loop mitigation with three-loop fusion
against three-loop fusion and two-loop fusion on a single GPU. Table 2 lists the 600
execution times in seconds alongside the speedups obtained for three-loop fusion and loop mitigation with mmitig = 1 and mmitig = M/2. The performance of
rna
the three-loop fusion with the loop mitigation surpasses that of the two-loop and three-loop fusions, especially for large mmitig . Next, Fig. 8 shows how increasing the value of mmitig affects the speedup 605
at different moment orders for an image of 1,024 × 1,024 pixels. Thus, the loop mitigation implementation is selected for further experiments due to its superior performance.
Jou
Then, we evaluate the speedup, sequential running time improvement (Time Impr.), and throughput of the proposed implementations. Tables 3 and 4 show
610
the speedups and execution time improvements for the proposed implementations of moment computation, respectively. Similarly, Tables 5 and 6 show the speedups and execution time improvements for the proposed implementations of the interpolated image reconstruction, respectively. The execution time im-
28
lP repro of
Journal Pre-proof
TABLE 2: Execution times in seconds & speedup of two-loop fusion vs. three-loop fusion vs. the loop mitigation on a single GPU
Image
Mom.
Sequent.
Size
Order
[7]
Parallel [13]
The proposed implementations
2-loop
3-loop
Speedup
fusion
20 256×
7.68 ± 0.06
256 40
29.47 ± 0.20
60
65.19 ± 0.52
20 512×
31.04 ± 0.19
512
117.61± 40
4.80 ±
4.99x
13.11x
4.85 ±
4.91 ±
0.01
124.07±
76.78± 0.01
478.82±
76.94±
40
Jou 13.35
0.01
0.00
9.87x 0.02
0.06
0.53 ±
1.98 ±
4.33 ±
1.98 ±
7.56 ±
59.45x
59.82x
62.61x
63.30x
16.92± 20.49x
0.08
58.89x
0.00
51.44± 12.93x
62.01x
0.00
48.51±
81.55± 13.26x
29
2.58x
6.19x
1.05 ±
0.01
48.10±
0.00
0.01
20.26x 0.02
77.37±
55.25x
0.01
12.79±
0.00
0.53 ±
0.00
9.72x
1.63x
42.82x
0.00
12.10±
76.09±
1054.32± 79.50±
60
20.54x
2.59x
13.19x
0.18 ±
0.00
0.01
0.09
0.01
9.72x
11.98±
19.65±
6.22x
2.72
3.17 ±
6.07x
1.62x
0.97
3.03 ±
1.65x
13.12x
Speedup
0.00
0.00
0.00
19.76±
20
13.28x
19.39±
0.01
2.55x
0.00
0.00
19.22±
0.04
6.07x
18.85±
1.62x
3.01 ± 0.00
0.02
19.15±
2.04
1.62x
0.00
rna 259.18±
Speedup
M/2
0.00
0.01
60
1024
4.73 ±
6.12x
0.72
1024×
1.60x
0.00
4.86 ±
m=1
fusion
0.00
4.81 ±
m=
Speedup
62.30x 0.01
Journal Pre-proof
70 60 50 40 30 20 10 0 2
lP repro of
Speedup
Obtained Speedup as Mmitig and moment order change (1,024×1,024 pixels image with M=512)
4
8
16
32
64
128
256
Mmitig value
Momnet_Order =20
40
60
FIGURE 8: Effect of changing mmitig value on the obtained speedup
provements for the proposed implementations are calculated by: (sequential time − parallel time) × 100 sequential time 615
(19)
Figs. 9 and 10 depict the throughput of the different proposed implementations on different PUs for 60 seconds. These two figures show the throughputs for various image sizes and different moment order values of 20, 40, and 60.
The proposed implementations are evaluated using relatively large-sized im-
ages of 2,048 × 2,048 and 4,096 × 4,096 pixels. The execution times of this test is reported in Table 7 for multicore CPU, single GPU, and four GPUs with a
rna
620
CPU. For a 4,096 × 4,096 pixels image with a moment order 60, the required memory is more than the P100 GPU memory capacity, 16GB. This is due to the angular kernel Iq memory size. The Iq kernel is a three-dimensional array of size N × N × qmax; each element of this array is a complex number where a 625
complex number reserves 8 bytes for the real part and another 8 bytes for the
Jou
imaginary part. Thus, the total memory size of the Iq kernel of a 4,096 × 4,096
pixels image with a moment order 60 equals 4,096 × 4,096 × 61 × 16 = 16.3 × 108 bytes, 16.3GB.
In the same context, the GPU memory utilization of the GPU proposed im-
630
plementations of moment computations are listed in Table 8. For the CPU+4×GPU implementation, the memory utilization value is the same for the four GPUs, as the implementation evenly divides the computational loads over the four 30
lP repro of
Journal Pre-proof
TABLE 3: Execution times and speedups of moment computing for the proposed parallel implementations
CPU
GPU
CPU+4×GPUs
Image
Mom.
Size
Order
exec. time
speedup
exec. time
speedup
exec. time
speedup
20
0.64 ± 0.00
12.08x
0.18 ± 0.00
42.82x
0.10 ± 0.00
78.97x
40
1.90 ± 0.01
15.47x
0.53 ± 0.00
55.25x
0.18 ± 0.00
164.90x
60
4.28 ± 0.01
15.22x
1.05 ± 0.00
62.01x
0.27 ± 0.00
237.80x
20
2.57 ± 0.00
12.09x
0.53 ± 0.00
58.89x
0.19 ± 0.00
162.48x
40
7.63 ± 0.03
15.42x
1.98 ± 0.01
59.45x
0.52 ± 0.00
228.32x
60
17.23 ± 0.05
15.04x
4.33 ± 0.01
59.82x
1.05 ± 0.01
247.30x
20
10.30 ± 0.04
12.05x
1.98 ± 0.00
62.61x
0.66 ± 0.00
186.73x
40
30.28 ± 0.11
15.81x
7.56 ± 0.00
63.30x
1.73 ± 0.00
276.00x
60
69.51 ± 0.30
15.17x
16.92 ± 0.01
62.30x
3.81 ± 0.00
276.89x
256 × 256
512 × 512
1024×
rna
1024
TABLE 4: Execution times in seconds and improvement ratios for moment computation Image
The proposed implementations
Mom.
Sequential[7]
Order
Jou
Size
CPU
Time
CPU+GPU
Time
CPU+4×GPUs
Time
time
Impr.
time
Impr.
time
Impr.
20
124.11
10.30
91.70%
1.68
98.64%
0.66
99.46%
40
478.82
30.28
93.68%
6.13
98.72%
1.73
99.64%
60
1054.32
69.51
93.41%
13.28
98.74%
3.81
99.64%
1024× 1024
31
lP repro of
Journal Pre-proof
TABLE 5: Speedups of image reconstruction for parallel implementations Image
Mom. Order
CPU
GPU
CPU+4×GPUs
20
15.65x
51.71x
184.68x
40
15.75x
52.79x
199.82x
60
15.94x
52.68x
201.87x
20
15.91x
58.13x
251.57x
40
15.92x
58.42x
257.57x
60
15.99x
58.46x
260.22x
20
15.97x
61.79x
258.84x
40
15.56x
61.32x
258.87x
60
15.95x
61.63x
256.98x
Size
256 × 256
512 × 512
1024× 1024
TABLE 6: Execution times in seconds and time improvement ratios for image reconstruction
Mom.
Sequential
Size
Order
[7] time
256× 256
512
CPU
Time
GPU
Time
CPU+GPU
Time
CPU+4×GPUs
Time
time
Impr.
time
Impr.
time
Impr.
time
Impr.
20
38.43
2.46
93.61%
0.74
98.07%
0.59
98.47%
0.21
99.46%
40
150.96
9.58
93.65%
2.88
98.09%
2.27
98.49%
0.76
99.50%
60
337.42
21.17
93.73%
6.41
98.10%
4.99
98.52%
1.67
99.50%
20
152.79
9.60
93.72%
2.63
98.28%
2.22
98.55%
0.61
99.60%
Jou
512×
The proposed implementations
rna
Image
40
601.55
37.78
93.72%
10.29
98.29%
8.73
98.55%
2.34
99.61%
60
1348.98
84.39
93.74%
23.07
98.29%
19.52
98.55%
5.18
99.62%
20
614.51
38.44
93.74%
9.95
98.38%
8.41
98.63%
2.37
99.61%
40
2399.99
154.92
93.57%
39.14
98.37%
33.01
98.62%
9.27
99.61%
60
5393.58
338.22
93.73%
87.53
98.38%
73.90
98.63%
20.99
99.61%
1024× 1024
32
Journal Pre-proof
large sized images
Size
2,048
4,096
order
lP repro of
TABLE 7: Execution times of moment computation and image reconstruction for relatively
CPU
GPU
moment
reconst
moment
reconst
moment
reconst
20
41.83 ± 0.18
153.00 ± 0.22
7.53 ± 0.00
40.22 ± 0.01
2.06 ± 0.00
9.67 ± 0.00
40
123.13 ± 0.51
602.53 ± 0.54
31.24 ± 0.02
157.90 ± 0.05
7.11 ± 0.01
37.94 ± 0.01
60
279.49 ± 0.72
1347.91 ± 0.81
69.06 ± 0.07
352.81 ± 0.19
16.00 ± 0.01
84.94 ± 0.03
20
170.12 ± 0.53
612.00 ± 0.28
31.65 ± 0.01
161.19 ± 0.03
7.73 ± 0.03
38.14 ± 0.01
40
494.24 ± 4.40
2407.84 ± 2.72
126.76 ± 0.15
632.57 ± 0.15
29.92 ± 0.01
149.68 ± 0.03
60
1113.64 ± 5.44
5390.93 ± 2.29
out of memory
out of memory
66.00 ± 0.01
335.50 ± 0.7
GPUs. The proposed implementation of moment computing divides the data and processing between the parallel resources. On the other hand, the pro635
posed implementation of image reconstruction divides the processing only not
the data, as the data size is very tiny for this task. For image reconstruction moment, the memory utilization is low. The memory utilization varies from
2.50% to 6.86% for 256×256 pixels and 4,096×4,096 pixels images, respectively,
rna
for both single GPU and CPU+4×GPU implementations. 640
6.4. Data transfer times of the CPU and GPU The data transfer between the CPU (host) and the GPU (device) has two types, image data dependent and independent. The former type presents data that can be used by any image of the same size, the radial kernel Ip and the angular kernel Iq values. The latter type presents the image pixels; thus, this data is image dependent. These two types of data transfer between the host and
Jou
645
device(s) have a time overhead. Table 9 lists the required time of both types for a single GPU and 4×GPU implementations. The general rule that the more the image size and the moment order the more the data transfer time. For the image independent data, the transferred data volume is determined
650
CPU+4×GPU
based on the image size and moment order. The angular kernel Iq is the dominating portion with N × N × (qmax + 1) values, where N represents the number 33
Journal Pre-proof
Size
256
512
1024
2,048
4,096
lP repro of
TABLE 8: GPU(s) memory utilization of moment computations
order
GPU
CPU+4×GPU
20
1.79%
0.50%
40
2.27%
1.68%
60
2.46%
1.70%
20
2.51%
1.76%
40
313%
2.28%
60
3.77%
2.43%
20
4.21%
2.06%
40
6.41%
2.61%
60
8.67%
2.46%
20
10.93%
4.08%
40
19.16%
6.07%
60
27.61%
8.10%
20
37.61%
10.46%
40
69.80%
18.19%
60
out of memory
25.90%
rna
Moment Computation Throughput
CPU
GPU(P100)
CPU+P100
CPU+4×P100
700.0
Througput per a minute
600.0 500.0 400.0 300.0 200.0
Jou
100.0
0.0
(256×256) 20
40
60
(512×512) 20
40
60
(1024×1024 ) 20
40
60
CPU
94.4
31.5
14.0
23.4
7.9
3.5
5.8
2.0
0.9
GPU(P100)
334.7
112.5
57.1
113.8
30.3
13.8
30.3
7.9
3.5
CPU+P100
350.0
147.5
66.4
116.6
37.2
16.4
35.6
9.8
4.5
CPU+4×P100
617.3
335.8
218.9
314.1
116.5
57.3
90.3
34.6
15.8
(Image size) Moment Order
FIGURE 9: Different moment computing implementation throughputs
34
Journal Pre-proof
lP repro of
Interpolated Image Reconstruction Throughput CPU
Througput per a minute
350.0 300.0 250.0 200.0 150.0 100.0 50.0 0.0
GPU(P100)
CPU+P100
CPU+4×P100
(256×256) 20
40
60
(512×512) 20
40
60
(1024×1024) 20
40
60
24.4
6.3
2.8
6.2
1.6
0.7
1.6
0.4
0.2
GPU(P100)
80.7
20.8
9.4
22.8
5.8
2.6
6.0
1.5
0.7
CPU+P100
102.2
26.4
12.0
27.0
6.9
3.1
7.1
1.8
0.8
CPU+4×P100
288.3
79.4
35.9
98.6
25.7
11.6
25.3
6.5
2.9
CPU
(Image size) Moment Order
FIGURE 10: Different image reconstruction implementation throughputs
of the image rows and columns, and qmax represents the angular kernel order;
each value represents a complex number, i.e. two double variables. In Table 9,
the transfer times of the image independent increases as the image size and the 655
moment order increase. The transfer time of the 4×GPUs is more than a single
GPU. This is because the CPU memory buses are used by four GPUs instead of a single GPU.
For the image dependent data, the transferred data is going the two ways. First, the image pixel values are transferred from the host to the device; herein,
the moment order has no effect on the data transfer time. Second, after comput-
rna
660
ing the moment, the device transfers the moment instances, the loop mitigation’s mmitig value, to the host memory; the higher the moment order the higher the transfer rate. Using 4×GPU slightly improve the transfer time as only certain portions, rings, of the image pixels should be moved to each GPU. Finally, Table 9 lists the data transfer times for the moment computations
Jou
665
only. The ratio of the data transfer to the computational times of the image reconstruction task is relatively tiny. For instance, the ratio of the data transfer time to the computational times of the image reconstruction tasks are 0.4% and
0.03% for images of 256×256 pixels and 1,024 × 1,024 pixels, respectively, where
670
the moment order is 20.
35
Journal Pre-proof
lP repro of
TABLE 9: Data transfer times between CPU and GPU(s) of moment computations in seconds
GPU
Size
order
256 × 256
512 × 512
1024 × 1024
4×GPU
image independ.
image depend.
image independ.
image depend.
20
0.16 ± 0.00
0.02 ± 0.00
0.47 ± 0.01
0.01 ± 0.00
40
0.17 ± 0.00
0.06 ± 0.00
0.48 ± 0.01
0.05 ± 0.00
60
0.18 ± 0.01
0.12 ± 0.00
0.48 ± 0.01
0.11 ± 0.00
20
0.18 ± 0.01
0.04 ± 0.00
0.48 ± 0.01
0.03 ± 0.00
40
0.21 ± 0.01
0.12 ± 0.00
0.48 ± 0.01
0.10 ± 0.00
60
0.23 ± 0.01
0.25 ± 0.00
0.48 ± 0.01
0.23 ± 0.00
20
0.22 ± 0.01
0.09 ± 0.00
0.49 ± 0.01
0.07 ± 0.00
40
0.31 ± 0.01
0.25 ± 0.01
0.50 ± 0.02
0.22 ± 0.01
60
0.40 ± 0.02
0.52 ± 0.00
0.57 ± 0.20
0.45 ± 0.02
6.5. Performance Analysis
In Table 2, the three-loop fusion speedup almost doubles as the image size doubles because of the increase in computation time. For example, for an image
sized 256 × 256, the number of rings is M = 128; thus, the synchronization step 675
is executed 128 times. Doubling the image size to 512 × 512, both the number of
rna
rings and the number of times the synchronization step is executed double. On the other hand, the number of pixels in the latter image is four times that of the
former image. Thus, the computational time increases more than the synchronization time, which leads to a better speedup. Despite this improvement, the 680
CPU-only implementation with 16 cores performs better than the GPU-only
Jou
implementation using the three-loop fusion; the speedups obtained are 15.17× (last row of Table 3) and 12.93× (last row of Table 2), respectively. The proposed implementation achieves 257× and 277× speedups for the
interpolated color image reconstruction and QPCET moment computation, re-
685
spectively, using one CPU with 16 cores and four P100 GPUs with a moment order of 60 and an image of 1,024 × 1,024 pixels. These execution times of the CPU-GPU implementations are three and two 36
Journal Pre-proof
lP repro of
orders of magnitude shorter than the sequential version execution times for moment computing and image reconstruction, respectively. The reported speedups 690
and time improvements in Tables 3 and 4 show that the loop mitigation tech-
nique smoothly scales up when using a CPU with multiple GPUs. For instance, for a low moment order of 20 with a limited level of parallelism, loop mitigation
boosts the level of parallelism to achieve a speedup of 78.97× using CPU-GPUs, as shown in Table 3. As the moment order and image size increase, the speedups 695
increase; the speedup reaches 276.89× for the same image and a moment order 60. The image reconstruction results, in Tables 5 and 6, show the same behavior as the moment computations.
Comparing the throughputs of the loop mitigation technique with the GPUonly with those with a CPU and a single GPU, the latter has a slight advantage. 700
To keep the CPU busy while the GPU is working, a portion of the GPU work-
load, the ring pairs, is assigned to the CPU. The more ring pairs that are
are assigned to the GPU, the higher the level of parallelism on the GPU side. Thus, sharing the load increases the overall throughput but decreases the GPU throughput. 705
Finally, from a parallel processing perspective, the number of FLOPS out-
rna
lines the improvement from one implementation to another. For instance, the
double precision implementation of the 2-loop fusion in [13], on a P100 GPU, performs 1.13 × 108 , 1.12 × 109 , and 1.10 × 109 FLOPS, for images of 256 × 256, 512 × 512, and 1,024 × 1,024 pixels, respectively, where the moment order is 710
60. On the other hand, the double precision proposed implementation using the loop mitigation technique, on a P100 GPU, performs 5.26 × 109 , 5.21 × 109 ,
Jou
and 5.19 × 109 FLOPS for images of 256 × 256, 512 × 512, and 1,024 × 1,024
pixels, respectively, where the moment order is 60. These results show that the proposed implementation improved the existing method by approximately five
715
times.
These numbers of FLOPS are far from the theoretical computational peak
of the P100 GPU, 4.3 × 1012 FLOPS, for two reasons. First, the theoretical computational peak supposed processing flow of ready operations without con37
Journal Pre-proof
720
lP repro of
sidering the memory latency factor. Second, the QPCET algorithm performs computations across a large volume of data elements. For instance, for an image
of 256×256 pixels and moment order 20, one independent task (GPU thread) of the 2-loop fusion in [13] should access 3 × (256 × 256) double variables of the image pixels, 21×(256×256) complex number, each has two double variables, of the Iq kernel, and 21 × 128 complex numbers, where 128 is the number of image 725
rings. Fig. 4 explains the size of these data structures in more details. Thus,
in total one thread accesses 3,014,656 double variables; each consists of eight bytes. This huge number of data access pattern affects the GPU processing performance. On the other hand, the proposed method improves this performance
by reducing the number of data access by dividing them on the mmitig value. 730
For instance, using mmitig = 64, for the example discussed above, each thread
of the proposed method accesses 47,104 double variables only, as each thread should handle the data of two rings only. These values of data access increase as the image size and the moment order increase. 6.6. Color image watermarking 735
The efficiency of the proposed method is tested in watermarking of color
Three experiments are performed using three different well-known
rna
images.
datasets. The first experiment is performed with the high-resolution fundus (HRF) color image dataset [35]. This dataset contains 45 color images of the same size 3,504×2,336 and divided into 15 images of healthy patients, 15 images 740
of patients with diabetic retinopathy, and 15 images of glaucomatous patients. The dataset images are resized to be 2,048 × 2,048. Nine randomly selected
Jou
color images are displayed in Fig. 11a. The second experiment is performed with the MEDNOD dataset of skin
lesions [36]. This dataset contains 170 color images of different sizes and divided
745
into two classes. The first class contains 70 color images of malignant melanoma while the second class contains 100 color images of nevus lesion. For simplicity, all color images, of this dataset, are resized to be 1,024 × 1,024. Six randomly
selected color images are displayed in Fig. 11b. 38
Journal Pre-proof
datasets
Dataset
lP repro of
TABLE 10: Average color image watermarking exec. times in second and speedups of three
Sequential
GPU
4×GPUs
time
time
speedup
time
speedup
Soccer Team
394.22
7.31
53.93x
2.12
185.95x
MEDNOD
1573.71
29.51
53.33x
8.61
182.78x
HRF
6229.31
116.21
53.60x
34.24
181.93x
The third experiment is performed with the soccer teams color image dataset 750
[37]. This dataset contains 280 color images of different sizes. These images are
divided into 7 classes for 7 soccer teams, A.C. Milan, Barcelona, Chelsea, Juven-
tus, Liverpool, Real Madrid, and PSV Eindhoven where each class containing 40 images. For simplicity, all color images are resized to be 512 × 512. Eight color images are displayed in Fig. 11c. 755
Fig. 12 depicts the average logarithmic execution times of the color images
watermarking for the utilized three datasets. The reported execution times are
for the sequential, GPU, and 4×GPUs implementations of a single image of each dataset with moment order of 30. Table 10 lists the execution times in seconds
760
rna
with the obtained speedup for the same datasets and implementations.
7. CONCLUSION
This paper proposed parallel implementations for the QPCET moment and image reconstruction based on the computed moment for CPU-only, GPU-only,
Jou
and CPU-GPU architectures using OpenMP and CUDA. In this context, the implementations include different techniques to address the low level of paral-
765
lelism for low order moments and workload imbalances between heterogeneous PUs. The loop mitigation technique is proposed to boost the level of paral-
lelism, reduce the GPU space complexity and synchronization overhead, and balance the workloads. In addition, multiple regression is used for load balancing between CPUs and GPUs. Finally, the performance of the proposed 39
lP repro of
Journal Pre-proof
(a)HRF sample images
Jou
rna
(b)MEDNOD sample images
(c)Soccer teams dataset sample images
FIGURE 11: Samples of three different color image datasets for watermarking
40
Journal Pre-proof
lP repro of
Logarithmic values of execution times for watermarking time of different datasets Log10(exec. time)
4 3.5 3 2.5 2 1.5 1 0.5 0 Soccer teams(512×512)
MEDNOD(1,024×1,024)
Sequential
GPU
HRF(2,048×2,048)
4×GPUs
FIGURE 12: Logarithmic values of execution times for watermarking time of different datasets
770
implementation is evaluated and reported to underline the effectiveness of the proposed parallel implementations through using different applications. For all the experiments, the proposed CPU-GPU implementation achieves significant
average speedups of 257× and 277× for color image reconstruction and quaternion moment computation, respectively. For our future work, the proposed 775
implementations will be extended to the generic quaternion domain. Thus, the method could be adapted, with a simple modification, to deal with other quater-
rna
nion image moments, such as quaternion exponents, quaternion radial Harmonic Fourier moments, and quaternion ZMs.
8. Acknowledgments 780
This work was supported in part by the Program of China under Grant 2016YFB0201402, and in part by the National Natural Science Foundation of
Jou
China under Grant 61702170, Grant 61602350, Grant 61602170, and Grant 61750110531.
References
785
[1] L.-Q. Guo, M. Zhu, Quaternion Fourier–Mellin moments for color images, Pattern Recognition 44 (2) (2011) 187–195. 41
Journal Pre-proof
lP repro of
[2] H.-y. Yang, L.-l. Liang, Y.-w. Li, X.-y. Wang, Quaternion exponent moments and their invariants for color image, Fundamenta Informaticae 145 (2) (2016) 189–205. 790
[3] B. Kaur, S. Singh, J. Kumar, Iris recognition using Zernike moments and
polar harmonic transforms, Arabian Journal for Science and Engineering (2018) 1–10.
[4] B. Kaur, S. Singh, J. Kumar, Cross-sensor iris spoofing detection using orthogonal features, Computers & Electrical Engineering 73 (2019) 279– 795
288.
[5] Z. Xia, X. Wang, W. Zhou, R. Li, C. Wang, C. Zhang, Color medical image
lossless watermarking using chaotic system and accurate quaternion polar harmonic transforms, Signal Processing 157 (2019) 108–118.
[6] J. Zhong, Y. Gan, J. Young, P. Lin, Copy move forgery image detection via 800
discrete radon and polar complex exponential transform-based moment in-
variant features, International Journal of Pattern Recognition and Artificial Intelligence 31 (02) (2017) 1754005.
rna
[7] K. M. Hosny, M. M. Darwish, Highly accurate and numerically stable higher
order QPCET moments for color image representation, Pattern Recognition 805
Letters 97 (2017) 29–36.
[8] M. J. M. Requena, P. Moscato, M. Ujald´ on, Efficient data partitioning for the GPU computation of moment functions, Journal of Parallel and Distributed Computing 74 (1) (2014) 1994–2004.
Jou
[9] K. M. Hosny, A. Salah, H. I. Saleh, M. Sayed, Fast computation of 2D and
810
3D Legendre moments using multi-core CPUs and GPU parallel architectures, Journal of Real-Time Image Processing (2017) 1–15.
[10] Y. Xuan, D. Li, W. Han, Efficient optimization approach for fast GPU computation of Zernike moments, Journal of Parallel and Distributed Computing 111 (2018) 104–114. 42
Journal Pre-proof
[11] Y. Xin, M. Pawlak, S. Liao, Accurate computation of Zernike moments in
lP repro of
815
polar coordinates, IEEE Trans. on Image Processing 16 (2) (2007) 581–587.
[12] S. Farokhi, S. M. Shamsuddin, J. Flusser, U. U. Sheikh, M. Khansari,
K. Jafari-Khouzani, Rotation and noise invariant near-infrared face recognition by means of Zernike moments and spectral regression discriminant 820
analysis, Journal of Electronic Imaging 22 (1) (2013) 013030.
[13] K. M. Hosny, M. M. Darwish, K. Li, A. Salah, Parallel multi-core CPU and GPU for fast and robust medical image watermarking, IEEE Access 6 (2018) 77212–77225.
[14] L. Dagum, R. Menon, Openmp: an industry standard API for shared825
memory programming, IEEE computational science and engineering 5 (1) (1998) 46–55.
[15] C. Nvidia, Nvidia CUDA C programming guide, Nvidia Corporation 120 (18) (2011) 8.
[16] J. S. Vetter, S. Mittal, Opportunities for nonvolatile memory systems in 830
extreme-scale high-performance computing, Computing in Science & Engi-
rna
neering 17 (2) (2015) 73–82.
[17] W. R. Hamilton, Elements of quaternions, Longmans, Green, & Company, 1866.
[18] T. A. Ell, S. J. Sangwine, Hypercomplex Fourier transforms of color images, 835
IEEE Trans. on image processing 16 (1) (2007) 22–35.
Jou
[19] C. Wang, X. Wang, Y. Li, Z. Xia, C. Zhang, Quaternion polar harmonic Fourier moments for color images, Information Sciences 450 (2018) 141– 156.
[20] P.-T. Yap, X. Jiang, A. C. Kot, Two-dimensional polar harmonic trans-
840
forms for invariant image representation, IEEE Trans. on Pattern Analysis and Machine Intelligence 32 (7) (2010) 1259–1270. 43
Journal Pre-proof
lP repro of
[21] X.-y. Wang, W.-y. Li, H.-y. Yang, P. Wang, Y.-w. Li, Quaternion polar complex exponential transform for invariant color image description, Applied Math. and Computation 256 (2015) 951–967. 845
[22] K. M. Hosny, M. M. Darwish, Accurate computation of quaternion polar complex exponential transform for color images in different coordinate systems, Journal of Electronic Imaging 26 (2) (2017) 023021.
[23] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, K. Li, A parallel random forest algorithm for big data in a Spark cloud computing environment, 850
IEEE Trans. on Parallel and Distributed Systems 28 (4) (2017) 919–933.
[24] K. Li, W. Yang, K. Li, Performance analysis and optimization for SpMV on GPU using probabilistic modeling, IEEE Trans. on Parallel and Distributed Systems 26 (1) (2015) 196–205.
[25] S. Mittal, J. S. Vetter, A survey of CPU-GPU heterogeneous computing 855
techniques, ACM Computing Surveys 47 (4) (2015) 69.
[26] Q. Liu, W. Luk, Heterogeneous systems for energy efficient scientific com-
puting, in: International Symposium on Applied Reconfigurable Comput-
rna
ing, Springer, 2012, pp. 64–75.
[27] Y. Ogata, T. Endo, N. Maruyama, S. Matsuoka, An efficient, model-based 860
CPU-GPU heterogeneous FFT library, in: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, IEEE, 2008, pp. 1–10.
Jou
[28] V. Podlozhnyuk, Image convolution with cuda, NVIDIA Corporation white paper, June 2097 (3).
865
[29] H. Jiang, N. Ganesan, Cudampf: a multi-tiered parallel framework for accelerating protein sequence search in hmmer on cuda-enabled gpu, BMC bioinformatics 17 (1) (2016) 106.
44
Journal Pre-proof
lP repro of
[30] A. Di Biagio, G. Agosta, Improved programming of gpu architectures
through automated data allocation and loop restructuring, in: 23th In870
ternational Conference on Architecture of Computing Systems 2010, VDE, 2010, pp. 1–8.
[31] J. Gong, S. Markidis, E. Laure, M. Otten, P. Fischer, M. Min, Nekbone
performance on gpus with openacc and cuda fortran implementations, The Journal of Supercomputing 72 (11) (2016) 4160–4180. 875
[32] R. Reyes, F. de Sande, Optimization strategies in different cuda architec-
tures using llcomp, Microprocessors and Microsystems 36 (2) (2012) 78–87.
[33] C.-K. Luk, S. Hong, H. Kim, Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, in: Proceedings of the 42nd
Annual IEEE/ACM International Symposium on Microarchitecture, ACM, 880
2009, pp. 45–55.
[34] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-UCSD birds 200.
[35] A. Budai, R. Bock, A. Maier, J. Hornegger, G. Michelson, Robust vessel
885
2013.
rna
segmentation in fundus images, International journal of biomedical imaging
[36] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman, N. Petkov, Med-node: a computer-assisted melanoma diagnosis system using nondermoscopic images, Expert systems with applications 42 (19) (2015) 6578–
Jou
6585. 890
[37] J. Van De Weijer, C. Schmid, Coloring local feature extraction, in: European conference on computer vision, Springer, 2006, pp. 334–348.
45
Journal Pre-proof
Jou
rna
lP repro of
Highlights: 1. We proposed an accurate parallel implementation for the QPCET moment with zero ignored pixels. 2. We proposed a polar moment-based optimized prediction of the balanced workload of heterogeneous processing units (PUs). 3. We proposed a technique for boosting the level of parallelism, reducing the synchronization cost, and load balancing for loop iterations with unequal computational loads called loop mitigation. 4. We proposed a set of parallel implementations including multicore CPUonly, GPU-only and CPU-GPU implementations for the QPCET moment.
lP repro of
Journal Pre-proof
Jou
rna
The authors declares that there is no conflict of interest regarding the publication of this article.
Journal Pre-proof Ahmad Salah received the PhD degree in computer science from Hunan University, China. He received the Master degree in CS from Ain-shams University, Cairo, Egypt. His current research interests are parallel computing, computational biology and algorithms.
lP repro of
Kenli Li received the PhD degree in computer science from Huazhong University of Science and Technology, China, in 2003. He was a visiting scholar at the University of Illinois at Urbana-Champaign from 2004 to 2005. He is currently a full professor of computer science and technology at Hunan University and deputy director of the National Supercomputing Center in Changsha. His major research areas include parallel computing, high-performance computing, and grid and cloud computing. He has published more than 130 research papers in international conferences and journals, such as IEEE Transactions on Computers, IEEE Transactions on Parallel and Distributed Systems, Journal of Parallel and Distributed Computing, ICPP, CCGrid. He is an outstanding member of CCF. He is a member of the IEEE and serves on the editorial board of the IEEE Transactions on Computers.
Khalid M. Hosny is a professor of information technology, faculty of Computers and Informatics at Zagazig University. He received the B.Sc., M.Sc. and Ph.D. from Zagazig University, Zagazig, Egypt in 1988, 1994, and 2000 respectively. From 1997 to 1999 he was a visiting scholar, University of Michigan, Ann Arbor and University of Cincinnati, Cincinnati, USA. He is a senior member of ACM and a member of IEEE and the IEEE computer society. His research interests include image processing, pattern recognition and computer vision. Dr. Hosny published more than 50 papers in international journals. He is an editor and scientific reviewer for more than thirty international journals.
rna
Mohamed M. Darwish is assistant professor of Computer science, Faculty of Science, Assiut University, Assiut, Egypt. He received the B.Sc. (Hons.), M.Sc. and Ph.D. in Computer Science from Faculty of Science, Assiut University. His research area is Image Processing, Association Rule Mining and Medical Image Analysis.
Jou
Qi Tian received the B.E. degree in electronic engineering from Tsinghua University, Beijing, China, in 1992, the Ph.D. degree in ECE from the University of Illinois at Urbana-Champaign, Champaign, IL, USA, in 2002, and the M.S. degree in ECE from Drexel University, Philadelphia, PA, USA in 1996. He is currently a Full Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA), San Antonio, TX, USA. Dr. Tian is the Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA (TMM), the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY (TCSVT), Multimedia System Journal (MMSJ), and is on the Editorial Board of Journal of Multimedia (JMM), and Journal of Machine Vision and Applications (MVA). He is the Guest Editor of the IEEE TRANSACTIONS ON MULTIMEDIA and the Journal of Computer Vision and Image Understanding. His research projects are funded by ARO, NSF, DHS, Google, FXPAL, NEC, SALSI, CIAS, Akiira Media Systems, HP, Blippar, and UTSA. He was the recipient of the 2014 Research Achievement Award from College of Science, UTSA. He was the recipient of the 2010 ACM Service Award. He was the coauthor %of a Best Paper in ACM ICMR 2015, a Best Paper in PCM 2013, a Best Paper in MMM 2013, a Best Paper in ACM ICIMCS 2012, a Top 10 Paper Award in MMSP 2011, a Best Student Paper in ICASSP 2006, and coauthor of a Best Student Paper Candidate in ICME 2015, and a Best Paper Candidate in PCM 2007.
Journal Pre-proof
lP repro of
Ahmad Salah
Kenli Li
rna
Khalid M. Hosny
Jou
Mohamed Darwish
Qi Tian