Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging

Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging

ARTICLE IN PRESS JID: ADES [m5G;May 27, 2016;21:48] Advances in Engineering Software 0 0 0 (2016) 1–8 Contents lists available at ScienceDirect A...

2MB Sizes 0 Downloads 17 Views

ARTICLE IN PRESS

JID: ADES

[m5G;May 27, 2016;21:48]

Advances in Engineering Software 0 0 0 (2016) 1–8

Contents lists available at ScienceDirect

Advances in Engineering Software journal homepage: www.elsevier.com/locate/advengsoft

Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging a ˇ Milan Jaroš a,b, Petr Strakoš a,∗, Tomáš Karásek a, Lubomír Ríha , Alena Vašatová a,b, a,b a,b Marta Jarošová , Tomáš Kozubek a b

IT4Innovations, VŠB Technical University of Ostrava, Czech Republic Department of Applied Mathematics, VŠB Technical University of Ostrava, Czech Republic

a r t i c l e

i n f o

Article history: Received 24 July 2015 Revised 24 February 2016 Accepted 16 May 2016 Available online xxx Keywords: Image processing K-means Liver Computed tomography CT images GPU MIC Intel Xeon Phi Coprocessors

a b s t r a c t The paper presents speed up of the k-means algorithm for image segmentation. This speed up is achieved by effective parallelization. For parallel implementation we focus on Many Integrated Core (MIC) architecture with Intel Xeon Phi coprocessors. The MIC implementation is compared with GPU, CPU and sequential implementation. To demonstrate parallel capabilities of k-means algorithm, segmentation of CT images of human body are used. Results of this work will be used for development of the software application for automatic 3D model reconstruction of heart and liver.

1. Introduction The role of High Performance Computing (HPC) increases its importance in the human society. Technical computing became an integral part of research and development of new technologies in our civilization [1]. Computational precision and effectivity are together with the price of the computation more and more accentuated. Although CPU speed is still increasing, power consumption and thus computation price stays unchanged or decreases only a little. Here comes the room for coprocessors like Intel Xeon Phi [2,3]. In medicine technical advances are common and bring enormous increase in processed data. To solve this phenomenon effective tools for processing these data are necessary. Computed tomography (CT) is one of the particular areas of medicine where large amounts of data appear. CT uses x-rays to create an image of human body in sequential slices. With development of this technique new directions in diagnostic medicine were found [4,5]. It can validate therapeutic effectiveness in cancer diseases or help in



Corresponding author. E-mail address: [email protected] (P. Strakoš).

© 2016 Elsevier Ltd. All rights reserved.

surgical planning. These procedures can be especially valuable in treatment of liver diseases. To successfully go through the whole treatment of liver disease 3D model of the organ can be very useful. Creation of the 3D model from CT images requires image segmentation. The image segmentation is digital image processing tools which can automatically split digital image into the regions with the same properties [6–8]. Results of the image segmentation can then be directly used in 3D model reconstruction of the organ. There is numerous software tools for image segmentation available. They are both commercial and open source. Disadvantage of the most open-source software for image segmentation is that they do not support parallel image processing. This leads due to large scale data being processed to long computational times and to high computational power requirements. For example segmenting a thousands of medical images of human body and creating 3D models of segmented data could take hours or days based on the complexity of segmentation. HPC technologies can dramatically speed up those computations. There are several options from the branch of HPC implementations that can be used for speeding up the computations. The wellknown one is based on GPU architecture [9]. Although the GPU

http://dx.doi.org/10.1016/j.advengsoft.2016.05.008 0965-9978/© 2016 Elsevier Ltd. All rights reserved.

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

ARTICLE IN PRESS

JID: ADES 2

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

Fig. 2. Image segmentation by k-means method. Fig. 1. CT image in DICOM format.

architecture can improve the computational power dramatically, it has not brought as much attention as expected. It is mainly due to quite complicated implementation and restricted utilization. Another option for increasing the computational power is in the usage of Intel Xeon Phi coprocessors. Intel Xeon Phi can further extend the scaling capability of Intel Xeon processor-based systems. The Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth. Utilization of the Phi coprocessors is the new trend in hardware development for supercomputers. As an example the worlds fastest supercomputer Milky Way 2, which is using the MIC architecture, can be mentioned as a reference [10]. The new coprocessors Intel Xeon Phi are built on more than sixty low power consumption cores in one unit. By cooperating with CPU computational power is increased. If one compares solution based on CPUs only and CPU with Xeon Phi coprocessors, power consumption can be lowered significantly while offering the same computational power. Intel Xeon Phi uses Many Integrated Cores architecture (MIC), which can perform billions of operations per second (teraflops) in floating point numbers. MIC architecture is providing compatibility with standards like OpenMP and MPI. The paper is based on [11], where results of parallelization are presented on CPUs and MIC architecture. In this paper additional research in terms of GPU implementation and its comparison with aforementioned implementations is presented.

2. Pre-processing procedures The first step is to retrieve the measurement data stored in the DICOM format. DICOM (Digital Imaging and Communications in Medicine) is a standard for storing, displaying and distributing medical data obtained from CT, MRI or ultrasound. Example of CT image stored in DICOM format is depicted in Fig. 1. In our research we deal mainly with medical data obtained from CT. CT gives higher resolution in comparison with MRI. While acquiring images from CT that are stored as axial slices of a patient we tend to acquire as many slices of a body as possible. Our data contain images at mutual axial distance of 0.6 mm. This gives a good prerequisite for future 3D model reconstruction. Drawback of small distance between slices is a presence of noise. The image

noise is reduced by noise reduction filters (gauss blur, anisotropic filter, adaptive Wiener filter, etc.). Next step of pre-processing is to convert the original values of pixel intensities from CT to Hounsfield Units (HU). To convert from the normal units to HU we have to apply a linear transformation of the data. This is written in Eq. (1)

IHU = Io · s + i,

(1)

where IHU is pixel intensity in Hounsfield Units, Io is pixel intensity within an original image, s stands for rescale slope and i stands for rescale intercept. The rescale slope and the rescale intercept are stored in the DICOM files and typically have values of 1 for the slope and -1024 for the intercept. After converting to HU we can define the tissue type based on typical HU values. For example, the air has -10 0 0 HU, the water has 0 HU, the liver has value from +40 to +60 HU and the bones have typically +700 HU or greater. The last step before we can proceed to segmentation by kmeans algorithm is the transformation of the image data. Usually the image segmentation is performed on 2D image. In next step we need to segment all consecutive images in a series as one 3D image. To do this information from all the data has to be stored into one image vector. It means that intensity value of each pixel in an image and each image in a series is taken and inserted into one image vector. 3. K-means clustering method K-means clustering is method originating in signal processing and is often used in data mining. Generally this method can be employed in different areas including image processing, where it can be used as image segmentation method based on clustering. This method partitions n pixels into k clusters, where k is integer value that holds k < n. K-means algorithm classifies pixels in an image into k number of clusters according to some similarity feature like grey level intensity of pixels and distance of pixel intensities from centroid pixel intensity [8,12,13]. Example of a 2D image segmentation is in Fig. 2. The algorithm works in the following way: (i) Selection of k clusters (k is a user defined parameter) (ii) Calculation of the number of image pixels N (iii) Selection of k initial pixel intensity centroids μj (usually done randomly or using some prior knowledge)

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

ARTICLE IN PRESS

JID: ADES

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

3

Fig. 3. Flood algorithm (left), marching cubes method (right).

(iv) Calculation of distances Dij between pixel xi and each centroid μj as given in Eq. (2). Particular pixel xi is then classified to cluster cj to which centroid it has the smallest distance.

Di j = ( xi − μ j ) 2

(2)

where i = 1 ÷ N and j = 1 ÷ k (v) Recalculation of centroid positions μj as a mean value from all pixel intensities which belong to cluster cj as shown in Eq. (3).

μj =

lj 1  · xi lj

(3)

i=1

where lj is a number of pixels that belong to cluster cj . (vi) Steps (iv) and (v) are repeated until classification of the image pixels does not change or equivalently, centroids do not move.

4. Post-processing procedures After determining the individual segments we can proceed to the reconstruction of the surface. The surface of each segment in 3D is reconstructed based on the knowledge of its boundary curves. It is a delineation of the segment in each slice. Every boundary curve is represented by a set of pixels, which is obtained using the flood algorithm [14]. Explanation of this algorithm is depicted in Fig. 3 (left). This algorithm fills all connected areas represented by the segment and also marks the segment boundary. Resulting boundary curves are then used as input for the 3D reconstruction using Marching Cubes method [15]. Basic principle of the Marching Cubes method can be described with the help of Fig. 3 (right). Algorithm goes through every pixel of the 3D image. This step can be represented by marching an imaginary cube through a scalar field, where presence or absence of the boundary in each vertex of an imaginary cube is classified to one of the 15 generalized cases that assign particular surface polygons. After the whole scalar field (the whole 3D image) is walked through, 3D reconstruction composed of connected polygons is created. The final results showing 3D reconstructions of human body organs and the skeleton by using Marching Cubes method are shown in Fig. 4.

Fig. 4. Resulting 3D model using marching cubes method.

5. Sequential implementation The sequential version of the k-means algorithm can be directly implemented using several for loops embedded in one while loop. In for loops calculation of distances between each pixel and centroid is performed. Recalculation of the centroids position for the next iteration of the while loop is also done here. The while loop condition for another iteration is inequality between the centroids position from the previous iteration and the actual iteration. If the centroids position does not change, the loop is broken and algorithm stops. Possible interpretation of such sequential algorithm can be written as:

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

JID: ADES 4

ARTICLE IN PRESS

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

1. while previous centroids position differs from actual centroids position 1. for i = 1 ÷ N image pixels, do the classification 1. for j = 1 ÷ k clusters 1. Compute the distance ||xi − u j ||2 2. Classify pixel xi by the cluster cj if ||xi − u j ||2 is among others the smallest. 2. end 2. end 1. for j = 1 ÷ k clusters, do centroids position update 1. Count the number of pixels classified by cluster cj 2. Sum up all pixel intensities xi classified by cluster cj 3. Using 1. and 2. compute the mean value as new centroid position 2. end 2. end 6. Parallel implementation The sequential implementation is direct programmatic implementation of the k-means algorithm. Using the sequential implementation we can run into the problems with memory allocation and speed of convergence. This is visible from first two lines of the sequential implementation (the command while and for). We can optimize the basic algorithm by working exclusively with the histogram of image vector. The histogram reduces (compress) the amount of original data. The implementation of such modified algorithm is shown below. Shift the interval of pixel values to < 0, brightestvalue > Find the brightest pixel Calculate the size of histogram (= brightest + 1) Create the image histogram and find non-zero values in histogram (save to histogram indexes) 5. Initialize centroids 6. while previous centroids differs from actual centroids 1. for i = 1 ÷ H histogram indexes, do the classification 1. for j = 1 ÷ k clusters 1. Compute the distance ||hi − u j ||2 2. Classify histogram value hi by the cluster cj if ||hi − u j ||2 is among others the smallest. 2. end 2. end 1. 2. 3. 4.

1. for j = 1 ÷ k clusters, do centroids position update 1. Count the number of histogram values classified by cluster cj 2. Sum up all histogram values hi classified by cluster cj 3. Using 1. and 2. in this loop compute the mean value as new centroid position 2. end 7. end 8. for i = 1 ÷ N image pixels, do labelling 1. for j = 1 ÷ k clusters 1. Compute the distance ||xi − u j ||2 2. Label pixel xi by the cluster cj if ||xi − u j ||2 is among others the smallest. 2. end 9. end The main blocks of algorithm that are ideal for parallelization are: 1, 2, 5, 6 (only for loops), 8. The parallel implementation has been divided into following parts: • •

allocMem – memory allocation calculateHistogram – calculation of histograms for each image

• •





initializeCentroids – initialization of the clustering centroids computekMeans – recalculation of the cluster centroids until reaching the optimum doLabelling – classification of each pixel to the appropriate labelled cluster freeMem – de-allocation of the memory

Above mentioned parts were parallelized in same way for all versions of the code. Only difference is in used technology for CPU, MIC and GPU version. OpenMP technology was used for CPU. For MIC implementation offload OpenMP mode was selected and for GPU, CUDA technology was used. As an example of parallelization in Sections 6.1, 6.2 and 6.3 doLabelling is used. This method classifieds each pixel based on its intensity. 6.1. CPU OpenMP 4.0 (Open Multi-Processing) technology was used to achieve parallelization. OpenMP is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming. A code example for doLabelling part is shown below. Arguments passed into this method are: image - vector of input picture, imageSize - vector size, labelCountCPU - number of domains of final segmentation, mu - vector of mean values of segmented areas. Output of this method is vector mask, which for each pixel assign its appropriate segment number this pixel belong to. This method is parallelized in such a way that this vector is divided into parts based on number of cores and one thread per core is used. This is achieved by command (pragma omp parallel for), which for each pixel (image[i]) search for segment (mu[j]) with smallest difference of mean value (minDistance). void doLabellingCPU(IMAGE_TYPE *image, int64_t imageSize, int64_t labelCount, float *mu, char *mask) { #pragma omp parallel for for (int64_t i = 0; i < imageSize; i++) { float minDistance = FLT_MAX; for (int64_t j = 0; j < labelCount; j++) { float distance = fabs(image[i] - mu[j]); if (minDistance > distance) { minDistance = distance; mask[i] = j; } } } }

6.2. MIC Intel Xeon Phi 5110P (MIC - Many Integrated Cores architecture) can be utilized in various ways. It can work in: (1) native mode: only accelerator executes the computations; (2) hybrid offload OpenMP mode: application runs on CPU and selected parts of the program run on coprocessor; (3) MPI symmetrical mode: MPI processes run on both CPU and MIC and workload is divided using MPI processes. In our implementation the offload OpenMP mode was used. An example of the code for doLabellingMIC can be seen below. Arguments passed into this method are mic_image - vector of input picture, imageSize - vector size, labelCount - number of domains of final segmentation, mic_mu - vector of mean values of segmented areas. Output of this method is vector mic_mask which for each pixel assign its appropriate segment number this pixel belong to. This method is parallelized

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

JID: ADES

ARTICLE IN PRESS

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

in same way as a CPU version of code (see Section 6.1), i.e. input vector is divided into parts based on number of cores. In this case 61 cores per CPU and four threads per core are used. Each thread search for each pixel (mic_image[i]) its appropriate segment (mic_mu[j]) with smallest difference of mean value (minDistance). Memory on MIC for mic_image, mic_mu and mic_mask was already allocated so now we could write only alloc_if(0) and free_if(0) in the code. The speedup was achieved by using the SIMD (Single instruction, multiple data) vectorization (pragma omp simd collapse(2)). __declspec(target(mic)) void doLabellingMIC(int64_t imageSize, int64_t labelCount) { #pragma offload target(mic)\ out(mic_mask: length(imageSize) alloc_if(0) free_if(0))\ nocopy(mic_image: length(imageSize) alloc_if(0) free_if(0))\ out(mic_mu : length(labelCount) alloc_if(0) free_if(0)) { //labelling #pragma omp parallel for #pragma omp simd collapse(2) for (int64_t i = 0; i < imageSize; i++) { float minDistance = FLT_MAX; for (int j = 0; j < labelCount; j++) { float distance = fabs(mic_image[i] - mic_mu[j]); if (minDistance > distance) { minDistance = distance; mic_mask [i] = j; } } } } }

6.3. GPU The GPU implementation was performed on NVIDIA Kepler K20. In our implementation the CUDA 6.5 technology was used. CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA. An example of code for doLabellingCUDA is shown below. In GPU version of our code we had to create additional function which is not used in CPU and MIC versions. It is a global function _doLabellingCUDA. Global functions are also called “kernels”. It’s the function that you may call from the host and it is executed in the device. Arguments passed into this method are: _d_image - vector of input image, imageSize vector size, labelCount - number of domains of final segmentation, _d_centers - vector of mean values (mean pixel intensity value) of segmented areas. Output of this method is vector _d_mask, which for each pixel assign its appropriate segment number this pixel belongs to. Memory allocation for _d_image, _d_centers and _d_mask was made using CUDA API for handling device memory cudaMalloc() (it is a similar to the C function malloc()) earlier in code. Each parallel invocation of _doLabellingCUDA is referred to as a block. The set of blocks is referred to as a grid. A block can be split into parallel threads. In our case we use 1024 threads. To access particular element of vector we have to calculate its position by: int64_tid = threadIdx.x + blockIdx.x ∗ blockDim.x; where threadIdx.x is the number associated with each thread within a block, the blockIdx.x refers to the label associated

5

with a block in a grid and blockDim.x is a next built-in variable for count of threads per block. Each thread search for each pixel _d_image[tid] its appropriate segment (centers[j]) with smallest difference of mean value (minDistance). The speed-up was achieved by using the shared memory. Within a block, threads share data via shared memory (__shared__ float centers). __global__ void _doLabellingCUDA(IMAGE_TYPE *_d_image, int64_t imageSize, int64_t labelCount, float *_d_centers, char *_d_mask) { __shared__ float centers [HISTOGRAM_SIZE]; if (threadIdx.x < labelCount) centers[threadIdx.x] = _d_centers[threadIdx.x]; __syncthreads(); //labelling int64_t tid = threadIdx.x + blockIdx.x * blockDim.x; if (tid < imageSize) { float minDistance = FLT_MAX; //_d_centers[j] IMAGE_TYPE tempImage = _d_image[tid]; IMAGE_TYPE tempJ = 0; for (int64_t j = 0; j < labelCount; j++) { float distance = fabs(tempImage - centers[j]); if (minDistance > distance) { minDistance = distance; tempJ = j; } } _d_mask[tid] = tempJ; } } void doLabellingCUDA(int64_t imageSize, int64_t labelCount) { int64_t nBlocks = (imageSize/ 1024 + 1); _doLabellingCUDA << > >(d_image, imageSize, labelCount, d_centers, d_mask); cudaDeviceSynchronize(); }

7. Example For testing up to 8191 of CT (DICOM) images have been used. The reason for maximally 8191 images was due to memory limitations of the GPU implementation. Nevertheless such amount of data was more than sufficient as approximately 30 0 0 images covers the length of adult person. Each CT image contained 512x512 pixels. Our application was compiled with Intel C++ Composer XE 2013 SP1 and NVIDIA CUDA 6.5. We have used one node of our supercomputer Anselm with 2x Intel Sandy Bridge E5-2470 2.3GHz CPUs. For accelerated nodes MIC accelerator Intel Xeon Phi 5110P or GPU accelerator NVIDIA Tesla Kepler K20 in addition to the previous CPU configuration have been used. For debugging in offload mode we used Allinea DDT 4.2.1. 8. Results In connection to [11] we present here tests obtained first on smaller data series (1-10 0 0 images) and then on the dataset up to 8191 images. We present also results for different number of labelling clusters as another parameter that influences the complexity of the solved problem. In Table 1 we can see the time dependencies (in seconds) of used algorithm and architecture for dataset between 1-10 0 0 images and 10 labelled clusters. In Table 2 we can see results for 32 clusters and dataset between 1-8191 images. The

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

ARTICLE IN PRESS

JID: ADES 6

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

Fig. 5. The comparison of the sequential CPU implementation and the parallel implementations on different architectures (CPU, MIC); 10 clusters.

Fig. 6. The comparison of different parallel implementations on different architectures (CPU, MIC, GPU); 32 clusters.

Table 1 The time dependencies of different algorithms and architectures; 10 clusters. Clusters = 10

Runtime [s]

No. of images [-]

CPU1

CPU16

Seq.

MIC+t

1 10 50 100 200 500 10 0 0

0.005 0.040 0.220 0.440 0.880 2.210 4.430

0.001 0.010 0.033 0.065 0.130 0.310 0.640

0.050 0.440 2.090 3.670 8.520 67.140 726.040

0.330 0.330 0.430 0.570 0.830 1.620 2.880

Table 3 shows results for 64 clusters and 1-8191 images. In all of the tables: Seq. is time of sequential implementation ( Section 5), CPUn is time of parallel implementation (Section 6) on CPU with n cores, MIC+t is time of parallel implementation on Intel Xeon Phi

with considering the time for data transfer and data allocation (total solution time). The same holds for GPU results. The GPU+t is total solution time. In Fig. 5 we can see the comparison between sequential and specific parallel implementations on smaller image series (1-10 0 0) and smaller number of clusters, which is 10. The runtime of parallel algorithm on the CPU is shorter than the total runtime on MIC. The original sequential implementation is much slower than optimized parallel algorithms. In Fig. 6 results for larger dataset (1-8191) and higher number of clusters (32 clusters). The effectivity of the both accelerated parallel versions (MIC, GPU) starts to be visible here. The higher speed-up of the accelerated parallel versions over CPU is visible in Fig. 7, where the number of clusters is 64. The accelerated parallel versions become beneficial from approximately 750 images in the series. In larger image series the sequential implementation of the algorithm has not been tested since the total runtime would be too long (approximately 11 days).

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

ARTICLE IN PRESS

JID: ADES

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

7

Fig. 7. The comparison of different parallel implementations on different architectures (CPU, MIC, GPU); 64 clusters.

Table 2 The time dependencies of different algorithms and architectures; 32 clusters. Clusters = 32

Table 3 The time dependencies of different algorithms and architectures; 64 clusters. Clusters = 64

Runtime [s]

Runtime [s]

No. of images [-]

CPU1

CPU16

GPU+t

MIC+t

No. of images [-]

CPU1

CPU16

GPU+t

MIC+t

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8191

0.017 0.038 0.079 0.150 0.305 0.588 1.161 2.303 4.605 9.166 18.272 36.512 72.985 145.864

0.171 0.133 0.085 0.152 0.221 0.231 0.326 0.323 0.447 0.676 1.354 2.692 5.371 10.715

1.376 1.433 1.436 1.430 1.431 1.448 1.459 1.476 1.519 1.617 1.880 2.340 2.869 4.345

0.255 0.247 0.259 0.270 0.302 0.299 0.361 0.443 0.612 0.997 1.698 3.157 6.038 11.985

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8191

0.029 0.074 0.143 0.289 0.555 1.096 2.151 4.273 8.531 17.015 33.983 67.883 136.615 267.472

0.152 0.166 0.194 0.215 0.103 0.286 0.316 0.732 1.036 1.255 2.514 5.211 10.057 19.625

1.438 1.415 1.404 1.418 1.425 1.430 1.449 1.487 1.395 1.735 1.877 2.398 3.252 5.627

0.244 0.265 0.249 0.287 0.277 0.303 0.355 0.456 0.634 1.079 1.854 3.447 6.695 12.659

Acknowledgment 9. Conclusion Effective parallelization of the k-means algorithm has been presented in this paper. This algorithm was used for image segmentation of medical data from computed tomography (CT). In the paper example comparison of typical sequential implementation and parallel implementations of the mentioned algorithm is presented. Beside the typical CPU parallel implementation the accelerator implementations on MIC and GPU have been tested. All tests ran always on one node of supercomputer Anselm located at IT4Innovations national supercomputing center, Ostrava, Czech Republic. Results show high effectivity of parallel implementations in terms of short solution times in comparison to sequential implementation. The accelerator implementations show possible further speed-up above the CPU solution especially in long datasets (750-8191 images) and high number of clusters. In our future work we will focus on development of software for fast image segmentation of human organs with further 3D model reconstruction and visualization.

This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPU II) project IT4Innovations excellence in science - LQ1602 and from the Large Infrastructures for Research, Experimental Development and Innovations project IT4Innovations National Supercomputing Center LM2015070. References [1] Hager G, Wellein G. Introduction to high performance computing for scientists and engineers. Boca Raton, FL: CRC Press; 2011. [2] Reinders J. An overview of programming for Intel Xeon processors and Intel Xeon Phi coprocessors. Santa Clara, CA: Intel Corporation; 2012. [3] Jeffers J, Reinders J. Intel Xeon Phi Coprocessor high-performance programming. Elsevier; 2013. [4] Hsieh J. Computed tomography: principles, design, artifacts, and recent advances. SPIE and John Wiley & Sons, Inc.; 2002. [5] Kak AC, Slaney M. Principles of computerized tomographic imaging. SIAM Society of Industrial and Applied Mathematics; 2001. [6] Neri E, Caramella D, Bartolozzi C. Image processing in radiology. Berlin: Springer; 2008.

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008

JID: ADES 8

ARTICLE IN PRESS

[m5G;May 27, 2016;21:48]

M. Jaroš et al. / Advances in Engineering Software 000 (2016) 1–8

[7] Terry Y. Insight into images: principles and practice for segmentation, registration, and image analysis. Wellesley, MA: A K Peters/CRC Press; 2004. [8] Sonka M, Hlavac V, Boyle R. Image processing, analysis and machine vision. Toronto: Thomson; 2006. [9] Luebke D. GPU architecture: Implications & trends, ser. SIGGRAPH 2008. 2008. [Online]. Available from: http://s08.idav.ucdavis.edu/luebke- nvidia- gpuarchitecture.pdf. [10] Alba D. China’s tianhe-2 caps top 10 supercomputers. IEEE Spectrum2013-0617 Retrieved 2013-06-19. [11] Strakos P, Jaros M, Karasek T, Riha L, Jarosova M, Kozubek T, et al. Parallelization of the image segmentation algorithm for Intel Xeon Phi with application in medical imaging. In: Ivnyi P, Topping BHV, editors. Proceedings of the fourth international conference on parallel, distributed, grid and cloud computing for engineering. Civil-Comp Press, Stirlingshire, UK, Paper 7; 2015. 10.4203/ccp.107.7.

[12] Dass R, Priyanka, Devi S. Image segmentation techniques. IJECT 2012;3:2230–7109. [13] Luo S, Li X, Li J. Review on the methods of automatic liver segmentation from abdominal images. J Comput Commun 2014;2:1–7. doi:10.4236/jcc.2014.22001. [14] Flood fill. 2015. Available from: http://en.wikipedia.org/wiki/Flood_fill, [20151-24]. [15] Lorensen WE, Cline HE. Marching cubes: a high resolution 3d surface construction algorithm. Comput Graph 1987;21.

Please cite this article as: M. Jaroš et al., Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software (2016), http://dx.doi.org/10.1016/j.advengsoft.2016.05.008