GPU accelerated dynamic functional connectivity analysis for functional MRI data

GPU accelerated dynamic functional connectivity analysis for functional MRI data

Computerized Medical Imaging and Graphics 43 (2015) 53–63 Contents lists available at ScienceDirect Computerized Medical Imaging and Graphics journa...

3MB Sizes 0 Downloads 132 Views

Computerized Medical Imaging and Graphics 43 (2015) 53–63

Contents lists available at ScienceDirect

Computerized Medical Imaging and Graphics journal homepage: www.elsevier.com/locate/compmedimag

GPU accelerated dynamic functional connectivity analysis for functional MRI data Devrim Akgün a,∗ , Ünal Sako˘glu b , Johnny Esquivel b , Bryon Adinoff c,d , Mutlu Mete b a

Department of Computer Engineering, Sakarya University, Sakarya, Turkey Department of Computer Science and Information Systems, Texas A&M University – Commerce, Commerce, TX, USA VA North Texas Health Care System, Dallas, TX, USA d Department of Psychiatry, UT Southwestern Medical Center, Dallas, TX, USA b c

a r t i c l e

a b s t r a c t

i n f o

Article history: Received 15 November 2014 Received in revised form 20 February 2015 Accepted 25 February 2015 Keywords: GPU computing CUDA OpenMP Dynamic functional connectivity Functional magnetic resonance imaging fMRI

Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. © 2015 Elsevier Ltd. All rights reserved.

1. Introduction Computational power of multicore processors and Graphical Processing Unit (GPU) supported devices, which are now widely available in commonly-used computers, provide a practical way to implement computationally intensive algorithms in parallel. Open Multi Processing (OpenMP) application programming interface provides an efficient abstraction for implementing parallel algorithms for multicore processors [1,2]. Software developers utilized advantages of OpenMP to speed up their applications using regular multicore CPUs. In addition to multicore processors, most of the modern graphics cards that have GPUs deliver computing hubs for the acceleration of parallelizable algorithms that demand high

∗ Corresponding author. Tel.: +90 264 295 7234; fax: +90 264 295 5454. E-mail address: [email protected] (D. Akgün). http://dx.doi.org/10.1016/j.compmedimag.2015.02.009 0895-6111/© 2015 Elsevier Ltd. All rights reserved.

computational power. Numerous improvement attempts greatly increased the flexibility of the hardware and allowed the GPU vendors to better scale horizontally. Over time, uneven work distribution in this aspect of the architecture has been refined further. Specialized processor types, each with its own special instructions, have been unified into one less specialized processor type that is able to handle more general tasks. Unified processors have become more complex as well as more flexible. Currently, large number of lightweight threads in a GPU enables properly designed parallel algorithms to be executed relatively faster than sequential implementation. Compute Unified Device Architecture (CUDA) provides an exceptional foundation for programmers to utilize NVidia(R) GPUs efficiently [3–5]. Since their debut, OpenMP and CUDA-based parallel applications attracted researchers to use these tools in scientific and engineering applications including bioinformatics. Utilizing parallel computing methods to accelerate computational power also have gained

54

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

importance. For example, various image processing and signal processing applications, heuristic optimization methods such as genetic algorithms, simulated annealing and particle swarm, and solutions of differential equations are just a few examples that have been implemented successfully using CUDA. Parallel computing approaches are also of high interest in bioinformatics to accelerate time-consuming computations [6–10]. Dynamic functional connectivity (DFC) analysis is one of the computationally-intensive methods in neuroscience for investigating dynamic interactions among different brain networks (also applied to other organs such as the spinal cord [11]) from functional magnetic resonance imaging (fMRI) data, using dynamic slidingwindow temporal correlation analysis [12–17]. DFC-based analyses identify pairs of brain areas that are activated or deactivated in tandem over the duration of fMRI experiments. However, fMRI data from multiple subjects examining interacting brain networks with DFC can be very large, such as 234 × 435 = 101,790 for 234 scans and 30 combinations per scans in human brains [13,15,18] as well as small animals’ brains (e.g., rodents [19]). Hence, DFC analysis can be computationally intensive depending on the imaging data size, resolution, and analysis parameters such as sliding-window width and window step size. The need for a more rapid computational approach for the DFC analysis warranted parallelization of the DFC algorithm and utilization of GPUs. GPUs are now more common and inexpensive; hence, the utilization of the parallel computations on GPUs would make the DFC analyses feasible for many researchers. Moreover, using the parallelization with GPUs, dynamic brain interactions can be accessed near-real-time or real-time during fMRI experiments using the DFC analysis. Another important application of the parallelization can be to the recently developed real-time neurofeedback fMRI experiments. It has been recently shown that the interactions occurring among different brain regions can be possibly exploited as neurofeedback, which can be used to potentially learn voluntary self-regulation at individual brain levels [20–27] using dynamic fMRI activation analysis in real-time. These novel approaches may lead to improved and more distinct sensory perceptions [28,29], faster motor response [30], and therapeutic effects in chronic patients [24], Parkinson’s disease [23], tinnitus [31] and depression [32]. Voluntary regulation of functional interactions between different brain regions [21] has also been proposed; however, it was based on static functional connectivity. Some recent work utilized GPUs for fMRI data analysis [6,33], and more recently for functional connectivity analysis [34], but none applied it to DFC analysis of fMRI data. Enabling near real-time analyses of the brain’s dynamic functional connectivity was the primary motivation for the utilization of parallel computing capabilities of OpenMP-supported multi-core CPUs and CUDA-supported GPUs to speed up the DFC analysis on a multi-subject fMRI dataset in this paper. In this work, 234 fMRI scans sessions from 83 de-identified subjects were used, since each subject had 2 or 3 separate fMRI scan sessions. We will refer to an fMRI scan session as “fMRI scan”, or simply “scan”, throughout this text, as it is common in the literature [35]. An identical set of 30 brain networks and their corresponding signature fMRI time-series data, which we refer to as “time-courses” throughout the text, were extracted for each subject. These networks and their time-courses were refined by a spatial group independent component analysis (ICA), which we simply refer to as “ICA” throughout this text. ICA is a data-driven blind source-separation technique which provides independently “behaving” brain networks common to all the subjects within a group of subjects [36–39]. While a parallel implementation of DFC analysis based on OpenMP provides a practical way to reduce the computation time on multicore computers, CUDA-enabled GPU cards open door to further accelerations for the intensive computations. For

this very reason, we developed parallel DFC algorithms based on the proposed thread-based and block-based approaches. Threadbased approach focuses on a single time-course as unit size of data and implemented in both OpenMP and CUDA programming environments. The block-based computing method is specific to GPU hardware and is proposed to better utilize GPU architecture under certain analysis parameters. The experimental results were obtained using a multicore host having AMD (R) FX 8320 processor and NVidia(R) GeForce GTX 660 GPU devices, which are widely available in the current market. The article is organized as follows. In the following Section 2, the details of the DFC algorithm, the computing application interfaces, algorithmic elements of OpenMP and CUDA, the implementation and proposed thread- and block-based approaches are explained. In Section 3, the details of the hardware we use and experimental results are provided. The outcomes of our designs are also discussed in Section 3. The last section, Section 4, finalizes the work with summary and future studies. 2. Methods In this work, a window-based DFC algorithm is accelerated with multicore implementation using OpenMP and GPU implementation using NVidia(R) CUDA. Before the implementation of the two newly proposed designs on two different hardware components, we introduce DFC analysis and how it could be run as a sequential algorithm. 2.1. DFC analysis Before addressing the parallel implementation we first begin with the underlying basics of the DFC algorithm. As shown in Fig. 1, DFC analysis involves windowed correlation operations over time-courses of two brain signals. From each subject, pair-wise regions are determined and their fMRI time-courses are recorded in two single dimensional arrays. There are two required parameters: window size and window step size. Correlation coefficients of the resulting DFC, i.e. the DFC time-courses, which are also timeseries data, are calculated for the two windowed fMRI time-courses, (TC1 and TC2 in Fig. 1), at every window step. Throughout this paper, the calculation of a DFC time-course (the dynamic correlation coefficients) from two fMRI time-courses is considered as the smallest calculation unit of the DFC analyses.

Fig. 1. Visualization of dynamic functional connectivity (DFC) analysis on two brain regions.

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

55

shifted by the specified step size. In the pseudocode, k denotes an index variable for arrays and it is also used for average of calculated correlations, if necessary. More computational power is demanded in DFC calculations as number of subjects or number of time-courses is increased. Ordered DFC coefficients are the output of analysis resulting from two brain regions (networks). Two DFC parameters (window and step sizes) and data size with respect to each subject are other factors that may affect the computational load significantly. Smaller window size or smaller step size means more number of windowing operations. Therefore, realization of the analysis may be longer depending on the fMRI data size and DFC parameters. 2.2. OpenMP implementation Fig. 2. Formatted fMRI data structure.

Once the brain signals from specific regions are extracted, DFC analyses investigate interactions among regions using slidingwindow correlation. The outcome of DFC analysis is another time-course that summarizes dynamics between the regions of interest. At the end, this method helps us identify pairs of brain areas that are activating or deactivating together (either positively or negatively correlated). The DFC steps between two brain regions are summarized in Fig. 1. Implementation steps of the DFC analysis is shown in Algorithm 1. It starts with fMRI data in the form of a data structure as shown by Fig. 2 with additional parameters; window length and window step size. Usually fMRI is considered as 4D data, where spatial x, y, and z values are repeated over time, t. However, computational design will benefit single dimension representation of data. Index operations are exploited during conversions. Before computing the DFC, all possible different paired combinations of regions, corresponding to the paired combinations of different region time-courses, are initially determined.

OpenMP applications use libraries to implement directive clauses and supports shared memory model. OpenMP provides a very handy set of APIs to utilize computational power of multicore computers for parallel implementation of DFC analysis [1,2]. OpenMP directives make parallelization process considerably more practical to for particularly loop-based sequential algorithms. The pseudo-code given by Algorithm 1 can be implemented in dataparallel style by iterating loops independently. Choosing the correct loops to parallelize is of importance in preventing load imbalance and of increasing the efficiency as a result. In practice, load imbalance would also be caused by the way operating system manages threads. When the chunk size (number of all unit operations to be run by all thread) is not a multiple of number of cores, which is a very usual case, it results in workload inequalities among thread allocations. This causes some threads to join the main thread lately, which results in a significant delay in the execution because of barrier synchronization. The degree of load imbalance can be determined by using the Eq. (1) [40]. P is the number of processor; Wmax and Waverage show the maximum chunk size and average workload distributed to threads respectively. After investigating the equation when the average workload is an integer all workloads equal and no load imbalance occur theoretically. I=

Wmax − Wavg , Wavg 1 Wi , P P

Wavg =

(1)

i=1

Wmax = max{W1 , . . ., WP }

While the outer for-loop in Algorithm 1 sweeps the scans to be analyzed, the inner for-loop sweeps all paired combinations of time-courses for a selected scan. Once a pair of time-course is selected, the first DFC process starts for specified window length and step size. Initially, window position is set to first item of time signal by setting its index to zero. Then, a while-loop sweeps the windowed time-courses using a sliding-window, which is simply

Table 1 shows the effect of parallel designs in terms of load imbalance calculations, considering one, two, and three loops. A parallel design focusing on only the first for loop may generate load imbalance, which may result in performance degradation because of uneven work distribution in the last round. Parallelizing both the first loop and the second loop increases the number of chunks to an amount which is equal to multiplication of number of scans and number of time-course combinations (nScans × nCombs). As a result, parallelizing first and second loops together reduces load imbalance from 2.56e−2 to 1.96e−5 which means 1300 times reduction in the load imbalance for eight-core implementation. Therefore a better load balance can be obtained focusing on at least the first and second loops together. We followed this design, called thread-based approach, as given by the parallel implementation by Algorithm 2. The two for loops are reduced to one parallel for loop, which means each of iterations can be executed independently. In this method, parallelizable work is defined in terms DFC operation, which means each thread computes a DFC operation for a selected time-courses pairs. A finer grained parallel algorithm can be provided by also splitting the while loop where smaller amounts of works are realized.

56

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

Table 1 Example load imbalances where nScans is 234; nCombs is 435; step size is 2; window size is 64; and the number of sliding windows is 126. Parallelized loops

Number of Chunks Number of Chunks for our dataset Imbalance ratios for 8-core processor case W −Wavg I = max Wavg

First loop

First and second loops

First, second, and third loops

nScans 234 2.5641 × 10−2

nScans × nCombs 234 × 435 1.9648 × 10−5

nScans × nCombs × nSlidingWindow 234 × 435 × 126 3.1187 × 10−7

However this requires additional synchronization operations that result in some overhead, due to managing threads. The proposed finer grained parallel algorithm will be utilized for the implementation of our block-based approach in GPU implementation.

The parallel for loop in the algorithm given by Algorithm 2 can be implemented in parallel using the omp parallel for directive. This causes main thread to instantiate a number of sub-threads, which, in the present case, is restricted to the number of CPU cores. In addition, the schedule (dynamic) directive is selected for dynamic load balancing of parallel loops among available processors. Once all of threads complete its assigned tasks, the partial results from each thread are collected and evaluated within the main thread. OpenMP directives such as shared and private facilitate a mechanism to define shared variables among threads and encapsulated

variables in threads, respectively. In this OpenMP-based parallel design, selected time-courses are transferred to threads’ local variables to improve data locality.

2.3. CUDA implementation CUDA was another computing approach we utilized in this study that provides efficient ways to utilize CUDA enabled GPU devices. In comparison with CPU architecture, GPU has a large number of computing units and these can be used to run concurrent lightweight threads. Parallelization with GPU is initialized with a host CPU where parallel codes are placed within serial codes executed on the

Fig. 3. Block diagram of thread- and block-based approaches implemented within CUDA library for GPU parallelization.

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

57

host (CPU-main memory of the system). The set of instructions to be run on the GPU is called a kernel. Before running a kernel on the GPU, a memory area should be allocated according to the requirements of the algorithm. Necessary arguments are then transferred from the host to the GPU and kernel is executed on threads running parallel. Finally, processed data is transferred from the GPU to the host for another operation. Within the CUDA solution of DFC, we implemented both approaches, thread- and block-based, as shown in Fig. 3a and b, respectively. Thread-based design uses time-courses as a smallest

Implemented kernel structures are given by Algorithm 1 and 2 for thread- and block-based computing approaches, respectively. Implementation of the thread-based approach involves accessing time-courses using thread and blocks identification numbers, which are hardware (GPU) designated variables. Based on indexes determined in the algorithm, time-courses from the data array are selected and transferred to local thread memory. The remaining operations are the same as in OpenMP implementation. Finally, the obtained result is written to an output array variable in the global memory of the GPU device for each thread according to the thread id.

computational unit for each thread in GPU. Each thread is assigned to calculate a correlation point out of a pair of windowed time-courses, as done in OpenMP implementation. Caching the time-courses in each thread helps improve the performance significantly since accessing the global memory takes a long time when compared with the time to access the thread memory. For a pair of 316 elements float array requires about 2.5 kB byte memories. Since 3.x CUDA computing capability provides 64KB memory for local variables, there is more than enough memory space for the time-courses and it allows our algorithm to run on longer timecourses. Block-based approach (Fig. 3b) allocates a block of threads for each DFC computations between a time-course pair. It partitions time-courses to smaller window-size chunks. Therefore, in blockbased design, the threads are expected to run more independent and effectively with respect to the thread-based approach. Before determining the extent of the sliding-windows, a time-course pair is first transferred to shared memory; this provides faster access for threads. The transfer operation is also carried out by block threads in parallel. Taking advantage of the local shared memory instead of accessing the global memory, all the threads within a block, fulfills a sliding-window operation at the same time, in parallel.

The main difference between thread-based and block-based implementations is that the former employs one thread, but the latter launches a block of threads to compute DFC per timecourse pair. In the block-based design, each sliding-window of a DFC operation is computed separately as shown by the kernel code in Algorithm 2. Before threads within a block start to compute sliding windows, selected time-courses are parallelcopied to a block’s shared memory and all threads access this memory for the whole calculation instead of global memory. On the other hand block-based approach involves synchronization operations within each block to parallel copy and to gather the results from threads within that particular block. Following the computation for each set of sliding-windows, each thread writes the correlation index to shared output array according to its thread identification number. At the end of all steps, the obtained results are written to global memory for each block. As shown in experimental results section, these kernels have many advantages over each other for some computational parameters. According to window and step size, our parallel framework can switch between two approaches, thread- and block-based, to reach the most efficient combined approach.

58

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

Time complexity of DFC algorithm and newly proposed GPUbased implementations are analyzes as follows. Let nSubjects be number of subjects, nTCS be number of brain regions for each subjects, and M be number of data points in each fMRI time-course. The dataset specific values for these variables were given in Table 2. Also let W and Z be two parameters of DFC, window and step sizes, respectively. Then, the sequential version of whole DFC analysis takes E = nSubjects

nTCS (nTCS − 1) (M − W ) 2 Z

(2)

correlation calculations and it has complexity of O(nSubject × nTCS2 ), which is primarily due to pairwise combinations of brain regions, nTCS. On the other hand, the GPU-based parallel version we proposed completes same number of correlation calculations in E×k +r P

(3)

amount of time, where k is a scaling factor due to frequency of computation units and, r is overhead due to data transfer and P is the number of streaming processors in GPU. Therefore, the parallel version of DFC has complexity of O(nSubject × nTCS × log(nTCS)) since P is practically in the vicinity of nTCS. 3. Results The fMRI data was collected on a 3-T Phillips MR scanner equipped with an eight-channel RF coil. Thirty-six 3 mm thick functional slice locations were obtained in the axial plane parallel to the AC–PC line. Functional BOLD signals were acquired using a single-shot gradient-echo planar imaging (EPI) sequence with a matrix of 64 × 64, echo time (TE) of 25 ms, repetition time (TR) of 1.7 s, flip angle (FA) of 70◦ , field of view (FOV) of

208 mm × 208 mm. In-plane resolution was 3.25 mm × 3.25 mm. Motion correction and slice-time correction was applied as preprocessing steps using FSL software (www.fmrib.ox.ac.uk/fsl). The images were then spatially normalized to the MNI space, resulting in voxel size of 2 mm × 2 mm × 2 mm. The study consisted of 83 human subjects. Subjects were recruited from Administration Medical Center, Homeward Bound, Inc. and Nexus Recovery Center in Dallas, Texas, USA. Dataset used in the experiments consist of 234 de-identified fMRI scans from 83 subjects. We isolated 30 brain regions and corresponding time-courses by a group spatial independent components analysis (ICA), which is a data-driven blind source-separation technique that provides independently behaving brain networks [36–39]. Each time-course in the fMRI data contains 316 time points. As given in Table 2, the number of combinations used in the experiments is 435 for each subject and therefore the length of the correlation array is 101,790. Analyses were repeated for window sizes of 32 and 64 and step sizes of 1, 2, 4, and 8 while other analysis parameters remained fixed. Although DFC analysis is not being used in clinical studies yet, many researchers studied this tool with various step sizes, for example 8 in [15], 15–60 in [41]. The shorter the step size, the more dynamic are the DFC analyses [42], and ideally one would like to set it to 1. However, this comes at the expense of computational intensity. Our results show the best speed-up results are for the most intense DFC, i.e. with the step size of 1, which makes the case that using GPUs for DFC is the most beneficial when one wants the most dynamic DFC analysis (see Table 4, Figs. 4 and 6).

3.1. Experimental setup Algorithms on host were coded using Microsoft VC++ program using OpenMP and CUDA libraries [1,2,8]. Sequential DFC algorithm

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

59

Table 2 Properties of the experimental data and the DFC settings. Window size Number of subjects (nSubjects) Number of time-course per subject, nTCS Number of combs per time-course (nCombs) Total number of combs Relative step size and number of sliding windows

32 234 30 nCombs = nTCS × (nTCS − 1) = (30 × 29)/2 = 435 nCombs × nSubjects 234 × 435 = 101,790 1: 28,908,360 2: 14,454,180 4:7,227,090 8: 3,562,650

is implemented to obtain reference runtimes in C++ and compiled within Microsoft Visual Studio 2012. Speed-up measurements were conducted on a host machine with an AMD(R) FX 8320 eight-core CPU and a GeForce GTX 660 GPU. The same host was also used for multicore computational results using OpenMP. Key specifications for tested processor and GPU equipment are summarized in Table 2. The memory size was 2 × 4 GB running in unganged mode and the operating system was 64-bit Windows 7. Memory requirements of sequential or parallel implementations are same and quite low with respect to fMRI dataset. It is because we have a time-course for each of brain regions. In case of 30 brain regions from 234 scans, where each scan has 316 time-points in data type of float (4-byte), the size of our whole dataset to keep in the host or GPU memory is 8.46 MB. In case of 60 brain regions, this size is doubled and reached 16.92 MB. As given in Table 3, both datasets are relatively small in size to keep in and process through host memory or global memory of GPU. 3.2. Performance measurements Success of each parallel implementation was evaluated by comparing its performance to a sequential equivalent. For this purpose, runtimes for sequential execution measured on the host processor that were obtained from single-core execution were referenced. The multi-core and GPU based computing durations for the same test parameters are compared with single-core timings. In GPU performance evaluations, the time amounts spent for memory allocation on the device and for the data transferred between device and host are also included. Computation times (running durations in seconds) using sequential, thread-based, and block-based implementations for the test cases, where window sizes are set to 32 and 64; step sizes are set to 1, 2, 4, and 8; the number of scans are set to Table 3 Hardware specification of CPU and GPU unit used in experiment. Device

Feature

Value

Multicore processor

Frequency Number of cores Level 1 cache size

3.5 GHz (Turbo mode disabled) 8 4 × 64 kB shared instruction caches 8 × 16 kB data caches 4 × 2 MB shared exclusive caches 8 MB shared cache

Level 2 cache size Level 3 cache size

GPU

Multiprocessor count CUDA cores Memory bus width Warp size Total global memory Shared Mem per block Max threads per block Clock rate Max threads dimensions Max threads per multi processor Memory clock rate

5 960 192 32 2,147,483,648 bytes 49,152 bytes 1024 bytes 1033 MHz 1024, 1024, 64 2048 3004 MHz

64 234 30 nCombs = nTCS × (nTCS − 1) = (30 × 29)/2 = 435 nCombs × nSubjects 234 × 435 = 101,790 1: 25,651,080 2: 12,825,540 4: 6,412,770 8: 3,155,490

234 and 468; and the number of time-courses for each scan are set to 30 and 60, are shown in Table 4. To obtain 468 scans, we added 234 synthetically generated fMRI scans in order to analyze scaling effect. Measured durations for sequential execution varied from 0.6 to 83 s. Increasing the number of scans from 234 to 468 or the number of time-courses from 30 to 60 approximately doubles the sequential execution durations. However, increasing the window length from 32 to 64 or doubling step size does not double the computation time in parallel implementations. Table 4 shows that for sequential DFC implementation increasing the window size from 32 to 64 results in 2.08× to 2.19× increase in computation time. On the other hand, doubling the step size reduces the computation times by 0.50× to 0.52×. In addition, the actual output (i.e. the output DFC signal) obtained from parallel executions were compared with those produced by the sequential results by checking the mean absolute error of the produced DFC outputs in order to see if there were any differences. The mean absolute error was zero; therefore, there were no differences in the actual DFC outputs. Based on the measured durations given by Table 4, Fig. 4 demonstrates the success of implemented methods in terms of speed-up that is the ratio of sequential execution time to parallel execution time for different experimental parameters. Performance of thread-based solution is deteriorated more regularly with respect to that of block-based approach. The speed-ups of OpenMP are not affected much by changes in step size. OpenMP execution durations show approximately the same behavior for all tests. Stagnant performance is mainly due to the type of the simulation data size. As we pointed before, the size of the experimental data is up to approximately 16.92 MB. During running, all cores are observed to give 100% performance mainly due to low communication to computation ratio which also keeps the page faults at low levels. When the results summarized in Table 4 are closely investigated, the speed-up performance slightly changes between 7.0× and 7.2× for window size of 32; between 7.4× and 7.7× for window size of 64. The main reason for slight performance decrease for the window size of 32 when compared to window size of 64 is the change in the intensity of the operations. Computing correlations for sliding windows involves averaging, squaring and dot-product operations for which computational intensities directly depend on the window size. Therefore, a reduction in the window size also reduces the ratio of computation time to communication time, which amplifies the overhead of memory operations and threads, decreasing the overall performance. Another factor affecting the computational performance is the type of the CPU used in the experiments. AMD(R) FX 8320 has eight-core where four L2 cache memories present each of which are shared between two of the cores. The six-core AMD(R) Phenom 1055T for example has separate L2 caches for each of the cores. When the algorithm is executed on this processor, the best performance is approximately is 5.96×. An investigation in to the parallel efficiency by using the speed-up to number of cores ratio show that while the parallel efficiency of the eight core processor with shared L2 caches is 0.962, the parallel efficiency of the six core processor is 0.993.

60

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

Fig. 4. Speed-up vs. step size with respect to three implementations for various number of scans (nScans), number of time courses (nTCs) and window sizes (W. Size).

For the case of GPU with block-based approach, reducing the window size from 64 to 32 considerably reduces the performance because its performance is very sensitive to memory operations. Performance of both the thread-based and the blockbased approaches reach up to 33× and 79×, respectively, for the

window size of 32. As the step size is increased, while performance of the thread-based approach falls to 18.5×, performance of the block-based approach falls to 13.5× mainly because of the synchronization overhead (Fig. 4a, c, e and g). For the window size of 64, the results showed considerable increase in speed-up. While

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

61

Table 4 Running times in seconds for sequential and parallel implementations under various parameters. The best performance values are shown in bold for each experiment. Num. of fMRI scans

Num. brain regions

Window size

32 30 64 234 32 60 64

32 30 64 468 32 60 64

Step size

Sequential (single core)

Thread-based

Block-based

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

4.677 2.352 1.188 0.601 10.167 5.305 2.625 1.294 19.032 9.572 4.840 2.446 40.734 20.540 10.332 5.154

0.651 0.328 0.167 0.085 1.321 0.671 0.339 0.169 2.650 1.333 0.688 0.345 5.459 2.708 1.369 0.696

0.146 0.084 0.048 0.032 0.246 0.133 0.073 0.045 0.577 0.314 0.179 0.113 0.990 0.521 0.282 0.162

0.070 0.037 0.031 0.044 0.073 0.045 0.043 0.061 0.240 0.134 0.110 0.152 0.263 0.152 0.163 0.209

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

9.353 4.710 2.384 1.199 20.416 10.025 5.252 2.561 38.065 19.144 9.671 4.883 82.344 41.780 20.974 10.595

1.302 0.657 0.333 0.180 2.662 1.338 0.700 0.340 5.294 2.666 1.352 0.715 10.824 5.415 2.723 1.359

0.287 0.157 0.096 0.059 0.489 0.259 0.143 0.084 1.144 0.617 0.348 0.217 1.969 1.034 0.554 0.314

0.124 0.071 0.065 0.081 0.135 0.087 0.092 0.116 0.478 0.260 0.218 0.301 0.523 0.302 0.321 0.414

the thread-based approach demonstrates speed-ups from 28× to 42×, the measurements of block-based approach vary from 21× to 157×. This is mainly due to high number of correlation operations per sliding window, boosting the performance of the thread-based approach significantly. Once the number of scans is increased from 234 to 468 or the number of time-courses is increased from 30 to 60, the speed-up characteristics remain approximately the same for both OpenMP- and CUDA-based implementations. This is because increasing the number of time-course pairs does not significantly change the ratio of computation time to communication time. In Fig. 5, the comparison of speed-up vs. number of sliding windows shows the effects of computational load for each parallel implementation. Performance of the OpenMP implementation does not change much as expected. For window size 32 (Fig. 5a), CUDA thread-based approach showed an improved performance for small number of sliding window operations. While the blockbased approach had a poor performance for respectively smaller number of window operations, its performance exceeds that of the thread-based approach as the load is increased. The same is also true for the windows size is 64 (Fig. 5b), the block-based approach did not perform well for small number windows operations; however, it outperformed thread-based implementation in last three parameters. In general, the block-based approach performs better than the thread-based approach, especially for larger sizes of window operations. Although the block-based approach involves much more synchronization operations, it takes advantage of block-shared memory. Considering the strengths of both approaches for different parameters, either the thread-based or the block-based kernel can be executed depending on window size and step size. Fig. 6a shows combined response of both kernels where results vary from 18.5x to 66× for window size of 32; and 28× to 139× for window size of 64. As shown in Fig. 6b, combined utilization of kernels provides 2.6× to 9.3× speed-up for window size of

OpenMP (8-core)

CUDA

Fig. 5. Speed-up measurements vs. the number of sliding window operations with respect to three parallel implementations.

62

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63

the correlation function can be improved in the GPU’s local memory in both the thread- and block-based approaches that we utilize. It will also be important to investigate the effect of reducing data precision in order to speed-up data transfer procedures from host to GPU, which are the bottleneck for the whole computation. Lastly, a scalability as well as energy efficiency analysis should be investigated using a big data perspective comparing GPU and Hadoop [44] platforms. Competing interests The authors declare that they have no competing interests. Authors’ contribution Dr. Sakoglu and Dr. Mete supervised the project. Mr. Esquivel conducted preprocessing and analysis of the fMRI dataset. Dr. Akgün implemented OpenMP and CUDA functions. Mr. Esquivel contributed initial setting of the implementations. Dr. Adinoff provided the MRI dataset and motivated the study. Comparison and testing were done by Dr. Akgün. All authors wrote, revised, and approved the manuscript. Acknowledgements

Fig. 6. Combined speed-up for GPU and speed-up over multicore implementation.

32 and 3.7× to 18× speed-up for window size of 64 over the multicore implementation with OpenMP. In CUDA implementation, the hybrid method in which we dynamically choose thread- or blockbased approach takes less time since the better method is selected the run time. 4. Conclusions Computational demands of the DFC analyses on fMRI data from multi-subject experiments are significantly high. Based on our implementations of the presented DFC algorithm, current multicore processor and GPU technologies were shown to accelerate the computationally-intensive DFC algorithm. The GPU’s architecture, which exploits the parallelism in the proposed implementation of the DFC algorithm, made the DFC analysis more practical as it enabled real-time or near real-time implementations. Our results showed that OpenMP and CUDA implementations significantly reduced the DFC computation time, and the utilization of parallel computing capabilities of GPU devices is very useful for the purpose of reducing computation time in the DFC analysis of fMRI data. The results also indicated that the increase in the intensity of the computations significantly boosted the speed-up/efficiency of the GPU device over the CPU, which amplifies the benefit of using GPUs for large scale multi-subject and multi-center fMRI studies. The achieved speed-ups in our study were obtained with NVidia(R) GeForce GTX 660, which is a widely available medium-end GPU device. Further performance improvements can be obtained using high-end NVidia(R) GPU devices. For future studies, we plan to utilize GPU clusters to reduce the execution time below one second for larger sizes of analyses, which can make real-time DFC analysis feasible, allowing neurofeedback based on DFC analysis of fMRI data. Our future work will involve seeking algorithmic improvements for reducing the execution time such as using a pseudo-correlation [43] method instead of using regular correlation function. Furthermore, the computation time of

This work was supported by NIDA 1R03DA031292-01, Office of Graduate Studies at Texas A&M University at Commerce, Turkish Higher Education Institution (YÖK) and Research Fund of Sakarya University. References [1] A. OpenMP. OpenMP Application Program Interface, vol. 3.1; 2008, May. [2] Chapman B, Jost G, Van Der Pas R. Using OpenMP: portable shared memory parallel programming, vol. 10. MIT press; 2008. [3] Nvidia C. Compute unified device architecture programming guide; 2014. [4] Sanders J, Kandrot E. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional; 2010. [5] Gembris D, Neeb M, Gipp M, Kugel A, Männer R. Correlation analysis on GPU systems using NVIDIA’s CUDA. J Real-Time Image Process 2011;6:275–80. [6] Eklund A, Dufort P, Villani M, LaConte S. BROCCOLI: Software for fast fMRI analysis on many-core CPUs and GPUs. Front Neuroinform 2014;8. [7] Hernández M, Guerrero GD, Cecilia JM, García JM, Inuggi A, Jbabdi S, et al. Accelerating fibre orientation estimation from diffusion weighted magnetic resonance imaging using GPUs. PLOS ONE 2013;8:e61892. [8] Hoang RV, Tanna D, Bray LCJ, Dascalu SM, Harris Jr FC. A novel CPU/GPU simulation environment for large-scale biologically realistic neural modeling. Front Neuroinform 2013;7. [9] Huang T-Y, Tang Y-W, Ju S-Y. Accelerating image registration of MRI by GPUbased parallel computation. Magn Reson Imaging 2011;29:712–6. [10] Jeong I-K, Hong M-G, Hahn K-S, Choi J, Kim C. Performance study of satellite image processing on graphics processors unit using CUDA. Korean J Remote Sens 2012;28. [11] Fallani FDV, Astolfi L, Cincotti F, Mattia D, Marciani MG, Salinari S, et al. Cortical functional connectivity networks in normal and spinal cord injured patients: evaluation by graph analysis. Hum Brain Mapp 2007;28:1334–46. [12] Sako˘glu Ü, Calhoun VD. Dynamic windowing reveals task-modulation of functional connectivity in schizophrenia patients vs healthy controls. In: Proceedings of the 17th annual meeting of the International Society for Magnetic Resonance in Medicine (ISMRM). 2009. [13] Sako˘glu Ü, Calhoun VD. Temporal dynamics of functional network connectivity at rest: a comparison of schizophrenia patients and healthy controls. In: Proceedings of the 15th annual meeting of the Organization for Human Brain Mapping (OHBM). 2009. p. S169. [14] Sako˘glu Ü, Michael AM, Calhoun VD. Classification of schizophrenia patients vs healthy controls with dynamic functional network connectivity. In: Proceedings of the 15th annual meeting of the Organization for Human Brain Mapping (OHBM). 2009. p. S57. [15] Sako˘glu U, Pearlson GD, Kiehl KA, Wang YM, Michael AM, Calhoun VD. A method for evaluating dynamic functional network connectivity and task-modulation: application to schizophrenia. MAGMA 2010;23(December):351–66. [16] Chang C, Glover GH. Time-frequency dynamics of resting-state brain connectivity measured with fMRI. Neuroimage 2010;50(March):81–98. [17] Akgun D, Sakoglu U, Mete M, Esquivel J, Adinoff B. GPU-accelerated dynamic functional connectivity analysis for functional MRI data using OpenCL. In: 2014 IEEE international conference on electro/information technology (EIT). 2014.

D. Akgün et al. / Computerized Medical Imaging and Graphics 43 (2015) 53–63 [18] Sako˘glu U, Upadhyay J, Chin CL, Chandran P, Baker SJ, Cole TB, et al. Paradigm shift in translational neuroimaging of CNS disorders. Biochem Pharmacol 2011;81(June):1374–87. [19] Upadhyay J, Baker SJ, Chandran P, Miller L, Lee Y, Marek GJ, et al. Default-modelike network activation in awake rodents. PLoS ONE 2011;6:e27839. [20] Johnson KA, Hartwell K, LeMatty T, Borckardt J, Morgan PS, Govindarajan K, et al. Intermittent “Real-time” fMRI feedback is superior to continuous presentation for a motor imagery task: a pilot study. J Neuroimaging 2012;22: 58–66. [21] Johnston S, Linden D, Healy D, Goebel R, Habes I, Boehm S. Upregulation of emotion areas through neurofeedback with a focus on positive mood. Cogn Affect Behav Neurosci 2011;11:44–51. [22] Hampson M, Scheinost D, Qiu M, Bhawnani J, Lacadie CM, Leckman JF, et al. Biofeedback of real-time functional magnetic resonance imaging data from the supplementary motor area reduces functional connectivity to subcortical regions. Brain Connect 2011;1:91–8. [23] Haller S, Badoud S, Nguyen D, Garibotto V, Lovblad KO, Burkhard PR. Individual detection of patients with Parkinson disease using support vector machine analysis of diffusion tensor imaging data: initial results. Am J Neuroradiol 2012;33(December):2123–8. [24] Christopher deCharms R, Maeda F, Glover GH, Ludlow D, Pauly JM, Soneji D, et al. Control over brain activation and pain learned by using real-time functional MRI. Proc Natl Acad Sci U S A 2005;102:18626–31. [25] Chiew M, LaConte SM, Graham SJ. Investigation of fMRI neurofeedback of differential primary motor cortex activity using kinesthetic motor imagery. NeuroImage 2012;61:21–31. [26] Caria A, Veit R, Sitaram R, Lotze M, Weiskopf N, Grodd W, et al. Regulation of anterior insular cortex activity using real-time fMRI. Neuroimage 2007;35:1238–46. [27] LaConte SM, Peltier SJ, Hu XP. Real-time fMRI using brain-state classification. Hum Brain Mapp 2007;28:1033–44. [28] Shibata K, Watanabe T, Sasaki Y, Kawato M. Perceptual learning incepted by decoded fMRI neurofeedback without stimulus presentation. Science 2011;334:1413–5. [29] Scharnowski F, Hutton C, Josephs O, Weiskopf N, Rees G. Improving visual perception through neurofeedback. J Neurosci 2012;32:17830–41. [30] Bray S, Shimojo S, O’Doherty JP. Direct instrumental conditioning of neural activity using functional magnetic resonance imaging-derived reward feedback. J Neurosci 2007;27:7498–507.

63

[31] Haller S, Birbaumer N, Veit R. Real-time fMRI feedback training may improve chronic tinnitus. Eur Radiol 2010;20(March):696–703. [32] Linden DE, Habes I, Johnston SJ, Linden S, Tatineni R, Subramanian L, et al. Realtime self-regulation of emotion networks in patients with depression. PLoS ONE 2012;7:e38115. [33] Eklund A, Andersson M, Knutsson H. fMRI analysis on the GPU-possibilities and challenges. Comput Methods Progr Biomed 2012;105(February):145–61. [34] Eklund A, Friman O, Andersson M, Knutsson H. A GPU accelerated interactive interface for exploratory functional connectivity analysis of FMRI data. In: 18th IEEE international conference on image processing (ICIP). 2011. p. 1589–92. [35] Michael AM, King MD, Ehrlich S, Pearlson G, White T, Holt DJ, et al. A data-driven investigation of gray matter-function correlations in schizophrenia during a working memory task. Front Hum Neurosci 2011;5:71. [36] McKeown MJ, Makeig S, Brown GG, Jung TP, Kindermann SS, Bell AJ, et al. Analysis of fMRI data by blind separation into independent spatial components. Hum Brain Mapp 1998;6:160–88. [37] Calhoun VD, Adali T, Pearlson GD, Pekar JJ. A method for making group inferences from functional MRI data using independent component analysis. Hum Brain Mapp 2001;14(November):140–51. [38] Calhoun VD, Adali T, McGinty VB, Pekar JJ, Watson TD, Pearlson GD. fMRI activation in a visual-perception task: network of areas detected using the general linear model and independent components analysis. Neuroimage 2001;14(November):1080–8. [39] Calhoun VD, Adali T, Pearlson GD, Pekar JJ. Spatial and temporal independent component analysis of functional MRI data containing a pair of task-related waveforms. Hum Brain Mapp 2001;13(May):43–53. [40] Deng Y. Applied parallel computing. World Scientific; 2012. [41] Hutchison RM, Womelsdorf T, Gati JS, Everling S, Menon RS. Resting-state networks show dynamic functional connectivity in awake humans and anesthetized macaques. Hum Brain Mapp 2013;34(September):2154–77. [42] Hutchison RM, Womelsdorf T, Allen EA, Bandettini PA, Calhoun VD, Corbetta M, et al. Dynamic functional connectivity: promise, issues, and interpretations. NeuroImage 2013;80:360–78. [43] Zheng Y, Xiaowei C, Mingquan L, Zhenming F, Jun Y. Pseudo-correlationfunction-based unambiguous tracking technique for sine-BOC signals. IEEE Trans Aerospace Electr Syst 2010;46:1782–96. [44] Shvachko K, Hairong K, Radia S, Chansler R. The Hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). 2010. p. 1–10.