GPU-based parallel group ICA for functional magnetic resonance data

GPU-based parallel group ICA for functional magnetic resonance data

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16 journal homepage: www.intl.elsevierhealth.com/jo...

1MB Sizes 2 Downloads 68 Views

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

GPU-based parallel group ICA for functional magnetic resonance data Yanshan Jing, Weiming Zeng ∗ , Nizhuan Wang, Tianlong Ren, Yingchao Shi, Jun Yin, Qi Xu Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China

a r t i c l e

i n f o

a b s t r a c t

Article history:

The goal of our study is to develop a fast parallel implementation of group independent com-

Received 1 August 2014

ponent analysis (ICA) for functional magnetic resonance imaging (fMRI) data using graphics

Received in revised form

processing units (GPU). Though ICA has become a standard method to identify brain func-

16 December 2014

tional connectivity of the fMRI data, it is computationally intensive, especially has a huge

Accepted 2 February 2015

cost for the group data analysis. GPU with higher parallel computation power and lower cost are used for general purpose computing, which could contribute to fMRI data analysis sig-

Keywords:

nificantly. In this study, a parallel group ICA (PGICA) on GPU, mainly consisting of GPU-based

fMRI

PCA using SVD and Infomax-ICA, is presented. In comparison to the serial group ICA, the

GPGPU

proposed method demonstrated both significant speedup with 6–11 times and comparable

Parallel computing

accuracy of functional networks in our experiments. This proposed method is expected to

Group ICA

perform the real-time post-processing for fMRI data analysis. © 2015 Elsevier Ireland Ltd. All rights reserved.

1.

Introduction

The blood oxygen level-dependent (BOLD) contrast functional magnetic resonance imaging (fMRI) is a non-invasive technique for the human brain activity detection. Nowadays, FMRI has become increasingly used in the neuroimaging community and even in the neurosurgery field [1]. However, it is usually high time costing for the fMRI data processing since the data is always with four dimensions (three spatial domains and one temporal domain). Recently, some parallel algorithms based on GPU have been applied to fMRI analysis, where the data naturally fits for GPU computation. The reason is that these algorithms mostly perform exactly the same calculations for each voxel. The first GPU application for fMRI is correlation analysis, which calculates the

correlation coefficients between the time course of each voxel and the reference time series [2]. In 2012, some fMRI data preprocessing processes, such as detrending, motion correction and spatial smoothing, etc., have been implemented on GPU [3], and achieved more 200 times acceleration compared with statistical parametric mapping (SPM) software on CPU (http://www.fil.ion.ucl.ac.uk/spm/). Meanwhile, Eklund et al. addressed the obvious GPU acceleration of the general linear model (GLM) and canonical correlation analysis (CCA) for fMRI analysis [3]. Furthermore, Ferreira da Silva [4] developed a RCUDA implementation of parallel Markov chain Monte Carlo (MCMC) simulation-based Bayesian multilevel model for fMRI data analysis, obtaining about 60 times speed in comparison to CPU. Recently, Scheinost et al. [5] proposed a GPU-based accelerated motion correction algorithm and modular system which could satisfy the real-time processing requirement.

∗ Corresponding author at: College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China. Tel.: +86 021 38282873; fax: +86 021 38282800. E-mail address: [email protected] (W. Zeng).

http://dx.doi.org/10.1016/j.cmpb.2015.02.002 0169-2607/© 2015 Elsevier Ireland Ltd. All rights reserved.

10

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

Some basic medical signals analysis algorithm have been optimized by using GPU, such as fast box-counting [6] and fast ray-tracing of human eye optics [7]. Unfortunately all the above methods cannot meet the multi-subject data real-time processing at group-level, especially for the group functional networks detection, due to great large amount of fMRI data. In this study, we developed a GPU-based group-level functional networks detection method for multi-subjects, named PGICA (parallel group independent component analysis), which mainly consists of GPU-based PCA using SVD and Infomax-ICA. General purpose graphic processing units (GPGPU) is a new kind of massive parallel processor, introduced in 2006 [8], and has already been applied to a great variety of research and engineering fields to achieve significance improvement in speed, compared with the optimized CPU implementations. For more information of GPU computing, please refer to Appendix 1. The remainder of this paper is organized as follows: an introduction and analysis of ICA and group ICA will be firstly presented, and then followed by the implementation of PGICA and testing system. Finally, results and analysis will be presented together with interpretations and conclusions related to advantages and limitations of this new computing model.

2.

Material and methods

2.1.

Independent component analysis

Independent component analysis is used to study the spatialtemporal structure of the signal in fMRI, and it can discover both spatially and temporally independent components. ICA was introduced by Comon [9] in 1994. The idea of ICA can be seen as an extension of the PCA, where the linear transformation minimizes the statistic dependence among its components. The following statistic model is assumed [10]: x = My + v,

(1)

where x, y, and v are random vectors with values in real or complex number with zero mean and finite covariance, M is a rectangular matrix with at most as many columns as rows and vector y has statistically independent components. The problem set by ICA can be summarized as follows: given N samples of vector x, an estimation of matrix M is desired, and the corresponding samples from vector y. However, because of the presence of noise v, it is in general impossible to reconstruct the exact vector y. Since the noise v is assumed here to have an unknown distribution, it can only be treated as a nuisance, and the ICA cannot be devised for the noisy model above. Instead, it will be assumed that: x = AS,

(2)

where S is a random vector whose components are maximized statistical independence. Group ICA of fMRI Toolbox (GIFT, http://mialab.mrn.org/ software/gift/) uses the Infomax algorithm [11] to estimate independent components. Infomax-ICA is based on a neural

Fig. 1 – Structure of the neural network for solving ICA, where X is the original data, r is the registered data, Y is the approximated independent data, and W is the un-mixing matrix.

network with three columns of neurons, each representing: (1) the original data (X); (2) the registered data (r); (3) the approximated independent data (Y). Each column of neurons combines linearly by matrix W. Fig. 1 shows the structure of the neural network. The principle used by Infomax-ICA is maximizing the mutual information that the output Y of the neural network processor contains about its input X. This is defined as I(Y, X) = H(Y) − H(Y|X),

(3)

where H(Y) is the entropy of Y, while H(Y|X) is the entropy of Y conditioned on X. In fact, H(Y) is the differential entropy of Y with respect to some reference, such as the noise level or the accuracy of discretization of the variables in X and Y. Thus, only the gradient of information-theoretic quantities with respect to some parameter w is considered [12]. Then, the Eq. (4) can be differentiated, with respect to a parameter w as: ∂ ∂ I(X, Y) = H(Y), ∂w ∂w

(4)

for H(Y|X) does not depend on w. In Eq. (1), H(Y|X) = v. No matter what the level of the additive noise is, maximization of the mutual information is equivalent to the maximization of the output entropy, because (∂/∂w)H(v) = 0. As a result, any invertible continuous deterministic mappings, the mutual information between inputs and outputs can be maximized by just maximizing the entropy of the outputs. The natural (or relative) gradient method considerably simplifies the method. The natural gradient principle [12] is based on the geometric structure of parameters space and it is related to the relative gradient principle [13] that ICA uses. By this approach, the Infomax-ICA uses the following iteration of the gradient method to estimate the W matrix: W ∝ W − tanh

 Wx  2

T

(Wx) W.

(5)

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

11

In summary, Infomax-ICA consists of the following steps: (1) (2) (3) (4) (5)

U = W × perm(x) (where perm is a random permutation), Y = − tanh(U/2), YU = Y × TU, YU = YU + I, W = lrate × YU × W + W,

where lrate is the learning rate for each iteration of the method, generally lower than e−2 . A higher lrate value may lead to increase compute speed but a bad choice could destroy convergence. So a function that selection criteria for lrate is needed.

2.2.

Theory of group ICA

Though ICA on single subject fMRI data has achieved great success and improvement, researchers try to discover the similarities and differences of brain functional connectivity from one group or multi-group subjects. In order to investigate the commonalities across a group of subjects, on the one hand, scientists have begun to develop group data analysis methods such as the classical multi-session temporal concatenated group ICA (GICA) method (Calhoun et al., 2001) [14] in which single-subject data are concatenated in the temporal domain. On the other hand, the single-subject ICA decomposition based group analysis methods such as FENICA (fully exploratory network ICA) [15] and Fast-FENICA [16] obtained good results. These methods were subjected to the assumption that the spatial distribution of the same functional signal changes was similar among group subjects. Since the GICA published, it has become a widely used and classical method for group analysis of fMRI data. The procedure of GICA can be depicted as follows. First, PCA is applied to the group data for dimensionality reduction at single subject-level and group-level separately [17]. PCA is a dimensionality reduction tool; some introduction is in Appendix 2. The single subject-level PCA reduces the temporal dimension of fMRI data of each subject to a certain level. The reduced data of all subjects are then concatenated across subjects in the temporal direction and another PCA (grouplevel) is performed to further reduce the temporal dimension of the group data. Once the mixing matrix is estimated, the component maps for each subject can then be computed by projecting the single subject data onto the inverse of the partition of the mixing matrix that corresponds to that subject. In the end this provides subject-specific time courses and images that can be used to make group and inter-group inferences [17]. All the processes are shown in Fig. 2. In mathematics all processes presented as follow: let Xi = Fi−1 Yi be the Npcs -by-V reduced data matrix from subject i, where Yi is the K-by-V data matrix (containing the preprocessed and spatially normalized data), Fi−1 is the Npcs -by-K reducing matrix (determined by the PCA decomposition), V is the number of voxels, K is the number of fMRI time points, and Npcs is the size of the time dimension following reduction [17]. Group-level PCA reduces the concatenated reduced data matrix to Nic (the number of components to be estimated). The

Fig. 2 – The flow chart of group ICA of fMRI data. Perform PCA on single subject, then connect reduced data in time space and perform PCA on the connected data, at last perform ICA on the reduced connected data.

Nic -by-V reduced, concatenated matrix for the Nsubs subjects is

⎡ ⎢ ⎣



Fi−1 Yi

X = G−1 ⎢

⎥ ⎥. ⎦

.. .

(6)

−1 FN YNsubs subs

where G−1 is an Nic -by-Nsubs × Npcs reducing matrix and is multiplied on the right by the Nsubs × Npcs -by-V concatenated data matrix for the Nsubs subjects. ˆ where A ˆ S, ˆ is the Following ICA estimation, we can get X = A Nic -by-Nic mixing matrix and Sˆ is the Nic -by-V component map. Substituting this expression for X into Eq. (6) and multiplying both sides by G, we have





Fi−1 Yi

⎢ ⎣

ˆ Sˆ = ⎢ GA

⎥ ⎥. ⎦

.. .

(7)

−1 FN YNsubs subs

ˆ by subject provides the following Partitioning the matrix GA expression

⎡ ⎢ ⎢ ⎣

G1 .. .





⎥ ⎢ ⎥ Aˆ Sˆ = ⎢ ⎦ ⎣

GNsubs

Fi−1 Yi .. .

⎤ ⎥ ⎥. ⎦

(8)

−1 FN YNsubs subs

We then write the equation for subject i by working only with the elements in partition i of the above matrices such that ˆ Sˆ i = F−1 Yi . Gi A i

(9)

The matrix Sˆ i in Eq. (9) contains the single subject maps for subject i and is calculated from −1 −1 ˆ Sˆ i = (Gi A) Fi Yi .

(10)

12

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

We now multiply both sides of Eq. (8) by Fi and get ˆ Sˆ i , Yi ≈ Fi Gi A

(11)

which provides the ICA decomposition of the data from subject i, contained in the matrix Yi . The Nic -by-V matrix Sˆ i ˆ is contains the Nic source maps and the K-by-Nic matrix Fi Gi A the single subject mixing matrix and contains the time course for each of the Nic components [17].

2.3.

Time consuming analysis of group ICA

In order to evaluate the time consuming of GICA, a test was performed, where 11 subject fMRI datasets were involved, and the dimensions of each subject data were 91 × 109 × 91 × 123. In the GICA, Npcs and Nic were set to 50 and 30, respectively. The serial implementation of GICA on CPU using Matlab [18] was performed. In fact, the serial Matlab GICA is a multithread one, because most of matrix operations include SVD in Matlab are multithread since MATLAB 7.14 (R2012a). And a profiling on the serial CPU version demonstrated that the Infomax-ICA took 57.8% of computing time, the group-level PCA on the concatenated data took 20.1%, and the total PCA consumed 41.6%. According to the Amdahl’s Law [19], the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program, so the ICA step and PCA step should be our target to optimize on GPU.

2.4.

Implementation of PGICA

The PGICA mainly consists of GPU-based PCA and ICA. As mentioned above, the most time costing part of PCA is the SVD. So we need a GPU SVD solver. We use CULA [20] Sgesvd function in our PCA program. CULA is an implementation of the Linear Algebra Package (LAPACK) [21] interface for CUDA-enabled GPUs. CULA also provides the needed matrix operations in our implementation such as matrix-by-matrix multiplication, matrix-by-vector multiplication, and matrix transpose. The remaining operations are solved by some adhoc functions using CUDA running time API [22]. We took advantage of the CULA device interface functions, which need users to allocate and populate device memory, to minimize the data transaction between host and device. So the performance of device functions is better than that of the standard interface functions. All the matrix operations in PCA by SVD are executed on GPU. Most kernel execution in PCA are set 256 threads pre block and the block setting is (N + Thread per Block − 1)/Thread per Block, where N is the number of data points. The performance analysis is shown in Fig. 3, it is clear that we can get more speedup with more time points. As for Infomax-ICA, Federico Raimondo et al. introduced a package called CUDAICA [23], which was used for EEG data analysis, obtained 25-fold acceleration in comparison with the CPU-based ICA implementation called binica [24]. According to the analysis of GICA for fMRI, combining the GPU-based PCA and CUDAICA, we developed a GPU-based parallel version of GICA for fMRI, named as PGICA in this study. These optimizations of PGICA can be divided into two levels: (1) intra-device optimizations on memory access and kernels

Fig. 3 – PCA executing time comparison between GPU and CPU. The CPU executing time is much longer then GPU executing time.

executions in particular matrix operations and (2) inter-device optimization making GPU and CPU execute their tasks respectively at the same time. At the intra-device level the vector indexed by the permutation was copied to the shared memory to reduce memory access latency. Calculating U and Y was combined in one kernel to reduce significantly the number of kernels launched. Calculating YU = YU + I was done by an adhoc matrix sum kernel (one matrix is identity matrix). Kernel execution parameters (dimensions of thread Grid and Block) were tuned corresponding to data points and time points of the fMRI data for dynamic partitioning of SM resources. At the inter-device level random permutations were performed by generating an index vector of random indexes in CPU meanwhile GPU computation was ongoing. We applied the CUDAICA to fMRI data, while some parameters need to be adjusted to achieve better performance and make the best use of the GPU resources. Details about the parameters choice will be discussed in Section 4.

2.5.

Testing system

Performance tests of PGICA were done with CPU Intel Xeon E5-2650 2.0 GHz, GPU Nvidia Tesla C2075 [25]. It is a Fermi GPU, with 448 CUDA cores, 6GB GDDR5 device memory, single precision float point performance (peak) 1.03Tflops. OS: Ubuntu 12.04.3 LTS (Precise Pangolin). Matlab version: UNIX 2012B. SDK: CUDA 5.5. IDE: Nsight Eclipse Edition. CULA: CULA dense R16a Linux. The precision type is float.

3.

Results and analysis

We tested the PGICA on both resting-state fMRI data and task-related data to assess the effectiveness and speedup performance. For task-related data, three subject datasets were analyzed. The visual paradigm was OFF-ON-OFF-ON-OFF-ONOFF in 20 s block except that the last OFF block was 10 s. At the “ON” state, visual stimulus was corresponding to a radio

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

13

networks are shown in Fig. 5. The resting-state functional networks are the products of 11 subjects data, the Nsubs = 50 and Nic = 30. Fig. 5a denotes 3 typically detected functional networks maps by GPU (upper) and CPU (lower), the left is one is a predominantly occipital network (VIN); the middle one is a network involving auditory regions (AUN); the right the so-called default mode network (DMN), a midline network involving primarily precuneus, prefrontal lobe and parietal regions. So the network maps by GPU are almost the same as their corresponding network maps by CPU, but some minor differences can also be observed. The reason may be the rounding modes. NVIDIA GPUs differ from the ×86 architecture in that rounding modes are encoded within each floating point instruction instead of dynamically using a floating point control word. Trap handlers for floating point exceptions are not supported. On the GPU there is no status flag to indicate when calculations have overflowed, underflowed, or have involved inexact arithmetic [26]. From Fig. 5b we can draw a conclusion that time costs by CPU is significantly higher than the one by GPU where the GPU implementation obtained 6–11 times speedup. The slope of CPU executing time curve is larger than the one of GPU. It indicates that GPU will take more advantage with more subjects.

4.

Fig. 4 – Result of task-related data testing. (a) Component maps of visual cortex; (b) time courses of the visual cortex; (c) execute times of CPU and GPU implementation.

blue/yellow checkerboard, reversing at 7 Hz. And at the “OFF” state, the participants were required to focus on the cross at the center of the screen. The resultant group maps of visual functional network, their corresponding time course and the costing-time comparison are shown in Fig. 4(a)–(c) respectively. The detected group map by GPU is almost the same. The detected time course matches the designed block accurately, where the correlation coefficientt is 0.51 and the speedup is 8.26 times. For resting-state data, Nsubs ranges from 2 to 11. All tests generated acceptable results and some detected functional

Discussion and conclusion

Group ICA of fMRI is able to detect the similarities and differences of brain functional connectivity on some status from one group or multi-group subjects. But it is computationally intensive and time-consuming. To solve this problem and make the best use of the new and high cost performance compute device GPU, we make a GPU implementation of group ICA of fMRI (If readers are interested in the source code, please contact the corresponding author). Compared with serial CPU implementation our PGICA obtained 8–11 times speedup. This work will possibly contribute to a real-time group functional network detection task, making neuroscience information processing or clinical treatment according to the fMRI data more efficient and reliable. Though the CUDAICA is an off-the-shelf package, it is not easy to make good use of it. It is originally designed for EEG data, so we map the concepts of fMRI to EEG. In the application of CUDAICA on fMRI data, two aspects should be emphasized. The first one is the parameter ‘ICA block’, which is a key factor of the performance of CUDAICA. The value of ‘ICA block’ that provides best performance is related to the space of device global memory and the time points of being processed dataset. And the ‘ICA block’ was set in a quite heuristic way in our experiments. The second aspect is that CUDAICA is a standalone application, which needs the input data for processing to be stored in binary format. As mostly fMRI researchers use Matlab as develop environment, the output data of CUDAICA should be read into Matlab workspace. The two steps consume too much read and write operation, which result in high time cost. We take advantage of the MEX-function technique, an interface for Matlab call C/C++ code, to call CUDAICA; it also makes the CUDAICA more user friendly and efficient. Meanwhile our PCA on GPU also uses this MEX-function technique. It is noteworthy that matrix stored in Matlab and CULA

14

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

Fig. 5 – (a) Three group component maps obtained using PGICA (upper) and their comparison using CPU (lower). The sagittal, coronal, and axial views are displayed. And coordinates (x, y, x) showed on lower right corner; (b) GICA executing time comparison between GPU and CPU.

is column-major ordering, while it is row-major ordering in C/C++. Furthermore, some potential issues related to the performance of PGICA should be discussed. On the one hand, to the best of our knowledge, the performance of GPU applications is closely related to the device. The GPU we used for testing is not the most advanced where the latest generation GPU by Nvidia is Kepler GK110. If it was applied to our application, the more significant acceleration result will surely be obtained. On the other hand, one limit of our PGICA is the GPU device memory, which cannot be expanded as the host memory. The most memory consuming step is the group level PCA operation on temporal concatenated data. Since the device memory is limited, the only solution is to limit the number of subject dataset or to involve the fewer time points of each subject, which results in degraded effectiveness of PGICA. Thus, a distributed and multi-GPU based group ICA implementation is urgently needed for more large amounts of subjects’ data

processing, which is our further study to enhance the robustness and speed performance of PGICA.

Acknowledgements This research was supported by the National Natural Science Foundation of China (Grant nos. 31170952 and 31470954), the Research Fund for the Doctoral Program of Higher Education of China (no. 20113121120004), Shanghai Municipal Natural Science Foundation of China (Grant no. 13ZR1455600) and the Research Foundation from Shanghai Science and Technology project (no. 14590501700).

Appendix 1. The most common GPGPU programming models are CUDA [1] (Compute Unified Device Architecture) and OpenCL [2]

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

(Open Computing Language), C++AMP [3] (C++ Accelerated Massive Parallelism). Since CUDA has a variety of welloptimized APIs and libraries (implemented functions provided such as: parallel reductions, CUBLAS, CURAND etc.), and multi-language support (CUDA C/C++, Fortran, Python, R), with the aim of accessing these existing well-optimized modules and functions, obtaining the fastest algorithm as possible, we chose Nvidia CUDA for developing our PGICA. CUDA is a platform consisting of hardware and software that enables Nvidia GPUs to execute programs written with C/C++, Fortran, Python, and other high-level languages. The CUDA programming model [6] makes it easy for the programmer to simultaneously launch thousands of GPU threads. Each thread executes the same operation, which is called kernel function, on different datasets. To manage and make full use of the GPU resources, it is only necessary to program with a supported traditional language (C/C++ or Fortran), using the CUDA API functions [28]. Besides the software model, CUDA provides a hardware specification, which is presented in all CUDA-capable GPUs. A CUDA device has several Multiprocessors (MPs), each including some Streaming Processors (SPs) with SIMD (Simple Instruction Multiple Data) architecture [6]. These processors are responsible for executing all the threads in a parallel way. The CUDA memory architecture is explicitly managed by the programmer. Each device has a large amount of slow Global Memory (readable/writable by both the CPU and GPU), a global read-only (in the GPU side) Constant Memory, and a special Texture Memory. Some small-size and fast-access memory modules, called Shared Memory, are present in each MP. The SPs can also manage several 32-bit registers. In addition, since the arrival of GF100 architecture [5], a real cache hierarchy has been introduced. All threads are organized into several levels. Individual threads are grouped in 1, 2 or 3-dimensional thread-blocks. Thread-blocks are grouped in a mono- or bi-dimensional grid, and also, in three-dimensional grids. Each thread-block has a limit of 1024 threads (on GF100 GPUs), and each grid supports up to 65,535 blocks in each dimension. Therefore, thousands of threads could be simultaneously launched and executed on the GPU. Each thread-block is assigned to an MP, so its shared memory is only accessible for threads within this thread-block. Resident threads are simultaneously executed in each SP in groups of 32, called warps. A warp is the minimum scheduling unit. The device executes an instruction for all threads within a warp before launching the next instruction. This is known as Single Instruction Multiple Threads (SIMT). Threads within a warp physically execute the same instruction in each cycle. The best performance is achieved when the number of threads in a thread-block is a multiple of the warp size. This organization philosophy, along with the CUDA scheduler execution, results in a high scalability level, because thread-blocks are automatically assigned to idle MPs, independently of the number of existing cores in the device. Therefore, the same compiled program can be executed on different and heterogeneous CUDA capable devices. CUDA offers special barrier instructions in order to establish synchronization points between threads in the same block. However, global memory is needed to share data among thread-blocks or set a

15

synchronization point for threads located in different threadblocks.

[1] Nvidia C. Programming guide. 2008. [2] Munshi A. The opencl specification. Khronos OpenCL Working Group, 2009, 1: 11–15. [3] Gregory K, Miller A. C++ AMP: Accelerated Massive Parallelism with Microsoft® Visual C++® . O’Reilly Media, Inc., 2012. [4] Wittenbrink CM, Kilgariff E, Prabhu A. Fermi GF100 GPU architecture. Micro, IEEE, 2011, 31(2): 50–59.

Appendix 2. Principal component analysis (PCA) [1] is a standard tool in modern data analysis in various fields from neuroscience to computer vision, since it is a simple, non-parametric method for extracting relevant information from confusing datasets. The targets of PCA are (1) extracting the most important information from the dataset; (2) compressing the size of the dataset by keeping only its important information; (3) simplifying the description of the dataset; (4) analyzing the structure of the observations and the variables. In this paper, PCA is used as a means of fMRI data dimension reduction in temporal domain, so as to reduce the amount of calculation. PCA is based on the Eigenvalue Decomposition (EVD) of covariance or correlation matrixes or the SVD of original data matrixes. Compared with EVD, SVD is the more robust, reliable, and precise method without computing the covariance or correlation matrix of the input. SVD decomposes an arbitrary matrix A (n × p) into three matrixes: A = USV T

(B.1)

where U (n × n) and VT (p × p) are orthogonal and normalized matrixes, i.e. UT U = I and VT V = I·S (n × p) is a diagonal matrix with singular values in decreasing order. The columns of U are the left singular vectors. The rows of VT are the right singular vectors. Computing the SVD includes finding both eigenvalues and eigenvectors of AT A and AAT . The effect of SVD is compressing the information contained in A into the leading singular vectors which are mutually orthogonal and their importance rapidly decreases after the first column or row. The importance of each singular vector is given by the corresponding singular values of the matrix S [2]. Thus the SVD gives us the eigenvectors and eigenvalues that we need for PCA [3].

[1] Wold S, Esbensen K, Geladi P. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 1987, 2(1): 37–52. [2] Shlens J. A tutorial on principal component analysis. Systems Neurobiology Laboratory, University of California at San Diego, 2005. [3] Praus P. SVD-based principal component analysis of geochemical data. Central European Journal of Chemistry, 2005, 3(4): 731–741.

16

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 9 ( 2 0 1 5 ) 9–16

references [1] C. Yao, Y. Liu, J. Yao, et al., Augment low-field intra-operative MRI with preoperative MRI using a hybrid non-rigid registration method, Comput. Methods Programs Biomed. 117 (2) (2014) 114–124. [2] D. Gembris, M. Neeb, M. Gipp, et al., Correlation analysis on GPU systems using NVIDIA’s CUDA, J. Real-time Image Process. 6 (4) (2011) 275–280. [3] A. Eklund, M. Andersson, H. Knutsson, fMRI analysis on the GPU—possibilities and challenges, Comput. Methods Programs Biomed. 105 (2) (2012) 145–161. [4] A.R.F. da Silva, cudaBayesreg: parallel implementation of a Bayesian multilevel model for fMRI data analysis, J. Stat. Softw. 44 (i04) (2011). [5] D. Scheinost, M. Hampson, M. Qiu, et al., A graphics processing unit accelerated motion correction algorithm and modular system for real-time fMRI, Neuroinformatics (2013) 1–10. [6] J. Jiménez, J. Ruiz de Miras, Fast box-counting algorithm on GPU, Comput. Methods Programs Biomed. 108 (3) (2012) 1229–1242. [7] Q. Wei, S. Patkar, D.K. Pai, Fast ray-tracing of human eye optics on Graphics Processing Units, Comput. Methods Programs Biomed. 114 (3) (2014) 302–314. [8] J. Nickolls, W.J. Dally, The GPU computing era, Micro, IEEE 30 (2) (2010) 56–69. [9] P. Comon, Independent component analysis: a new concept? Signal Process. 36 (3) (1994) 287–314. [10] A. Hyvärinen, E. Oja, Independent component analysis: algorithms and applications, Neural Netw. 13 (4) (2000) 411–430. [11] A.J. Bell, T.J. Sejnowski, An information-maximization approach to blind separation and blind deconvolution, Neural Comput. 7 (6) (1995) 1129–1159. [12] S. Amari, A. Cichocki, H.H. Yang, A new learning algorithm for blind signal separation, Adv. Neural Inf. Process. Syst. (1996) 757–763.

[13] J.F. Cardoso, B.H. Laheld, Equivariant adaptive source separation, IEEE Trans. Signal Process. 44 (12) (1996) 3017–3030. [14] V.D. Calhoun, T. Adali, G.D. Pearlson, et al., A method for making group inferences from functional MRI data using independent component analysis, Hum. Brain Mapp. 14 (3) (2001) 140–151. [15] V. Schöpf, C.H. Kasess, R. Lanzenberger, et al., Fully exploratory network ICA (FENICA) on resting-state fMRI data, J. Neurosci. Methods 192 (2) (2010) 207–213. [16] N. Wang, W. Zeng, L. Chen, A fast-FENICA method on resting state fMRI data, J. Neurosci. Methods 209 (1) (2012) 1–12. [17] V.D. Calhoun, J. Liu, T. Adalı, A review of group ICA for fMRI data and ICA for joint inference of imaging, genetic, and ERP data, NeuroImage 45 (1) (2009) S163–S172. [18] MATLAB® . The Math Works Inc. http://www.mathworks.cn/ products/matlab/ [19] J.L. Gustafson, Reevaluating Amdahl’s law, Commun. ACM 31 (5) (1988) 532–533. [20] J.R. Humphrey, D.K. Price, K.E. Spagnoli, et al., CULA: hybrid GPU accelerated linear algebra routines, in: SPIE Defense, Security, and Sensing, Int. Soc. Opt. Photonics (2010) 770502. [21] SIAM, LAPACK Users’ Guide, 1999. [22] CUDA C, Programming Guide, NVIDIA Corporation, 2012, July. [23] F. Raimondo, J.E. Kamienkowski, M. Sigman, et al., CUDAICA: GPU optimization of infomax-ICA EEG analysis, Comput. Intell. Neurosci. (2012) 2. [24] Binary Compiled Version of runica(), 2013, http://sccn.ucsd. edu/eeglab/binica/ [25] Tesla C2075 GPU – Nvidia. http://www.nvidia.cn/object/ workstation-solutions-tesla-cn.html [26] N. Whitehead, A. Fit-Florea, Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs. Tech. rep., NVIDIA, 2012 https://developer.nvidia.com/sites/ default/files/akamai/cuda/files/NVIDIA-CUDA-FloatingPoint.pdf (accessed 28.04.12).