Available online at www.sciencedirect.com
Magnetic Resonance Imaging 29 (2011) 712 – 716
Technical notes
Accelerating image registration of MRI by GPU-based parallel computation☆,☆☆ Teng-Yi Huang⁎, Yu-Wei Tang, Shiun-Ying Ju Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, R.O.C. Received 26 October 2010; revised 26 January 2011; accepted 20 February 2011
Abstract Automatic image registration for MRI applications generally requires many iteration loops and is, therefore, a time-consuming task. This drawback prolongs data analysis and delays the workflow of clinical routines. Recent advances in the massively parallel computation of graphic processing units (GPUs) may be a solution to this problem. This study proposes a method to accelerate registration calculations, especially for the popular statistical parametric mapping (SPM) system. This study reimplemented the image registration of SPM system to achieve an approximately 14-fold increase in speed in registering single-modality intrasubject data sets. The proposed program is fully compatible with SPM, allowing the user to simply replace the original image registration library of SPM to gain the benefit of the computation power provided by commodity graphic processors. In conclusion, the GPU computation method is a practical way to accelerate automatic image registration. This technology promises a broader scope of application in the field of image registration. © 2011 Elsevier Inc. All rights reserved. Keywords: Image registration; GPU; Parallel computing
1. Introduction Image registration is an important topic in MRI applications, such as registering slice positions for longitudinal studies and motion corrections for functional MRI (fMRI) studies. The main concept of image registration is to estimate a mapping between a pair of images. This mapping involves a set of parameters that spatially transform the “source” image to match the “reference” image. In MRI applications, registration methods can be roughly classified into two categories: single-modality and multimodality. Examples of singlemodality MRI registration include reducing motion artifacts in fMRI based on the alignment of all dynamic images [1–6] and the automatic slice positioning (ASP) system of MRI [7,8]. An example of multimodality MRI
☆ Supported in part by the National Science Council under grant NSC99-2628-E-011-003. ☆☆ Presented in part at the Joint Annual Meeting ISMRM-ESMRMB, Stockholm, Sweden, 2010. ⁎ Corresponding author. E-mail address:
[email protected] (T.-Y. Huang).
0730-725X/$ – see front matter © 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.mri.2011.02.027
registration is registering the functional images acquired by T2⁎-weighted echo planar imaging and T1-weighted structure images. Automatic image registration algorithms generally consist of three blocks: transformation, similarity measurement and optimization. First, the transformation, which is either collinear (e.g., rigid, affine) or deformable (e.g., Bspline), adjusts the coordinates of the source image. Next, the similarity measurement step, which includes the mutual information (MI) [9,10] or correlation coefficient (CC) method, measures the similarity of the image pair. Finally, an optimizer (e.g., Powell's method) [11] efficiently searches for the best transformation parameters using the similarity measurements as the cost function. Researchers have developed many different image registration algorithms. However, these optimization algorithms generally require many iterations to find all the parameters of the selected transformation. For example, the rigid transformation of three-dimensional (3D) images has six parameters, including displacements and rotations. In each iteration in the optimization loop, the transformation described by a 4×4 matrix [i.e., Eq. (1)] is applied to vectors representing all 3D voxels. Therefore, automatic image registration is generally a time-consuming problem
T.-Y. Huang et al. / Magnetic Resonance Imaging 29 (2011) 712–716
even with state-of-the-art workstations. This drawback delays the postprocessing of medical images (e.g., fMRI analysis), narrows the application scope of this technology and even slows the workflow of clinical routines. For example, the ASP [7,8] provided in modern MRI machines utilizes image registration to calculate the slice parameters for a specific brain orientation (e.g., the anterior–posterior commissure line of the brain). Subsequent MR exams have to be held on until finishing the registration. Modern graphics processing units (GPUs) equipped with parallel computing are capable of accelerating scientific computing. The modern GPU, which was originally designed for 3D games, is also useful in general computing tasks. GPU parallel computing is a practical way to accelerate the image processing time for MRI applications, including real-time applications [12,13] and postprocessing [14–16]. This study attempted to implement a GPU-accelerated image registration compatible to the statistical parametric mapping (SPM) system provided by the Wellcome Trust Centre for Neuroimaging at London's Global University. The SPM system is popular worldwide for its neuroimage applications and especially fMRI.
2. Materials and methods This study attempts to implement an accelerated version of “spm_coreg” of SPM, which performs
713
automatic image registration based on the information theory [9]. This program can obtain the voxel-to-voxel affine matrix that transforms the source volume to optimally match the reference volume. The cost function that the program uses to find the optimal image registration is the MI [17] or entropy CC (ECC) [18]. To calculate MI or ECC, a 256×256-bin joint histogram that is a discrete representation of the joint probability distribution [10] of the two image volumes is generated by a dynamic-link library “spm_hist2.dll.” 2.1. Implementation of GPU program compatible to SPM This study reimplemented the library (i.e., spm_hist2. dll) with Compute Unified Device Architecture (CUDA) programming model, a developer tool provided by NVIDIA (Santa Clara, CA) to execute computations on GPU hardware. The SPM system works in the MATLAB (Mathworks, Natick, MA) environment. Hence, it is necessary to design a communication mechanism between CPU-based MATLAB to GPUbased CUDA. The two volumes were transferred from the MATLAB to CUDA environment through the “mexfunction,” a built-in function in MATLAB. Second, the image volumes and auxiliary affine-transformation parameters were copied to the DRAM of the graphic card. The coordinate transformations on the pixels of the target volume were then calculated with massive parallel threads of the GPU, as Eq. (1) shows, where (x, y, z)
Fig. 1. Block diagram of the GPU-accelerated image registration used in this study. The solid black arrows represent an iteration loop to search for the optimal registration parameters. Since the CUDA program is written in C-language environment, all the MATLAB parameters were “translated” by the MEX interface. The rigid transformation was then calculated by GPU parallel computing. The resulting joint histogram was transferred back to MATLAB environment to calculate the MI, which served as the cost function of the optimization algorithm.
714
T.-Y. Huang et al. / Magnetic Resonance Imaging 29 (2011) 712–716
and (x′, y′, z′) are the coordinates before and after transformation. 2
3 2 x0 r11 6 y0 7 6 r21 6 07=6 4 z 5 4 r31 1 0
r12 r22 r32 0
r13 r23 r33 0
32 3 tx x 6 7 ty 7 76 y 7 tz 54 z 5 1 1
ð1Þ
In the transformation matrix, r parameters represent the rotation factors determined by three Euclidean angles, while t parameters represent the translation factors. Since the transformation of each pixel is independent, it is appropriate to parallel computing. Bilinear interpolations were used to estimate the appropriate intensity values for the pixels whose coordinates were not on the regular grid. The transformed 3D volumes were then transferred back to CPU memory to calculate the joint histogram required by SPM. Finally, the joint histogram of both volumes was transferred back to the SPM program in the MATLAB workspace to compute MI
and further Powell optimization loops. Since two image volumes were required for every iteration loop, their memory spaces were set to be persistent (i.e., not deleted after execution). This approach ensures that data duplication between CPU-DRAM and GPU-DRAM is only performed once to save the data communication loading. Fig. 1 illustrates the experimental procedures. The function block linked by the solid black arrows in this figure represents the optimization loop, which efficiently determines the transformation parameters for optimal registration. 2.2. Collecting images for 3D image registration The data sets used in this study were obtained from three healthy adults (aged 22–24 years) on a 3-T whole-body MR system (Trio, Siemens Medical Solution, Erlangen, Germany). An eight-channel RF coil was used to receive the signal. The volunteers underwent two sessions of 3D T1-MPRAGE [19] (repetition time/echo time, 2530/3.42 ms; inversion
Fig. 2. Joint histograms calculated by the implemented GPU-based “spm_hist2.dll.” (A,B) Partially surface-rendered 3D volume pair of a subject, obtained in different days. (C) Before image registration, the joint histogram of the volume pair is more blurred and scattered. According to the information theory, the MI of this histogram is 0.66. (D) After registration, the histogram becomes sharp and concentrated. In this case, the MI achieves a higher value of 1.36.
T.-Y. Huang et al. / Magnetic Resonance Imaging 29 (2011) 712–716
715
time: 1100 ms;matrix, 256×256×176; resolution, 1 mm3) scans on different days. The data sets were used for evaluating GPU performance for single-modality intrasubject image registration.
Table 1 Calculation times of image registration: CPU vs. GPU
2.3. Performance evaluation of simulated data and 3D T1 images
Calculation times (s) CPU-based GPU-based Acceleration factor
The original SPM program (i.e., spm_hist2.dll) and its GPU-accelerated version were first tested on simulated data sets with different matrix sizes (643, 1283, and 2563) to evaluate their performances. On the other hand, the resulting T1 data sets were transferred to a computer workstation equipped with GeForce GTX295 (NVIDIA, USA) with 240 CUDA units per GPU and Core i7 920 (Intel, USA) to test the efficiency of the GPU-accelerated SPM image registration method. The 3D volumes were first converted from DICOM format to the Analyze format used in the SPM system. The volume pair of each subject was then loaded into the MATLAB environment and both CPU- and GPU-based SPM image registration methods were performed. The optimization cost function was MI. The SPM system used a two-step registration with two matrix sizes (i.e., Step 1, down-sampled data; Step 2, full samples) to reduce the calculation time. The sampling steps were 2 and 1 mm in Steps 1 and 2, respectively. The total calculation times were recorded to compare their performances. 3. Results
Subject 1
2
3
419.9±0.3 29.8±0.2 14.0±0.1
383.3±0.1 27.2±0.6 14.0±0.3
377.3±0.1 28.6±0.9 13.2±0.4
rendered from the two 3D volumes of one of the subjects. The joint histogram calculated from the volume pair obtained in different days is blurred (see Fig. 2C) due to mis-registration. SPM calculated the MI and ECC according to the histogram and then used these results as the cost function to search for the optimal registration parameters. Automatic image registration by SPM made the joint histogram sharper and concentrated (see Fig. 2D). Fig. 3 compares the speed of the matrix sizes in the simulated image volumes for registration. The curves in this figure show the calculation time of a single execution on “spm_hist2.dll” (both CPU and GPU versions) with different matrix sizes. The acceleration factor reached 23.6 when the image volume was 2563. Considering that “spm_hist2.dll” only computed the transformations and the joint histogram, this comparison does not reflect the acceleration of the total image registration process. Table 1 shows the total calculation times of the “spm_coreg” programs (both CPU and GPU versions) applied to the test data sets. Each program was performed four times on every volume pair, and the average acceleration factor of three volume pairs was 13.8±0.5.
Fig. 2 shows the joint histograms calculated by the proposed GPU program. Fig. 2A,B shows the surfaces 4. Discussions and conclusions
Fig. 3. These curves, shown on a logarithmic scale, compare the computation times of CPU-based (dashed line) and GPU-based (solid line) implementations of “spm_hist2.dll”. The computation time of the program includes a single execution of the rigid transformation and the joint histogram. For the common matrix sizes used in MR experiments, the acceleration factors of GPU are 19.6–23.5.
This study proposes using the recently advanced GPU parallel computing technique to speed up image registration calculation process. Since SPM is widely used in the MR and brain imaging research communities, a fully compatible GPU-accelerated program is desirable. Therefore, the original SPM library “spm_hist2.dll” written in the CPUbased C language was reprogrammed into the GPU-based CUDA parallel computation model. This allows SPM users to simply replace the registration library with the proposed library to gain the benefit of GPU parallel computing. Commercially available GPU hardware achieved a ∼14-fold speed up (or 300–400 s) in the total computation time in this experimental study. This significant time reduction can be beneficial to clinical practice, which requires as fast a response as possible, or to scientific computing, which generally sacrifices precision to decrease computing time. There are three major computations in the “spm_hist2.dll” program, which are transformation, interpolation and joint histogram. Transformation and interpolation are well suited for GPU parallel computation because these computations are independent in each pixel. However, computing the joint histogram is not a trivial task for parallel processing.
716
T.-Y. Huang et al. / Magnetic Resonance Imaging 29 (2011) 712–716
Different parallel threads that try to increment concurrently the same memory space leads to invalid operations. Thus, to perform concurrent processes, parallel programming generally performs concurrent processes with atomic operations that lock the memory before modifications. However, these atomic operations greatly reduce the overall performance of GPU-based image registration. Therefore, the proposed approach uses the CPU to compute the joint histogram. Shams et al. recently proposed a sort-and-count algorithm that can speed up computing joint histogram with high bins [20]. Their algorithm may also benefit the GPU-based SPM image registration and is worthy of further research. The joint histogram estimated by “spm_hist2.dll” is mainly used on multimodality coregistration. Although the test data pair was acquired with the same image modality, the program could be used for multimodality image registration as well. The GPU hardware architecture requires less power consumption and storage space than conventional highperformance computing clusters. Therefore, it is feasible to build GPU system in the MR console room to speed up clinical routines. We applied this method to the application of SPMbased ASP [21].Time reduction achieved by GPU computation greatly increases the practicality of the ASP system. In the future, fast GPU-based image registration may be applied to real-time image fusion for image-guided therapy systems. However, this topic requires further investigations. Although the proposed program is only compatible with NVIDA GPUs, it can be reimplemented on the platforms of other vendors with minimal difficulty. As GPU computing continues to improve, open standards of GPU computation are developing. Thus, a GPU-accelerated image registration program with cross-platform compatibility could be feasible in the near future. In conclusion, GPU-accelerated computation is a promising method to speed up automatic image registration. Results show that GPU computing can significantly accelerate the SPM registration procedure and has the potential to increase the efficiency of clinical routines in brain imaging and MR data analysis. The GPU program can be downloaded from the Web page (http://mri.ee.ntust.edu.tw/cuda/).
References [1] Friston KJ, Williams S, Howard R, Frackowiak RS, Turner R. Movement-related effects in fMRI time-series. Magn Reson Med 1996;35(3):346–55. [2] Speck O, Hennig J. Motion correction of parametric fMRI data from multi-slice single-shot multi-echo acquisitions. Magn Reson Med 2001;46(5):1023–7. [3] Thesen S, Heid O, Mueller E, Schad LR. Prospective acquisition correction for head motion with image-based tracking for real-time fMRI. Magn Reson Med 2000;44(3):457–65.
[4] Glover GH, Li TQ, Ress D. Image-based method for retrospective correction of physiological motion effects in fMRI: RETROICOR. Magn Reson Med 2000;44(1):162–7. [5] Kim B, Boes JL, Bland PH, Chenevert TL, Meyer CR. Motion correction in fMRI via registration of individual slices into an anatomical volume. Magn Reson Med 1999;41(5):964–72. [6] Glover GH, Lee AT. Motion artifacts in fMRI: comparison of 2DFT with PR and spiral scan methods. Magn Reson Med 1995;33(5): 624–35. [7] Gedat E, Braun J, Sack I, Bernarding J. Prospective registration of human head magnetic resonance images for reproducible slice positioning using localizer images. J Magn Reson Imaging 2004;20 (4):581–7. [8] van der Kouwe AJ, Benner T, Fischl B, Schmitt F, Salat DH, Harder M, Sorensen AG, Dale AM. On-line automatic slice positioning for brain MR imaging. Neuroimage 2005;27(1):222–30. [9] Collignon A, Maes F, Delaere D, Vandermeulen D, Suetens P, Marchal G. Automated multi-modality image registration based on information theory. In: Bizais Y, Barillot C Paola RD, editors. Information Processing in Medical Imaging. Dordrecht: Kluwer Academic Publishers; 1995. p. 263–74. [10] Studholme C, Hill DL, Hawkes DJ. Automated three-dimensional registration of magnetic resonance and positron emission tomography brain images by multiresolution optimization of voxel similarity measures. Med Phys 1997;24(1):25–35. [11] Press WH, Vetterling WT, Teukolsky SA, Flannery BP. Numerical Recipes in C. 2nd ed. Cambridge (UK): Cambridge University Press; 1992. [12] Knoll F, Unger M, Diwoky C, Clason C, Pock T, Stollberger R. Fast reduction of undersampling artifacts in radial MR angiography with 3D total variation on graphics hardware. MAGMA 2010;23(2):103–14. [13] Roujol S, de Senneville BD, Vahala E, Sorensen TS, Moonen C, Ries M. Online real-time reconstruction of adaptive TSENSE with commodity CPU/GPU hardware. Magn Reson Med 2009;62(6): 1658–64. [14] Hansen MS, Atkinson D, Sorensen TS. Cartesian SENSE and k–t SENSE reconstruction using commodity graphics hardware. Magn Reson Med 2008;59(3):463–8. [15] McGraw T, Nadar M. Stochastic DT–MRI connectivity mapping on the GPU. IEEE Trans Vis Comput Graph 2007;13(6):1504–11. [16] Sorensen TS, Schaeffter T, Noe KO, Hansen MS. Accelerating the nonequispaced fast Fourier transform on commodity graphics hardware. IEEE Trans Med Imaging 2008;27(4):538–47. [17] Wells 3rd WM, Viola P, Atsumi H, Nakajima S, Kikinis R. Multimodal volume registration by maximization of mutual information. Med Image Anal 1996;1(1):35–51. [18] Maes F, Collignon A, Vandermeulen D, Marchal G, Suetens P. Multimodality image registration by maximization of mutual information. IEEE Trans Med Imaging 1997;16(2):187–98. [19] Mugler III JP, Brookeman JR. Rapid three-dimensional T1-weighted MR imaging with the MP-RAGE sequence. J Magn Reson Imaging 1991;1(5):561–7. [20] Ramtin S, Parastoo S, Rodney K, Richard H. Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images. Comput Meth Prog Bio 2010;99 (2):133–46. [21] Ju SY, Tang YW, Huang TY, Peng HH. The inter-scan variations of flow quantifications on human basilar artery: a study controlled the scan conditions with automatic slice positioning and the automatic lumen-area segmentation. Proceedings of Joint Annual Meeting ISMRM-ESMRMB 2010. Sweden): Stockholm; 2010. p. 2334.