Computer Physics Communications 184 (2013) 1364–1371
Contents lists available at SciVerse ScienceDirect
Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc
Efficient GPU-accelerated molecular dynamics simulation of solid covalent crystals Chaofeng Hou a,∗ , Ji Xu a , Peng Wang b , Wenlai Huang a , Xiaowei Wang a a
State Key Laboratory of Multiphase Complex Systems, Institute of Process Engineering, Chinese Academy of Sciences, P.O. Box 353, Beijing 100190, China
b
NVIDIA Corporation, China
article
info
Article history: Received 6 August 2012 Received in revised form 19 December 2012 Accepted 7 January 2013 Available online 11 January 2013 Keywords: Graphics processing unit Molecular dynamics simulation Many-body Covalent crystal Reordering Performance
abstract Different from previous molecular dynamics (MD) simulation with pair potentials and many-body potentials, an efficient and highly scalable algorithm for GPU-accelerated MD simulation of solid covalent crystals is described in detail in this paper using sophisticated many-body potentials, such as Tersoff potentials for silicon crystals. The algorithm has effectively taken advantage of the reordering and sorting of atoms and the hierarchical memory of a GPU. The final results indicate that, about 30.5% of the peak performance of a single GPU can be achieved with a speedup of about 650 over a contemporary CPU core, and more than 15 million atoms can be processed by a single GPU with a speed of around 2 ns/day. Furthermore, the proposed algorithm is scalable and transferable, which can be applied to other manybody interactions and related large-scale parallel computation. © 2013 Elsevier B.V. All rights reserved.
1. Introduction In recent years, graphic processing units (GPU) have been a powerful computational tool in science and engineering. Accompanied by a simplified development environment and improved programmability [1–4], GPUs are attracting more attention for their high performance/cost and adaptability to data-parallel and computing-intensive processing. Nowadays, GPUs have been widely applied to tackle intensive computation and huge data involved in MD simulation [4–10], which will be promising to substantially promote the scale and resolution of MD simulations. Currently, most of the MD simulations on GPUs are focused on pair potentials [4–8]. Although some algorithms on MD simulations with many-body potentials [4,5,9,10] have been reported, all of them are restricted to angle and dihedral interactions in macromolecules and polymers. And, the algorithms possess respective limitations such as the fitness to small or medium-sized proteins [9] and the 3-fold or 4-fold computation cost [5,10]. Accordingly, a specific improvement has been devised in the Refs. [11,12] to surmount the problem. However, the improvement did not make further optimization, such as effective utilization of the hierarchical memory of a GPU and designing of an efficient
∗
Corresponding author. Tel.: +86 10 8254 4840; fax: +86 10 6255 8065. E-mail address:
[email protected] (C. Hou).
0010-4655/$ – see front matter © 2013 Elsevier B.V. All rights reserved. doi:10.1016/j.cpc.2013.01.001
algorithm. In the paper, the reordering and sorting methods of atoms [13,14] and effective utilization of shared memory, constant memory and texture memory are integrated into the presented algorithm, which achieves an extremely excellent performance. 2. MD simulations with many-body potentials Herein, crystalline silicon is taken as a model system to illustrate the algorithm and its implementation. Crystalline silicon is one of the most important materials for information technology and solar cell industries. To describe the many-body interactions between the atoms in crystalline silicon as accurately as possible, Tersoff potentials [15,16], which are among the most successful models for tetravalent atoms, are adopted here. Its detailed formula and the model parameters can be seen in the Refs. [12,16]. Initially, the atomic positions of the simulated crystalline silicon are assigned in light of a regular crystal lattice containing eight full atoms, and the velocities of atoms are given according to the Maxwellian distribution for a specific temperature. With respect to the canonical ensemble (NVT) mentioned in the following section, the velocity rectification method is applied to regulate the system temperature [12,17]. In the simulations, the leapfrog algorithm is chosen to integrate the Newtonian equations of atomic motion. Moreover, mean potential energy per atom is calculated as a benchmark to check the validity and correctness of the algorithm presented [12,18].
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
1365
a
b Fig. 1. (a) Initial reordering of all the calculated atoms in the system showed in the form of the plane of ‘‘supercells’’, (b) Mapping of the threads in the blocks to the calculated atoms when implementing the related kernel functions. The number of atoms in each ‘‘supercell’’ is exemplified by Nb = 3.
3. GPU implementation of MD simulation As mentioned in the Ref. [12], programming for the many-body interaction such as the Tersoff model on GPUs is fairly intractable. Since the interaction between two adjacent atoms relies on all other neighbor atoms, writing the forces exerted on these neighbor atoms is prone to give birth to write conflict and frequent memory access when applying the ‘‘one thread, one atom’’ method [4–8]. And, taking into account avoiding redundant floating point operations existing in the previously reported simulations with angle and dihedral interactions in biological systems [5,10], we devise the following solution, where the force computation is split into two kernels to be completed. As a matter of fact, due to the limited temperature range of this study, the structure of solid crystalline silicon is relatively stable, where an unchanged neighbor list can be followed in the simulations, each with four fixed neighbor atoms. Of course, if a higher temperature or molten crystals need to be considered, the range of neighbor atoms of each atom can be extended to its second neighbor or farther. Apparently, this feature allows us to bypass the time-consuming and inefficient process of neighbor indexing in traditional MD simulations. Furthermore, to improve the cache hit rate and the computational efficiency, the spatial locality principle is followed in the algorithm, which requires that the data are arranged in successive locations in memory [13,14]. Since solid crystals are focused in the paper, the static reordering method is applied. Different from the previous report, the concept of the ‘‘supercell’’ mentioned here substitutes the unit cell given in the Ref. [13], and the size of the ‘‘supercell’’ in each direction is several times longer than that of the unit cell of solid silicon crystals. When the algorithm is implemented on a GPU initially, the atoms in the system are put into the array allocated on device memory sequentially in the order of the ‘‘supercells’’ (see Fig. 1(a)), and these locations in memory are kept unchanged in the successive iterating computation. Then, the
atoms lying in a ‘‘supercell’’ are assigned to a corresponding thread block. If Nb represents the number of the atoms in each ‘‘supercell’’, and m thread blocks process all Natom atoms in the system, one block needs to launch at least Natom /m threads on average to processes Nb atoms arranged sequentially (see Fig. 1(b)). In order to avoid idle threads in the blocks, Natom = m∗ Nb is satisfied in the algorithm [11,12]. Ultimately, the mapping between atoms in the simulation system and threads on the GPU is established, and each atom in the system can be calculated by the corresponding thread. For pursuit of higher performance, reducing global memory (GM) access and avoiding write conflict are also taken into account. Buffers for those computed atomic positions and forces are created in shared memory (SM) of GPU, which has much faster access. Nevertheless, due to the limited size of shared memory, its effective utilization is extremely significant. On the basis of this purpose, the size and aspect ratios of the above assigned ‘‘supercells’’ are selected and optimized carefully. And, the atoms contained in each ‘‘supercell’’ are sorted in three groups (see Fig. 2). The atoms in the first group have one neighbor inside the ‘‘supercell’’, the second group with two neighbors inside the ‘‘supercell’’, and the third group with four neighbors inside. The number of the atoms in the third group is expected to be maximized so as to reduce GM access. Finally, as seen in Fig. 2, the atoms inside each ‘‘supercell’’ are aligned according to the different number of neighbors of each atom, the atoms with the same number of neighbors are put together, which is beneficial to reduce branching and ensure coalesced memory access. This reordering can accomplish the initial assignment of all the atoms in the system after coupling with the order of ‘‘supercell’’ given in Fig. 1. In the force computation part of the algorithm, as mentioned ahead, the solution with two separate kernel functions, is provided below in detail. A brief and short description of these related implementations can be also seen in the Ref. [24].
1366
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
Fig. 2. Sorting of the calculated atoms in each ‘‘supercell’’. The red, green and blue atoms belong to the first, second and third groups, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.) Source: Modified and reprinted from Ref. [24]. © 2012, with permission from SAGE.
Fig. 3. Temporal variation of mean potential energy per atom calculated on GPU and CPU, respectively. SP represents single precision, DP represents double precision.
Table 1 Summary of the running environments.
Environment 1
Environment 2
Table 2 Specifications of NVIDIA Tesla C2050 GPU card.
Hardware configuration
Operating system
Compiler
Two Intel Xeon E5670 2.93 GHz CPU, 48 GB DDR3-1333 ECC memory, four 1TB SATA hard disks GPU card: NVIDIA Tesla C2050
CentOS 5.4 (Linux kernel version 2.6.18-164.el5)
icc 11.1
CUDA3.2
Kernel 1. Collision In this kernel, the forces each atom experiences from its adjacent atoms are evaluated by its corresponding thread in the blocks. At the same time, according to the different group of each atom, the inverse forces the four adjacent atoms experience are written to the force buffers created in shared memory, or to the force list in global memory for the neighbors outside the ‘‘supercell’’. Finally, the complete resultant forces acting on the atoms in the third group
Number of multiprocessors Number of cores in each multiprocessor Core clock frequency Global memory size Constant memory Shared memory per multiprocessor Register number per multiprocessor Bandwidth of global memory access
14 32 1.15 GHz 2.82 GB 64 KB 48 KB 32,768 ∼144 GB/s
in each ‘‘supercell’’ can be obtained in this step, and the positions and velocities of the atoms are updated. Kernel 2. Updating The forces saved in force list of global memory in Kernel 1 are read back and accumulated to the atoms in the first and second groups in each ‘‘supercell’’. Then, the complete resultant force of the atoms in the two groups is also obtained, and the positions and velocities of the atoms are updated. To describe the implementation of force computation in more detail, the relevant pseudocode is written as follows:
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
1367
optimization, the atomic position array was bound to the texture memory to reduce redundant global memory access and its wasted bandwidth. And, enough active warps on every multiprocessor are used to hide global memory latency. As a result, a large amount of measurements were made to achieve the best tradeoff among various resources (registers, shared memory and block size, etc.). In terms of instruction throughput optimization, all the transcendental functions and floating point divide use native GPU math functions, which was found to generate practically tiny difference between statistical results in comparison with the more precise but more costly versions. 4. Performance evaluation In this section, the performance of the algorithm is evaluated on the basis of the environments given in Table 1. Specifications of the adopted GPU card, NVIDIA Tesla C2050 (Fermi), are listed in Table 2. For consistency, all real values are present in single precision in the CPU and GPU implementations. Firstly, a simple case (Case 1) is run to check the effectiveness and correctness of the presented algorithm. Table 3 lists the detailed parameters for Case 1, including the reduced values favored by numerical treatment, which can be corresponding to a real physical system with certain density and temperature. In our simulations, most of cases are microcanonical ensembles (NVEs), whose initial temperatures are assigned at 300 K except for some explanations in the specific cases. Herein, we make a comparison between the computational results on GPU and CPU. The CPU implementation is a general MD simulation with utilizing neighbor list method and link-cell method in double precision, where the neighbor atoms around each atom need to be searched and then determined. The GPU implementation adopts the algorithm presented in the paper, and utilizes single precision. Examining two simulation results in Fig. 3, it can be found that the computed statistical results on CPU and GPU are the same in term of the mean potential energy per atom. The captured snapshots also show that the crystal structures in the systems are still diamond-like basically when reaching the thermodynamic equilibrium. Compared to the literature [15,16,19], it is demonstrated that the results obtained in the above two implementations are correct and reliable. 4.1. Performance optimization
where the kernel functions in all the GPU implementations are performed by organizing the indices of 1-dimensional blocks and 1-dimensional threads. tn is the thread number in each block, threadIdx.x ranges from 0 to tn − 1, blockIdx.x ranges from 0 to m − 1. In fact, the above steps 4 and 5 correspond to the Kernels 1 and 2 in the above force computation, respectively. The two kernels occupy most of the running time on the GPU. By analyzing the instruction-to-byte ratio provided by the CUDA visual profiler and manual test, it is found that the two kernels are neither compute-bound nor memory-bound. In terms of memory
Due to the sorting of the atoms in each ‘‘supercell’’, we hoped that the number of the atoms whose neighbors all lie inside the ‘‘supercell’’ are maximized in the ‘‘supercell’’. Furthermore, taking into account the limits of shared memory and registers in the multiprocessors on GPU card, the dimensions of the ‘‘supercell’’ can be neither too large, nor too small, which needs to be finely selected and optimized. The following Table 4 gives the numbers and fractions of the atoms belonging to disparate groups in the ‘‘supercells’’ with different sizes. The measurement results indicate in Fig. 4 that, the selection of the ‘‘supercells’’ has significant influences on total implementation performance and two main computational kernels, namely, collision kernel and updating kernel. As a result, it is found that the ‘‘supercell’’ containing 256 atoms is optimal, which has 2 × 4 × 4 unit cells in the X , Y , Z directions, respectively. The analyzed cases with different blocks indicate that, the case with the block of 64 atoms has the best L1 (99.93%) and texture cache hit rate (39.16%), so its collision kernel has the highest performance, but the performance of the updating kernel is the worst, which results in poor total implementation performance. Actually, the L1 hit rate decreases monotonically with block size. The case with the block of 288 atoms is the worst in total
1368
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371 Table 3 Major parameters for Case 1. Parameter
Symbol
Dimensionless value
Dimensional value
Remark
Atom diameter
σ
1
0.235 nm
Atom mass Time step Density
m
1t ρ
1 0.01 0.65
4.65 × 10−26 kg 1.27 fs 2330 kg/m3
Typical diameter of silicon atom Mass of silicon atom
Atom number Space height Space width Space depth
N H W L
2048 18.48 9.24 18.48
2048 4.34 nm 2.17 nm 4.34 nm
2365 kg/m3 for silicon at 1 atm, 300 K Varying in different cases Varying in different cases Varying in different cases Varying in different cases
Table 4 Assignment of the atoms in the three groups in different supercells. Atom number in different ‘‘supercells’’
Dimensions in three directions
Atom number in the first group
Atom number in the second group
Atom number in the third group
Fraction of the third group (%)
64 128 192 216 256 288 384 512
2×2×2 2×2×4 2×3×4 3×3×3 2×4×4 3×3×4 3×4×4 4×4×4
10 14 16 16 18 18 20 22
27 51 71 75 91 95 119 147
27 63 105 125 147 175 245 343
42.19 49.22 54.69 57.87 57.42 60.76 63.80 66.99
memory and instruction optimization mentioned in the above section, the effective memory and instruction throughput in these simulations can achieve around 80% of the peak. And as a whole, the floating point operations per second (FLOPS) can reach about 30.5% of theoretical peak value of a single GPU, especially the case containing up to 15,360,000 atoms, with the size of 217.1 nm*26.1 nm*54.3 nm, can be simulated by an exciting rate of about 2 ns/day, which is extremely excellent. 4.2. Performance evaluation of GPU over CPU
Fig. 4. Variation of the implementation performance with different blocks. All the used cases contain 1,728,000 atoms, and their sizes are 32.6 nm*32.6 nm*32.6 nm. The running times are the results obtained in 1000 time steps.
performance. Its longest implementation time of collision kernel is brought about by a large amount of register spilling. The abnormal case is the case with the block of 216 atoms. Though its fraction of the atoms in the third group is high, it has lower occupancy (29%) than all the other cases (33%), which should be the major reason of its lower performance. For the case with the block of 128 atoms, the implementation time of collision kernel is excellent, but the updating time is relatively long. Evidently, the computational performance does not increase monotonically with the fraction of the third group, though the ratio of shared memory access and global memory access increases monotonically with the ratio. In some cases, when the block size is bigger and the ratio is higher, the occupancy actually decreases. Accordingly, the optical block size is a tradeoff of the two factors, and looks to be the case with the block of 256 atoms. It needs to be pointed out that, the measurement performances given in the current and following sections are obtained on the basis of CUDA 3.2. And, by a series of optimizations including
In this section, all the GPU implementations utilize the optimal ‘‘supercell’’ with 256 atoms. And to guarantee the comparability with GPU implementation, the same simulation systems are tested on one CPU core under Environment 1 in Table 1, where the CPU implementation makes use of the unchanged neighbor list in single precision as well. In the simulations, the atom numbers and the corresponding simulation boxes are different from case to case, but the other parameters and controlling conditions exerted on the cases are kept identical to Case 1. It is seen from Fig. 5 that, with the increase of atom number, the speedup of GPU over CPU raises rapidly at the beginning, then, the increase becomes more and more moderate after approaching a critical system size. Ultimately, as far as the whole computation and force computation are concerned, the performance of GPU implementation approaches 650 times and 730 times higher than their CPU implementation, respectively. Apparently, compared with the fastest algorithm provided in Ref. [12], the running speed of the current algorithm is around ten times faster than that of the algorithm in both force computation and the whole computation. The pivot of the currently designed algorithm is effective and efficient utilization of hierarchical GPU memory, especially shared memory, whereas in the old algorithm, global memory is mainly used with less consideration of shared memory. Surely, the work in algorithm optimization may be endless. Howbeit, the performance improvement given in the paper is significantly evident. And the current algorithm possesses the transferability and the scalability, which can be extended to other many-body potentials such as Tersoff–Brenner potential [20], SW potential [21], DOD potential, PTHT potential
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
1369
Fig. 5. Performance comparison of GPU over CPU. The atom numbers are 2048, 16,384, 131,072, 256,000, 512,000 and 1,024,000, respectively. Other parameters are listed in Table 3.
Fig. 6. Variation of mean potential energy per atom with the size of the simulation systems.
and BH potential and so forth [22]. In fact, interactive forces beyond two atoms are more common for solid materials. 5. Effects of numerical precision In GPU-based MD simulation, an evidently existing question can not be ignored, namely, when the size of the simulation system becomes larger, and the iterating time is longer, the accumulated computational error will be more conspicuous. As demonstrated in Fig. 6, with the increment of the atom number of the systems, the error of the statistical result becomes obvious
gradually. Nevertheless, in terms of the case with 1,024,000 atoms, the deviation from the statistical mean value is below 0.005% in 1.0E6 time steps. In the cases with less than 1.0E5 atoms, the simulation systems nearly have no significant deviation from the statistical mean value in 1.0E6 time steps. And even, in the case with 15,360,000 atoms, the deviation in 1.0E6 time steps is merely around 0.08% (see Fig. 7). Actually, the total system energy is frequently employed to check and assess the precision of a MD simulation software. Since our simulated cases are microcanonical ensembles, the total system energy should be conserved in the cases. Specially, in terms of the case with 15,360,000 atoms, due to
1370
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
Fig. 7. Variation of the two statistical mean values with the iterating time step. The atom number of the simulated system is 15,360,000.
Fig. 8. Temporal variation of the two statistical mean values under temperature controlling. The simulated system contains 15,360,000 atoms, and the system temperature is kept at 300 K all the time.
the addition of the deviation of mean kinetic energy, the deviation of the mean total energy per atom is larger than that of mean potential energy per atom, which is about 0.16%. Nevertheless, the numerical precision effects are reasonably acceptable with respect to computation of such a huge system [9,25]. Progressively, for pursuit of weakening or even eliminating the effects of floating point precision, some solutions are conceived. Temperature controlling exerted on the simulation systems is testified to be one of the good solutions. Seen from Fig. 8, the two statistical mean results fluctuates around a stable value under the condition of temperature controlling, which do not have the obvious deviation any more. Practically, the canonical ensemble is frequently encountered and studied in general MD simulations. Surely, the current algorithm is restricted to solid crystalline or amorphous materials at low temperature no higher than the melting point. In terms of more general MD simulation with strong diffusion of the system atoms, two solutions have been devised. One is the coupling of the neighbor list method and linkcell method, which can be realized based on the first algorithm provided in the Ref. [12], the other is based on the coupling of GPU and CPU, where the GPU is used to compute the weakly diffusive part, whereas the CPU computes the strongly diffusive part [23,24]. The work is now in continuous progress. In addition,
the realization of large-scale parallel computation reported in the Ref. [24] demonstrates the scalability of the algorithm. 6. Conclusions In this paper, an efficient and highly scalable algorithm implemented on a single GPU is presented to speed up the MD simulation of solid covalent crystals with sophisticated many-body potentials. The algorithm involves the reordering and sorting of atoms, and makes full and efficient use of the memory hierarchy of the GPU. The results indicate that the algorithm can achieve extremely excellent performance. 30.5% of the peak FLOPS of a single GPU can be reached. The speedup relative to a contemporary CPU core approaches 650, and more than 15 million atoms can be processed with a speed of around 2 ns/day. Moreover, the algorithm is transferable and scalable for MD simulation of solid materials, which can be extended to other materials and systems with many-body interactions, such as carbon, germanium and silicon carbide and so forth. Acknowledgments This study is financially supported by Ministry of Finance under the grant No. ZDYZ2008-2, Ministry of Science and Technology
C. Hou et al. / Computer Physics Communications 184 (2013) 1364–1371
under the grants Nos. 2008BAF33B01 and 2007DFA41320, National Natural Science Foundation of China under the grant Nos. 20821092 and 21106147, the CAS under the grant Nos. KGCX2-YW-124, KGCX2-YW-222 and KGCX2-YW-362, and State Key Laboratory of Multiphase Complex Systems under the grant No. MPCS-2012-A-09. The authors also thank Prof. Wei Ge, Dr. Chenxiang Li and Dr. Guofei Shen for valuable help. References [1] R.G. Belleman, J. Bédorf, S.F.P. Zwart, New Astron. 13 (2008) 103–112. [2] W. Liu, B. Schmidt, G. Voss, W. Muller-Wittig, HiPC 2007, in: LNCS, vol. 4873, 2007, pp. 185–196. [3] NVIDIA, NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1, 2007. [4] J.E. Stone, J.C. Phillips, P.L. Freddolino, D.J. Hardy, L.G. Trabuco, K. Schulte, J. Comput. Chem. 28 (2007) 2618–2640. [5] J.A. Anderson, C.D. Lorenz, A. Travesset, J. Comput. Phys. 227 (2008) 5342–5359. [6] P. Eastman, V.S. Pande, J. Comput. Chem. 31 (2010) 1268–1272. [7] J.A. van Meel, A. Arnold, D. Frenkel, S.F.P. Zwart, R.G. Belleman, Mol. Simul. 34 (2008) 259–266. [8] J. Yang, Y. Wang, Y. Chen, J. Comput. Phys. 221 (2007) 799–804.
1371
[9] M.S. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A.L. Beberg, D.L. Ensign, C.M. Bruns, V.S. Pande, J. Comput. Chem. 30 (2009) 864–872. [10] J. Xu, Y. Ren, W. Ge, X. Yu, X. Yang, J. Li, Mol. Simul. 36 (2010) 1131–1140. [11] F. Chen, W. Ge, C. Hou, B. Li, J. Li, X. Li, F. Meng, W. Wang, X. Wang, J. Xu, Q. Xiong, Y. Zhang, GPU-Based Multi-Scale Discrete Simulation and Parallel Computing, Chinese Science Press, Beijing, 2009. [12] C. Hou, W. Ge, Mol. Simul. 38 (2012) 8–15. [13] S. Meloni, M. Rosati, L. Colombo, J. Chem. Phys. 126 (2007) 121102. [14] S. Meloni, M. Rosati, A. Federico, L. Ferraro, A. Mattoni, L. Colombo, Comput. Phys. Comm. 169 (2005) 462–466. [15] J. Tersoff, Phys. Rev. B 37 (1988) 6991–7000. [16] J. Tersoff, Phys. Rev. B 38 (1988) 9902–9905. [17] K.H. Hofmann, M. Schreiber, Computational Physics, Springer-Verlag, Berlin, Heidelberg, 1996. [18] D.C. Rapaport, The Art of Molecular Dynamics Simulation, Cambridge University Press, Cambridge, 2004. [19] J. Tersoff, Phys. Rev. B 39 (1989) 5566–5568. [20] D.W. Brenner, Phys. Rev. B 42 (1990). [21] F.H. Stillinger, T.A. Weber, Phys. Rev. B 31 (1985) 5262–5270. [22] H. Balamane, T. Halicioglu, W.A. Tiller, Phys Rev B 46 (1992) 2250–2279. [23] C. Hou, W. Ge, Int. J. Mod. Phys. C 23 (2012) 1250015. [24] C. Hou, J. Xu, P. Wang, W. Huang, X. Wang, W. Ge, X. He, L. Guo, J. Li, Int. J. High Perform. C. (2012) http://dx.doi.org/10.1177/1094342012456047. [25] B.A. Bauer, J.E. Davis, M. Taufer, S. Patel, J. Comput. Chem. 32 (2011) 375–385.