Technical advances in molecular simulation since the 1980s

Technical advances in molecular simulation since the 1980s

Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx Contents lists available at ScienceDirect Archives of Biochemistry and Biophysics journal...

375KB Sizes 0 Downloads 19 Views

Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

Contents lists available at ScienceDirect

Archives of Biochemistry and Biophysics journal homepage: www.elsevier.com/locate/yabbi

Review

Technical advances in molecular simulation since the 1980s Martin J. Field Dynamo Team, DYNAMOP Group, UMR 5075, Université Grenoble 1, CNRS, CEA, Institut de Biologie Structurale, 71 Avenue des Martyrs, CS 10090, 38044 Grenoble Cedex 9, France

a r t i c l e

i n f o

Article history: Received 3 February 2015 and in revised form 5 March 2015 Available online xxxx Keywords: Algorithms Hardware Molecular simulation Software The last 30 years

a b s t r a c t This review describes how the theory and practice of molecular simulation have evolved since the beginning of the 1980s when the author started his career in this field. The account is of necessity brief and subjective and highlights the changes that the author considers have had significant impact on his research and mode of working. Ó 2015 Elsevier Inc. All rights reserved.

Introduction The first electronic computers were developed during and immediately after the Second World War. Their capability in tackling complex numerical problems was obvious and they rapidly gained an important foothold in scientific research. One of the areas that blossomed due to the advent of computing was that of molecular simulation in which the behavior of molecular systems was modeled at the atomic level. Previously calculations had to be done by hand or with primitive mechanical or electromechanical devices, but the new machines permitted the use of either novel or hitherto impractical techniques. Examples from the 1950s and 1960s include the use of Monte Carlo and molecular dynamics (MD)1 methods to study gases and liquids [1], and the extension of molecular orbital (MO) Hartree–Fock (HF) and configuration

E-mail address: martin.fi[email protected] Abbreviations used: AMD, Advanced Micro Devices; CDC, Control Data Corporation; CI, configuration interaction; CM, Connection Machine; CPU, central processing unit; CRT, cathode ray tube; DEC, Digital Equipment Corporation; DFT, density functional theory; DIIS, direct inversion in the iterative subspace; Fortran, formula translation; FLOP, floating point operations per second (in the text double precision values are used where possible); FMM, fast multipole method; GB, generalized Born (method); GB, gigabyte (unit); GPU, graphics processing unit; HF, Hartree–Fock; HP, Hewlett–Packard; IBM, International Business Machines; IMSL, International Mathematics and Statistics Library; MB, megabyte; MD, molecular dynamics; MIT, Massachusetts Institute of Technology; MM, molecular mechanical; MO, molecular orbital; MPI, message passing interface; NAG, Numerical Algorithms Group; NIH, National Institutes of Health; OS, operating system; PB, Poisson–Boltzmann; PC, personal computer; PME, particle mesh Ewald; PS, picture system; QC, quantum chemical; SCF, self consistent field; TPS, transition path sampling; US, umbrella sampling; VAX, virtual address extension; VMS, virtual memory system. 1

interaction (CI) quantum chemical (QC) techniques to molecules containing more than a handful of atoms [2]. Progress since these early days has been rapid, and computer simulation is now an integral part of the scientific process. The most public indication of this in the area of molecular science has been the award of two recent Nobel Prizes in Chemistry. The first, in 1998, was given to Walter Kohn and John Pople for a mixture of theoretical and computational work, the former for his development of density functional theory (DFT), and the latter for his development of computational methods in quantum chemistry [3]. By contrast, the 2013 Chemistry Prize was given to three recipients, Martin Karplus, Michael Levitt and Arieh Warshel, all of whom are primarily computationalists. They were cited for their development of multiscale models for complex chemical systems in the late 1960s and 1970s [4]. This review provides an account of the changes in molecular simulation since the early 1980s when the author first started working in this area. The items that have been chosen are necessarily highly subjective and include technical advances that have impacted the practice of molecular simulation in addition to purely scientific developments.

Hardware An autobiography The world of scientific computing at the beginning of the 1980s was very different from that nowadays. Electronic hand-held calculators had only just become inexpensive enough to be widely used, and personal computing was in its infancy, with the release

http://dx.doi.org/10.1016/j.abb.2015.03.005 0003-9861/Ó 2015 Elsevier Inc. All rights reserved.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

2

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

of the Apple II at the end of the 1970s and that of the first IBM personal computer (PC) at the beginning of the 1980s. At this time, computational chemistry seemed to rely on large mainframe computers located in centralized centers, although it was being realized — in the US at least — that smaller so-called minisupercomputers within a research group could be just as productive and more cost-effective [5]. All the author’s PhD work, in quantum chemistry, was done at the University of Manchester Regional Computer Center in the UK. The bulk of calculations was done on the two CDC 7600 computers at the center, although a more powerful CDC Cyber 205 computer became available towards the end of the author’s thesis. Interaction with these mainframes was done via intermediate front-end computers, upon which files for job submission could be prepared and back to which results files would be transmitted once a job had finished. Initially communication to these frontends was done via teleprinter, which had to be periodically reloaded with rolls of paper (!), but these were later superseded by CRT video display terminals that the author’s host research group managed to acquire. However, this did not mean an end to the use of paper, as it was customary to print out all job output for perusal and for subsequent storage, leading to many boxes of results stacked around the office. Electronic backups, when performed, were carried out on 10.5 inch magnetic reel tapes which had to be done by physically visiting the computer center where the tape drives were to be found. The CDC 7600s were notable machines. They were designed by Seymour Cray at CDC before he left to found his own company, Cray Research, which dominated supercomputer design in much of the 1970s and 1980s. The 7600s were actually quite old in the 1980s as they had appeared on the market at the beginning of the 1970s, and had been overtaken in terms of computing power by the Cray-1 which was released in the mid-1970s. The CDC 7600 used a word length of 60 bits, different from the 32 or 64 bits that are commonly used today, had an achievable top performance of approximately 10 MFLOP, and a dual memory system consisting of 64 Kword of small core memory and 192 Kword of large core memory (very roughly, 0.5 MB and 1.5 MB, respectively). This small memory presented quite a challenge when writing QC algorithms, as it limited the size of problems that could be handled and meant that there was much reading and writing of intermediate results to and from disk. As an example, an important step in HF calculations is the diagonalization of the Fock matrix to give the orbitals and their energies. If the number of basis functions in the calculation is N, the Fock matrix, which is symmetric, requires NðN þ 1Þ=2 words of memory, whereas the matrix of orbitals, which is square, requires N 2 words of memory. This implies that the maximum number of basis functions that could be handled in small core memory was approximately 200, given the assumption that these are the two principal arrays required for diagonalization. After his thesis the author went as a postdoctoral fellow to the Department of Chemistry at Harvard University in Cambridge, Massachusetts. Here the principal computers for research were organized at a departmental level, in contrast to Manchester where they were in a multi-university regional center. The departmental machines were made up of mini-supercomputers, namely two DEC VAX 11/780s, supplemented later by a Convex C2. Although the VAXes were slower machines than the CDC 7600s that the author had used previously, with approximate speeds of 1 MFLOP, they were much more intuitive to use. The Boston area was an exciting place to be in the mid to late 1980s as it was one of the centers of the computer industry. There were established companies, such as DEC, but also younger start-ups, including Alliant, Apollo and Thinking Machines. The

latter was located at MIT and was one of the first manufacturers to produce computers with a massively parallel architecture. The author had the privilege of working on one of their earlier machines, the CM-2 Connection Machine, for a short while. The CM-2 consisted of a cube approximately 1.5 m per side and could house up to 65,536 (216 ) very simple single-bit processors. Numerical calculations were accelerated by adding floating-point units, typically one per group of 32 of the simpler processors. Although a full machine had, in principle, a performance in excess of 1 GFLOP, this was difficult to achieve for the type of molecular simulation algorithm that the author was interested in. The trend to more localized computer power continued in the author’s next major appointment at the National Institutes of Health (NIH) in Bethesda, Maryland. The host group, in the Division of Computer Research and Technologies, relied on their own Apollo workstations for most of their calculations, although there was also some access to the NIH’s centralized IBM 3090 mainframes. When the author left to set up his own group at the beginning of the 1990s in the Institute of Structural Biology in Grenoble, France, it was clearly preferable to purchase personal workstations rather than employ external resources. At the time, the most cost-efficient machines for calculation were HP 9000 700 series workstations, and it was on five of these that the group relied for the next several years. However, when it came time to replace these machines towards the end of 1990s, the choice was made to switch to commodity PCs, using Intel chips, as these had increased significantly in computational power during the decade and were unbeatable on cost. Today the situation is very similar and Intel-based PCs, or their AMD equivalents, dominate computational research. Even the majority of the most powerful supercomputers consist of clusters of tightly-coupled nodes with Intel or AMD chips, although some other companies, including Fujitsu, IBM and ShenWei, also manufacture their own processors. To terminate this section, it is worth reflecting on the progress made in general-purpose computational hardware in the last 30 years. The machines the author first used were room-size with approximate computational speeds and memory sizes of 1 MFLOP and 1 MB, respectively. Since then there has been at least four orders of magnitude improvement in performance as, currently, a high-end single processor PC, that comfortably fits onto a desktop, will have a speed of several GFLOPs and a memory of several GBs.

Specialized computational hardware The previous section gave a brief and rather simplified autobiographical overview of the changes in general computing hardware over the last 30 years. This section describes some more specialized topics. One of these is the design of processors that are specific to certain types of calculation. In the molecular domain most work appears to have been done on chips for accelerating MD simulations with molecular mechanical (MM) force fields. Examples include the FASTRUN accelerator of Fine et al. [6], the Gemmstar project of Brooks et al. [7] that was being undertaken by the author’s host group when he was at the NIH, and the MDGRAPE series of processors that are currently in their fourth incarnation [8,9]. The problem with these efforts is that the production of specialized chips is extremely demanding and so can be very slow in a niche market such as that for MD simulations unless substantial resources are provided. This has meant that MD-specific processors are often rendered obsolete by the newest generation of generalpurpose processors because they cannot be developed rapidly enough.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

Probably the most successful example of an MD-specific machine is ANTON of D.E. Shaw Research in New York. Only a small number of ANTONs have been built, mostly for internal use by the company, but they have permitted ground-breaking MD simulations of unprecedented length to be performed [10]. The current version of the machine, ANTON-2, is claimed to allow simulation rates of multiple microseconds of physical time per day for systems with millions of atoms, and to be at least two orders of magnitude faster than alternative hardware for standard protein benchmark systems [11]. Another area where specialized hardware is required and that has seen rapid advances is that of visualization. This is a domain in which the author has done little development although he has used it as a day-to-day research tool. The first graphics machines the author remembers were the Evans and Sutherland PS vector graphics displays that were driven by a host computer, such as a VAX, but these were supplanted in the late 1980s and early 1990s by cheaper, stand-alone raster graphics workstations, most notably those manufactured by Silicon Graphics. It is sobering to note that even at this time the most reliable way of obtaining publication quality images of a molecule, such as a protein, was to take an analog photograph of the graphic display’s screen! In the late 1990s, driven in large part by the competitive computer games market, commodity PCs began to have graphics performances approaching those of the more expensive specialized machines, which ultimately led to the demise of Silicon Graphics’s graphics workstation activity. The handling of images in molecular graphics or other computer graphics applications is very computationally intensive, and it has now become common to have specialized graphics processing units (GPUs) in PCs, games consoles and other devices that speed up image generation and manipulation. Many of the operations required for rendering graphics are simple, but they are expensive because a very large amount of data needs to be processed. Modern GPUs take advantage of this by having hundreds or even thousands of relatively simple processors that operate in parallel on a given set of data. Many, although not all, of the most costly parts of other calculations, including those in many molecular simulations, are of a similar type to those that GPUs can accelerate, and so it was rapidly realized that GPUs might be useful in non-graphical applications as well. This has led, since the mid-2000s, to an increasing use of GPUs for general-purpose computing and to the design by manufacturers of GPUs that are better adapted for such tasks. To illustrate this, some of the most powerful, recent general-purpose GPUs from the AMD FirePro and NVidia Tesla series are, in principle, capable of computations in excess of 1 TFLOP, many times faster than the fastest desktop CPUs.2

Software This section briefly reviews how software has evolved in response to the rapid improvements in hardware that have occurred. Clearly much of this evolution has been at the system level and so not directly visible to the majority of users. Many of the machines that the author worked on in his first years as a scientist had operating systems that (to put it mildly) were not very user friendly. However, two exceptions were Apollo’s Domain/OS and DEC’s VMS, which were a pleasure to use, with many of the capabilities that would be expected in a modern operating system. Since the early 1990s the author’s group has relied on Unix, initially various proprietary versions, followed in the late 1990s by different flavors of Linux. 2 It is to be noted that the latest Intel Xeon Phi coprocessors, with 60 cores, have comparable speeds, although they are not GPUs.

3

For scientific programing, there was only one choice at the beginning of the 1980s and this was Fortran. Although the Fortran 77 standard had appeared some years earlier, adoption was slow, and the programs the author first worked on were in Fortran 66. With care it was possible to write reasonably clear, as well as efficient code, but obscure, unreadable code was unfortunately all too common [12]—an experience that has marked the author to this day. Fortran 77 was slightly nicer to use than Fortran 66 but it also had numerous limitations. Nevertheless, due to a lack of viable alternatives, it remained the author’s programing language until a switch was made to the newer Fortran 90 in the early-1990s. The latter was a significant improvement but, like Fortran 77, it again suffered from slow uptake, especially from many workstation manufacturers who did not offer Fortran 90 compilers on their machines. This meant that users had to resort to third-party commercial vendors, such as NAG, as free compilers were not at the time available. This situation persisted until the early 2000s and it, combined with a growing dissatisfaction with Fortran itself and a greater experience with other languages, led the author to abandon Fortran altogether for his future projects. Several different replacements were considered but the author currently relies on the scripting language Python [13], with extensions written in the compiled language C for the computationally most intensive parts of a calculation. This trend away from Fortran is a general one, although there is still much legacy Fortran code and new versions of the language have been standardized. As anecdotal evidence for this assertion, the three third-party programs most used in the author’s group — for MD simulation, QC calculation and visualization — are all written in C++. One of the challenges for computer languages, their compilers and their users is to adapt to changes in the underlying hardware. An example of this was in the 1970s and 1980s when vector processing computers, such as the Cray-1, appeared. It took several years for programmers to learn the best way to take advantage of these machines and to modify their codes accordingly. This problem is even more acute for computers with multiple, possibly heterogeneous, processing elements that work in parallel. As contemporary examples, the computational workstations in the author’s group have two multicore CPUs and one GPU, whereas the majority of the most powerful supercomputers consist of tightly-coupled clusters of nodes of similar type. Programing for such machines requires special approaches and is an area of intense on-going activity. However, tools that have current significant impact include CUDA [14], MPI [15], OpenCL [16] and OpenMP [17]. An important way that the developers of scientific software can benefit from the experience of others, and be more efficient, is by employing third-party libraries that are specialized for certain tasks. For numerical computing, efforts in this area started in the 1960s and 1970s giving rise to a number of libraries written in Fortran, the successors to some of which are still in use today. Two that the author remembers using in the 1980s are the commerical libraries from IMSL and from NAG. Notable developments since then include BLAS [18] and LAPACK [19], both in Fortran and both now de facto standards for linear algebra, and the GNU scientific library [20], in C. Although not strictly speaking software, this section will terminate with mention of two other changes in the last 30 years that have revolutionized directly or indirectly the way scientists work and interact. The first is the world wide web, accessed via the internet, which has facilitated beyond recognition the dissemination and exchange of information. The second is the rise of the free software and open source movements, which has led to a plethora of free, high-quality and modifiable software.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

4

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

Algorithms In looking back over the past three decades for this article, the author was surprised to recall how much of the basic infrastructure of molecular simulation was already in place by the mid-1980s. Thus, in ab initio quantum chemistry with Gaussian basis sets, one could calculate closed and open-shell HF wave functions for molecules of a few tens of atoms, a process greatly facilitated by the recent introduction of the direct inversion in the iterative subspace (DIIS) method for self consistent field (SCF) convergence [21]. In addition it was possible to estimate correlation corrections for smaller molecules using a variety of techniques including complete active space (CAS) SCF methods [22], CI and Möller-Plesset perturbation theory. The analytic evaluation of the first and second derivatives was also possible for HF (and some of the other) methods and so geometry optimizations and saddle point searches could be performed, and vibrational frequencies determined. For larger molecules, one could resort to a variety of semi-empirical QC techniques, most notably the MNDO [23] and AM1 [24] methods of Dewar and co-workers. The principles for the simulation of liquids and biomacromolecules were also established [1]. Thus, a variety of empirical potentials existed, including some for proteins [25–27], and it was possible to carry out geometry optimizations, and both Monte Carlo and molecular dynamics simulations in a variety of thermodynamic ensembles. Many of the advances in the last 30 years have arisen by adapting the techniques just mentioned to the evolving hardware and benefitting from the larger systems that could be studied and the more extensive simulations that could be performed. There have, however, been many algorithmic advances, some of the more significant of which, at least to the author, are described below.

Density functional theory Undoubtedly the major change in QC that has occurred since the 1980s has been the introduction and subsequent ubiquity of DFT methods. Modern DFT dates from the mid-1960s with the publication of papers outlining the Hohenberg–Kohn theorem [28] and the Kohn–Sham equations [29]. Subsequently, the approach was widely adopted in the physics community for electronic structure calculations of systems, such as materials. Chemists were less ready to take up these techniques although this does not mean that they were ignored completely. One such was the multiple scattering-Xa approach that was shown in the mid-1970s to be an approximation to exact DFT [30]. However, the latter was not extensively employed due to numerical shortcoming (for example, obtaining geometries) and a perceived lack of accuracy with respect to MO methods. The situation changed markedly during the 1980s and early 1990s. First, there were a number of technical improvements that enabled MS-Xa and later DFT calculations to be performed with atom-centered basis functions, such as Gaussians. These included density-fitting approaches using additional auxiliary basis sets [31] and more accurate methods of numerical integration [32]. Second, newer density functionals were developed, initially by adding gradient corrections [33–35] and then by including a fraction of HF exchange [36], which were much more accurate for chemical problems. Finally, there was the incorporation of DFT methods into common QC programs, notably Gaussian 92 [37], so that they became readily accessible to the QC community. Since the mid-1990s, DFT has become the predominant method of QC calculation due to its ease of application and its relatively low cost [38,39]. It has also undergone continued development. A whole range of functionals now exist, including those which

account for effects such as dispersion [40]. Likewise, via its timedependent formalism [41], it can provide a good description of the electronic excited states of many systems. Despite these successes, the results given by DFT calculations are not completely reliable and higher accuracy methods are required for some problems and for benchmark calculations. Examples of the latter include correlated MO approaches, such as multireference CI and perturbation theory [42], together with alternative techniques such as quantum Monte Carlo [43] and the more recent density matrix renormalization group methods [44]. To terminate this section, it would be remiss not to mention the Car–Parrinello method, that was introduced in 1985 [45], and made tractable molecular dynamics simulations using DFT methods in combination with plane-wave basis sets. This algorithm has had substantial impact for calculations on periodic systems, such as crystals and small boxes of molecular liquids.

Including the environment A significant amount of effort in the last 30 years has gone into ways of representing the environment in molecular simulations as, no matter how fast the computer or how efficient the algorithm, it is always possible to overwhelm the available resources by choosing a system that is large enough. There are three different strategies that will be considered here. The first is to treat all the particles in an infinite, condensed phase system explicitly but to assume that there is periodic or translational symmetry. These techniques will be discussed in more detail in the next section. A second strategy is to forgo an atomistic description of the environment and to model it as some sort of homogeneous medium. These approaches are particularly appropriate for the modeling of systems in solvent or weak ionic solutions, and are called implicit solvent models. The principles underlying these types of model were developed at the beginning of the 20th century through the work of Born, Debye, Kirkwood, Onsager and others [46]. One of the characteristics of these models is that they assume that the solute occupies a cavity in the medium. The resulting equations are difficult to solve for molecule-shaped cavities although it is possible for simple shapes such as spheres and ellipsoids. Implicit solvent models were not yet mainstream in QC at the beginning of the 1980s but gradually became so during the decade that followed as consistent ways of including the solvation terms in the QC formalism were found, and more realistic and tractable treatments of the cavity were implemented. Today implicit solvent models are standard parts of most QC programs, although efforts continue to improve their accuracy and robustness [47,48]. A related development that occurred in the 1980s and the early 1990s was the application of realistic implicit solvent models to large molecules, such as proteins, using MM rather than QC energy functions. These methods numerically solved the Poisson– Boltzmann (PB) equation for the electrostatic free energy of the combined solute/solvent system and its components, from which a variety of useful quantities could be estimated. These included binding energies, such as that between a ligand and a protein, and the changes in pKa or redox potential that a group undergoes in different environments [49–51]. PB methods are now standard analysis tools in macromolecular modeling but they have not been used extensively for geometry optimizations or molecular dynamics simulations due to the expense of calculating accurate energy derivatives. To overcome this, an approximation to the PB formalism, called the generalized Born (GB) model, was introduced by Still and co-workers in the early 1990s [52]. GB-like methods are now the most widely used implicit solvent approach for macromolecular MD simulation.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

5

The third strategy that will be mentioned is the hybrid potential QC/MM approach.3 This method is appropriate for problems, such as enzymatic reactions, that require an explicit atomistic representation of the environment, but which can be treated with a more approximate, and cheaper, level of theory. The QC/MM method is the prototypical example of a multiscale method, and was first introduced by Warshel and Levitt in 1976 in their landmark study of the enzymatic reaction in lysozyme [53]. However, due to limited computational power, and a number of problems in the original formulation of the method, QC/MM potentials were not widely used until the 1990s, after significant methodological developments by Singh and Kollman [54], and by Bash, Field and Karplus [55,56]. A recent review highlighting the history of QC/MM methods until the mid-1990s can be found in Liu et al. [57]. Since then they have become indispensable for studying many problems, including reactions and other localized quantum processes, such as electronic excitation, in the heterogeneous condensed phase.

there has been intense effort in finding such methods for the determination of the electron density or wavefunction in DFT and HF calculations. Despite this, it is probably fair to say that the majority of QC calculations are still done with standard non-linear-scaling methods, because the latter are more robust and efficient for systems of the size that are typically studied (a few hundred atoms or less). Nevertheless, it is clear that benchmark calculations on very large systems are only feasible with the lower scaling methods [63]. A large variety of QC linear-scaling algorithms have been published and it is still an area of active research [64–66]. However, notable examples include the divide and conquer algorithm of Yang and co-workers in the early 1990s [67], density matrix minimization that attains linear-scaling through the sparsity of the density matrix, and methods that rely on localized electronic orbitals.

Linear scaling

The ability to determine rapidly the energy of a configuration, together with properties such as the atomic forces, is necessary, but not sufficient, for molecular simulation. It is also important to use the information obtained from each calculation in as an efficient way as possible to explore the parts of the system’s phase space that is relevant for the problem that is being studied. There has been a veritable explosion of algorithms that enhance sampling in this way in the last 30 years, and so only a handful of those that the author has found most useful in his work will be discussed in what follows. As mentioned earlier, at the beginning of the 1980s, the basics of the simulation of liquids and of other condensed-phase systems using Monte Carlo and MD techniques were established, and attention was turning to the computation of thermodynamic and kinetic quantities that could be compared to experiment. A major focus of activity was the development of methods for the calculation of free energies. This is an extensive topic that can only be treated cursorily here and so, for further information, readers are referred to the recent, comprehensive reviews in the volume by Chipot and Pohorille [68]. In the late 1970s umbrella sampling (US) was introduced by Torrie and Valleau [69] for the evaluation of free energy differences between two different states of a system that were characterized by having distinct values of one or more order parameters. The method worked by performing simulations, either Monte Carlo or MD, in the presence of a series of biasing, umbrella, potentials that concentrated sampling around particular values of the order parameters between the two end states. The effect of the biasing potentials was then removed to give an unbiased probability distribution from which properties, such as the free energy difference, were obtained. US was supplemented in the 1980s by additional approaches for the determination of free energy differences as a function of a set of order parameters [70–73]. Notable amongst these were free energy perturbation theory and thermodynamic integration, the fundamentals of both of which had been derived many years before by Kirkwood and by Zwanzig. It is worth remarking that both physical and unphysical, or alchemical, free energies were calculated using these techniques with an appropriate choice of order parameters, and that they were employed with QC/MM [55] as well as MM potentials. Since these early applications, there is a much greater understanding of how these methods need to be applied to obtain reliable results. There have also been a host of additional developments by a multitude of workers. These include algorithms, such as the weighted histogram analysis method [74], that optimally use the data generated by simulations with different values of the order parameters, and techniques, such as metadynamics [75], that

An important property of any algorithm is how its cost scales as the problem size increases. Many of the most common operations in molecular modeling have computational costs that scale polynomially with the system size, N. Examples include the evaluation of the pairwise electrostatic interactions in a molecule, the diagonalization of a matrix, and the determination of the two-electron integrals using an atom-centered basis set in a HF or DFT calculation. These scale (formally) as N 2 , N 3 and N 4 , respectively. To take full advantage of increases in computer power, it is clearly advantageous to have algorithms with scalings that are as low as possible, ideally linear or less. Much effort has gone into developing linear-scaling algorithms. An area in which they have had significant impact is for the evaluation of the pairwise electrostatic interactions in MM/MD simulations of systems with periodic symmetry, such as crystals, liquids and solutions. Ewald summation is a technique for the evaluation of these interactions to high accuracy that has been 3

known since the early 1920s, and whose cost scales as N 2 versus system size (with the caveat that optimal parameters for the summation are chosen [58]). Although the Ewald method was used in molecular simulations during the 1980s and before, it was probably more common to employ cheaper truncation techniques in which only those interactions within a given inter-particle cutoff distance were calculated. These latter schemes are linear-scaling, but they are also approximate because they neglect potentially important long-range interactions. The situation changed in the early to mid-1990s with the development of the approximate, but highly accurate and efficient, particle mesh Ewald (PME) algorithm, that scales as N lnðNÞ [59,60]. Methods of the PME type have since become standard for MD simulations with MM potentials. The fast multipole method (FMM) is a distinct algorithm for electrostatic calculations that was developed by Greengard and Rokhlin in the late 1980s and exhibits true linear-scaling, N [61]. In contrast to the Ewald methods, it is applicable to both periodic and vacuum systems. The FMM is a very elegant technique that was subsequently selected as one of the top 10 algorithms of the 20th century [62]. In practice, it turns out that the PME method is more efficient for the MM/MD simulation of periodic systems [58], but the FMM is used in other circumstances, such as determining the Coulombic interactions between continuous charge distributions in large QC calculations. QC approaches would benefit greatly from linear-scaling algorithms. Apart from the application of the FMM just mentioned, 3

The abbreviation QM/MM is more commonly used but QC/MM is more precise.

Sampling

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

6

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx

adaptively search the phase space of a system rather than using preimposed strategies. Mention should also be made of the recent class of non-equilibrium techniques that arose from the theoretical advances of Jarzynski and of Crooks in the late 1990s, who showed how the work done during the irreversible transformation between two states could be used to determine their (equilibrium) free energy difference [76]. The determination of the free energy is central to transition state theory and the reactive flux formalism [77,78], which are two of the most important routes for determining the rate constants of chemical reactions and other transitions. An important problem in the application of these theories is the identification of paths by which the transition can occur. In small systems, with few degrees of freedom, a feasible strategy is to: (i) locate the minima on the potential energy surface corresponding to the configurations of interest; (ii) find the first-order saddle points lying between minima; and (iii) trace out the reaction or transition paths that pass through each set of stationary points. For larger systems, this approach becomes more difficult due to the increasing number of stationary points and the fact that many of the most efficient algorithms for locating saddle points require the second coordinate derivatives of the potential energy which are expensive to calculate and to manipulate. Elber and co-workers were some of the first to tackle this problem, in the late 1980s and early 1990s, by developing methods that represented a path between two configurations as a trajectory of discrete intermediate structures and then geometry optimizing the trajectory as a whole, rather than individual structures in isolation [79,80]. Subsequently, these chain-of-states methods underwent substantial development leading, notably, to the nudged-elastic-band method of Jónsson and co-workers [81] and the string method of Vanden-Eijnden and co-workers [82]. It is also possible to extend the chain-of-states idea so that ensembles of transition paths between configurations are sampled, instead of just a limited number. This forms the basis of the transition-path-sampling (TPS) methods that were introduced in the late 1990s and early 2000s by Chandler and co-workers [83,84], based on an earlier method by Pratt [85]. TPS methods are computationally expensive but they allow, in principle, the computation of both the free energies and rate constants of a process. Other algorithms of similar type include milestoning [86,87], weighted ensemble path sampling [88], and generalizations of the string method [89]. The sampling algorithms outlined in the preceding few paragraphs are but a small selection of the many that have appeared over the last 30 years and that there is no space to consider here in detail. Important areas that have been omitted, but nevertheless deserve mention, include global optimization, typified by simulated annealing that appeared at the beginning of the 1980s [90], approaches for examining the dynamics on and between electronic excited state surfaces, path-integral algorithms, additional enhanced sampling methods that rely on modification of the potential energy surface or on non-standard thermodynamic ensembles, and techniques such as coarse-graining and elastic network modeling that simplify the representation of a system, thereby rendering accessible to simulation larger length and time scales.

Concluding remarks The last three decades have witnessed a vast increase in computer power that has revolutionized computational science and spurred innovation in molecular simulation and neighboring fields, such as bioinformatics [91] and chemoinformatics [92]. The impact of these developments has been huge, so much so that modeling and simulation are now an essential complement to experiment

and theory in the scientific endeavor. Much progress has been made in the computation of the behavior of molecular systems and many properties can be determined to high accuracy within reasonable time. However, much remains to be done to increase the reliability and precision of molecular simulation techniques, and to extend their range of applicability to problems of current interest. Acknowledgements The author would like to thank Alexander MacKerell, Andrew Miranker and Jeremy Smith for helping to jog his memory, and the Commissariat à l’Énergie Atomique et aux Énergies Alternatives for financial support. References [1] M.P. Allen, D.J. Tildesley, The Computer Simulation of Liquids, Clarendon Press, Oxford, 1987. [2] A. Szabo, N.S. Ostlund, in: Introduction to Advanced Electronic Structure Theory, first revised ed., McGraw-Hill, New York, 1989. [3] Nobel Foundation, The Nobel Prize in Chemistry 1998, 2015. (Online; accessed 25January-2015). [4] Nobel Foundation, The Nobel Prize in Chemistry 2013, 2015. (Online; accessed 25January-2015). [5] H.F. Schaefer III, W.H. Miller, Comput. Chem. 1 (1976) 85–90. [6] R. Fine, G. Dimmler, C. Levinthal, Proteins Struct. Funct. Genet. 11 (1991) 242– 253. [7] B. Borrell, Nature 451 (2008) 240–243. [8] M.J. Field, in: T. Ebisuzaki, J. Makino (Eds.), New Horizons of Computational Science, Springer, Netherlands, 1997, pp. 133–142. [9] I. Ohmura, G. Morimoto, Y. Ohno, A. Hasegawa, M. Taiji, Philos. Trans. A 372 (2014) 20130387. [10] S. Piana, J.L. Klepeis, D.E. Shaw, Curr. Opin. Struct. Biol. 24 (2014) 98–105. [11] D.E. Shaw, J.P. Grossman, J.A. Bank, B. Batson, J.A. Butts, J.C. Chao, M.M. Deneroff, R.O. Dror, A. Even, C.H. Fenton, A. Forte, J. Gagliardo, G. Gill, B. Greskamp, C.R. Ho, D.J. Ierardi, L. Iserovich, J.S. Kuskin, R.H. Larson, T. Layman, L.-S. Lee, A.K. Lerer, C. Li, D. Killebrew, K.M. Mackenzie, S.Y.-H. Mok, M.A. Moraes, R. Mueller, L.J. Nociolo, J.L. Peticolas, T. Quan, D. Ramot, J.K. Salmon, D.P. Scarpazza, U.B. Schafer, N. Siddique, C.W. Snyder, J. Spengler, P.T.P. Tang, M. Theobald, H. Toma, B. Towles, B. Vitale, S.C. Wang, C. Young, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, Piscataway, NJ, USA, IEEE Press, 2014, pp. 41–53. [12] G. Lindahl, Real programmers don’t use pascal, 2015. (Online; accessed 28-January-2015). [13] Python Software Foundation, Python, 2015. (Online; accessed 29-January-2015). [14] NVIDIA Corporation, CUDA, 2015. (Online; accessed 29-January-2015). [15] MPI Forum, Message passing interface forum, 2015. (Online; accessed 29-January-2015). [16] Khronos Group, The open standard for parallel programming of heterogeneous systems, 2015. (Online; accessed 29January-2015). [17] OpenMP Architecture Review Board, The OpenMP API specification for parallel programming, 2015. (Online; accessed 29-January2015). [18] Netlib, BLAS (Basic Linear Algebra Subprograms), 2015. (Online; accessed 29-January-2015). [19] Netlib, LAPACK — Linear Algebra PACKage, 2015. (Online; accessed 29-January-2015). [20] Free Software Foundation, GSL – GNU Scientific Library, 2015. (Online; accessed 29-January-2015). [21] P. Pulay, Chem. Phys. Lett. 73 (1980) 393–398. [22] P.E.M. Siegbahn, J. Almlöf, A. Heiberg, B.O. Roos, J. Chem. Phys. 74 (1981) 2384–2396. [23] M.J.S. Dewar, W. Thiel, J. Am. Chem. Soc. 99 (1977) 4899–4907. [24] M.J.S. Dewar, E.G. Zoebisch, E.F. Healy, J.J.P. Steward, J. Am. Chem. Soc. 107 (1985) 3902–3909. [25] P.K. Weiner, P.A. Kollman, J. Comput. Chem. 2 (1981) 287–303. [26] B.R. Brooks, R.E. Bruccoleri, B.D. Olafson, D.J. States, S. Swaminathan, M. Karplus, J. Comput. Chem. 4 (1983) 187–217. [27] U.C. Singh, P.A. Kollman, J. Comput. Chem. 5 (1984) 129–145. [28] P. Hohenberg, W. Kohn, Phys. Rev. 136 (1964) 864–871. [29] W. Kohn, L.J. Sham, Phys. Rev. 140 (1965) 1133–1138. [30] C. Corminboeuf, F. Tran, J. Weber, J. Mol. Struct. THEOCHEM 762 (2006) 1–7. [31] B.I. Dunlap, J.W.D. Connolly, J.R. Sabin, J. Chem. Phys. 71 (1979) 3396–3402. [32] A.D. Becke, J. Chem. Phys. 88 (1988) 2547–2553. [33] A.D. Becke, Phys. Rev. A 38 (1988) 3098–3100.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005

M.J. Field / Archives of Biochemistry and Biophysics xxx (2015) xxx–xxx [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50]

[51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65]

C. Lee, W. Yang, R.G. Parr, Phys. Rev. B 37 (1988) 785–789. J.P. Perdew, K. Burke, M. Ernzerhof, Phys. Rev. Lett. 77 (1996) 3865–3868. A.D. Becke, J. Chem. Phys. 98 (1993) 1372–1377. J.A. Pople, P.M.W. Gill, B.G. Johnson, Chem. Phys. Lett. 199 (1992) 557–560. W. Koch, M.C. Holthausen, A Chemist’s Guide to Density Functional Theory, Wiley-VCH, Weinheim, 2000. A.D. Becke, J. Chem. Phys. 140 (2014) 18A301. S. Grimme, WIREs Comput. Mol. Sci. 1 (2011) 211–228. E. Runge, E.K.U. Gross, Phys. Rev. Lett. 52 (1984) 997–1000. P.G. Szalay, T. Mnller, G. Gidofalvi, H. Lischka, R. Shepard, Chem. Rev. 112 (2012) 108–181. M. Towler, Quantum Monte Carlo, 2015. (Online; accessed 29-January-2015). U. Schollwöck, Rev. Mod. Phys. 77 (2005) 259–315. R. Car, M. Parrinello, Phys. Rev. Lett. 55 (1985) 2471–2474. C.J. Cramer, D.G. Truhlar, in: K.B. Lipkowitz, D.B. Boyd (Eds.), Reviews in Computational Chemistry, VI, VCH Publishers, New York, 1995, pp. 1–72. J. Tomasi, M. Persico, Chem. Rev. 94 (7) (1994) 2027–2094. J. Tomasi, B. Mennucci, R. Cammi, Chem. Rev. 105 (8) (2005) 2999–3094. K.A. Sharp, B. Honig, Annu. Rev. Biophys. Biophys. Chem. 19 (1990) 301–332. N.A. Baker, in: K.B. Lipkowitz, R. Larter, T.R. Cundari (Eds.), Reviews in Computational Chemistry, 21, John Wiley & Sons, Hoboken, NJ, 2005, pp. 349– 379. J.M. Antosiewicz, D. Shugar, Mol. Biosyst. 7 (2011) 2923–2949. W.C. Still, A. Tempczyk, R.C. Hawley, T. Hendrickson, J. Am. Chem. Soc. 112 (1990) 6127–6129. A. Warshel, M. Levitt, J. Mol. Biol. 103 (1976) 227–249. U.C. Singh, P.A. Kollman, J. Comput. Chem. 7 (1986) 718–730. P.A. Bash, M.J. Field, M. Karplus, J. Am. Chem. Soc. 109 (1987) 8092–8094. M.J. Field, P.A. Bash, M. Karplus, J. Comput. Chem. 11 (1990) 700–733. M. Liu, Y. Wang, Y. Chen, M.J. Field, J. Gao, Isr. J. Chem. 54 (2014) 1250–1263. H.G. Petersen, J. Chem. Phys. 103 (1995) 3668–3679. T. Darden, D. York, L. Pedersen, J. Chem. Phys. 98 (1993) 10089–10092. U. Essmann, L. Perera, M.L. Berkowitz, T. Darden, H. Lee, L.G. Pedersen, J. Chem. Phys. 103 (1995) 8577–8593. L. Greengard, V. Rokhlin, J. Comput. Phys. 73 (1987) 325–348. J. Dongarra, F. Sullivan, Comput. Sci. Eng. 2 (2000) 22–23. J. VandeVondele, U. Borštnik, J. Hutter, J. Chem. Theory Comput. 8 (2012) 3565–3573. S. Goedecker, Rev. Mod. Phys. 71 (1999) 1085–1123. C. Ochsenfeld, J. Kussmann, D.S. Lambrecht, in: K.B. Lipkowitz, T.R. Cundari (Eds.), Reviews in Computational Chemistry, 23, Wiley-VCH, Hoboken, NJ, 2007, pp. 1–82.

7

[66] R. Zales´ny, M.G. Papadopoulos, P.G. Mezey, J. Leszczynski, Linear Scaling Techniques in Computational Chemistry and Physics: Methods and Applications, Springer, Dordrecht, 2011. [67] W. Yang, Phys. Rev. Lett. 66 (1991) 1438–1441. [68] C. Chipot, A. Pohorille, Free Energy Calculations: Theory and Applications in Chemistry and Biology, Springer, Berlin, 2007. [69] G.M. Torrie, J.P. Valleau, J. Chem. Phys. 23 (1977) 187–199. [70] W.L. Jorgensen, Acc. Chem. Res. 22 (1989) 184–189. [71] D.L. Beveridge, F.M. Di Capua, Annu. Rev. Biophys. Biophys. Chem. 18 (1989) 431–492. [72] T.P. Straatsma, J.A. Mc Cammon, Annu. Rev. Phys. Chem. 43 (1992) 407–435. [73] P.A. Kollman, Chem. Rev. 93 (1993) 2395–2417. [74] S. Kumar, J.M. Rosenberg, D. Bouzida, R.H. Swendsen, P.A. Kollman, J. Comput. Chem. 13 (1992) 1011–1021. [75] A. Laio, M. Parrinello, Proc. Natl. Acad. Sci. 99 (2002) 12562–12566. [76] G. Hummer, A. Szabo, Proc. Natl. Acad. Sci. 98 (2001) 3658–3661. [77] D. Chandler, J. Chem. Phys. 68 (1978) 2959–2970. [78] M.J. Field, in: C. Le Bris (Ed.), Special Volume, Computational Chemistry, Handbook of Numerical Analysis, vol. 10, Elsevier, 2003, pp. 667–697. [79] R. Elber, M. Karplus, Chem. Phys. Lett. 139 (1987) 375–380. [80] R. Czermin´iski, R. Elber, Chem. Phys. Lett. 24 (1990) 167–186. [81] H. Jónsson, G. Mills, K.W. Jacobsen, in: B.J. Berne, G. Ciccotti, D.F. Coker (Eds.), Classical and Quantum Dynamics in Condensed Phase Simulations, World Scientific, Singapore, 1998, pp. 385–404. [82] W.E.W. Ren, E. Vanden-Eijnden, Phys. Rev. B 66 (2002) 052301. [83] C. Dellago, P.G. Bolhuis, D. Chandler, J. Chem. Phys. 108 (1998) 9236–9245. [84] C. Dellago, P.G. Bolhuis, P.L. Geissler, in: I. Prigogine, S.A. Rice (Eds.), Advances in Chemical Physics, 123, John Wiley & Sons, Hoboken, NJ, 2002, pp. 291–318. [85] L. Pratt, J. Chem. Phys. 85 (1986) 5045–5048. [86] P. Májek, R. Elber, J. Chem. Theory Comput. 6 (2010) 1805–1817. [87] S. Viswanath, S.M. Kreuzer, A.E. Cardenas, R. Elber, J. Chem. Phys. 139 (2013) 174105. [88] E. Suárez, S. Lettieri, M.C. Zwier, C.A. Stringer, S.R. Subramanian, L.T. Chong, D.M. Zuckerman, J. Chem. Theory Comput. 10 (2014) 2658–2667. [89] L. Maragliano, B. Roux, E. Vanden-Eijnden, J. Chem. Theory Comput. 10 (2014) 524–533. [90] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Science 220 (1983) 671–680. [91] J.B. Hagen, Nat. Rev. Genet. 1 (2000) 231–236. [92] P. Willet, WIREs Comput. Mol. Sci. 1 (2011) 46–56.

Please cite this article in press as: M.J. Field, Arch. Biochem. Biophys. (2015), http://dx.doi.org/10.1016/j.abb.2015.03.005