Force-field coefficient optimization of coarse-grained molecular dynamics models with a small computational budget

Force-field coefficient optimization of coarse-grained molecular dynamics models with a small computational budget

Computational Materials Science 176 (2020) 109518 Contents lists available at ScienceDirect Computational Materials Science journal homepage: www.el...

2MB Sizes 0 Downloads 21 Views

Computational Materials Science 176 (2020) 109518

Contents lists available at ScienceDirect

Computational Materials Science journal homepage: www.elsevier.com/locate/commatsci

Force-field coefficient optimization of coarse-grained molecular dynamics models with a small computational budget

T

M. Razia, , A. Narayana, R.M. Kirbya, D. Bedrovb ⁎

a b

Scientific Computing and Imaging Institute, University of Utah, United States Department of Materials Science and Engineering, University of Utah, United States

ARTICLE INFO

ABSTRACT

Keywords: Force-field coefficient optimization Model-order reduction Molecular dynamics Monatomic water model Multi-fidelity models Surrogate modeling TIP4P/2005 water model

Coarse-grained models in molecular dynamics are low-fidelity models developed to study the properties and behavior of materials. These force-field models are popular due to their simpler implementation and significant computational benefit of their use as compared to complex high-fidelity atomistic models. On the other hand, these advantages often come at the expense of accuracy. One effective way to resolve this issue is tuning their corresponding force-field coefficients such that the estimation error of atomistic properties is minimal with respect to high-fidelity reference data. However, acquisition/computation of these data is often costly, and in most practical cases in materials sciences only a small experimental/computational budget is available to obtain high-fidelity reference data. In this work, a multi-fidelity surrogate modeling strategy for resolving the trade-off between accuracy of a full of set of quantities of interest for an atomistic system and computational expenses is proposed. The proposed procedure takes advantage of the multi-fidelity upper error bound calculated using lowfidelity data by (i) employing an efficient point selection mechanism in the design space and obtaining a set of coarse-grained models, (ii) building stochastic collocation models corresponding to each selected low-fidelity model and obtaining the optimal small group of physical input parameter sets at which a high-fidelity model is evaluated, (iii) constructing a non-intrusive spectral polynomial-based surrogate based on the error of the lowfidelity emulators corresponding to the selected coarse-grained models, and (iv) optimizing force-field coefficients using the resultant surrogate with the objective of error minimization. Moreover, in the context of using a multi-fidelity framework and in order to construct a more accurate emulator, an approach for optimal kernel function selection is implemented in this work. Also, for the purpose of obtaining a set of optimal force-field coefficients using the constructed surrogate model, we used a heuristic population-based optimization approach. For the proof of concept, we applied the proposed approach for the force-field coefficient optimization of the coarse-grained monatomic water (mW) model based on the data obtained from the high-fidelity TIP4P/2005 water model simulations. The outcome of this study is an optimal coarse-grained monatomic water model that along with the TIP4P/2005 water model, provides an accurate emulator for exploration of the properties of water in the designated pressure-temperature input parameter space. Furthermore, this approach provides a more accurate way of sampling important points in the pressure-temperature plane. Finally, the cross-validation results indicate that a certain combination of quantities of interest can produce a more accurate multi-fidelity emulator for the TIP4P/2005 water model.

1. Introduction Despite all the advancements in the development of rigorous molecular and atomistic models for the accurate representation of the atomistic behavior of materials, the exploration of the potential energy landscape in a reasonable computational time remains a major challenge in molecular dynamic simulations. This challenge becomes more pronounced when running hundreds or even thousands of



computationally costly high-fidelity simulations is necessary to capture the atomistic behavior within a large parameter space. Hence, using coarse-grained force-field models, which are less accurate than highfidelity models for the estimation of some quantities of interest such as entropy and diffusion coefficient (as they have been designed to preserve the prediction accuracy for another subset of quantities of interest)s but computationally faster, appears to be an appropriate alternative. Such models have become popular tools for faster exploration

Corresponding author at: Scientific Computing and Imaging Institute, University of Utah, 72 Central Campus Dr, Salt Lake City, UT 84112, United States. E-mail address: [email protected] (M. Razi).

https://doi.org/10.1016/j.commatsci.2020.109518 Received 13 November 2019; Accepted 5 January 2020 0927-0256/ © 2020 Elsevier B.V. All rights reserved.

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

of the energy landscape of materials. Although the significant drawback of these models is their inaccuracy, it is possible to reduce the error induced by tuning their corresponding force-field coefficients. Such an objective can be achieved by optimizing these coefficients such that the overall error of their corresponding coarse-grained model predictions over the designated input parameter space is minimized. However, accomplishing this task demands the exploration of the parameter space, which is a computationally intensive task even when a coarse-grained model is used. In order to address this issue, researchers have resorted to surrogate model construction from a set of coarsegrained simulation results. Different methods for surrogate modeling described in the literature include configuration-sampling-based surrogate (multi-state Bennett acceptance ratio (MBAR) and pair correlation function rescaling (PCFR)) [1], polynomial chaos expansion or non-intrusive spectral projection [2–5], transitional Markov chain Monte Carlo in combination with the empirical interpolation method [6], a combination of Bayesian inference with polynomial chaos expansion [7], cost-efficient kriging [8], and the approximate Bayesian computational algorithm [9]. Most of these recent research studies have been implemented successfully for coarse graining or tuning coarsegrained model force-field coefficients. However, all these approaches demand a substantial volume of high-fidelity/experimental data for the construction of an accurate surrogate model. Furthermore, they work only when the discrepancies between high-fidelity and low-fidelity predictions stay within reasonable bounds. Also, some of these approaches (for example, non-intrusive spectral projection methods and Gaussian process machine learning) are built on assumptions such as response surface smoothness. Gaussian process machine learning approaches can suffer from the computational infeasibility of kernel matrix inversion for large amounts of data and an often arbitrary process of choosing the kernel function based on the prior belief/knowledge on the input parameters’ (such as temperature and pressure) correlations, which might not be realistic/feasible for molecular systems. Moreover, any deviation of the underlying physics from this type of assumption as well as the limited availability of data can lead to a poor force-field coefficient estimation. In the present work, a stochastic collocation methodology with multi-fidelity [10,11] is applied in order to address the practical situation in which the availability of high-fidelity data is limited due to insufficient computational budget. This surrogate modeling approach uses the low-fidelity data for important sampling of high-fidelity data and does not require any a priori assumption about the probability measure of the underlying physics. The reasoning behind the usage of low-fidelity data in the diagnostic context of important sampling points as well as calculation of the error metric for the corresponding multifidelity emulator, without performing even one-high fidelity model evaluation, stems from the existence of an expression for the upper bound of the multi-fidelity emulator error, which is a function of the low-fidelity stochastic collocation model error. However, from our numerical experiments, it appears that when this approach is used, the discrepancies between low- and high-fidelity models can be large. Hence, this upper bound for the emulator’s error does not dictate any requirement for the quality of the low-fidelity model. This multi-fidelity stochastic collocation methodology has been successfully applied to different problems in science and engineering, including frequencymodulated trigonometric functions [10], heat-driven cavity flows [12], irradiated particle-laden turbulence [13], acoustic horn problems [11], molecular dynamics simulation [14], parametric study of NACA airfoils [15], solution of the discrete-space evolution probability equation [16], and plasmonic nano-particle arrays efficiency [16]. Hence, such an approach can be used both for exploration of coarse-grained models as well as accurate estimation of materials properties using a few highfidelity samples and by leveraging the information obtained from coarse-grained simulation data. On the other hand, the computational feasibility of this approach depends on the number of low-fidelity simulations as well. The cost of low-fidelity simulations should not

outweigh the cost of the exploration of the whole input parameter space. Therefore, a problem-dependent computational feasibility limit with respect to the number of low-fidelity simulations should be considered prior to the application of the proposed multi-fidelity force-field coefficient optimization approach. To provide a proof of concept, we apply the proposed approach to optimize coarse-grained model force-field coefficients and to construct high-fidelity molecular dynamics emulator for the purpose of studying bulk water properties using low-fidelity, coarse-grained mW and highfidelity, atomistic TIP4P/2005 water models, respectively. Consequently, an optimal set of force-field coefficients for the mW model is obtained using only with a few high-fidelity model evaluations. These optimal coefficients are determined through a procedure that greedily reduces the error of the corresponding multi-fidelity emulator. The final outcome of this procedure is a high-fidelity emulator that can accurately predict the atomistic behavior of water (with respect to the TIP4P/2005 model) within the designated pressuretemperature space while using only a few high-fidelity samples. Furthermore, cross validation results of the models provide insight about the force-field coefficient bounds for admissible mW models in the framework of the stochastic collocation methodology with multifidelity. These results as well as those in our previous study [14] can pave the way for the construction of a high-fidelity emulator by leveraging the low-fidelity data. This paper is organized as follows. Section 2 includes a brief description of a detailed theoretical foundation of the proposed multifidelity force-field coefficient optimization approach. Section 3 presents the results and discussion for the canonical test problems on force-field coefficient optimization of the mW model, both of which are presented in the subsequent section. Finally, Section 4 contains concluding remarks. 2. Multi-fidelity coarse-grained model force-field coefficient optimization approach In science and engineering, several models with differing levels of accuracy are commonly available for modeling a physical phenomenon or behavior. These levels of accuracy can be interpreted as fidelity levels. High-fidelity models are often accurate representations of the underlying physics. However, they are computationally costly, and obtaining even a moderate number of high-fidelity simulation samples for complex systems is often not practical. Hence, in materials sciences, for instance, obtaining a sufficient number of data points from highfidelity models or experiments to construct a metamodel that can accurately represents the behavior of a desired material for a wide range of input parameters, is difficult in practice. On the other hand, low-fidelity models such as coarse-grained models in materials sciences are significantly less accurate for the estimation of some quantities of interest such as entropy and diffusion coefficient, but are being computationally faster. Hence, it is possible to use these models in a multi-fidelity framework such that an optimal set of force-field coefficients for the coarse-grained model is found, and consequently a reasonably accurate metamodel for the emulation of material properties variations within a designated input parameter space is obtained. Note that our objective in this study is not to find the coarse-grained model that provides the most accurate numerical predictions using high-fidelity model or experiments. Instead we want to find a coarse-grained metamodel that correctly reflects all desired quantitative features of the molecular system and that can be “corrected” or “anchored” to the true values using a limited amount of highfidelity samples. In the following sections, our proposed procedure for the construction of this multi-fidelity emulator as well as its application for the coarse-grained model force-field coefficient optimization are discussed. 2

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

2.1. Naive multi-fidelity model construction

factorization of the GL matrix. Finally, after completing the process of high-fidelity data acquisition, our desired quantities can be estimated using the following series:

Consider two models with different levels of accuracy: lowm ) and high-fidelity ( g : p m ) models. In this context, ( gL : p H a common model input parameter space is assumed for both models, and p and m denote the dimension of both models’ input parameter space and low- and high-fidelity output space, respectively. When it comes to molecular dynamics, pressure, temperature, and the number of molecules in the simulation box are among the common choices of input parameters. Here, given a set of common input parameters , we also make no explicit assumption about the physical meaning or interpretation of gL ( ) versus gH ( ) . In the other words, the only requirement for the low-fidelity model to be applied in this multi-fidelity framework is to have similar parametric variations in the common p ; not the physical space). input parameter space (D When choosing the low-fidelity model according to these assumptions, a common set of quantities of interest for both models is selected as the first step. Next, the bounds of the subspace D are chosen, followed by uniform sampling of the set of input parameters in this space and running the low-fidelity model at those selected N data points. Vector gL ( i) , for which i denotes the ith set of input parameters (at ith sampling point), can then be constructed based on the concatenation of the calculated quantities of interest from the low-fidelity simulations at the ith data point, where i = 1, …, N . The mapping between our desired low-rank feature space and the current input parameter space can be naively described as a linear mapping formulated by the inner product of gL ( ) vectors. Consequently, the Gramian matrix, which is constructed as a result of this mapping, can be written as

n

gH (x )

fobj = G Linear

c1 (x ) =

(GL)in, i1

(GL )in, in

cn (x )

gL (x ), gL (

i1)

gL (x ), gL (

in)

GL

+ F

×

GL GL

2 F

,

(5)

where GLinear and GL denote the Gramian matrices constructed using linear kernel function and any other desired kernel function, for which we are looking to optimized its hyperparameters, respectively. In Eq. (5), subscript F represents Frobenius norm. Also, the value of in Eq. (5) is set to be 0.1 for limiting the resultant Gramian matrix condition number and thus preserving the numerical stability of the multi-fidelity surrogate modeling procedure. In this work, in order to minimize fobj, we apply a heuristic optimization algorithm called the particle swarm algorithm [21,22] with only low-fidelity data. In the next step, all kernel functions listed in Table 1 with optimized hyperparameters are used to build low-fidelity emulators (stochastic collocation models) with low-fidelity simulation data alone by replacing gH with gL in Eq. (4). It Table 1, h (i) denotes the ith hyperparameter of the corresponding kernel function. Next, the resultant metamodels are tested on all low-fidelity data points to determine their overall accuracy. Consequently, the best function is selected based on this accuracy metric and then applied to 1) identifying appropriate sets of input parameters to run the high-fidelity simulation model on, and 2) constructing a multi-fidelity emulator by the fusion of low- and highfidelity data as formulated in Eqs. (3) and (4). To use Eq. (3) in this framework, the terms with inner product must be replaced within a kernel function.

(2)

(GL )i1, in

(4)

Numerical experiments indicate that the naive multi-fidelity approach will fail if the resultant Gramian matrix becomes ill-conditioned [16,20]. One approach to address this issue is to engineer new features out of the available features [16]. However, this approach is ad hoc and very problem specific. An alternative approach, which can be automated, is to use a nonlinear kernel function instead of a linear kernel function. Not only does this approach reduce the Gramian matrix condition number significantly, but it also may increase the accuracy of the resultant multi-fidelity emulator by providing a nonlinear mapping between the desired low-rank feature space and current input parameter space. The ill-conditioned Gramian matrix situation occurs in the presence of a relatively small-sized gL vector and particularly when the number and size of quantities of interest is small relative to the necessary number of samples for obtaining a certain level of accuracy. This is often not the case for problems in materials sciences and molecular dynamics since in addition to thermodynamic observables such as density, energy, and pressure, there is a wide range of molecular correlations, such as radial distribution functions, mean, squared displacement, etc., which can be concatenated into the vector gL . However, in order to take advantage of the potential added benefit of nonlinear kernel functions for increasing the resultant emulator accuracy, we chose to apply the optimal kernel selection method introduced in [20] for this work. The process of optimal kernel function selection demands the optimization of hyperparameters associated with a list of nonlinear kernel functions, including those listed in Table 1. Relying on the accuracy of the multi-fidelity emulator constructed using the linear kernel function (see for examples see [10,11,14]) and with the goal of avoiding overfitting and increasing the numerical stability of the resultant emulator, the following objective function is used for this hyperparameter optimization:

where matrices L , U , and R are upper and lower triangular factorization matrices and permutation matrices, respectively, that reorder rows of the Gramian matrix. An algorithm for this procedure is provided in [20]. After obtaining the permutation matrix C, the first n indices including i1, …, in are selected accordingly, and the high-fidelity simulation data are obtained at these sampling points. Next, the coefficients of the multi-fidelity emulator ({ck (x )}nk = 1) for estimation of the quantities of interest at a input parameter space data point, x, can be obtained from the following least-squares projection based upon the low-fidelity model data alone:

(GL)i1, i1

ik ) ck (x ).

2.2. Multi-fidelity model construction with optimal kernel function selection

Using this N × N symmetric matrix constructed using low-fidelity data N ) the best sampling points for the only, one can obtain the n (n evaluation of the high-fidelity model and consequently construction of an optimal surrogate model. For this purpose, different approaches including orthogonal-triangular (QR) decomposition [10], the Cholesky decomposition [11], and the LU factorization [17]; statistical strategies such as leverage-score sampling methods [18]; and sparsity-promoting group-matching methods [19] have been introduced in the literature. Among these approaches, LU factorization is used in this work for finding the column pivots of the Gramian matrix GL and consequently indices of the important sampling points. We selected this approach because it is available in most common numerical libraries/computational environments, and it is simple to understanding and implement. In this approach, the permutation matrix C is obtained:

RGL C = LU,

gH ( k=1

(1)

(GL)l, j = gL ( l), gL ( j) l, j = 1, …, N .

gH (x ) =

. (3)

After obtaining the coefficients cj s ( j = 1, 2, ···,n ) from solving Eq. (3), we can start performing the high-fidelity simulation evaluations for the first time in this multi-fidelity procedure. These high-fidelity evaluations are performed at the previously designated sampling points by LU 3

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

Table 1 Kernel Function Library. No. 1

2

Function Name

Corresponding Gramian Matrix

Linear

(GL)i, j = gL ( i), gL ( j)

Exponential

3

Squared Exponential

4

Rational Quadratic

5

Matern 3/2

(GL )i,j = h (2)2

1+

(GL )i, j = h (2)2 × exp

Compact Radial Basis

gL ( i)

gL ( j), gL ( i) 2h (1)2

2

uLF

uLF

gL ( j), gL ( i) (2(h (1)2) × h (3))

gL ( i)

3

gL ( j), gL ( i) h (1) g L ( i)

1+

5

5

gL ( i)

2,

gL ( j), gL ( i) h (1)

ik ) ck (x ),

k=1

gL ( i)

gL ( j), gL ( i)

cj j (X ),

+

5 g L ( i)

gL ( j), gL ( i) h (1)

gL ( j), gL ( i) 3(h (1)2)

gL ( j)

gL ( j)

gL ( j)

h (3)

× exp

gL ( i)

gL ( j), gL ( i) 2h (1)2

gL ( j)

space, for which d

=

q

+1

k+1 , (9)

q=1

the orthonormal polynomial bases ({ can be chosen to be the Legendre polynomials defined in D = [ 1, 1]d with µ being a uniform probability measure. Hence, in Eq. (8), j is the jth member of a L2 -orthonormal polynomial basis set, V, such that M j } j = 1)

(6)

D

i (x ) j (x ) µ (x )

V = span{ 1, …,

=

i, j,

M },

(10)

where is the Kronecker delta function, and µ is the continuous probability measure corresponding to the orthonormal polynomial basis that is replaced with a discrete measure for computational purposes. One can easily map the independent force-field coefficients to D using an affine transformation. Moreover, Eq. (8) is a linear system of equations, and coefficients cj can be determined using the least-squares method. Hence, without the requirement of running the low-fidelity model with a large number of force-field coefficient sets and construction of a separate low-fidelity stochastic collocation model for each, one can estimate the accuracy metric of the emulator for a desired set of force-field coefficients by building the spectral metamodel around only a few low-fidelity stochastic collocation models.

(7)

2.3. Model selection approach Any change in the values of the force-field coefficient set for a coarse-grained model may result in a significant change in the model’s responses. The sampling process for obtaining these sets of coefficients affects the resultant metamodel discussed in the previous section. It is well-known that reducing the condition number of the corresponding Vandermonde matrix (V) results in a more accurate surrogate model. For the purpose of sampling point selection in this context, we follow a three step procedure. First, using a numerical root-finding algorithm and inverse transform sampling [23], a set of random samples is drawn from the induced orthogonal polynomial measure µ . This sampling strategy which comprehensively has been described in [23] has been shown to be optimal with regard to both numerical stability and accuracy with a minimal number of samples [24]. The result is a function

M j=1

gL ( j)

gL ( i)

gL ( j )

h (1)

in the similar manner as discussed in Section 2.1. Hence, the assumption of proximity of the error variation for both low- and multi-fidelity stochastic collocation model constructs appears to be reasonable. Considering the limited availability of high-fidelity data, this assumption is made; otherwise construction of multi-fidelity emulators and calculation of their corresponding error with respect to high-fidelity data at all test points is a more precise way of quantifying the emulators’ accuracy. In this research work, a metric is needed for quantification of each emulator’s accuracy. This metric can be defined as the mean of the emulator’s error for estimation of quantities of interest. In order to apply such a metric for a wide range of low-fidelity models in a continuous design space, we propose to construct a polynomial surrogate model in the space of force-field coefficients of the coarse-grained model in order to evaluate this metric for any desired set of force-field coefficients that determines the low-fidelity model. Here, we chose to use a non-intrusive spectral approximation of the error metric. In this context, given a function f : D with the expansion

f (X )

3

× exp

h (1)

n

uLF (

gL ( j)

gL ( j), gL ( i)

where subscripts MF, HF and LF denote multi-, high- and low-fidelity data, and uLF is the output of the low-fidelity stochastic collocation model for which

uLF (x ) =

gL ( j) ) ( h (3))

(1 + gL ( i)

Having found the optimal kernel function for each set of low-fidelity data corresponding to each sampled coarse-grained model and also taking the desired number of high-fidelity model evaluations into account, a set of corresponding low-fidelity stochastic collocation models based on low-fidelity data alone is constructed. This is done by the notion of the upper error bound for the multi-fidelity model prediction error. Assuming the output function to be denoted by u, one can show that [10]

uHF

gL ( j)

(GL )i, j = max 0, 1

uMF

gL ( j)

(GL)i,j =

Matern 5/2

7

gL ( j), gL ( i) h (1)

(GL )i,j = h (2)2 exp

h (2) 2 6

g L ( i)

(GL )i,j = h (2)2 exp

(8)

we are interested in computing or approximating the vector c (containing entries cj ) using evaluations (measurements) of f. Considering M is the size of the polynomial index set ( ) in the hyperbolic cross index 4

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

Fig. 1. Schematic view of the proposed approach of force-field coefficient optimization for coarse-grained molecular dynamics models with an illustration of its procedural steps.

of the type of polynomial index set and dimension and degree of the spectral polynomial approximation model. In the next step, after generation of a large number of samples in the force-field coefficient space, the well-known orthogonal-triangular (QR) decomposition is used to obtain M optimal samples such that the

condition number of the corresponding Vandermonde matrix becomes reasonably small in a greedy procedure. Here M is equivalent to the size of the corresponding polynomial index set . We use the QR approach in this step because it is easy to implement and computationally faster than other standard subset selection approaches. 5

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

The final step of this procedure involves adding more optimal samples on top of the previously found M samples. The M samples are often not sufficient to guarantee the accuracy and numerical stability of least squares solution to obtain vector c from equation Vc = U , where U is the vector of low-fidelity stochastic collocation model error metric values for sampled coarse-grained models. In this step, our goal is to 1 maximize the determinant of matrix G (where G is L (V T × V ) ) in a greedy fashion. For this greedy process, the following objective function is minimized:

1 (vG 1vT ), Kv

F=

the objective function (Eq. (14)) further in the direction of incrementation of all design variables from their old to new values. Then, this greedy algorithm is repeated again with the newly found values of the design variables until the calculated minimum is equal to the previous step minimum value within a specified tolerance (here 10 3 ). The resultant set of force-field coefficients minimizes Fobj so that the error metric predicted using the metamodel is reduced considerably. The resultant optimal coarse-grained model then is constructed using this optimal set of force-field coefficients. This model is expected to be optimal for multi-fidelity emulator construction as well. For the sake of brevity, the whole proposed procedure for force-field coefficient optimization of coarse-grained models is illustrated in Fig. 1 and outlined as Algorithm 1.

(11)

where

Kv =

M M

,

Algorithm 1 Force-field coefficient optimization of coarse-grained models

vi2

1. Set the degree of the surrogate polynomial model (k) for low-fidelity stochastic collocation model error (Section 2.3) 2. Obtain the discretized force-field coefficient space (Section 2.4) 3. Evaluate the selected low-fidelity (coarse-grained) models on a discretized input parameter space

(12)

i=1

and L and v denote the number of samples and the candidate vector for concatenation to the previously determined Vandermonde matrix with L 1 samples. The outcome of this process is a set of samples for a given µ , k and d such that the corresponding Vandermonde matrix (V) condition number becomes reasonably small.

4. Construct low-fidelity stochastic collocation models: gL (x ) =

5. Determine the best set of input parameters using LU factorization; decide by averaged error of quantities of interest 6. Calculate the error metric for low-fidelity stochastic models; Eq. (13) 7. Construct the polynomial surrogate for the error metric (Section 2.3) 8. Minimize the error metric within the optimization bounds (Section 2.5) 9. Do a few designated high-fidelity model evaluations at important sampling points that have been previously selected in Step 5 of this algorithm 10. Construct corresponding multi-fidelity emulators corresponding to each newly

2.4. Force-filed coefficients optimization approach With the determination of Vandermonde matrix with a reasonably small condition number, sets of model force-field coefficients that can produce different coarse-grained models are obtained. Next, by running these models at designated points in the input parameter space, corresponding low-fidelity stochastic collocation models ( gL ) can be constructed according to the details discussed in Sections 2.1 and 2.2. The corresponding error metric for each low-fidelity emulator then is computed as

error =

median

i = 1, … , p

l = 1, … , N

gL(i) (xl ) gL(i )

g L(i) (xl ) (xl )

,

found locally optimal model: gH (x )

0

n k = 1 gH ( ik ) ck (x )

3. Benchmark problem: force-field coefficient optimization of the mW model for a multi-fidelity emulator After almost five decades of research, many models have been developed for studying the thermodynamic, structural, and dynamical properties of water across different phases. Two major categories among them are atomistic and coarse-grained water models. Both groups of models have been developed based on analytical potential energy functions that aim at correctly representing many-body effects in water. The first group of these models includes well-known rigid nonpolarizable models such as TIP3P [26], SPC [27], SPCE [28], TIP4PEw [29], and TIP4P/2005 [30], and TIP5P [31], which use long-ranged interactions (electrostatics) to produce short-ranged tetrahedral structure (hydrogen bonds). The second set includes models with a decreased number of degrees of freedom and often less accuracy for a full set of quantities of interest for an atomistic system but a higher computational speed. The short water SPC/E model [32], the monatomic water (mW) model [33], and the Mercedes-Benz model [34] are examples of coarse-grained models of water. These models have extended molecular simulations to explore larger length and time scales with their faster computational speed. However, in order to properly achieve this objective, their corresponding accuracy needs to be enhanced. One way to address this issue is the “top-down” approach to choose the force-field coefficients of a coarse-grained model of water. The proposed approach in this study can be used toward the goal of accuracy enhancement of coarse-grained models. In this study, for the sake of the proof of concept, we use the TIP4P/2005 and mW models as high- and low-fidelity models, respectively. As a result, an optimal set of force-field coefficients for the mW model is obtained such that its corresponding multi-fidelity emulator provides an accurate prediction of quantities of interest. In other words, we are looking for an mW model with the accuracy that can be enhanced to an optimal level using the data from a small number of TIP4P/2005 water model simulations.

(13)

(14)

c 1,

gH (x ) =

11. Find the best model by comparing the calculated error metric for the corresponding multi-fidelity emulator (Sections 2.1 and 2.2)

for p quantities of interest, which contribute to the construction of the resultant multi-fidelity models. Here, represents the statistical expectation or mean. These error measures then are concatenated into vector U. The vector c , the metamodel for the estimation of the lowfidelity stochastic collocation model error metric, is then obtained via the least squares solution of Vc = U . The resultant surrogate model then can be used for finding a set of coarse-grained model force-field coefficients, which, in combination with a few high-fidelity model evaluations, produces a multi-fidelity emulator with the minimum value of the error metric defined in Eq. (13). For this purpose, we define the following objective function:

Fobj = Vc +

n k = 1 gL ( ik ) ck (x )

where 0 is set to 10 and c 1 is the 1 norm of the spectral metamodel vector of coefficients. The term 0 c 1 is added to the objective function (Fobj) as a measure of the model’s complexity for the purpose of smoothing the objective function variation over the coarse-grained force-field coefficient space. To minimize this objective function and consequently the error metric, we use the cyclic coordinate optimization approach [25]. In this optimization method, the cost of the computation scales with the dimension of the force-field coefficient space. Here, the force-field coefficients, which uniquely specify the coarse-grained model, are the design variables for this optimization problem. The cyclic coordinate optimization that has been applied in this research work uses one-dimensional particle-swarm optimization for each dimension (with all other variables being held fixed in each step) to find the minimum value of Fobj for all design variables in a step-by-step process. After exploring all dimensions one by one, a pattern search strategy is used to minimize 2

6

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

In the sections that follow, first both models are briefly discussed and then the results of the application of the proposed multi-fidelity forcefield coefficient optimization approach are studied.

particle-mesh solver (PPPM) method [36] is used. To validate the applied computational models in the LAMMPS environment [37], the water molecular interactions were simulated using both mW and TIP4P/2005 models at T = 298 K and P = 1 atm for 10000000 timesteps (with t = 1 fs) in the simulation boxes of size 49.6 by 49.6 by 49.6 Å3 with 4096 molecules and 62 by 62 by 62 Å3 with 8000 molecules for mW and TIP4P/2005 models, respectively. Using the simulation with the mW model, we obtained the values of 0.99786 g/cm3 and 10.672 kcal/mol for density and enthalpy of vaporization, respectively, which agree very well with the data reported in the work of Molinero and Moore [33]. Using the TIP4P model validation simulation, we obtained the results for potential energy, enthalpy of vaporization, and density as −11.401 kcal/mol, 11.9936 kcal/mol, and 0.9974 g/cm3, respectively. These results are also in a close agreement with the data reported by Orsi [38]. After the model validation step, and through the adaptation of hyperbolic polynomial space on the five-dimensional mW model forcefield coefficient space, 43 sets of points in the designated space were chosen using the approach discussed in Section 2.3. In the context of fitting a third-degree polynomial defined with the hyperbolic cross index space, one needs to find the values for 26 coefficients. Hence, for the construction of a well-posed system of equations for this purpose, selection of 26 sets of force-field coefficients appears to be sufficient. However, in order to avoid overfitting and enhance the accuracy of the resultant surrogate, we chose to include a larger number of sets of forcefield coefficients in the process of the construction of a polynomial surrogate for the defined error metric. Moreover, this limit for the number of low-fidelity models to be tested was set to 43 based on our available computational recourses and budget. Here, as a result, the condition number of the corresponding Vandermonde matrix becomes 5.0211, which is reasonably low. The bounds of the designated hyperrectangular force-field coefficient space for mW models are set to be at ± 20 percent of the standard force-field coefficient values mentioned in Section 3.1, as an optimal set of values for these coefficients has been found by Jacobson et al. [3] in this region using the reference experimental data. For each selected mW model, simulations are performed at 42 points in the P-T (pressure–temperature) plane with 273 T 373K and 1 P 300 atm. The criterion for choosing this number of points in the pressure–temperature space is based on our available computational budget and the absence of phase transition or occurrence of any expected sudden change in the molecular properties of water within the designated region, which makes a fine discretization of the input parameter space unnecessary. These sets of 42 simulations have been run in the LAMMPS environment [37] for the TIP4P/2005 model (once for the standard model and for error estimation purpose) and mW models (for each of the selected 43 models) in a 62 by 62 by 62 Å3 simulation box with periodic boundary conditions and 8000 and 24000 atoms for mW and TIP4P/2005 models, respectively. Each simulation was performed for 2500000 time-steps with t = 2 fs. After obtaining the low-fidelity data, the construction of the emulator starts with the construction of its building block, vector gL . Here, the corresponding vector for the ith sample point becomes

3.1. Coarse-grained monatomic water model The monatomic water (mW) model is a single particle model with anisotropic interactions. The potential energy in this model explicitly includes three-body interactions. Using the Stillinger-Weber potential function [35], this coarse-grained model is parameterized with five main force-field coefficients: , , a , , and . Hence, the potential energy for a system of atoms in this case can be written as

U=

2 (rij ) i

+

j>i

3 (rij, ril, i

ijl ),

(15)

j i l >j

where 4 2 (rij ) = A

B

1 exp

rij

rij

,

a

(16)

and 3 (rij, ril,

ijl )

=

[cos

cos 0 ]2 exp

ijl

rij

exp

a

ril

.

a

(17)

In this model, 2 and 3 are two- and three-body interaction terms, respectively. In Eqs. (16) and (17), rij and ij denote the intermolecular distance and angle between ith and jth water molecules, respectively. Also, 0 is the initial and preferred tetrahedral angle. Moreover, in this model, coefficients A, B, 0 are held constant at values 7.049556277, 0.6022245584 and 109.47°, respectively. The values of the design variables for the standard mW model have been defined as = 6.189 kcal/mol, = 2.3925 Å, a = 1.8, = 23.15, and = 1.2 [3]. 3.2. TIP4P/2005 water model The TIP4P models are four interaction site water models. In these force-field models, four charges are placed; one on oxygen, two on two hydrogen atoms, and an extra charge on a dummy atom. The potential energy formulation in this model can be outlined as the following equation [9]: 12

U=

4 i

j

TP

6

TP

TP

rij

rij

qi rij

+ i

j

qj

,

(18)

where TP , TP , and q denote the distance at which the inter-particle potential is zero, the value of the minimum energy, and the partial charges of each molecule, respectively. Here, a center of force is located on the oxygen atom. TIP4P is the extension of the TIP3P model with the addition of a negative charge on the oxygen atom. This model originally was designed to reproduce experimental data for enthalpy of vaporization and density of water. The TIP4P/2005 variant of this model has been optimized further to better reproduce the phase diagram of water. 3.3. Model construction and cross-validation results

gL, i =

The objective of the implementation of the proposed multi-fidelity force-field coefficient optimization approach for the mW model is to obtain a set of optimal force-field coefficient values that minimizes the error of mW model compared to the reference data. Here quantities of interest are various molecular and thermodynamic properties extracted from TIP4P/2005 and mW simulations. For the case of TIP4P/2005 model, the distance from the oxygen atom to massless charges and the cutoff radius are considered to be 0.1546 and 13.0 Angstrom, respectively. Moreover, the long-range Van der Waals tail “correction” to the energy and pressure in each time-step is considered. Also, in order to compute the long-range Coulomb interactions, the particle–particle

g (r ) i NR g (r )

,

MSDi NM MSD

,

Hi NH H

,

Ei NE E

,

i

N

,

Di ND D

T

, (19)

where NR , NM , NH , NE , N , and ND denote the number of elements in the corresponding vector of RDF ( g (r ) ), MSD (for 2500 time-steps of 2 fs), enthalpy of evaporation (H), total energy (E), density ( ), and selfdiffusion coefficient (D), respectively. Also, in order to have equal impact on the estimation made by the resultant multi-fidelity model, the contribution of all elements for each quantity, normalization coefficients g (r ), MSD, H , E , , and D for I sample points are defined as

7

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

1

g (r ) = I

1

MSD= I

1

H =I

1

E =I

1

=I

1

D =I

g (r ) i

,

i=1

As the next step, 10 sample points, which are deemed important by LU factorization approach, in the P T parameter space are identified for each of the 43 mW models. Afterwards, the error metric from Eq. (13) is obtained for each of the 43 constructed low-fidelity emulators (stochastic collocation models) with a fixed permutation of 10 important samples. The permutation with the lowest error across all emulators is then selected. With the selected permutation of sampling points, 43 emulators are constructed. Next, the vector U is populated with the calculated errors. The solution of Vc = U provides a thirdorder five-dimensional polynomial surrogate for the low-fidelity stochastic collocation model error. This polynomial surrogate with the hyperbolic cross index set (M = 26) is then used for the exploration of the mW force-field coefficient space (using the cyclic coordinate optimization approach [25]) to find the optimal set of force-field coefficients that minimizes the error (error ). As a global minima cannot be determined solely by the applied surrogate model-based approach, a set of local optima is obtained. The corresponding mW models are then used to construct the multi-fidelity emulator. Comparison of error metric value for these emulators indicates that the optimal = 5.90962, = mW model force-field coefficients are 2.39455, a = 1.99590, = 20.16154 , and =1.27683 (see Fig. 4a). The corresponding multi-fidelity emulator error for this model provides estimation of the quantities of interest with the same level of accuracy (about 9 × 10 3) as the emulators constructed based upon mW models, which are found to give the most accurate predictions (when compared to TIP4P data; see Fig. 4b). Moreover, as shown in Fig. 4b, without even using the proposed multi-fidelity procedure, this optimal model is found to give comparable estimation of the quantities of interest to that of those mW models with the highest prediction accuracy. It should be noted that in order to find this optimal model, no high-fidelity model evaluation was performed up until the time of choosing the best model among a set of locally optimal mW models, when the resultant multifidelity emulator was constructed. As expected, the accuracy of the optimal emulator increases if the high-fidelity sampling points are chosen based upon its corresponding

2 2

I

MSDi i=1

, 2

2

I

Hi i=1

, 2 2

I

,

Ei i=1

2 2

I i i=1

, 2 2

I

.

Di i=1

3.4. force-field coefficient optimization and optimal multi-fidelity emulator

2

I

2

(20)

Using Eqs. (19) and (20) as well as Table 1, the Gramian matrices for different kernel functions can be constructed, and the whole process of data-driven kernel function optimization (as discussed in Section 2.2) can be fully implemented to obtain the low-fidelity emulator. The error of this emulator is computed for the molecular system’s total energy, enthalpy of vaporization, density, self-diffusion coefficient, RDF, and mean squared distance (MSD) for the oxygen atoms. Cross-validation (leaving one out) results for 43 mW models indicate that construction of multi-fidelity models separately for scalar and vector quantities yields more accurate predictions (compare Fig. 2a and Figs. 2d). Here, the cross-validation error computed when constructed with only 10 important samples. Furthermore, as shown in Fig. 3, when force-field coefficient values are chosen near the mW force-field coefficient bounds, the highest cross-validation error for these models occur. Thus, we consider the region with the bounds to be at least 15 percent away from the previously designated mW force-field coefficient bounds, so that the outcome for the optimal set of model’s force-field coefficients can be trusted.

Fig. 2. Cross-validation error for 43 sampled mW models with different gL s.

8

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

Fig. 3. Cross-validation error for 43 sampled mW models at different mW force-field coefficient planes; the highest cross-validation error occurs at the designated mW force-field coefficient bounds; the sizes of the circles are scaled with the corresponding mW model accuracy: large particles represent models with the highest error and small particles represent models with the smallest error.

Fig. 5. Error of the multi-fidelity emulator constructed with the optimal mW model in comparison to 43 sampled mW models for different quantities of interest with preset and non-preset permutation of samples.

points and non-preset sampling points is found to be two. In Fig. 5, the resultant accuracy enhancement is shown compared to the accuracy spectrum of all 43 designated mW models with preset sampled highfidelity data points for six quantities of interest. Moreover, assuming that we had access to the high-fidelity at all 42 sampling points in the pressure–temperature space, we ran the same procedure to obtain the optimal mW model. As a result an optimal model ( = 6.94062, = 2.13023, a = 1.8, = 25.75229, and = 1.2 ) with 31 percent increase in the accuracy was found. However, the values of the relative error for both optimal models are within the same order of magnitude (compare the relative error of 0.00923 with 0.00637). This proximity of accuracy for different quantities of interest is illustrated in Fig. 6. As can be seen in this figure, the relative error for both optimal models falls at the end of the error spectrum of the tested models for most quantities of interest. Also, as expected, our experiments indicate that the minimization of the multi-fidelity emulator error does not necessarily guarantee the minimization of the low-fidelity model inaccuracy. For instance, in this study, the first optimal mW model that was obtained without a high-fidelity data evaluation was found to be more accurate. However, this enhanced accuracy does not translate into the minimum corresponding multi-fidelity emulator error, as the objective of the proposed multi-fidelity force-field coefficient optimization

Fig. 4. The optimal model found in the proposed force-field coefficient optimization process and its corresponding multi-fidelity emulator error in comparison to (a) local optimal mW models found in the optimization process (size and color represent the resultant multi-fidelity error) and (b) mW models that are found to produce predictions closest to those of the TIP4P/2005 model (size and color represent multi-fidelity error and model relative error with reference to TIP4P/2005 data, respectively).

Gramian matrix. This accuracy enhancement is found to be about 10 percent. For this purpose, performing eight more high-fidelity data evaluations is necessary as only the overlap between preset sampling 9

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

Fig. 6. Median error of the multi-fidelity emulator constructed based upon the optimal mW model for different quantities of interest (represented by solid green diamonds); comparison with sampled 43 mW models (represented by red squares) as well as the optimal mW model obtained with the unlimited availability of highfidelity data (represented by solid dark circles).

procedure is to find an mW model that can be best “corrected” with a small amount of high-fidelity data and not necessarily the best mW model on its own. As shown in Fig. 6, the optimal multi-fidelity emulator (which is found with the limited availability of high-fidelity data) is accurate for even only 10 high-fidelity samples, and its error for each quantity of interest contributing to gL (except for D) lies near the end of the error spectrum of tested emulators in this study. As an average, this optimal emulator is highly accurate. However, as shown in Fig. 6f, this accuracy comes at the cost of a slight inaccuracy of the self diffusion coefficient estimation when compared to the best tested model. This is the price we pay for optimizing the coarse-grained model force-field coefficients with a limited number of high-fidelity model evaluations. However, the resultant emulator is significantly more accurate than the low-fidelity model, and, as illustrated in Fig. 7, its estimation of quantities of interest agrees well with the actual TIP4P/2005 model evaluations, including the accurate prediction of H, for which TIP4P model has been designed to estimate well. Another aspect of this investigation involves studying the impact of each force-field coefficient on the performance of the corresponding multi-fidelity emulator. For this purpose, the first-order Sobol global indices [39] as well as total sensitivity indices associated with the coefficients of the polynomial metamodel for the emulator error metric (Eq. (13)) are calculated. Here, sensitivity analysis is performed by

computing only first-order terms for global sensitivity analysis. Hence, if the error metric can be defined as

error =

ci i , 0 i M

(21)

the first-order global and total sensitivity indices are obtained as follows:

ci2 [

Si = i

D

i

ci2 [

Ti = i

2 i ]

i

2 i ]

D

,

,

(22)

(23)

where

ci2 [

D= 0
2 i ],

(24)

and is a multivariate polynomial and an element of the polynomial span used for the construction of the Vandermonde matrix. Also, discrete sets i and i , for which first-order global (S) and total (T) sensitivity indices are obtained on in Eqs. (23) and (22), are defined as 0 for i} , re{ 0for i , = 0, otherwise} and { spectively. Here, the total contribution of force-field coefficients can be 10

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

Fig. 7. Comparison of the properties predicted for water as obtained using high-fidelity model (TIP4P/2005), low-fidelity model (optimal mW model), and multifidelity approach based on low-fidelity data and 10 samples from the high-fidelity data.

measured by total sensitivity indices, and, as shown in Fig. 8a, is found to have the highest impact on the multi-fidelity defined error metric among all force-field coefficients. On the other hand, no significant effect on the accuracy of the multi-fidelity model accuracy is found as a result of the variation of . Moreover, in this case, the firstorder Sobol indices, which are provided in Fig. 8b, sum to about 0.63, which shows that the contribution of higher order interactions is around 37 percent. Therefore, the first-order force-field parametric

interactions dominate the uncertainty of the emulator’s accuracy. Finally, our experiments with the newly found optimal multi-fidelity emulator for the water molecule indicate that the points at the both extreme ends of the temperature spectrum have the highest probability of being selected for the high-fidelity simulation model evaluation (see Fig. 9a; 50 percent of first 10 points). As depicted in Fig. 9b, this deduction is confirmed with the 60 percent selection rate of these points among the top 10 samples when the optimal emulator with the full

Fig. 8. Global and total sensitivity analysis on mW model force-field coefficients (including (1) , (2) metamodel for the multi-fidelity emulator error. 11

, (3) a, (4)

, and (5) ) using the constructed spectral

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

the operation should be kept below that of 43 TIP4P/2005 model simulations (the cost of running the high-fidelity model on the 42 designated samples in the pressure-temperature space plus the optimization cost). Therefore, in conjunction with the computational costs associated with each part of the proposed multi-fidelity process, the feasible number of low-fidelity model simulations becomes 715. Therefore, the proposed approach is suggested for the scenarios for which the computational cost of the required number of low-fidelity simulations is below this limit. Otherwise, using the high-fidelity alone for force-field coefficient optimization of the mW model is more computationally justifiable. Such a consideration can be taken into account when designing the experiments with regard to the number of samples in the input parameter space as well as the number of candidate lowfidelity models to be tested. In this research work, for the sake of demonstration and proof of concept, we performed 1806 mW simulations. However, using techniques such as 1-minimization [24,40,41], this number can be reduced to fewer than 715. Moreover, if the required number of high-fidelity simulations to cover the pressure–temperature space increases beyond only 42, both the accuracy of the proposed approach and its computational feasibility are enhanced (since the limit for a computationally justifiable number of low-fidelity model evaluations increases accordingly). Consequently, even without utilizing techniques such as 1-minimization, this approach becomes a computationally attractive choice for practitioners in the area of materials sciences because the limit for an allowable number of low-fidelity simulations for gaining computational feasibility increases significantly. 4. Concluding remarks Fig. 9. Importance sampling based on ordered column pivots from the LU factorization of GL ; 15 most critical points in the designated P-T plane for the benchmark problem.

Force-field coefficient optimization of coarse-grained models is one of the most popular approaches in the materials sciences community for developing high-fidelity molecular dynamics models that are computationally efficient and fast. Exploration of the often moderately highdimensional space of force-field coefficient space of high-fidelity models, which is computationally costly, is necessary to achieve this goal. Therefore, different emulator and surrogate model construction approaches, including configuration-sampling-based surrogate, nonintrusive spectral projection, transitional Markov chain Monte Carlo, Bayesian inference, kriging, and the approximate Bayesian computational algorithm, have been successfully applied in the literature. In this work, a novel approach for force-field coefficient optimization is proposed, which (i) uses a novel strategy for sampling low-fidelity data in the input parameter space; (ii) limits the number high-fidelity data evaluations by identifying the most important column pivots through LU decomposition of the Gramian matrix; (iii) employs an optimal kernel function selection procedure to promote numerical stability and as a result enhances the accuracy of resultant multi-fidelity emulator; (iv) applies a non-intrusive spectral surrogate model to evaluate the defined error metric within the force-field coefficient space and significantly reduces the sampling computational load; and (v) takes advantage of the cyclic coordinate optimization approach application on a spectral metamodel of the emulator error to accelerate the exploration of the design variables space and the process of finding a set of forcefield coefficients that minimizes the error. To demonstrate the proposed approach, we applied it to optimize mW model force-field coefficients with respect to the multi-fidelity emulator, which is constructed by the fusion of TIP4P/2005 and mW models’ simulation data. The results indicate (i) the effectiveness of this approach in obtaining the optimal multi-fidelity emulator for the water molecule with limited high-fidelity data (or limited computational budget), (ii) the accuracy of the resultant optimal multi-fidelity emulator for the estimation of molecular and thermodynamic properties, (iii) the possibility of using a certain arrangement of the contributions of the quantities of interest to achieve a higher accuracy, (iv) the insensitivity of the emulator to one of the mW force-field coefficients, and

availability of high-fidelity data is tested. The selection of these points, which are near the locations of the phase transition in the water molecule phase diagram, makes sense from the materials sciences perspective since sharp variations in the atomistic properties of water are expected at their whereabouts. Hence, the higher rate of the selection of these points can be considered as another sanity check for the optimal multi-fidelity emulator. 3.5. Computational performance of the proposed force-field coefficient optimization Computational performance is a major feature of an algorithm and the feasibility of its application. In the performance analysis of the proposed multi-fidelity force-field coefficient optimization method, the computational elapsed times of different pieces of the process should be considered. As an average, the mW simulation model in LAMMPS is computationally 17.8 times faster than the TIP4P/2005 simulation. Moreover, the computational effort for the process of generating samples in both two-dimensional input parameter space and five-dimensional force-field coefficient space as well as building a polynomial surrogate for the defined error metric is negligible even when compared to one mW simulation run. The process of finding the important sampling set takes as an average two times the computational cost of running TIP4P/2005 as one has to compute the averaged error induced by selecting any of 43 sets of samples on all designated 43 low-fidelity stochastic collocation models. Also, the computational cost of the construction of each stochastic collocation model is about 1/30th of the TIP4P/2005 simulation cost. Finally, the optimization process for the multi-fidelity model error reduction costs about half of that of one TIP4P/2005 simulation. Hence, among all parts, running mW model is the dominant factor in the resultant computational cost. Consequently, in order to justify the application of the proposed approach, the cost of 12

Computational Materials Science 176 (2020) 109518

M. Razi, et al.

(v) the significance of atomistic properties and behavior at the onset of phase transition for the water molecule in construction of a high-fidelity surrogate model. Hence, this approach can be used as an alternative to standard force-field coefficients tuning approaches for coarse-grained models because it provides optimal accuracy with a limited computational/experimental budget for high-fidelity data acquisition. However, in order to become a computationally feasible choice, the number of low-fidelity model evaluations used for the proposed approach should not exceed the problem-dependent computational feasibility limit, which is discussed in Section 3.5. Moreover, as long as the models of different fidelity levels collectively capture the variations of quantities of interest in the input parameter space, the proposed approach is expected to produce results with reasonable accuracy. Finally, the important sampling strategy employed in this approach can provide some crucial insight about the underlying physics of molecular dynamics systems.

[9] [10] [11] [12] [13] [14] [15] [16] [17]

Data availability statement

[18]

The authors confirm that the data required to reproduce the findings of this study are available within the article and can be reproduced by molecular dynamic simulations.

[19]

CRediT authorship contribution statement

[20]

M. Razi: Conceptualization, Formal analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing - original draft, Writing - review & editing. A. Narayan: Methodology, Writing - review & editing, Project administration. R.M. Kirby: Writing - review & editing, Project administration, Funding acquisition. D. Bedrov: Methodology, Writing - review & editing, Project administration.

[21] [22] [23] [24] [25]

Acknowledgments

[26]

M. Razi, D. Bedrov, A. Narayan and R. M. Kirby acknowledge that their part of this research was sponsored by ARL under Cooperative Agreement Number W911NF-12–2-0023. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ARL or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

[27] [28] [29] [30] [31]

References

[32] [33]

[1] R.A. Messerly, S.M. Razavi, M.R. Shirts, Configuration-sampling-based surrogate models for rapid parameterization of non-bonded interactions, J. Chem. Theory Comput. 14 (2018) 3144–3162. [2] F. Rizzi, H.N. Najm, B.J. Debusschere, K. Sargsyan, M. Salloum, H. Adalsteinsson, O.M. Knio, Uncertainty quantification in MD simulations. part i: Forward propagation, Multiscale Model. Simul. 10 (2012) 1428–1459. [3] L.C. Jacobson, R.M. Kirby, V. Molinero, How short is too short for the interactions of a water potential? exploring the parameter space of a coarse-grained water model using uncertainty quantification, J. Phys. Chem. B 118 (2014) 8190–8202. [4] M. Zimoń, R. Sawko, D. Emerson, C. Thompson, Uncertainty quantification at the molecular–continuum model interface, Fluids 2 (2017) 12. [5] H. Meidani, J.B. Hooper, D. Bedrov, R.M. Kirby, Calibration and ranking of coarsegrained models in molecular simulations using Bayesian formalism, Int. J. Uncertainty Quantif. 7 (2017). [6] S. Wu, P. Angelikopoulos, G. Tauriello, C. Papadimitriou, P. Koumoutsakos, Fusing heterogeneous data for the calibration of molecular dynamics force fields using hierarchical Bayesian models, J. Chem. Phys. 145 (2016) 244112 . [7] F. Rizzi, H.N. Najm, B.J. Debusschere, K. Sargsyan, M. Salloum, H. Adalsteinsson, O.M. Knio, Uncertainty quantification in MD simulations. Part ii: Bayesian inference of force-field parameters, Multiscale Model. Simul. 10 (2012) 1460–1492. [8] F. Cailliez, A. Bourasseau, P. Pernot, Calibration of forcefields for molecular

[34] [35] [36] [37] [38] [39] [40] [41]

13

simulation: Sequential design of computer experiments for building cost-efficient kriging metamodels, J. Comput. Chem. 35 (2014) 130–149. R. Dutta, Z.F. Brotzakis, A. Mira, Bayesian calibration of force-fields from experimental data: TIP4P water, J. Chem. Phys. 149 (2018) 154110 . A. Narayan, C. Gittelson, D. Xiu, A stochastic collocation algorithm with multifidelity models, SIAM J. Sci. Comput. 36 (2014) A495–A521. X. Zhu, A. Narayan, D. Xiu, Computational aspects of stochastic collocation with multifidelity models, SIAM/ASA J. Uncertainty Quantif. 2 (2014) 444–463. J. Hampton, H. Fairbanks, A. Narayan, A. Doostan, Parametric/stochastic model reduction: low-rank representation, non-intrusive bi-fidelity approximation, and convergence analysis, 2017. arXiv preprint arXiv:1709.03661. L. Jofre, G. Geraci, H. Fairbanks, A. Doostan, G. Iaccarino, Multi-fidelity uncertainty quantification of irradiated particle-laden turbulence, 2018. arXiv preprint arXiv:1801.06062. M. Razi, A. Narayan, R.M. Kirby, D. Bedrov, Fast predictive models based on multifidelity sampling of properties in molecular dynamics simulations, Comput. Mater. Sci. 152 (2018) 125–133. R. Skinner, A. Doostan, E. Peters, J. Evans, K.E. Jansen, An evaluation of bi-fidelity modeling efficiency on a general family of NACA airfoils, in: 35th AIAA Applied Aerodynamics Conference, p. 3260. M. Razi, R.M. Kirby, A. Narayan, Fast predictive multi-fidelity prediction with models of quantized fidelity levels, J. Comput. Phys. 376 (2019) 992–1008. D. Anderson, M. Gu, An efficient, sparsity-preserving, online algorithm for low-rank approximation, in: International Conference on Machine Learning, pp. 156–165. D.J. Perry, R.M. Kirby, A. Narayan, R.T. Whitaker, Allocation strategies for high fidelity models in the multifidelity regime, SIAM/ASA J. Uncertainty Quantif. 7 (2019) 203–231. A. Lozano, G. Swirszcz, N. Abe, Group orthogonal matching pursuit for logistic regression, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 452–460. M. Razi, R.M. Kirby, A. Narayan, An optimal kernel function selection approach for low-rank multi-fidelity approximation, Int. J. Uncertainty Quantif. (under review) (2019). J. Kennedy, Particle swarm optimization, Encyclopedia of machine learning, Springer, 2011, pp. 760–766. M. Clerc, Particle Swarm Optimization vol. 93, John Wiley & Sons, 2010. A. Narayan, Computation of induced orthogonal polynomial distributions, 2017, arXiv preprint arXiv:1704.08465. J.D. Jakeman, A. Narayan, T. Zhou, A generalized sampling and preconditioning scheme for sparse approximation of polynomial chaos expansions, SIAM J. Sci. Comput. 39 (2017) A1114–A1144. M. Razi, R. Wang, Y. He, R.M. Kirby, L. Dal Negro, Optimization of large-scale Vogel spiral arrays of plasmonic nanoparticles, Plasmonics 14 (2019) 253–261. W.L. Jorgensen, J. Chandrasekhar, J.D. Madura, R.W. Impey, M.L. Klein, Comparison of simple potential functions for simulating liquid water, J. Chem. Phys. 79 (1983) 926–935. H. Berendsen, J. Postma, W. Van Gunsteren, A.J. Hermans, Intermolecular forces, in: Pullman, B. (Ed.), Reidel Publishing Company, Dordrecht, 1981, pp. 331–342. H. Berendsen, J. Grigera, T. Straatsma, The missing term in effective pair potentials, J. Phys. Chem. 91 (1987) 6269–6271. H.W. Horn, W.C. Swope, J.W. Pitera, J.D. Madura, T.J. Dick, G.L. Hura, T. HeadGordon, Development of an improved four-site water model for biomolecular simulations: TIP4P-Ew, J. Chem. Phys. 120 (2004) 9665–9678. J.L. Abascal, C. Vega, A general purpose model for the condensed phases of water: TIP4P/2005, J. Chem. Phys. 123 (2005) 234505 . M.W. Mahoney, W.L. Jorgensen, A five-site model for liquid water and the reproduction of the density anomaly by rigid, nonpolarizable potential functions, J. Chem. Phys. 112 (2000) 8910–8922. R.C. Remsing, J.M. Rodgers, J.D. Weeks, Deconstructing classical water models at interfaces and in bulk, J. Stat. Phys. 145 (2011) 313–334. V. Molinero, E.B. Moore, Water modeled as an intermediate element between carbon and silicon, J. Phys. Chem. B 113 (2008) 4008–4016. C.L. Dias, T. Ala-Nissila, M. Grant, M. Karttunen, Three-dimensional Mercedes-Benz model for water, J. Chem. Phys. 131 (2009) 054505 . F.H. Stillinger, T.A. Weber, Computer simulation of local order in condensed phases of silicon, Phys. Rev. B 31 (1985) 5262. R.W. Hockney, J.W. Eastwood, Computer Simulation Using Particles, CRC Press, 1988. S. Plimpton, Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys. 117 (1995) 1–19. M. Orsi, Comparative assessment of the elba coarse-grained model for water, Mol. Phys. 112 (2014) 1566–1576. Y. He, M. Razi, C. Forestiere, L. Dal Negro, R.M. Kirby, Uncertainty quantification guided robust design for nanoparticles’ morphology, Comput. Methods Appl. Mech. Eng. 336 (2018) 578–593. L. Guo, A. Narayan, T. Zhou, Y. Chen, Stochastic collocation methods via ell_1 minimization using randomized quadratures, SIAM J. Sci. Comput. 39 (2017) A333–A359. L. Guo, A. Narayan, T. Zhou, A gradient enhanced 1-minimization for sparse approximation of polynomial chaos expansions, J. Comput. Phys. 367 (2018) 49–64.