A new method for model validation with multivariate output

A new method for model validation with multivariate output

Accepted Manuscript A new method for model validation with multivariate output Luyi Li , Zhenzhou Lu PII: DOI: Reference: S0951-8320(16)30214-9 10.1...

949KB Sizes 0 Downloads 56 Views

Accepted Manuscript

A new method for model validation with multivariate output Luyi Li , Zhenzhou Lu PII: DOI: Reference:

S0951-8320(16)30214-9 10.1016/j.ress.2017.10.005 RESS 5973

To appear in:

Reliability Engineering and System Safety

Received date: Revised date: Accepted date:

6 July 2016 6 October 2017 7 October 2017

Please cite this article as: Luyi Li , Zhenzhou Lu , A new method for model validation with multivariate output, Reliability Engineering and System Safety (2017), doi: 10.1016/j.ress.2017.10.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights 

A new method for model validation with multivariate output is presented



The new method can consider both uncertainty and correlation of multivariate output



The new method can be used for validation at both single and multi-validation



CR IP T

site The advantages of the new method with respect to the existing ones are

AC

CE

PT

ED

M

AN US

highlighted

1

ACCEPTED MANUSCRIPT

A new method for model validation with multivariate output Luyi Lia, Zhenzhou Lu a a

School of Aeronautics, Northwestern Polytechnical University, Xi‟an, 710072, P. R. China

Abstract: Traditional methods for model validation assessment mainly focus on validating a single response. However, for many applications joint predictions of the multiple responses are needed. It is thereby not sufficient to validate the individual responses separately, which ignores correlation among multiple responses. Validation assessment for multiple responses involves

CR IP T

comparison with multiple experimental measurements, which makes it much more complicated than that for single response. With considering both the uncertainty and correlation of multiple responses, this paper presents a new method for validation assessment of models with multivariate output. The new method is based on principal component analysis and the concept of area metric. The method is innovative in that it can eliminate the redundant part of multiple responses while reserving their main variability information in the assessment process. This avoids directly

AN US

comparing the joint distributions of computational and experimental responses. It not only can be used for validating multiple responses at a single validation site, but also is capable of dealing with the case where observations of multiple responses are collected at multiple validation sites. The new method is examined and compared with the existing u-pooling and t-pooling methods through numerical and engineering examples to illustrate its validity and potential benefits. Key words: model validation; multivariate output; principal component analysis; area metric

M

1 Introduction

With the rapid development of computers‟ abilities, an increasing number of computational

ED

models have been established for simulating the system behavior and supporting decision instead of the expensive physical experiments in many research fields, such as risk analysis [1-3], engineering design and performance estimation. The increased dependence on using

PT

computational simulation models in engineering presents a critical issue of confidence in modeling and simulation accuracy. Verification and validation (V&V) are the primary means to assess accuracy and reliability of computational simulations in engineering. Verification is the process of

CE

determining that a model implementation accurately represents the developer‟s conceptual description of the model and the solution to the model. Validation is the process of determining

AC

the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model [4]. That is, verification addresses the accuracy of the numerical solution produced by the computer code as compared to the exact solution of the conceptual model. In the verification, how the conceptual model relates to the „„real world‟‟ is not an issue. Validation addresses the accuracy of the conceptual model as compared to the „„real world‟‟, i.e., experimental measurements [4-7]. As Roache [8] stated, “verification deals with mathematics, while validation deals with physics‟‟. In general, code verification and numerical error estimation which are the two types of model verification activities should be completed before model 

Corresponding author P.O. Box 120, School of Aeronautics, Northwestern Polytechnical University, Xian, 710072, PR China E-mail: [email protected], Tel: 0086-29-88460480 2

ACCEPTED MANUSCRIPT validation activities are conducted or at least before actual comparison of computational results are made with experimental results [4-7]. The reason is clear. We should have convincing evidence that the computational results obtained from the code reflect the physics assumed in the models implemented in the code and that these results are not distorted or polluted due to coding errors or large numerical solution errors. As pointed out in literatures [7,9], if a researcher/analyst does not provide adequate evidence about code verification and numerical error estimation in a validation activity, the conclusions presented are of dubious merit. If conclusions from a defective simulation are used in high consequence system decision-making, disastrous results may occur. After

CR IP T

decades-long development, while verification is well established, rigorous model validation assessment is not. In the present work, we also restrict our attention to model validation assessment.

Various model validation methods have presently been developed for assessing model validity. These methods are generally classified into four categories [10]: classical hypothesis testing [11], Bayesian factor [12,13], frequentist's metric [7,14] and area metric [9]. Classical hypothesis testing mainly focuses on determining which of the two alternative propositions (where

AN US

the null hypothesis usually is that the model is correct and the alterative one is that the model is not correct.) is correct. It cannot assesse the quantitative accuracy of a model. The Bayesian factor approach is primarily interested in evaluating the probability (i.e., the belief) that the model is correct by incorporating an analyst‟s prior belief of model validity. Although it has some advantages over the classical hypothesis testing method, the Bayesian approach still cannot provide quantitative accuracy of the model, but gives a “yes” or “no” answer [9]. Different from

M

the classical hypothesis testing and Bayesian factor, the frequentist‟s metric can help to quantitatively assess the adequacy of the given model for an application of interest. However, the

ED

main limitation of this method is that it is based on comparing means or other summary statistics (e.g. the maximum values) of the experimental data and model predictions. Thus, it considers only the central tendencies or other specific behaviors of data and predictions rather than their entire

PT

distributions. With this consideration, Ferson et al. [9] proposed an area metric method which assesses the accuracy of the given model by comparing the difference between the cumulative distribution function (CDF) of the model predictions and the empirical CDF of experimental data.

CE

This method can provide an objective assessment of the model accuracy by considering the entire distributions of the model predictions and experimental data. It generalizes the frequentist‟s metric

AC

method to validation assessment that focuses only on the mean behavior or other summary statistics of predictions and observations. More importantly, by employing the u-pooling technique to incorporate the experimental data collected at multiple validation sites into a single metric, the method in [9] can be used to assess the overall performance of a model against all the experimental data in the validation domain. Here, a validation site is a specific setting of the model input variables at which the accuracy of a model is validated against the measured quantities. In spite of numerous advantages, Ferson‟s area metric method is based on comparison of the marginal distributions of the model predictions and experimental data. It, thus, is only suitable for validation assessment of models with single response or with multiple uncorrelated responses [15]. In practice, however, correlated multiple responses are often of interest. For instance, multiple 3

ACCEPTED MANUSCRIPT response quantities, like stress, strain and displacement etc., often are predicted simultaneously from the same experiment at a single location. These different quantities are correlated, since they are based on the same inputs. On the other hand, the interested model responses from the same experiment may change with location (in space or time coordinate). In this case, there is a strong correlation between any pair of response quantities from the same experiment. Model validation assessment for multiple correlated response quantities is much more complicated than that for single response, as it needs to consider both uncertainty and correlation of the multiple responses in the validation assessment process.

CR IP T

Validation assessment with “multivariate data” has drawn attention of the scientists in meteorological and climate community for decades [16-19], where verification and validation are combined in one verification process. Many methods, such as the minimum spanning tree (MST) histograms, multivariate rank (MVR) histograms and bounding boxes (BBs) [16,17], have been developed in this community to assess the multidimensional ensemble reliability of their forecasts within the context of high-dimensional multivariate data. The bounding boxes in BBs approach are defined by the minimum and maximum values of an ensemble forecast. They, thus, are unduly

AN US

affected by outliers and may fail in characterizing the bulk ensemble properties appropriately [16]. Both the MST histograms and MVR histograms are based on the reliability criterion that the ensemble members should be statistically indistinguishable from the observations. This is similar to that underlying the u-pooling technique. From that, insights can be gained into the average dispersion characteristics of the multidimensional ensemble forecasts.

In the context of engineering community, classical hypothesis testing and Bayesian factor

M

have been extended to validation assessment of models with multivariate output [20, 21]. These methods, however, inevitably inherit their disadvantages in univariate case. That is, they cannot

ED

provide quantitative accuracy of the model, but give a “yes” or “no” answer. One intuitive and straightforward method which can account for both uncertainty and correlation of the multiple responses is directly comparing the joint CDF of the model predictions and the multivariate

PT

empirical CDF of experimental data. Nevertheless, in many engineering cases, experimental observations are usually very sparse due to the expense of full-scale physical experiments. This means that the correlation structure of experimental data is poorly known. Thus, the multivariate

CE

empirical CDF used for capturing correlation information in the data will be a very poor representation of what actually exists in the real physical system. Even if the experimental data are

AC

sufficiently collected, the method would still suffer severely from the “curse of dimensionality” when computing the high-dimensional integration in the metric. To provide a quantitative assessment for the accuracy of the computational models with

correlated multivariate output with considering both the uncertainty and correlation, Li et al [15] proposed two validation metrics (A validation metric is a formal measure of the mismatch between predictions and experimental data.) by extending the idea of area metric and u-pooling method. They are probability integral transformation (PIT) area metric for a single validation site and t-pooling metric for multiple validation sites. Both the metrics are based on the multivariate PIT theorem. Although this method can take into account both the uncertainty and correlation of the multivariate output in the process of model validation assessment, it still requires estimating the joint CDF of model responses to transform the multivariate experimental observations. This is 4

ACCEPTED MANUSCRIPT often unavailable for high-dimensional response space. In this paper, a new method is proposed for the validation assessment of models with multiple correlated responses based on the principal component analysis (PCA) and the idea of area metric. By the PCA, the correlated multiple output responses are transformed into a set of orthogonal principal components (PCs) with the first few PCs containing nearly all the variability of the multiple outputs. Then the area metric can be applied to each PC to obtain the corresponding validation metric value. The total model validation metric is obtained by aggregating these validation metric values of PCs with their corresponding PCA weights. With the proposed method,

CR IP T

validation assessment of models with correlated multiple high-dimensional responses can be decomposed into a series of model validation assessments in independent one-dimensional space. This allows considering both the uncertainty and correlation of the multiple responses without estimating their joint CDF. Thus, it is more feasible for the practical applications.

The rest of the paper is organized as follows. Section 2 briefly reviews the existing area metric and u-pooling technique. The theory of the proposed method is detailed in Section 3. In Sections 4 and 5, an illustrative mathematical example and a simple risk analysis model are

AN US

respectively used to show the advantages of the proposed method by comparing with the existing ones. The proposed method is also applied to an engineering model where the experimental data are assumed to be sparse in Section 6. Section 7 discusses some practical issues in the application of the proposed method and the future works to be done. Finally, the conclusions come at the end of the paper.

M

2 Review of the area metric and u-pooling based metric

For the relevant system response quantity (SRQ) y, let Fm(y) denote the CDF of y predicted

ED

by the computational model at a single validation site, and Sne ( y) represent the empirical CDF of the corresponding experimental data of y. In Fm(y) and Sne ( y) , the superscripts m and e represent “model” and “experiment”, respectively and the subscript n is the sample size of the data set (Note

PT

that since the experimental data are often sparse, empirical CDF is usually used to represent their uncertainty which is an exact representation of the data regardless of the amount of data, and does

CE

not require any assumptions [9]. In the case where predictions are also sparse, empirical CDF is also used for describing the uncertainty of predictions). As shown in Fig. 1, area metric uses the area between the prediction distribution Fm(y) and the data distribution Sne ( y) as the measure of

AC

the mismatch between the model and the data. Mathematically, it can be written as follows,

d  F m , Sne   





F m  y   Sne  y  dy

(1)

where d ( F m , Sne ) is the Minkowski L1 distance between the two distributions. As pointed out in Ref. [9], when the predictions and experimental data are exhaustively collected so that there is essentially no sampling uncertainty about their distributions, a larger value of area metric would indicate stronger evidence for disagreement between the model predictions and the experimental measurements at the specified validation site. Otherwise, a large area metric may also result from lack of experimental data. Area metric measures the mismatch between the model and the experiment at a single 5

ACCEPTED MANUSCRIPT validation site. It cannot be directly used when the model predictions and experimental data are collected at multiple validation sites. With this consideration, Ferson et al. [9] proposed another metric, i.e. u-pooling based metric to measure the overall disagreement between the model‟s predictions and the experimental data at multiple validation sites. By transforming the experimental data at different validation sites into a universal probability scale through the corresponding predictive distributions, the u-pooling base metric can pool all physical observations over the intended prediction domain into a single aggregated metric. Taking three validation sites as an example, Fig. 2 illustrates the process of the u-pooling method for model validation assessment at multiple validation sites.

CR IP T

As shown in Fig.2, at each validation site i (i  1, 2,3) , assume that only one experimental observation yie is collected for the purpose of illustration. Because of the existence of various uncertainties (which can be seen in the next section), such as model parameter uncertainty, the corresponding prediction distributions at the three validation sites can be obtained as

F1m ( y), F2m ( y), F3m ( y) , respectively. Then each experimental observation yie is transformed

AN US

according to its corresponding prediction distribution Fi m to obtain the corresponding probability value ui  Fi m ( yie ) which ranges on the unit interval [0,1]. Under the assumption that

yie (i  1, 2,3)

really are distributed according to their respective distributions

Fi m (i  1, 2,3) [9], these ui  Fi m ( yie )(i  1, 2,3) will have a uniform distribution on [0,1]

M

according to the probability integral transform theorem [22]. Therefore, the area differences (shaded areas in the left half part of Fig.2) between the empirical CDF of these u-values and the standard uniform distribution can be used as overall summary metric assessing the accuracy of the

ED

model‟s predictions at multiple validation sites. Similar to the area metric, when the predictions and experimental data are exhaustively collected so that there is essentially no sampling uncertainty about their distributions, a larger

PT

value of u-pooling based metric would indicate stronger evidence for disagreement between the model predictions and the experimental measurements over the intended prediction domain (i.e. at

CE

the multiple validation sites). Because both the standard uniform distribution and an empirical distribution function of the transformed u-values are constrained to the unit square, the value of the u-pooling based metric is between 0 and 0.5, with 0 indicating a perfect match between the

AC

model and experiment and 0.5 representing the worst possible match.

3 The proposed method A reasonable model validation metric should consider various uncertainties in the

computational model and the physical experiment. The uncertainties involved in the V&V process generally include uncertainty of the input data (e.g. uncontrollable or controllable geometry and material parameters of the model), model parameter uncertainty (e.g. boundary condition of the partial differential equation), numerical solution error in the prediction of the SRQ and experimental uncertainty in the form of measurement error, systematic error and random error [27]. Among them, numerical solution error in the prediction of the SRQ is the uncertainty concerned with the numerical solution process, while input data uncertainty, model parameter uncertainty and 6

ACCEPTED MANUSCRIPT the experimental uncertainty are all considered in the model validation assessment process. A validation site is a specific setting of the model controllable input variables at which the accuracy of a model is validated against the measured quantities. At the specific validation site, because of the existence of various other uncertainties, the distribution of the SRQ always can be obtained if enough output samples can be collected. In some cases of the engineering, multiple response quantities which are based on the same inputs may be predicted at a single location [9]. For instance, various quantities, such as stress, strain and displacement, etc., may be calculated simultaneously from the derivatives or the

CR IP T

integration of the finite element field combined with the structural parameters. In other cases, single or multiple correlated model responses may change with location (in space or time coordinate). For example, a heating model might be used to predict the time-dependent temperature at a given location on an object. In both cases, there is a strong correlation between any pair of response quantities from the same experiment. The uncertainties in the input parameters propagate to these derived quantities and multivariate distribution can be used to represent such correlated random response quantities. In order to consider both the uncertainty and

AN US

the correlation of the multiple responses, an intuitive and straightforward method for validation assessment of a model in these cases requires comparing the joint distributions of model responses and the observed data, i.e.,

d  F m , Sne   











F m  y1 ,

, yd   Sne  y1 ,

, yd  dy1

dyd

(2)

where Fm(y1,…, yd) is the joint CDF of d-dimensional multiple responses (y1,…, yd) predicted by

M

the computational model at a single site or multiple sites, and Sne ( y1 ,

, yd ) is the multivariate

empirical CDF of the corresponding experimental data of (y1,…, yd). The comparison in Eq. (2)

ED

requires (i) estimating the joint CDF of predictions and multivariate empirical CDF of the corresponding experimental data; (ii) estimating the high-dimensional integration. As mentioned earlier, estimating the joint CDF of the multivariate response is never an easy task. Furthermore,

PT

when the experimental data are very sparse, the multivariate empirical CDF used for capturing correlation information in the data can be a very poor representation of what actually exists in the real physical system [15]. Even if the experimental data are sufficiently collected and the joint

CE

CDF of the predictions can be obtained, the method would still suffer severely from the “curse of dimensionality” for computing the high-dimensional integration in the metric. These problems

AC

make it impossible to directly compare the predictions and the experimental data in the case of the correlated multiple responses. Nevertheless, for correlated multiple responses, a few principle components often contain

nearly all their uncertainty information with the majority making only a negligible contribution. If the “main uncertainty information” (i.e. the important principle components) have been correctly identified, adding further information to the data may add to its completeness and defensibility without adversely increasing the uncertainty in the data. Therefore, if we can eliminate the redundant part of the multiple responses, i.e., those components that make only negligible contributions to the whole uncertainty of the multiple responses, while reserving their main uncertainty information in the validation assessment process, we can considerably simplify the problem without losing much uncertainty information. Based on the decomposition of the 7

ACCEPTED MANUSCRIPT correlation matrix of the multiple variables, principal component analysis (PCA) provides an excellent statistical tool for extracting the main variability information of these correlated multiple variables. Therefore, we combine the PCA with the idea of area metric to assess the predictive capability of models with multivariate output in this section. For the sake of completeness, we first review the PCA briefly. 3.1. Principal component analysis Consider the correlated multivariate output yi(x), i=1,…,d, where x is a vector of the inputs and d is the number of output responses. PCA allows converting these correlated multivariate output yi(x), i=1,…,d into a set of linearly uncorrelated ones called principal components (PCs), so converting process of the PCA is as follows.

CR IP T

that most stochastic information is concentrated in the first few components [23-26]. The Assume that we have a set of N observations on d-dimensional variables y={y1, y2 ,…, yd}, and Y is the corresponding N×d observation matrix with columns representing the d variables and rows representing N different repetition of the observations. Let Σ denote either the variance–

AN US

covariance matrix or the correlation matrix of Y. Thus

Σ

1 T Yc Yc N

(3)

where Yc is the matrix obtained by centering and possibly normalizing each column of Y. The PCA is based on the decomposition of Σ,

d

Σ   k φk φkT

(4)

where

M

k 1

1  ...  d are the eigenvalues of Σ and 1 ,...,d are the corresponding eigenvectors

ED

which are normalized and mutually orthogonal. Then the PCs hk , k  1,2,..., d of the original correlated variables y can be obtained by

(5)

PT

hk  Yc φk

With this transformation, the resultant PCs are linearly uncorrelated with each other. Each PC

hk

2

 k , i.e. the variance of the kth PC is k . The sum of the variances of the PCs is

CE

satisfies

AC

equal to the variance of the original variables, i.e. d

 k 1

k

 trace  Σ 

(6)

where trace(Σ) is the trace of Σ. If Σ is a correlation matrix, trace(Σ)=d. Otherwise, trace(Σ) is the total dispersion or variability among the rows of Y. By construction, the first principal component h1 has the largest possible variance (that is,

accounting for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Thus, the first few PCs are sufficient to represent the whole variability of the original variables, which can reduce the number of dimensions while retaining the data‟s variation as much as possible. Instead of investigating thousands of original variables, the first few components containing the majority of the data‟s variation are explored. 8

ACCEPTED MANUSCRIPT The visualization and statistical analysis of these new variables (i.e., the principal components) can help to find similarities and differences between samples. These considerations motivate the use of PCA for validation assessment of models with multivariate output. 3.2 New validation assessment method for models with multivariate output In this paper, we mainly consider the input data uncertainty and model parameter uncertainty, with experimental uncertainty being represented by zero mean Gaussian random variable as have been done in many literatures [9,15]. With this consideration, the physical experiment and computational model in this work can be described as y e ( x, z ) and y m ( x, z, θ ) , respectively.

CR IP T

Here x are uncontrollable random input variables of the experiment and computational model, z are controllable input variables, i.e., deterministic space or time domain variables representing the sites to validate the model and θ are uncertain model parameters. Assume

y ( x, z), i  1, e i

that

the

physical

experiment

considered

has

multiple

responses

, d , where d is the number of responses at each validation site. The

corresponding observation matrix for these responses is represented by the matrix Ye of size n×

AN US

(dp). Here, n is the number of the random observations of each response and the product dp is the total number of different responses at the multiple intended validation sites z= (z1, z2,…, zp). These different responses of the experiment are determined by the uncontrollable inputs x. Y e is the vector of mean values of each column of Ye and S is the corresponding covariance matrix of Ye. The predictions of the candidate computational model are denoted by yi ( x, z, θ ), i  1, m

,d

M

with θ being a vector of model parameters. The predictions of the candidate computational model at the same validation sites as the experiment are represented by the matrix Ym of size N×(dp), where N is the number of the predictions of each response. In the general cases, N≥n holds. Let

ED

μ0 be the mean vector and Σ be the covariance matrix of Ym. The main procedure of the proposed method is illustrated in Fig. 3 and explained as follows.

PT

(1) Collect observations from physical experiment at specified validation sites z=(z1, z2,…, zp), and represent these observed data by the matrix Ye of size n×(dp). Here the rows of Ye represent n different observations for the responses from n repeat of experiment and the

CE

columns of Ye represent dp different responses at p validation sites. (2) Obtain the predicted data from computational model at the same validation sites as the experiment, and represent these predictions by the matrix Ym of size N×(dp). Here the rows

AC

of Ym represent N different realizations of the responses by N different runs of computational model and the columns of Ym represent dp different responses at p validation sites.

(3) Perform PCA for prediction matrix Ym based on the theory in subsection 3.1 to obtain the corresponding PCs

hkm , k  1, 2,..., dp (This is because that the predictions from

computational model are usually much more than the observations from the physical experiment, and the uncertainty and correlation of the predictions can be better captured by their correlation matrix). The eigenvalues

1  ...  dp and the corresponding eigenvectors

1 ,..., dp of Σ are also obtained in this step. The contribution rates of all the PCs then can be 9

ACCEPTED MANUSCRIPT given by Rk  k (1 +2 +...+dp ), k  1, 2,..., dp . (4) Transform the matrix Ye of the physical observations by the eigenvectors

1 ,..., dp obtained

in step (3) according to Eq. (5) to get the corresponding PCs hke , k  1, 2,..., dp . Under the assumption that the predicted responses really are the same as the responses of the physical experiment, the PCs hke , k  1, 2,..., dp of the responses of the physical experiment will have the same distributions as the corresponding PCs hkm , k  1, 2,..., dp of the predicted

CR IP T

responses. If, however, there are prevailing differences between them that cannot be eliminated by adding more data, we can infer that there is a disagreement between the computational model and the physical experiment.

(5) Determine the number K of considered PCs by R1 +R2 +...RK  c , where c is the percentage of total variability of the multiple responses that one would like to reserve. Generally c≥85%. m

(6) Compare the CDF Fk

h

of each PC hkm of the predicted responses with the e

the area metric, i.e.

dk  F , S   





AN US

corresponding empirical CDF Sk  h  of the PC hke of the experimental observations by

Fkm  h   Ske  h  dh, k  1,2,..., K

(7)

(7) Estimate the overall validation metric for multivariate response by aggregating these

M

validating metrics of PCs with their corresponding contribution rates, that is K

d  F , S    d k  F , S   Rk

(8)

k 1

ED

From the theory and procedure of the proposed method, it can be seen that by incorporating PCA in the validation assessment process, both the correlation and uncertainty of the multiple responses have been considered in the proposed method. Unlike the previous methods for

PT

validation assessment of the models with multiple responses, such as hypothesis testing or PIT based method, the proposed method only needs to consider the highest variability information

CE

contained in the multiple responses. This avoids directly comparing the joint distributions of the computational and experimental responses and considerably simplifies the model validation assessment problem with multiple responses. Compared with the PIT based method in [15], the

AC

proposed method does not require estimating the joint CDF of the multiple responses which is an essential step in the PIT and t-pooling method. It, thus, has lower computational cost and can be applied to validation assessment of more complex models with high-dimensional responses. Furthermore, by PCA, validation assessment of the model with correlated multivariate output

is reduced into a series of independent validation assessments of the models with single response. This makes many model validation assessment methods developed for single response applicable for multiple cases. In this paper we choose to use the excellent area metric proposed by Ferson et al.[9] which allows considering the distribution of the main variability information contained in the multiple responses. In the case of model validation assessment with single response, the proposed method degenerates into the area metric method. 10

ACCEPTED MANUSCRIPT 4 Numerical examples We first use the numerical studies designed in [15] to validate the effectiveness of the proposed method. Assume that the experimental observations are generated by the following two responses:

y1e  z,   sin  2 z  0.5    1 y2e  z,   cos  0.25 z    0.2 z   2 e

(9)

e

where y1 and y2 represent experimental SRQs, z(0≤z≤6) is a deterministic control variable

CR IP T

and =1.5 is a constant parameter in the experiment. ε1 and ε2 are the measurement errors of the two responses, respectively. Their distributions are the same zero mean Gaussian distribution N(0,σ2) with standard deviation σ=0.2 and correlation coefficient 1 ,2  0.5 . Since the function forms and measurement errors of the two responses are all correlated, the experimental responses are correlated random variables.

Two test scenarios, as shown in Table 1, are designed in [15] by employing different

AN US

predictive models. Test 1 is designed to test whether the method can identify more accurate model and less accurate model with respect to the available experimental observations. Test 2 is created to study whether the method can distinguish model with larger uncertainty from that with less uncertainty when the correlation between the model responses varies from site to site. The measurement errors of the predictive models have different correlation coefficients in different cases.

M

4.1 Test 1

In Test 1, three predictive models are validated against the physical observations produced by Eq. (9). Model 1 is a predictive model where both the model parameter θ and the correlation

ED

coefficient 1 , 2 of the measurement errors are the same as those of the experimental data. Model 2 has an incorrect model parameter θ=1.2. Model 3 has an incorrect model parameter θ=1.2

PT

as well as an incorrect correlation coefficient 1 ,2  0.6 for the measurement errors. Therefore, model 1 should be the best predictive model, followed by model 2, with model 3 being the worst

CE

one.

The three computational models in Test 1 are firstly validated against the physical

observations at the single validation site z=2.0. In this case, 1000 observations are produced by

AC

Eq.(9) and 10000 simulated responses are generated by each of the three predictive models at z=2.0. Fig.4 shows that when applying the proposed method, the metric provides a comparison between the CDFs of the PCs of the predictions and the corresponding empirical CDFs of the transformed observations. It can be seen from Fig.4 that no matter for the PC1 (the first PC) or the PC2 (the second PC), the difference between the prediction and the observation is largest for model 3 (0.151 for PC1 and 0.153 for PC2), and smallest for model 1 (0.002 for both PC1 and PC2). This is well consistent with the model setting. Actually, with only PC1, the proposed method can distinguish more accurate model with less accurate model with respect to the available experimental observations. The aggregated validation metrics of the three models by using all the PCs are given in the 11

ACCEPTED MANUSCRIPT upper part of Table 2, where the model validation assessment results obtained by the PIT method (PIT area metric) in Ref.[15] are also listed for comparison. For model validation assessment at the single validation site, the PIT method in Ref. [15] first estimates the joint CDF of the 2 predictive responses to obtain the multivariate PIT distribution of the prediction. And then it transforms all the experimental observations according to the estimated joint CDF to obtain the empirical CDF of the transformed observations. The PIT area metric is then given by the area difference between the multivariate PIT distribution of the prediction and the empirical CDF of the transformed observations. The validation assessment results in Table 2 show that the proposed

CR IP T

method can identify the model whose predictive SRQs have better match with the available experimental data as effectively as the PIT area metric at the single validation site. However, by using PCA, the proposed method does not need to estimate the joint CDF of the predictions. It, thus, can correctly distinguish the candidate computational models with less computational efforts. To assess the overall predictive capability of the three models in the validation domain, the three computational models in Test 1 are then validated against the physical observations at 100 validation sites uniformly distributed in the interval [0,6]. In this case, 100 sets of observations are

AN US

collected at 100 validation sites with 10 observations at each site, and 100 sets of simulated responses are produced from each model with 1000 responses at each site. For model validation assessment at 100 validation sites, there are totally 200 responses with 2 responses at each site. In this case, directly comparing the joint CDFs of the predictions and the observations is obviously impossible. Fortunately, the proposed method can extract the key variability information while removing the redundant one for the multiple responses, which can simplify the problem

M

considerably. By keeping the 85% variability of the whole multiple responses, the proposed method can transform the original 200 correlated responses into 120 independent PCs which are

ED

accessible to the area metric method. Fig.5 shows the comparison between the CDFs of the first three PCs of the predictions and the corresponding empirical CDFs of the transformed observations. It can be seen that although the contributions to the output variability by the PCs of

PT

the 200 responses are not significantly different, we still can reduce 200-dementional output space into 120 one by the PCA.

The aggregated validation metrics of the three models at 100 validation sites are shown in the

CE

lower part of Table 2, compared with the results by the PIT method (t-pooling metric) in [15]. The validation assessment results correctly identify that in the validation domain, the overall predictive

AC

capability of model 1 is better than that of model 2 which outperforms model 3. Additionally, it can be seen that although model 1 is the same as the physical model, the validation metric of model 1 is not equal to zero. This is because only 10 observations are collected at each validation site which have a large experimental uncertainty. We have tested that as the number of observations increases (i.e. the uncertainty of experiment decreases), the metric of model 1 converges to zero. The results show that the proposed method also can correctly identify more accurate model and less accurate model with respect to the available experimental observations even when the observations of the physical experiment are sparse. Besides, for model validation assessment at 100 validation sites, the PIT method in [15] needs to estimate 100 joint CDFs of the 2 predictive responses to obtain 100 multivariate PIT distributions of the predictions. And then it transforms the experimental observations at each validation site according to their corresponding 12

ACCEPTED MANUSCRIPT predictive joint CDFs to obtain the first transformed observations. Since the first transformed observations calculated from different validation sites correspond to different PIT distributions, a second transformation is further required in [15]. In the second transformation, the 100 multivariate PIT distributions at different validation sites are transformed into identical PIT distribution based on multivariate PIT theory. The first transformed observations are also transformed by the corresponding PIT distributions at each validation site to obtain the final transformed observations. The t-pooling metric is then given by the area difference between the PIT distribution of the predictions and the empirical CDF of the observations obtained at the

CR IP T

second transformation. Compared with the method in [15], the proposed method does not require estimating 100 joint CDFs of the multiple responses, and it does not need a second treatment (transformation) for the predictions and observations at the multiple validation sites. This can considerably reduce the computational effort of model validation assessment with multiple responses. 4.2 Test 2

AN US

In Test 2, model 4 and model 5 with uncertain model parameters are validated against the physical observations produced by Eq.(9). The purpose is to test whether the proposed method can distinguish model with greater uncertainty from that with less one. As shown in Table 1, parameter

 of model 4 has a smaller uncertainty, whereas that of model 5 possesses a larger one. Therefore, it is expected that the method can conclude that model 4 is better than model 5. Similar to Test 1, The two computational models in Test 2 are firstly validated against the

M

physical observations at the single validation site z=2.0, and then at the multiple validation sites uniformly distributed in the interval [0,6]. For validation assessment at the single site z=2.0, 1000

ED

observations are produced by Eq.(9), while 10000 simulated responses are generated by each of the two predictive models. Fig.6 illustrates comparison between the CDFs of the PCs of the predictions and the corresponding empirical CDFs of the transformed observations. It can be seen

PT

from Fig.6 that with the most important PC1, the proposed method can differentiate between models of greater and less uncertainty. The aggregated validation metrics of the two models by using all the PCs are given in the upper part of Table 3. The results correctly show that at the

CE

single validation site z=2.0, model 4‟s predictive SRQs have better match with the corresponding experimental data than those of model 5.

AC

For validation assessment at the multiple validation sites uniformly distributed in the interval

[0,6], the same sets of observations collected in subsection 4.1 are employed to validate the two models in a overall sense. By retaining the 85% variability of the whole multiple responses, the original 200 correlated responses are reduced into 120 independent PCs in the proposed method. These independent PCs are then compared by using the area metric. The CDFs of the first three PCs of the predictions are compared with the corresponding empirical CDFs of the transformed observations in Fig.7. The corresponding aggregated validation metrics are summarized in the lower half of Table 3. The results show that the proposed method is capable of differentiating more accurate model with less accurate one with respect to the available experimental data even though the correlation information of the models varies from site to site. Tests 1 and 2 show that no matter for the model validation assessment at the single validation 13

ACCEPTED MANUSCRIPT site or at the multiple validation sites, the proposed method can correctly distinguish between more accurate model and less accurate one with respect to the available experimental data. Compared with the existing method in [15], the proposed method does not require the joint CDF of the multiple responses, and it can integrate the validation assessment of multiple responses at the single validation site and the multiple ones into a unified metric. Therefore, the proposed method is more computationally efficient, and can be applied to more complex model validation problems with correlated multiple responses.

5 A simple risk analysis model

CR IP T

In this section, we consider a simple stochastic and dynamic growth model that simulates the concentration of Listeria monocytogenes in a cheese at an hourly time step. This academic model is based on a simplification of the growth model proposed in papers of Augustin and Carlier and Augustin et al.,[28,29] through some assumptions [30]. The simplified model includes three main periods: ripening, transport and retail consumption. The temperature of each period and the initial concentration are constants of unknown values with uncertainty described in Table 4. To the initial

AN US

concentration of L.monocytogenes in a cheese, a random shift parameter N(0, 0.25) is added to account for the variability in sampling cheeses and to make the model stochastic. The model consists of equations that involve the characteristics of the bacteria and the cheese. These equations are as follows [30].

The log concentration of L.monocytogenes (CFU/g) in a cheese at time t (Y(t)) is given by: (10)

M

  109   Y (t )  9  log10 1   C0  1  exp     t   t   N  0,0.25    10 

where the growth rate μ(t) is a multiple of the optimal growth rate (μopt=0.25), the factor of

ED

temperature (γ(Temp(t))), pH (γ(pH(t))) and water activity (γ(aw)). μ(t) is expressed as:

  t   opt   Temp  t      pH  t      a 

(11)

CE

PT

Temp  t   45.5 Temp  t   1.72   Temp  t    38.72  38.72  Temp t   37   8.5  35.28  2  Temp t    2

(12)

with Temp(t) = Temp1 from days 0 to 7, Temp2 from days 7 to 14 and Temp3 from days 14 to 28.

AC

  pH  t   9.61  pH  t   4.71    pH  t     2.59   pH  t   7.1  2.51  4.71  pH  t    0

if 4.71  pH  t   9.61 otherwise

(13) with

pH  t   1.14  109  t 3  3.04  106  t 2  5.2  104  t  4.58

(14)

  a   0.916

(15)

The experimental observations in this example are generated by the Eqs. (10)-(15). In these 14

ACCEPTED MANUSCRIPT equations, the temperature of each period and the initial concentration shown in Table 4 are uncontrollable inputs and t=1,2,…,28 are validation sites. Two candidate predictive models are considered here. Model 1 has the same form as the experimental model shown above but without the random shift parameter N(0, 0.25). The model form of model 2 is also the same as the experimental model but without the random shift parameter N(0, 0.25). Furthermore, the model parameters of model 2, i.e.   a  =0.906 and μopt=0.22, are different from those of the experimental model. The model validation assessment results by the different methods are given in Table 5. In this

CR IP T

example, only 5 observations are produced by the experimental model and 1000 simulated responses are generated by each of the two predictive models at each validation site t. For the L.monocytogenes model in this example, there are total 28 responses with one response at each validation site t. In this case, the PIT used in [15] at each validation site is the univariate PIT which is the same transformation employed in the u-pooling method. Therefore, the t-pooling metric in [15] is actually the same as the u-pooling based metric in [9] in this case which needs to

AN US

transform the observations at each validation site according to the corresponding CDF of the predictions. However, with the proposed method, the 28 correlated responses are reduced into 2 uncorrelated PCs which nearly account for all the variability of the outputs. As shown in Fig. 8, these two PCs can be easily dealt with by the area metric, respectively, with the aggregated metrics shown in Table 5. The aggregated metrics in Table 5 can correctly distinguish the more accurate model from the less accurate one with respect to the available experimental data. If one

M

needs to keep the 85% variability of the whole multiple responses, only one PC1 is required and it also can correctly identify that model 1‟s predictive SRQs have better match with the corresponding experimental data than those of model 2. Therefore, compared with the u-pooling

ED

based metric, the proposed method is more efficient. Additionally, the results in this example also show that the proposed method can be applicable to model validation assessment problems with non-normal random input/parameter variables. This is easy to understand considering that the

PT

propagation of the input/parameter uncertainties to the model/experimental outputs is achieved by the Monte Carlo method in this paper which is suitable for any type of distributions.

CE

6 An engineering model

A simply supported beam with I-shaped cross section is considered in this section to

AC

demonstrate the engineering applicability of the proposed method. As shown in Fig. 9, one end of the beam is fixed (no vertical or horizontal displacement permitted), while the other end is supported by a roller (no vertical displacement permitted). A static force F is exerted at the midpoint of the beam to induce various responses. The length of the beam is denoted as L, while the width, height and thickness of its I-shaped cross section are denoted as w, h and t, respectively. The geometric parameters L, w, h and t are all considered as uncontrollable input variables of the model with distribution information given in Table 6. The magnitude of the applied force F is chosen as the controllable input and the elastic modulus E is treated as model parameter. Five responses with their descriptions shown in Table 7 are considered in this example. The material of the beam is 2117-T4 aluminum alloy whose elastic modulus is 71.7GPa and Poisson‟s ratio is 0.33. For the purpose of illustration, the responses from a high fidelity finite 15

ACCEPTED MANUSCRIPT element analysis (FEA) model of the supported beam are considered as the “experimental responses”. In the high fidelity model, the FEA considers the real plastic deformation of the beam and the material law used is based on the plastic stress-strain curve shown in Fig. 10 [31]. Model parameter E is set to the true value E*=71.7GPa in the high fidelity model to produce experimental responses. Each of the experimental responses is set to have a measurement error that follows a Gaussian distribution with mean and coefficient of variation equal to 0 and 0.1, respectively. To test the proposed method, two computational models are validated against the above

CR IP T

“experimental responses” at different validation sites given by the applied force F. The first one is a FEA model of the supported beam which also considers the plastic deformation of the beam, but has an uncertain model parameter E following normal distribution N (E*, (0.01E*)2). The second computational model is a FEA model of the supported beam which not only has an uncertain model parameter E following normal distribution N (E*, (0.01E*)2), but also represents the supported beam using a simplified elastic material law. The five responses shown in Table 7 are correlated due to the shared model input variables and parameters. If the proposed model

AN US

validation assessment method is effective, it should identify that model 1 matches better with the “experiment” than model 2 which uses simplified elastic material law to represent the beam. The validation assessment is conducted at 11 different validation sites F= [1500, 1600, … ,2500]. At each of these validation sites, 5 experimental data are observed by following the randomness of the uncontrollable model inputs. The two computational models are simulated 1000 times respectively by following the randomness of the uncontrollable model inputs as well as

M

the uncertain model parameter E. The model validation assessment result of the proposed method is given in Table 8. The result indicates that the proposed method can correctly differentiate more

ED

accurate model (model 1) from less accurate one (model 2) with respect to the available experimental data for the simply supported beam in engineering. For the engineering model in this example, there are total 55 responses with 5 responses at

PT

each validation site F. In this case, the t-pooling metric in [15] seems to be powerless. This is because that accurately estimating the joint CDF of 5 responses at each validation site is difficult, let alone transforming the observation data according to the resultant joint CDF. However, with

CE

the proposed method, the 55 correlated responses are reduced into 10 uncorrelated PCs which nearly account for all the variability of the outputs. The first three PCs of the observations and

AC

predictions are compared in Fig.11. As shown in Fig.11, the transformed PCs can be easily dealt with by the area metric, respectively. Therefore, compared with the t-pooling metric, the proposed method which does not require the joint CDF of the responses is more applicable for the real engineering models.

7 Some practical issues in the application of the proposed method and future works After several decades‟ development, the formal definition for the terminology validation is moving toward convergence within lots of communities. Many organizations, such as the American Institute of Aeronautics and Astronautics (AIAA), the US Department of Defense (DoD) and American Society of Mechanical Engineers (ASME), now use the same formal definition for 16

ACCEPTED MANUSCRIPT validation. However, there are still important differences in interpretation and implications of the term [32]. Generally, there are two viewpoints underlying interpretation of the term validation [32]. One interpretation is what is called the encompassing view of validation. This interpretation includes three aspects to the term validation: (i) quantification of the accuracy of the computational model results by comparing the computed SRQs of interest with experimentally measured SRQs, (ii) use of the computational model to make predictions, in the sense of interpolation or extrapolation of the model, for conditions corresponding to the model‟s domain of intended use, and (iii) determination of whether the estimated accuracy of the computational

CR IP T

model results satisfies the accuracy requirements specified for the SRQs of interest. Another interpretation is what is called restricted view of validation. It considers each aspect of validation separately. That is, aspect 1 is referred to as validation assessment, model accuracy assessment or model validation. Aspect 2 is referred to as model prediction, predictive capability or model extrapolation. Aspect 3 is referred to as model adequacy assessment or adequacy assessment for intended use. Either interpretation can be used in validation activities related to scientific computing. However, as pointed out in [32] and proved by the experience of many, an

AN US

encompassing view of validation commonly leads to misunderstandings and confusion in discussions and in communication of computational results. The primary reason for this confusion is the dissimilarity between each of the three aspects discussed above.

Misunderstandings and confusion can be particularly risky and damaging, for example, in communication of computational results to system designers, project managers, decision makers and individuals not trained in science or engineering. Besides, different decision makers and

M

different users have widely different demands on a model‟s accuracy. A model could be consistently inaccurate and yet close enough for its purpose. Likewise, even a highly accurate

ED

model might not be good enough if it is needed for some delicate, high-consequence decisions. Analysts, thus, should only determine the accuracy of the model and estimate the uncertainty about a prediction, which may be good or poor given the needs of the decision maker. That is,

PT

analysts should not specify the requirements on accuracy; such requirements are the province of decision makers [9,32]. Therefore, this paper also uses the restricted view of the term validation as is done by the AIAA Guide [4] and many literatures [9,15,32]. That is, model validation refers to

CE

aspect 1 shown earlier: assessment of model accuracy by comparison with experimental data. In response to this quantification of a model‟s accuracy, a decision maker or intended user can make

AC

the adequacy decision, i.e., determination of whether the model is adequate for the intended use. This is a complex issue of subjective judgment. For example, a decision maker might reject the model‟s use for a particular application, or alternatively the decision maker might infer from the model‟s estimated inaccuracy that an engineered system must be designed more conservatively with larger margins of tolerance to account for the poor model performance. How the result of a validation metric relates to an application of interest is a separate and more complex issue, especially if there is significant extrapolation of the model. Although this issue is not addressed in this paper, it is critical to the estimation of computational modeling uncertainty for complex engineering systems. Therefore, when relating the model accuracy assessment result of the proposed method to the real applications, all these issues discussed above should be carefully addressed. 17

ACCEPTED MANUSCRIPT PCA has also been extensively used in Geophysical fields to explore, or to express succinctly, the joint space/time variations of the many variables, such as temperature and pressure, in the data set. The results indicate the power of PCA to uncover meaningful joint relationships among atmospheric (and other) fields in an exploratory setting, where clues to possibly unknown underlying physical mechanisms may be hidden in the complex relationships among several fields [17]. This kind of power of PCA is also employed in this paper to explore the joint relationships among multiple responses in one or more validation sites. Since the number of model predictions in engineering is usually much larger than that of the experimental observations (which means that

CR IP T

the uncertainty and correlation structures of the multiple responses are better captured by the predictions than by the sparse experimental data), the PCA is applied to model predictions in this paper. And then to determine whether the experimental data are generally drawn from the joint distribution that has the same underlying uncertainty and correlation structures as their corresponding joint prediction distribution, each experimental data is transformed by the eigenvectors of the corresponding correlation matrix of predictions. This avoids constructing an impoverished picture about the correlation information that actually exists in the real physical

AN US

system using sparse data. From this point of view, the proposed method is applicable even when the amount of experimental observations is small. In this case, however, the validation results may be dramatically affected by the experimental uncertainty and should be carefully examined. When the predictions of the computational models are also sparse, e.g., a computational model may be extremely expensive to run to extract the uncertainty and correlation information, the analysts can construct only an impoverished picture of what the model is actually predicting. The predictions,

M

and thus the results of PCA, will be accompanied by substantial epistemic uncertainty arising from sampling error. In this case, the risk of underestimating or overestimating the real discrepancy of

ED

the model also should be carefully considered. Future work will address the issue of expressing this and other forms of epistemic uncertainty in the prediction and experimental distribution and accounting for it with the proposed validation metric.

PT

In addition, when the validation metric is estimated based on exhaustively collected experimental observations and model predictions so that there is essentially no sampling uncertainty about the data and prediction distributions, it is clear that the disparity between the

CE

predictions and the data is statistically significant. This disparity cannot be explained by randomness arising from the sampling uncertainty (because there is none), but must rather be due

AC

to inaccuracies in the model. However, when the validation metric is estimated based on a very small sample size of experimental observations, or a small number of model predictions, or both, it is not clear that the disagreement between the predictions and the experimental data is significant, even with a larger validation metric value. The computed discrepancy might be entirely due to the vagaries of random chance that were at play when the observations and model predictions were made. In this case, Ferson and Oberkampf et.al [9] suggested some kinds of statistical tests, such as traditional chi squared test and Neyman‟s smooth tests [33,34] and references therein, to be used to give a context for these validation metric values in order to understand when a value is big and when it is not really so big (but due to sampling uncertainty). Transforming the experimental data at different validation sites by the eigenvectors of the correlation matrix of predictions and pooling all the transformed observations together can 18

ACCEPTED MANUSCRIPT substantially increase the power of the statistical test because the sample size is larger in a single, synthetic analysis. These tests will allow the analysts to formally justify an impression or conclusion that the experimental observations disagree with the model‟s predictions and to identify significant overall failure of the model‟s predictive capability. Finally, for models with correlated multiple responses, although using a single (aggregated) metric for the multiple responses can provide an overall assessment of a model, it will inevitably lose some useful information (such as how accurate is a model in predicting individual response) in the aggregate process. This point has also been recognized in Ref. [15]. Therefore, model

CR IP T

validation assessment for multiple responses should not be used as a substitute, but as a complement of the model validation assessment for individual response. With the individual validation assessment, one can assess the diverse model accuracy across a set of multivariate data. When individual validation indicates that not all the responses match well with the data, it certainly exposes the deficiencies in the model and one must improve the model at that stage. At the same time, it is also important to incorporate the correlation among the model responses in the validation process. Aggregate or multiple responses validation helps to assess the overall „quality‟

AN US

of the computational model by comparing all the model output variables simultaneously accounting for the model and data correlations. Therefore, Validation assessment for individual and multiple responses can complement each other for the real model validation assessment problems. The decision maker can use both types of validation metrics, individual and aggregate, to detect different types of weaknesses in the model.

M

8 Conclusions

The combination of principal component analysis and the concept of area metric leads to a

ED

metric that provides valuable information on model validation assessment in the case of multiple responses. In this paper, we show that the whole model validation metric of the correlated multiple responses can be decomposed into a series of model validation metrics of uncorrelated PCs. These

PT

validation metrics of PCs then can be easily estimated by the existing area metric method. In the numerical and engineering case studies, the new method is compared with the PIT method and the u-pooling method in the existing literatures. The results show that the proposed method can well

CE

consider both the uncertainty and the correlation among multiple responses in the validation assessment process. It could capture the disagreement between predictions and experiments as effectively as the existing PIT method. However, by using PCA, the proposed method does not

AC

require estimating the joint CDF of the multiple responses, which is an essential step in the PIT method. It, thus, has lower computational cost and can be applied to more complex models. Furthermore, with PCA, the proposed method also avoids directly comparing the joint distributions of the computational and experimental responses. This makes it especially suitable for validation assessment of models with high-dimensional responses. The results in the case studies also show that the proposed method not only can be used for validating multiple responses at a single validation site, but also is capable of dealing with the case where observations of multiple responses are collected at multiple validation sites. This allows assessing the overall performance of the model against all the experimental observations in the validation domain. For model validation assessment of single response at multiple validation sites, 19

ACCEPTED MANUSCRIPT the validity of the proposed method is comparable with the existing u-pooling based metric, but it has a more efficient validation assessment process. These features make the proposed method well suited for the validation assessment of models with correlated multiple responses. Finally, it is worth mentioning that this paper uses the restricted view of the term validation, i.e., focusing on assessment of model accuracy by comparison with experimental data. When relating the model accuracy assessment results of the proposed method to the real applications, many issues, such as model extrapolation and adequacy assessment for intended use, need to be carefully addressed.

CR IP T

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. NSFC 51505382) and Natural Science Foundation of Shaan Xi Province (Grant No. 2016JQ1034).

References

[1] Iman RL. A matrix-based approach to uncertainty and sensitivity analysis for fault tree. Risk

AN US

Analysis, 1987; 7:21–33.

[2] Saltelli A. Sensitivity analysis for importance assessment. Risk Analysis, 2002; 22:579–590. [3] Borgonovo E. Measuring uncertainty importance: Investigation and comparison of alternative approach. Risk Analysis, 2006; 26:1349–1361.

[4] AIAA, Guide for the verification and validation of computational fluid dynamics Simulations, American Institute of Aeronautics and Astronautics, 1998. AIAA-G-077-1998.

M

[5] ASME, Guide for verification and validation in computational solid mechanics. American Society of Mechanical Engineers, 2006, ASME Standard V&V 10-2006, New York, NY. [6] Oberkampf WL, Trucano TG, Hirsch C. Verification, validation, and predictive capability in

ED

computational engineering and physics. Applied Mechanics Reviews. 2004; 57:345-84. [7] Oberkampf WL, Barone MF. Measures of agreement between computation and experiment: validation metrics. Journal of Computational Physics 2006; 217(1): 5–36.

PT

[8] Roache PJ. Verification and validation in computational science and engineering, Hermosa Publishers, Albuquerque, NM, 1998.

CE

[9] Ferson S, Oberkampf WL, Ginzburg L. Model validation and predictive capability for the thermal challenge problem. Computer Methods in Applied Mechanics and Engineering 2008; 197: 2408–30.

AC

[10] Liu Y, Chen W, Arendt P, Huang H-Z. Toward a better understanding of model validation metrics. Journal of Heat Transfer-Transactions of the ASME 2011; 133: 071005.

[11] Buranathiti T, Cao J, Chen W, Baghdasaryan L, Xia ZC. Approaches for model validation: methodology and illustration on a sheet metal flanging process. Journal of Manufacturing Science and Engineering-Transactions of the ASME 2006; 128(2): 588–97. [12] Rebba R, Mahadevan S. Model predictive capability assessment under uncertainty. AIAA Journal 2006; 44(10): 2376–84. [13] Mahadevan S, Rebba R. Validation of reliability computational models using Bayes networks. Reliability Engineering & System Safety 2005; 87(1): 223–32. [14] Oberkampf WL, Trucano TG. Verification and validation in computational fluid dynamics. 20

ACCEPTED MANUSCRIPT Progress in Aerospace Sciences 2002; 38(2): 209–72. [15] Li W, Chen W, Jiang Z, Lu ZZ, Liu Yu. New validation metrics for models with multiple correlated responses. Reliability Engineering & System Safety, 2014, 127, 1–11. [16] Jollife JT, Stephenson DB. Forecast verification: A practitioner‟s guide in atmospheric sciences, Wiley, 2003. [17] Wilks DS. Statistical methods in the atmospheric science, 3rd Edition, Academic Press, 2011. [18] Murphy AH, Winkler RL. A general framework for forecast verification, Monthly Weather Review, 1987, 115: 1330-1338. Physics, 2000, 63: 71-116.

CR IP T

[19] Palmer TN. Predicting uncertainty in forecasts of weather and climate, Reports on Progress in [20] Rebba R, Mahadevan S. Validation of models with multivariate output. Reliability Engineering & System Safety. 2006; 91: 861-71.

[21] Jiang X, Mahadevan S. Bayesian validation assessment of multivariate computational models. Journal of Applied Statistics. 2008; 35: 49-65.

[22] Angus JE. The probability integral transform and related results, SIAM Review 1994; 36 (4):

AN US

652–654.

[23] Jolliffe IT. Principal component analysis. Berlin: Springer-Verlag; 2002. [24] Anderson TW. An introduction to multivariate statistical analysis. 3rd ed. New York: Wiley & Sons; 2003.

[25] Saporta G. Probabilite ´, Analyse des Donne ´es et Statistique, 2nd ed. Paris: Technip; 2006. [26] Besse P. PCA stability and choice of dimensionality. Statistics & Probability Letters 1992; 13:

M

405–10.

[27] Kennedy MC, O‟Hagan A. Bayesian calibration of computer models. Journal of the Royal

ED

Statistical Society Series B-Statistical Methodology 2001; 63(3): 425–64. [28] Augustin JC, Carlier V. Modelling the growth rate of Listeria monocytogenes with a multiplicative type model including interactions between environmental factors. International

PT

Journal of Food Microbiology 2000; 56: 53–70. [29] Augustin JC, Zuliani V, Cornu M, Guillier L. Growth rate and growth probability of Listeria monocytogenes in dairy, meat and seafood products in suboptimal conditions. Journal of

CE

Applied Microbiology 2005; 99: 1019–1042.

[30] Lamboni M, Sanaa M, Tenenhaus-Aziza F. Sensitivity analysis for critical control points

AC

determination and uncertainty analysis to link FSO and process criteria: Application to Listeria monocytogenes in soft cheese made from pasteurized milk. Risk Analysis 2014, 34(4): 751-764.

[31] Song DL, Li Y, Luo B. An effective mathematical modeling for and simulation analysis of flush rivet pressing force of CFRP /Al components. Journal of Northwestern Polytechnical University, 2012, 30(4): 558-564 (in Chinese). [32] Oberkampf WL, Roy CJ. Verification and validation in scientific computing. Cambridge University Press, 2010. [33] Stephens MA. EDF statistics for goodness of fit and some comparisons, J. Am. Statist. Associat. 1974, 69: 730-737. [34] Rayner GD, Rayner JCW. Power of the Neymansmooth tests for the uniform distribution, J. 21

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Appl. Math. Decision Sci. 2001, 5 (3): 81-191.

22

1

1

F m  y

S ne  y 

y

AN US

F m  y

0

CR IP T

ACCEPTED MANUSCRIPT

S ne  y 

0

AC

CE

PT

ED

M

Fig.1 Example area metric at single validation site, with n = 3 on the left and n = 1 on the right

23

y

AN US

CR IP T

ACCEPTED MANUSCRIPT

1

1

u3

u3

ED

1

u2

u2

0

0

1/3

2/3

CDF of uniform distribution on [0,1] ECDF of transformed data ui

y1e

y2e

y3e

y

CDFs of model response at 3 validation sites

PT

yie:

Experimental data at 3 validation sites

Fig. 2 illustration of the u-pooling based metric

AC

CE

F1m  y

u1

M

u1

F3m  y 

F2m  y

24

Computational Model yim ( x , z , θ ), i  1,

CR IP T

ACCEPTED MANUSCRIPT

Physical Experiment yie ( x , z ), i  1,

,d

Simulated Responses Matrix Y m with Covariance Matrix Σ



Observed Data Matrix Y e

Σ  k 1 k φk φkT PCA PCs hkm  Y m φk , k  1,..., dp dp Contribution Rates Rk  k / j1j , k 1,..., dp d

Data Transformation

AN US

hke  Y e φk , k  1,..., dp

Number of Considered PCs K

hkm , k  1, 2,..., K Fkm  h  , k  1, 2,..., K

dk  F , S   





ED

Area Metric for PCs

Rk  85%

hke , k  1, 2,..., K Ske  h  , k  1, 2,..., K

Fkm  h   Ske  h  dh, k  1, 2,..., K K

Aggregative Metric for Multiple Responses

d  F , S    d k  F , S   Rk k 1

Fig. 3 illustration of the proposed method

AC

CE

PT

K

k 1

ECDF of

M

CDF of



25

,d

CR IP T

ACCEPTED MANUSCRIPT

PC1(75.1%)

PC2(24.9%)

1

1 A = 0.002

0.8

0.8

0.6

0.6

CDF

CDF

A = 0.002

0.4 0.2 0 -2

-1.5

-1

h1

-0.5

0.4 0.2

PC1 of Model 1 Transformed observation 0 0.5

0 -1

-0.8

-0.6

PC1(75.0%) 1

1 A = 0.139

AN US

A = 0.139

-0.2

0.8

CDF

0.6 0.4 0.2 0 -2

-1.5

-1

h1

-0.5

PC1(79.8%)

0.6 0.4 0.2

PC1 of Model 2 Transformed observation 0 0.5

0 -1

-0.8

-0.6

-0.4

h2

PC2 of Model 2 Transformed observation 0 0.2 0.4

-0.2

PC2(20.2%)

1

1

A = 0.151

A = 0.153

0.8

0.8

M

0.6 0.4 0.2 -1

-0.5

h1

0

0.6 0.4 0.2

0 -0.5

ED

0 -1.5

PC1 of Model 3 Transformed observation 0.5 1

CDF

CDF

h2

PC2(25.0%)

0.8

CDF

-0.4

PC2 of Model 1 Transformed observation 0 0.2 0.4

0

0.5

h2

1

PC2 of Model 3 Transformed observation 1.5 2

Fig. 4 Comparison between the PCs of the predictions and the observations at single validation site z=2.0 for

each of the three predictive models.

AC

CE

PT

Test 1; 1000 experimental observations are produced by Eq. (9) and 10000 simulated responses are generated by

26

PC1 of Model 1 Transformed observation

PC1(1.3%)

1

A = 0.092 0.8

0.6

0.6

0.4

0.4 0.2

0 -2

-1.5

-1

-2.5

1

0 -2.5

-1.5 -1 h2 -0.5 PC2 of Model 2 Transformed observation

1

0.4 0.2

0.4 0.2

PC1(1.3%)

0 0.5 h1 1 PC1 of Model 3 Transformed observation

-1

-0.5

PC2(1.3%)

1

0 -3

0 0.5 h2 1 PC2 of Model 3 Transformed observation

PC3(1.3%)

ED

CDF

0.6

CDF

0.8

0.6 0.4 0.2

-0.5

0

0.5

h1

1

0

-1.5

-1 -0.5 h3 0 PC3 of Model 3 Transformed observation

0.4 0.2

0.5

1

1.5

2

2.5

h2

3

0 0

PT

-1

0

-2

A = 0.432

0.8

0 -1.5

-2.5

1

0.6

0.2

A = 0.441

0.4

0.8

0.4

-1 -0.5 h3 0 PC3 of Model 2 Transformed observation

0.6

A = 0.337

A = 0.272

-1.5

0.2

0 -1.5

M

-0.5

-2 PC3(1.3%)

0.8

CDF

0.6

CDF

0.6

1

0.4

A = 0.214 0.8

-1

-2

PC2(1.3%)

A = 0.371 0.8

0 -1.5

0.6

0.2

0 -3

-0.5 0 h1 0.5 PC1 of Model 2 Transformed observation

PC1(1.3%)

1

CDF

A = 0.112

0.8

AN US

0.2

PC3 of Model 1 Transformed observation

PC3(1.2%)

1

CDF

0.8

CDF

CDF

A = 0.199

CDF

PC2 of Model 1 Transformed observation

PC2(1.3%)

1

CR IP T

ACCEPTED MANUSCRIPT

0.5

1

1.5

h3 2

Fig. 5 Comparison between the PCs of the predictions and the observations at 100 validation sites for Test 1;

are generated by each of the three predictive models.

AC

CE

at each validation site, 10 experimental observations are produced by Eq. (9) and 1000 simulated responses

27

CR IP T

ACCEPTED MANUSCRIPT

PC1(72.6%)

PC2(27.4%)

1

1 A = 0.015 0.8

0.6

0.6

CDF

0.8

0.4 0.2 0 -2

-1.5

-1 h1 PC1(76.8%)

0 -0.8

-0.6

-0.4

-0.2

0 h2 PC2(23.2%)

PC2 of Model 4 Transformed observation 0.2 0.4 0.6

1

A = 0.011

A = 0.091

0.8

CDF

0.6 0.4 0.2 0 -2.5

-2

-1.5

-1 h1

0.6 0.4 0.2

PC1 of Model 5 Transformed observation -0.5 0 0.5

M

CDF

0.2

PC1 of Model 4 Transformed observation -0.5 0

1 0.8

0.4

AN US

CDF

A = 0.019

0 -0.6

-0.4

-0.2

0

0.2 h2

PC2 of Model 5 Transformed observation 0.4 0.6 0.8 1

ED

Fig. 6 Comparison between the PCs of the predictions and the observations at single validation site z=2.0 for Test 2; 1000 experimental observations are produced by Eq. (9) and 10000 simulated responses are generated by

AC

CE

PT

each of the two predictive models.

28

PC1 of Model 4 Transformed observation

PC1(1.4%) 1

PC2 of Model 4 Transformed observation

PC2(1.4%) 1

A = 0.165

0.6

CDF

0.6

0.8

CDF

0.6 0.4

0.4

0.2

0.2

-1

0 h1

1

0 -3

2

PC1 of Model 5 Transformed observation

PC1(1.9%) 1

-2.5

-2

-1.5 h2

A = 0.430

-1

-0.5

0

0.2

1

2

3

0 -4

2.5

PC3 of Model 5 Transformed observation

PC3(1.8%) A = 0.347

-3

-2

-1

0

1

h2

0

0

1

2

3

4

h3

Fig. 7 Comparison between the PCs of the predictions and the observations at 100 validation sites for Test 2; at each validation site, 10 experimental observations are produced by Eq.(9) and 1000 simulated responses are generated by each of the two predictive models.

AC

CE

3

0.2

PT

h1

2

0.4

0.2

0

1.5 h3

0.6

0.4

ED

0.4

1

0.8

M

CDF

0.6

CDF

0.6

0.5

1

A = 0.400 0.8

-1

0

0

PC2 of Model 5 Transformed observation

PC2(1.8%) 1

0.8

0 -2

0.2

CDF

0 -2

0.4

AN US

CDF

0.8

PC3 of Model 4 Transformed observation

PC3(1.3%)

1

A = 0.205

A = 0.152

0.8

CR IP T

ACCEPTED MANUSCRIPT

29

5

CR IP T

ACCEPTED MANUSCRIPT

PC1(86.0%)

PC2(14.0%)

1

1 A = 3.050

A = 0.919

0.6

0.6 CDF

0.8

CDF

0.8

0.4

AN US

0.4 0.2

0.2

0 -20

-10

0 h1

PC1 of Model 1 Transformed observation 10 20

PC1(87.5%)

0

5

10 h2

PC2 of Model 1 Transformed observation 15 20

PC2(12.5%)

1

1

A = 0.577

M

0.8

0.4 0.2

-10

-5

0.2

PC1 of Model 2 Transformed observation 5 10 15 20

0

0.6 0.4

ED

CDF

0.6

0.8

CDF

A = 4.166

0 -15

0

0

0

PT

h1

2

4

PC2 of Model 2 Transformed observation 8 10 12 14

6 h2

L.monocytogenes model

AC

CE

Fig. 8 Comparison between the PCs of the predictions and the observations at multiple validation sites for

30

CR IP T

ACCEPTED MANUSCRIPT

t

AN US

F

L/2

L/2

w

AC

CE

PT

ED

M

Fig.9 The simply supported beam

31

h

ED

M

Stress σ /MPa

AN US

CR IP T

ACCEPTED MANUSCRIPT

Plastic strain /ԑ

AC

CE

PT

Fig. 10 Plastic stress-strain curve of 2117-T4 aluminum alloy [31]

32

PC1(9.9%)

PC2(9.6%)

1

PC3(9.2%)

1 A = 1.045

1

A = 0.844

0.6

0.6

0.4

CDF

0.6

A = 1.223

0.8

CDF

0.8

CDF

0.8

0.4

0.4

0.2

0 -10

PC1 of Model 1 Transformed observation 0 5

-5

0.2

0 -10

-5

h1

PC1(11.0%)

PC2 of Model 1 Transformed observation 5 10

AN US

0.2

0 h2

1 A = 1.114

-5

0.6

0.4

0.2

0.4

M

0.4

A = 1.288

CDF

0.6

0.2

h1

0 -10

-5

ED

PC1 of Model 2 Transformed observation 0 5

PC3(9.1%)

0.8

CDF

0.6

CDF

0.8

0 h3

PC3 of Model 1 Transformed observation 5 10

1

A = 0.888

0.8

-5

0 -10

PC2(9.6%)

1

0 -10

CR IP T

ACCEPTED MANUSCRIPT

0.2

0 h2

PC2 of Model 2 Transformed observation 5 10

0 -10

-5

0 h3

PC3 of Model 2 Transformed observation 5 10

Fig. 11 Comparison of the PCs of the predictions and the observations at multiple validation sites for the

AC

CE

PT

simply supported beam

33

ACCEPTED MANUSCRIPT Table 1 Formulas of the predictive (computational) models in two test cases Models

Model 1

Test 1

Model 2

Model 3

Formulas

Model descriptions

 y1m1 ( x )  y1e ( x,  1.5) , 1 , 2  0.5  m1 e  y2 ( x )  y2 ( x,  1.5)

Exactly as the experimental data source

 y1m2 ( x )  y1e ( x,  1.2) , 1 , 2  0.5  m2 e  y2 ( x )  y2 ( x,  1.2) m e  y1 3 ( x )  y1 ( x,  1.2) , 1 , 2  0.6  m3 e  y2 ( x )  y2 ( x,  1.2)

Model parameter is incorrect

Both model parameter and

CR IP T

Tests

correlation

coefficient

are

incorrect

Model 4

 y1m4 ( x )  y1e ( x, ~ N (1.5,0.22 )) , 1 ,2  0.5  m4 e 2  y2 ( x )  y2 ( x, ~ N (1.5,0.2 ))

Model 5

is exact, uncertainty is smaller

AN US

Test 2

Mean of the model parameter

m e 2  y1 5 ( x )  y1 ( x, ~ N (1.5,0.4 )) , 1 ,2  0.5  m5 e 2  y2 ( x )  y2 ( x, ~ N (1.5,0.4 ))

Mean of the model parameter is exact, uncertainty is larger

Table 2 Model validation results by different methods for different models in Test 1 Methods/Models

z=2.0 Aggregated results of Fig 4

Model 2

Model 3

Proposed method

0.0021

0.1387

0.1514

PIT area metric

0.0089

0.1052

0.1837

Proposed method

0.0856

0.1647

0.1851

t-pooling metric

0.0231

0.0850

0.1458

ED

PT

z~U(0,6) Aggregated results of Fig 5

Model 1

M

Validation sites

CE

Table 3 Model Validation results by different methods for different models in Test 2

Validation sites

AC

z=2.0 Aggregated results of Fig 6 z~U(0,6) Aggregated results of Fig 7

Methods/Models

Model 4

Model 5

Proposed method

0.0175

0.0728

PIT area metric

0.0284

0.0381

Proposed method

0.1079

0.2246

t-pooling metric

0.0517

0.0812

34

ACCEPTED MANUSCRIPT Table 4 Input factors of the simple risk model and their probability distributions Description

Unit

Probability Distribution

Temp 1

Temperature of first period (days 1-7)

o

C

Truncated normal N(10,2) on [-1.72,45.5]

Temp 2

Temperature of second period (days 7-14)

o

C

Truncated normal N(4,2) on [-1.72,45.5]

Temp 3

Temperature of third period (days 14-28)

o

C

Uniform U(4,17)

C0

Initial concentration

log CFU/g

Uniform U(-4,1)

CR IP T

Input

Table 5 Model Validation results by different methods for different L.monocytogenes model Methods/Models

Model 1

Model 2

t= (1~28)

Proposed method

2.7520

3.7185

u-pooling metric

0.0593

0.0658

AN US

Validation sites

Table 6 Distribution information of the geometric parameters of the simply supported beam Distributions

L

Normal

w

Normal

h

Normal

t

Normal

Means (mm)

Coefficients of variation

1600

0.01

40

0.01

50

0.01

2

0.01

ED

M

Parameters

Table 7 Responses measured in the simply supported beam.

AC

CE

PT

Responses

Description

y1

Stress at the midpoint of the beam (MPa)

y2

Strain at the midpoint of the beam (mm)

y3

Displacement at the middle of the beam (mm)

y4

Angle of deflection at the end of the beam (radians)

y5

Internal energy of the beam (Joules)

Table 8 Model Validation result by the proposed method for different simply supported beam models Methods/Models Proposed Method

35

Model 1

Model 2

0.8024

0.9593