Available online at www.sciencedirect.com
Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407 www.elsevier.com/locate/cma
Thermal problem solution using a surrogate model clustering technique Mark D. Brandyberry University of Illinois at Urbana-Champaign, Computational Science and Engineering Department, Urbana, IL 61801, United States Received 14 May 2007; received in revised form 25 May 2007; accepted 25 May 2007 Available online 23 December 2007
Abstract The thermal problem defined for the validation challenge workshop involves a simple one-dimensional slab geometry with a defined heat flux at the front face, adiabatic conditions at the rear face, and a provided baseline predictive simulation model to be used to simulate the time-dependent heatup of the slab. This paper will discuss a clustering methodology using a surrogate heat transfer algorithm that allows propagation of the uncertainties in the model parameters using a very limited series of full simulations. This clustering methodology can be used when the predictive model to be run is very expensive, and only a few simulation runs are possible. A series of time-dependent statistical comparisons designed to validate the model against experimental data provided in the problem formulation will also be presented, and limitations of the approach discussed. The purpose of this paper is to represent methods of propagation of uncertainty with limited computer runs, validation with uncertain data, and decision-making under uncertainty. The final results of the analysis indicate that the there is approximately 95% confidence that the regulatory criteria under consideration would be failed given the high level of physical data provided. Ó 2007 Elsevier B.V. All rights reserved. Keywords: Uncertainty; Validation; Clustering; Decision analysis; Hypothesis testing
1. Introduction The thermal challenge problem presented in [1] describes a simple one-dimensional slab of material that will be used in an application where it is exposed to a constant heat flux on its front face, and is insulated (adiabatic) at the rear face. See discussions in Ref. [1] for a detailed description of the problem and experimental data provided. A time-dependent simulation model for the temperature of the slab is also provided along with the physical description of the problem in order to establish a baseline model for all participants in the workshop to use. The model is a simple truncated series solution that assumes a constant thermal conductivity (k) and volumetric heat capacity (qCp, where q is density and Cp is the material specific heat). This model is intended to represent some long-running simulation code that would be used for this application. It is a function of the two variables above, as well
E-mail address:
[email protected] 0045-7825/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.cma.2007.05.029
as time, the incident heat flux, the thickness of the slab, and the perpendicular distance into the slab. The goal for the use of this simulation model is to assess, for some given surface heat flux, whether the slab of material will pass a provided ‘‘regulatory criteria”. It is to be assumed that the slab of material models some safety-critical device in a technological system, and it is the job of the analyst, using the model and data given in the problem, to assess whether or not the device meets the regulatory requirement. The requirement is simply that 99% of the time, the front surface temperature, 1000 s after application of the heat flux, will not be greater than 900 °C. This can be expressed as Eq. (1), with Ts being the front surface temperature, and t being the time after initial application of the surface heat flux P ðT s ðt ¼ 1000 sÞ > 900 CÞ < 0:01
ð1Þ
A table of verification data is provided in Ref. [1] with which to check the implementation of the provided simulation model. This data has been used to verify the implemented model by coding the supplied simulation
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
ables. In Section 3, the experimental data labeled ‘‘ensemble validation data” in Ref. [1] is analyzed, using the required computational model and the property data from Section 2. Section 4 analyzes the ‘‘accreditation validation data” from Ref. [1] against the computational model. Section 5 provides a summary of the validation activities from Sections 2–4 to set up use of the computational model in the regulatory situation. In Section 6, the surrogate-clustering method for propagating uncertainty distributions of the
0.09
4.8E+05
0.08
4.6E+05
Volumetric Heat Capacity
Thermal Conductivity
equation provided in ref. [1], and running it with the verification case data. The implementation is thus considered verified for the purposes of this paper. The focus of this work will be on uncertainty modeling and propagation, validation, and use of the computational model for decision-making in the regulatory application. Section 2 describes analysis of the physical property data and the process of assigning probability distributions to the thermal conductivity and volumetric heat capacity vari-
0.07 0.06 0.05 0.04 0.03 0.02
4.4E+05 4.2E+05 4.0E+05 3.8E+05 3.6E+05 3.4E+05 3.2E+05 3.0E+05
0
200
400
600
800
1000
0
200
Temperature (C)
600
800
1000
800
1000
800
1000
4.8E+05 4.6E+05
Volumetric Heat Capacity
0.08
Thermal Conductivity
400
Temperature (C)
0.09
0.07 0.06 0.05 0.04 0.03
4.4E+05 4.2E+05 4.0E+05 3.8E+05 3.6E+05 3.4E+05 3.2E+05 3.0E+05
0.02 0
200
400
600
800
0
1000
200
Temperature (C)
400
600
Temperature (C) 4.8E+05
0.09
4.6E+05
Volumetric Heat Capacity
0.08
Thermal Conductivity
2391
0.07 0.06 0.05 0.04 0.03
4.4E+05 4.2E+05 4.0E+05 3.8E+05 3.6E+05 3.4E+05 3.2E+05
0.02 0
200
400
600
Temperature (C)
800
1000
3.0E+05 0
200
400
600
Temperature (C)
Fig. 1. Thermal conductivity and volumetric heat capacity versus temperature for the low (top), medium (center) and high (bottom) levels of available experimental data. Fit lines on volumetric heat capacity graphs are linear and second order polynomial.
2392
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
physical parameters through the computational model using only 10 evaluations of the computational model is presented and the regulatory results using this method are discussed. Section 7 discusses a simplistic method for factoring in the apparent temperature-dependent model bias that is seen during the validation activities in Sections 2–4 and Section 8 describes the conclusions of the work. 2. Analysis of physical property data The thermal challenge problem provides data on the measured thermal conductivity and volumetric heat capacity of samples of the slab material. Three levels of data are provided: six (low), 20 (medium) or 30 (high) samples, at either three (low) or five (medium, high) different temperatures. All three levels of data are shown plotted in Fig. 1. The reference simulation model assumes temperature independent values of the physical parameters. There are several goals for an analysis of the physical property data: (1) to ascertain if the data is, in fact, independent of temperature, (2) to define the uncertainty in the data, and (3) to decide, if the data is not independent of temperature, what will be done about it, since the reference model cannot accept temperature-dependent data. This section will attempt to answer the first two of these questions. 2.1. Thermal conductivity As discussed in Section 1, the reference simulation model for this project requires the use of a constant value of the thermal conductivity (k) of the slab material. As can be seen from Fig. 1, the data is not obviously temperature independent for all three levels of data. The solid lines in the figures show a linear regression through the data for each level of experimental data. The problem statement directs to assume that ‘‘diagnostic variability or uncertainty” is negligible, and thus the variation shown must be purely from differences in the samples themselves. All three of the levels of available experimental data would result in the conclusion that the thermal conductivity is temperature dependent based on the significant slope of the fit lines. Without changing the provided simulation model, however, the temperature dependence of k cannot be modeled directly. For this work, the supplied simulation model is to be considered a complex computer model with an inaccessible implementation. Thus, even though the thermal conductivity shows an obvious linear trend with temperature, it will need to be assumed to be temperature independent for use in the reference simulation model. This leads to the need to define the uncertainty in the thermal conductivity in terms of the available reference simulation model, as well as the experimental data. The goal of an uncertainty analysis should be to estimate the uncertainty in the dependent variable (in this case, the front surface temperature of the slab at 1000 s) as a best-estimate [2]. Conservatisms or non-conservatisms are both inappropriate for a best-estimate
uncertainty assessment. A case could be made for using all the thermal conductivity data at each level of available experimental data, since at some point during any simulation, the temperature passes through all temperatures between 25 °C and somewhere near 1000 °C. However, this would lead to a very large variance for the assessed probability distribution for the thermal conductivity, which is obviously incorrect at any specific temperature. Using data at 25 °C (i.e., low values of k) would lead to faster heat-up of the face of the sample than should happen (since the heat could not be transferred into the slab as quickly), and would thus be conservative. Using high values of k (at 1000 °C) would lead to faster than realistic heat transfer through the slab, and a resulting non-conservatism. In order to develop an uncertainty representation that, to first order is neither conservative nor non-conservative, data at 500 °C for each level of available experimental data will be used to assess the mean of a probability distribution for the thermal conductivity. Table 1 shows the calculated means at 500 °C for each level of data. To assess the variability of k, the variance of the provided experimental data is calculated at each temperature at which data was provided. Table 1 also shows the variance of k for each level of data, at each temperature for which data were available. The maximum variance for all temperature levels for a specific level of available data is being coupled with the mean k at 500 °C by making the assumption that the variance of the data at any specific temperature is a constant, and that the maximum observed value is the most representative. For the medium and high levels, this appears to be an assumption that is warranted by the data (the variability for different temperatures is fairly consistent, at least in the upper half of the temperature range), and thus for each of these available sets of experimental data, an analyst might reasonably make this assumption. For the low level of data (taken by itself), the evidence would support a conclusion that the variance of k is growing with temperature. There is so little data, however, that taking the maximum observed variance to describe the data would also be a supportable conclusion. Even for the high level of data, the higher temperatures appear to have a larger variance. However, since we are, for the most part, interested in temperatures above 500 °C, and the variance data for both medium and high data levels are both non-monotonically increasing above this temperature, then again, taking the maximum value is a reasonable model. Thus, for all three levels of available
Table 1 Derived mean and variance for thermal conductivity data distributions Level of data
Mean k at 500 °C
Var at 20 °C
Var at 250 °C
Var at 500 °C
Var at 750 °C
Var at 1000 °C
Low Medium High
0.0574 0.060 0.0626
5.8E06 N/A 1.6E06 N/A 1.4E04 9.4E06 2.0E05 1.5E05 1.0E06 4.9E05 5.8E06 1.5E05 3.1E05 8.4E06 5.3E 05
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
physical data, Table 1 describes the moments of the distributions of k that will be used in this analysis (500 °C k as the mean, and the variance of k at 1000 °C as the variance). A truncated normal distribution (truncated at the 0.001 and 0.999 percentiles of the full normal distribution) will be assumed in each case. This truncation is forced by the software that is being used to generate the Latin Hypercube sample, as well as the fact that the physical parameters would be represented by non-physical values if the full normal distribution were used (ranging to plus and minus infinity). 2.2. Volumetric heat capacity Fig. 1 also shows plots for the available experimental data to be used to characterize the Volumetric heat capacity (qCp) of the slab material. For the low level of data, from just visual observation, there appears to be a temperature dependence to the data. In fact, a non-linear response could be inferred, since the data peaks at 500 °C, although that effect is not as supportable given the small amount of data available. The mean heat capacity changes approximately 20% between the 20 °C data, and either the 500 or 1000 °C data. The 500 and 1000 °C data have approximately the same mean, so it could be concluded that the 20 °C data is not as representative in the temperature range of interest. For the low level of data, a model similar to that taken for the thermal conductivity will be used. The mean of the data at 500 °C will be used, as well as the maximum variance seen for any of the temperature levels (which is also the 500 °C data). The medium and high levels of data are not as supportive of a temperature-dependent qCp however. The medium level shows data that is still slightly peaked near 500 °C, but the spread in the data is large at all levels, and a linear fit through the data across temperatures is almost horizontal. This lack of temperature dependence at the high level of data is even slightly more obvious. Thus, for either the medium or high levels, the model used will be to consider the data to be temperature independent, and since the samples are defined to be independent, to use all the data in each case as an empirical sample from the distribution of qCp. By using the data directly to build the cumulative distribution function (CDF) there will be no external distributional assumptions introduced. Table 2 shows the mean and maximum variance for the low level of data, and the mean and variance of the overall set of data for the medium and high levels of data. A norTable 2 Derived mean and variance for volumetric heat capacity data distributions Level of data
Meana
Variance
Distribution
Low Medium High
431,000 402,000 394,000
8.82E08 1.56E09 1.31E09
Truncated normal Empirical Empirical
a
At 500 °C for low level, across all temperatures for medium and high.
2393
mal distribution will be used for the low level, but the raw data will be used to form an empirical CDF of the medium and high levels of data. 3. Benchmark ensemble validation Now that distributions for the problem input parameters have been derived, a brute-force approach to estimating the distributions of the model output would be to use Monte Carlo sampling to propagate the input distributions through the model and to thus generate a detailed sample for the predicted temperatures in the slab. Under the assumption that the estimation model is a long-running, computationally intensive simulation tool, however, it is unlikely that enough samples could be run with which to generate an adequate output distribution, especially in the distribution tails. Stratified sampling methods such as Latin Hypercube Sampling (LHS) [3] could be used to decrease the actual number of samples required to adequately assess the temperature probability distributions. If the computational model will only allow for a few (e.g., a maximum of a few tens of) runs, however, even LHS might prove too costly. Uncertainty estimation characterizing the results from computer code simulations is not new. A significant body of work has been published based on work performed for risk assessments in the nuclear power industry [4–8]. Those uncertainty analysis techniques have been used in other industries and government venues as well [9,10]. The basic LHS propagation techniques discussed in this section follow methodologies presented in that previous work. To provide a benchmark for the rest of this paper, a 500 sample member LHS sample has been produced for each of the low, medium, and high levels of physical data (using the distributions derived in Section 2), and then run through the benchmark model for each level of data. These runs will serve as the comparison benchmark for lower levels of data and for the run-minimizing technique discussed in Section 6. A 500 sample member LHS sample provides good coverage out to probabilities of 0.002 in the tails of the distributions. This probability is potentially not resolved enough, given that the regulatory criteria is defined at the 0.01 level, but as will be seen in Section 6, this number can be easily increased with minimal computational cost. 3.1. High data levels The definition of the thermal challenge problem provides four different sets of experimental data at four different physical conditions representing experiments performed for slab/heat flux combinations for which it was postulated that it would be possible to do multiple experiments. These are termed the ensemble validation experiments. There are again three levels of data available – no replicates for the low level, two replicates for the medium level, and four replicates for the high level for each of the four conditions.
2394
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
Fig. 2 shows time-temperature results for the four experimental conditions for the ensemble validation experiments (termed configurations 1–4), with all four replicate experiments (high level of data) plotted against the benchmark model result envelopes produced by running the high level of physical data distributions with 500 LHS sample members. The ninty-ninth and first percentile curves are plotted only to represent the extrema of the model’s variability based on the input. Percentile curves, such as the fifth and ninty-fifth percentiles might also be plotted, but were not needed for the display in Fig. 2. From a ‘‘viewgraph norm” visual perspective, the experimental data shown on the plots (Exp 1–4 for each configuration) all lie well within the result bands produced by the physical data distributions run through the benchmark model. Configurations 3 and 4 (the high heat flux samples) appear to be modeled more closely than the lower heat flux samples (configurations 1 and 2) in terms of the experimental data being closer to the mean curve of the simulation distribution. This is easily understood by considering the fact that the lower heat flux samples never get much above 250– 300 °C, and thus the thermal conductivities that we are using are too high for this range. Thus, it is expected that the simulation results will be low compared to the experimental results, which upon examination of the graphs for configurations 1 and 2 in Fig. 2, appears to be the case, albeit not for all the data. A better test of ‘‘agreement” than simple visual comparison of curves on a graph is desirable for making decisions of whether the simulation distributions ‘‘match” (in some
sense) the experimental results. For this work, a simple time-sequenced means test is employed. At each time during the simulation/experiment as shown in the graphs in Fig. 2, there is a set of four experimental results, and a set of 500 simulated results. Each of these sets of results, at each time, can be considered a sample from the distribution of the simulation/experiment at that time. As such, we can use these samples to ascertain whether the mean values for each of these distributions (simulation and experiment), at each of the times (100–1000) can be considered to be statistically equivalent or not. If the mean of the simulation distribution is statistically the same as the mean of the experimental distribution, then the two sets of results may be considered to have indistinguishable means (given the available data). The hypothesis testing results discussed below really imply that there is insufficient data with which to conclude that the means are different. However, if the data at hand is all the data that you are going to get, and if a conclusion of validity or invalidity is required given that data, then this result can be considered to show agreement between the model and experiment, and not just ‘‘lack of disagreement”. Thus, for the purposes of this work, the model will be considered to be providing valid results. Given the null hypothesis that the means of the two distributions (simulation/experiment) are the same, a simple 2-tailed t-test comparison has been performed, with results as shown in Table 3 for configuration 4 in the graph shown in Fig. 2. The analysis uses an a value of 0.05, which represents the probability of rejecting the hypothesis that the
Configuration 1
Configuration 2 300
350
250 200 150 100
99th 1st Exp2 Exp4
50 0 100
200
300
400
500
600
700
mean Exp1 Exp3
800
Temperature (C)
Temperature (C)
300
250 200 150 99th 1st Exp2 Exp4
100 50 100
900 1000
300
Time (sec) Configuration 3
500 400 300 99th 1st Exp2 Exp4
200
200
300
400
900
Configuration 4
600
100 100
700
Time (sec)
500
600
Time (sec)
700
mean Exp1 Exp3
800
900 1000
Temperature (C)
Temperature (C)
700
500
mean Exp1 Exp3
600 550 500 450 400 350 300 250 200 150 100 100
99th 1st Exp2 Exp4
300
500
700
mean Exp1 Exp3
900
Time (sec)
Fig. 2. Ensemble validation calculations using high levels of physical and ensemble data, with 500 model runs.
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
2395
Table 3 Simulation and experiment distribution means comparison for ensemble configuration four, high level of data t-test for equality of means t
Time_100 Time_200 Time_300 Time_400 Time_500 Time_600 Time_700 Time_800 Time_900 Time_1000
1.862 1.757 1.584 1.427 1.279 1.144 1.015 .889 .768 .642
df
502 502 502 502 502 502 502 502 502 502
Significance level (2-tailed)
.063 .080 .114 .154 .202 .253 .310 .374 .443 .521
Mean difference
10.04966 13.40515 14.80037 15.39717 15.42474 15.11581 14.47958 13.54325 12.39180 10.91375
means are equivalent, even though they really are equivalent (i.e., a Type 1 error). The value of 0.05 is typical when comparing two sets of experimental samples, but other values can be used as well (see below). The table reports significance values that assume that the variance of the experimental data and the variance of the computational distribution are equal. Given that both distributions derive from variability in the physical data (the problem statement assumes no experimental errors), this is a physically supportable assumption. In addition, Levene’s test for equality of variance [11] reports values that indicate that the variances can be considered equal, albeit the four sample members in the experimental distribution results in only moderate confidence from this statistic. In Table 3, the first value to note is the 2-tailed significance column. Values in this column that are greater than 0.05 indicate that there is no statistical reason to conclude that the differences in the means between the computational and experimental distributions are due to anything more than chance (a significance of less than 0.05 in this analysis would indicate a low probability of obtaining the observed results if the means were actually the same). At early times, this conclusion is weak (since the significance value is just over 0.05), but the conclusion becomes more robust at later times. The differences in the mean values for the distributions across times from 100 to 1000 s (Mean difference column) show that the simulation mean is consistently 10–15° lower than the experimental mean. This could be used later to inform a bias correction factor if desired. Given the distribution data used, at later times there is 95% confidence that the difference in the means wouldn’t be more than about 44° low, or 22° high. This information is not as applicable to the rest of the analysis, but gives an idea of the confidence in the accuracy of the simulation versus the experimental data. Another measure that can be important in a hypothesis test is the power of the test. The power of a hypothesis test is the probability of rejecting the hypothesis when it is, in fact false. So this would be the probability of concluding that the simulation and experiment give different results, when, in fact, they do give different results. Or said another
Std. error difference
5.39647 7.63129 9.34601 10.79132 12.06322 13.20997 14.25926 15.23001 16.13581 16.98846
95% Confidence interval of the difference Lower
Upper
20.65211 28.39834 33.16247 36.59889 39.12536 41.06944 42.49477 43.46567 44.09385 44.29100
.55278 1.58805 3.56174 5.80455 8.27588 10.83782 13.53561 16.37917 19.31024 22.46349
way, if the computational model is not valid (given the experimental data), the power of the test is the probability of correctly concluding this. The power of a hypothesis test is dependent on several factors. The alpha value used for the test affects the power, the number of samples used in the test affects the power, and the difference in the mean values (computational versus experiment) that you need to be able to detect affects the power. Unfortunately, the power of this hypothesis test on the means reported in this paper is very weak due to only having a maximum of four experimental samples, and using an alpha value of 0.05. We have weak experimental data, and are asking for a high confidence (95%) of not rejecting the hypothesis that the computational and experimental means are the same, if they are, in fact, the same. The power of the test is not available from the SPSS software used for this work [11], but a separate module from SPSS that is specifically designed to calculate test power [12] has been used to estimate that, at 900 s, given the computational and experimental means as measured (a difference of 12°), the sample standard deviations, and the given number of samples, the power is only 0.12. This result indicates that if a difference between the computed and experimental means of 12° at 900 s implied that we would reject the hypothesis that the means were equivalent, then the power to reject the hypothesis that the two are the same is only 12%. The software estimates that given the means and standard deviations seen for the 900 s data, it would require 55 experimental data points to have moderate power (0.8) to discern that the model is invalid, if in fact it were invalid. The tests of the computational distributions against the experimental data in this paper suggest that the experimental data does not contradict the computational results, but gives low confidence that we would be able to reject the hypothesis that they are they same, if in fact they are not. The plots in Fig. 3 provide further insight into the power of the tests used in this work versus possible changes in the test (different a), different numbers of experimental samples, or different mean differences (between computed and experimental). The plot on the left in Fig. 3 shows three
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Power
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
Power
2396
α = 0.05 Power(diff = 10) Power(diff = 20) Power(diff = 30) 0
50
100
150
200
250
300
Number of Experimental Samples
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
α = 0.20 Power(diff = 10) Power(diff = 20) Power(diff = 30) 0
50
100
150
200
250
300
Number of Experimental Samples
Fig. 3. Hypothesis test power for a = 0.05 (left) and a = 0.20 (right) showing power variation with number of experimental samples and computational versus experimental mean difference (diff).
different curves (all using the 900 s computational mean and standard deviation); one curve for each mean difference of 10°, 20°, or 30°. The left plot assumes that a test using alpha of 0.05 is being used, and shows, for the three different levels of mean difference, what the power of the test would be for different numbers of experimental samples (assuming that there are always 500 computational samples). The figure shows that if we need to be able to differentiate means that are within 10° as different, then we will need around 100 experimental samples to be able to do that with 80% confidence. If we only need to discern mean differences of 20°, we need only 20 experimental samples to have an 80% confidence of being able to reject the hypothesis that the means are the same if they are not. And if we only need to be able to discern the difference within 30°, we only need seven experimental samples to do the same. The right graph in Fig. 3 shows the same three curves, but assumes that we use an alpha of 0.20 instead of 0.05. We might want to do this if we don’t mind having a higher probability of rejecting the model even though it is valid (a), in order to obtain a higher probability of deciding that the model is invalid, when in fact, it is invalid (power). From the right side plot, it can be seen that fewer experimental samples are now needed in order to gain the same power for the test, or, for the same number of samples as for an alpha of 0.05, the power of the test will be higher. These discussions reflect the fact that in reality, the experimental design should be performed before the experiments are run, in order to design the appropriate probabilities of accepting validity of the model or not, and knowing how many samples will be required to achieve required confidence. So how can the results of this test be used? We have calculated relatively how probable it is that we would obtain the experimental results presented if, in fact, the computational model is assumed to provide the same results as experiment. The results indicate that it is not unlikely that we would see the results obtained if our model is equivalent to experimental reality. Keeping in mind that this test also does not give us much statistical information on whether the observed difference would also be seen if the computational model was doing a poor job, we now need to make a decision concerning the validity of the model. This is the
situation that an analyst would be in given the data provided, if it were not possible to obtain any further data. In order to make a decision on model validity, we need to consider together the statistics, the physics, and the implications of our decision. The experimental data is consistent with the model results, both from a visual perspective, and from the results of the hypothesis test. We know the physics of the model is approximate, and can see the trends in the mean difference (between computation and experiment) vary essentially as we would expect. We understand that the model does not consider the timedependent thermal conductivity, and at the end of this paper will consider the implications of this. Given all of this information, a reasonable conclusion is that the model is consistently predicting the mean experimental temperature within the 10–15 °C deviation shown in Table 3. Given the requirement to make a decision on the model one way or the other, the evidence indicates that we have no objective reason to reject that the model is making valid predictions in the Ensemble range of conditions, and will thus accept that the model is valid. If making a wrong decision on this would potentially cause life-threatening conditions, we might reevaluate this decision given the weakness that the experimental evidence gives us in being able to discern invalid predictions in the ranges of values seen. Thus, given the discussion above, we will consider the computational and experimental means to be equivalent, and thus at least for the case of the ensemble configuration four experiment, with the high level of experimental data, the model is considered validated. This implies that an accuracy of 10–15 degrees compared to the computational model is acceptable, given that the difference between the experimental and computational means vary in that range. By combining the results of the statistical tests, physical knowledge of the situation, and implications of the decision in making this assessment, we cannot make any probabilistic statements about the confidence of the validation other than those from the hypothesis test, which essentially say that with 95% confidence, the experimental data that we have does not contradict that the computational model is providing results that are statistically equivalent to the available experimental data. The same analysis has also been performed for configurations one–three for the high
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407 Table 4 Statistical comparison of simulation and experimental distribution means for ensemble configurations 1–4:high data level Config
Min/max significance
Min/max mean diff (C)
95% Range at 1000 s (C)
Notes
C1
0.003/0.293
8.0/13.9
30/+9
Fails means test until 700 s.
C2 C3 C4
0.202/0.398 0.066/0.771 0.060/0.521
3.4/7.5 +5.8/13.8 10.1/15.4
24/+10 30/+40 44/+22
2397
conductivity distribution that is valid for higher temperatures, coupled with the thin sample. In this work, later times are of more interest, and configuration one is one of the furthest from the regulatory configuration. So if the high levels of data are available, then the computational model could be considered to be valid for conditions of interest closest to the regulatory conditions, given the physical and ensemble data. It would not be valid for lower temperatures. 3.2. Low levels of data
levels of Physical and Experimental data. Table 4 provides a synopsis of the results of these calculations. For configurations two–four, the means comparison test is passed for all times. So for these configurations, the computational model can be considered to be validated given the available data (given the same discussion as above on the meaning of ‘‘validated”). For configuration one, however, it is not as clear that the model is producing correct temperatures for the physical situation. At later times (beyond 700 s), the means comparison do indicate that the experiment does not contradict the model results. However, up to 700 s, the comparison indicates that there may be a non-random bias involved. Configuration one is the low-heat flux, thin sample case. The temperature at the surface never rises above 350°. The experimental results are almost all higher than the computational results for the first 300 s. Even after 300 s, only one of the experiments has values below the mean of the simulation. It is difficult to absolutely assign a reason for this discrepancy, but it is either due to an unfortunate set of samples in the experiment, or due to the choice of a thermal
Section 3.1 discussed the ensemble validation if high levels of data are available. This section will consider only the low level of analysis data, both for physical data and ensemble experimental data. A difference in this section is that for the low level of data, distributions of thermal conductivity and volumetric heat capacity have been derived (see Section 2), but there is only a single experimental result available, with no replicates. Thus, no distribution of the experimental results can be formed. Since no distribution of the experimental results is available, a simple visual comparison against the computational distribution will be used. Fig. 4 shows the single experiment against the simulated distribution envelop for all four ensemble configurations using the physical parameter distributions derived from the low levels of data. Considering the configuration four graph, the visual comparison of the experimental curve against the simulated mean curve is excellent (i.e., within a few degrees along the entire curve). Configuration two also shows excellent visual agreement, configuration three shows close, but uniformly low agreeConfiguration 2 400
350
350
300 250 200 150 100 50 0 100
200
300
400
500
99th
mean
1st
Exp1
600
700
800
Temperature (C)
Temperature (C)
Configuration 1 400
300 250 200 150 100 50 100
900 1000
300
Time (sec) Configuration 3
Exp1
700
900
700
900
Configuration 4 800
700
99th
mean
600
1st
Exp1
700
500 400 300 200 100 100
mean
1st
Time (sec)
Temperature (C)
Temperature (C)
800
500
99th
200
300
400
500
600
Time (sec)
700
800
900 1000
600
99th
mean
1st
Exp1
500 400 300 200 100 100
300
500
Time (sec)
Fig. 4. Ensemble validation calculations using low levels of physical and ensemble data, with 500 model runs.
2398
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
ment, and configuration one is significantly low, as was the case for the high levels of data. For configuration four, the differences between the mean curve from the simulation and the experimental curve range from 2 °C at low times, to +13 °C at 1000 s. This is better agreement than for configuration four with the High levels of data. With this very low level of experimental data, it is a reasonable, if not wholly satisfying decision to use the visual comparison (with the numerical mean versus experimental point data to back the visual comparisons) to decide that for the thicker samples, the given model and physical data simulate the experimental situation with an accuracy that is sufficient. For the provided data, ‘‘sufficient” would have to be assumed to be within minus two or plus 13°. With no ability to estimate dispersion in the experiment, this is all that can be concluded. For the thinner samples, as long as the heat flux is high,
the model is adequate, but for thin, low-heat flux applications, the model has a significant low bias. So again, the reference simulation model, in combination with the assessed physical parameter distributions, will be considered valid for conditions closer to the regulatory situation, but less valid (or invalid) for lower temperature scenarios. For this level of data, there is no confidence assessment available due to the lack of experimental replicates. Note also that the agreement, and the ‘‘plus or minus” significance above is predicated on the distributions of the physical parameters used. Different physical parameter distributions might result in different results. 3.3. Medium levels of data The analysis for the medium levels of data is done in a very similar fashion to the analysis for the High level of
Front Face 900 99th 1st EXP2
Temperature (C)
800 700
Mean EXP1
600 500 400 300 200 100
200
300
400
500
600
700
800
900
1000
700
800
900
1000
Time (sec) Center (L/2)
Temperature (C)
500 400 300
99th
Mean
1st
EXP1
EXP2
200 100 0 100
200
300
400
500
600
Time (sec)
Temperature (C)
400
Back Face (L)
300
99th
Mean
1st
EXP1
EXP2
200
100
0 100
200
300
400
500
600
700
800
900
1000
Time (sec) Fig. 5. Accreditation simulations versus experiment for sample front face, center, and back face for high levels of physical and experimental data.
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
data, using an independent sample t-test, even though there are only two samples in the experimental data. The results are similar to those that were produced for the high levels of data. Configurations three and four produced means agreement at approximately the same levels of confidence as with the high levels of data, configuration two shows that the simulation was atypically low (as compared to the low and high levels of data), and configuration one, as with all of the other levels showed poor agreement, indicating a significant low bias for the simulation distribution. Given the medium level of data by itself, without the benefit of insights from either the low or high levels, the same considerations of the meaning and power of the statistical tests would need to be considered here. The lack of agreement with configuration two would give an analyst more to think about in terms of whether the overall results are valid, but configuration two is also a low temperature case,
2399
far from the regulatory situation of interest. It is likely that the model would again be considered valid for higher temperature situations such as configurations three or four, but invalid at lower temperatures. The statistical evidence is very weak due to only having two experimental samples, but the evidence available does not contradict the conclusion that the computational model is producing valid results at higher temperatures. If it is necessary to conclude with high confidence that the model is invalid (if it is in fact invalid), i.e., test with high power, then more experiments would be needed. 4. Accreditation validation A set of one or two experiments (low/medium = 1, high levels = 2) at conditions much closer to the regulatory condition were provided as a second level of experimental
Front Face 1100
Temperature (C)
1000 900 800
99th
Mean
1st
EXP1
700 600 500 400 300 200 100
200
300
400
500
600
700
800
900
1000
Time (sec)
Temperature (C)
500
Center (L/2)
400
99th
Mean
1st
EXP1
300 200 100 0 100
200
300
400
500
600
700
800
900
1000
Time (sec) Back Face (L)
Temperature (C)
400
300
99th
Mean
1st
EXP1
200
100
0 100
200
300
400
500
600
700
800
900
1000
Time (sec) Fig. 6. Accreditation simulations versus experiment for sample front face, center, and back face for low levels of physical and experimental data.
2400
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
validation. These were termed accreditation experiments. These data were provided not only at the surface of the sample, but in the center and back sides. These conditions were also required to be analyzed at low, medium, and high levels of physical and accreditation data. Fig. 5 provides the results of simulations using the high levels of physical and experimental data, and Fig. 6 provides the results using the low level of physical data, and only the first accreditation experiment. The front face data is the most pertinent to the regulatory problem, but the results for the center and back faces of the sample may provide information useful in extrapolating away from the experimental conditions up to the regulatory conditions. Using the independent samples t-test described in Section 3 for the surface temperature simulation and data, the results for the high levels of data indicate statistical evidence that the means of the simulation and experiment differ only by random chance. Visual examination of the graph provides the same conclusion. The simulation results do appear to deviate slightly at later times on the high side. The statistical results bear this out as well – the significance statistic is at it’s lowest at 1000 s, indicating the highest likelihood of non-random bias at higher times and temperatures. Thus, we may have to consider that there is a slight high bias in the model at late times. This affect is predictable given that the thermal conductivities being used are based on 500 °C temperatures, and thus at 1000 s, when surface temperatures are significantly higher than this, the simulation model is not able to conduct heat away from the surface as fast as the ‘‘real” situation requires – since the actual temperature-dependent thermal conductivity would then have a higher value. For the high level of data and the center of the sample, the t-test again arrives at the conclusion that the experimental evidence does not contradict the results from the computation. Visual analysis of the graph in Fig. 5 again confirms the close fit of experiment to mean of the simulation, but the tendency of the simulation to be high at later times is again evident. Finally, for the back-face data, the comparison again shows that the simulation and experimental means are statistically indistinguishable based on the limited data, but it is obvious from visual analysis of the graph in Fig. 5 that the simulation is overpredicting the back-side temperature. This is again understandable in the context of the thermal conductivities that are being used. Since the back-face temperatures are in the region of 250 °C, the thermal conductivities being used (from 500 °C) data are too high for the back portion of the sample. Thus, the simulation is able to conduct more heat to the back face than will really occur in the experiment. The low levels of experimental data are again not amenable to the t-test statistical treatment since there is only one experimental result, and we are left with visual comparisons. In Fig. 6, the single experiment matches the mean of the simulation distribution almost exactly for the front face. For the center and back-face temperatures, the simu-
lation is now seen to under predict the experiment, versus in the high-data case where it tends to over predict. The conclusion would again be that for cases closer to the regulatory case, the simulation model is valid, but further away (lower temperatures), it is not. 5. Summary of pre-regulatory validation This section summarizes the work performed in the physical data characterization, ensemble validation calculations, and accreditation calculations. 5.1. Physical data summary In the physical data analysis described in Section 2, the three levels of possible experimental data are all used to generate probability distributions for the thermal conductivity and volumetric heat capacity variables. We want to form probability distributions that describe our full stateof-knowledge concerning the uncertainty in the parameters, and thus it is desired to use as much of each level of data as possible. In the low data case, if no further experiments were available, then the conclusion would likely be that both variables are temperature dependent. Given that the regulatory simulation model cannot be changed to use temperature-dependent data, assumptions such as those made in Section 2 would be required to generate best-estimate distributions to describe the uncertainty in the parameters. Later portions of the analysis would need to decide how to use the knowledge that the data is temperaturedependent, but that the simulation model is not. For the medium level of experimental data, it still appears that the thermal conductivity is temperaturedependent, but it no longer appears that the volumetric heat capacity is. In the absence of physical information as to why the volumetric heat capacity data would be ‘‘modal” (i.e., peaked in the middle of the temperature range), it is reasonable to conclude that this effect is sampling based, and not physical. This leads to using all the data as a random sample for a non-temperature-dependent volumetric heat capacity. We are left with the problem of the obvious thermal conductivity temperature-dependence, however. The conclusions for the high level of experimental data are essentially the same as for the medium level. It is more obvious that the volumetric heat capacity is not temperature-dependent, and the thermal conductivity is still obviously temperature-dependent. Thus, the techniques for forming distributions for medium and high levels are the same. In a situation where small levels of experimental data can be used to drive the performance of further experiments, it is likely that the low level of data would generate too many questions, and more experiments would be performed. If this then leads to the medium level of data, that probably would confirm the temperature dependence of the thermal conductivity, but leave some questions about the
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
volumetric heat capacity in the analysts’ minds. If the low level of data drove the production of further experiments to get to the medium level, then that overall set of data could be treated as the base data (as is done in this paper), or a Bayesian updating of the low level of data with the new data could be performed. In either case, if experiments were very expensive, this would probably be seen as sufficient. If further experiments were not costly, then the high level of data might be generated to answer any lingering questions about the volumetric heat capacity. 5.2. Summary of ensemble validation calculations The ensemble validation calculations are performed at conditions that are far enough from the regulatory conditions that they would only provide moderate evidence of validation of the simulation model for the regulatory conditions. The analysis in Section 3 shows that for all levels of data, the agreement between the Ensemble data and the simulation are close for conditions that are closer to the regulatory situation, and are suspect for conditions that are further from the regulatory situation. This gives some confidence that the simulation model will perform adequately as conditions approach the regulatory conditions. However, it must be remembered that this comparison is predicated on the physical parameter distributions that were derived in Section 2. Different input distributions would result in different simulation distributions, which might result in different conclusions. If low temperature, moderate temperature, and high temperature physical parameter distributions were derived, and each used in regimes that were appropriate to the temperatures that are being seen in the different Ensemble calculations, it is likely that the results would be even better. 5.3. Summary of accreditation validation calculations The Accreditation experiments, being at conditions closer to the regulatory conditions, provide evidence of agreement of the simulation model with the experiment, but with evidence of a slight potential systematic bias between the model and experiment at higher temperatures. If the high level of data were available, it would appear that the simulation model tends to produce higher estimates than the experiment at later times, but with the low level of data, the conclusion would be the opposite. Section 7 shows a method for using the trend data from this accreditation comparison to estimate the effects of the apparent model bias on the results of the regulatory analysis. It is interesting to note the lack of spread in the experimental data for both the accreditation and ensemble data. If the spread in the physical data provided is ‘‘real”, then one would expect that a single random experimental sample would generate a result that, while less likely to be at the extreme outer edges of the simulation space, is also unlikely to lie directly on the mean curve of the simulation. The experimental data for this exercise is fabricated, and as
2401
such can be anything that the authors wished, but it is somewhat fortuitous that the experimental values almost all lie very close to, if not right on the mean simulation curves. In a real-world case, where only one experiment could be performed, it is likely that validation decisions would have to be made based on a comparison that is not as close as those seen in this analysis. 6. Surrogate-based regulatory clustering solution In the previous sections, a 500-element latin hypercube sample was used to propagate physical parameter uncertainties through the target simulation model to generate dependent distributions of temperature under various conditions. To actually do this for a single level of data, 2500 simulation model runs would be required (four conditions for the ensemble calculations 500 sample members each + one condition for accreditation 500 sample members). To simulate the regulatory situation, 500 more runs would be required, for a total of 3000 runs. If the goal of this analysis were only to estimate the measures of central tendency for the dependent temperature distributions, it would be possible to use far fewer LHS sample members, and to drastically reduce the number of simulation runs required. However, the ultimate goal is to fully characterize the uncertainty distribution of the sample surface temperature, including the distribution tails for regulatory conditions. In fact, the tails of the temperature distribution are very important to comparing the distribution against the one percent probability of exceedence threshold for the 1000 s temperature. Thus, for basic Monte Carlo or LHS sampling methods, many sample members would be required, and subsequently many runs of the (expensive) simulation model. Similarly, techniques such as spectral stochastic methods [13], or sparse-grid collocation schemes [14] require many evaluations of the simulation model as compared to the technique discussed in this work, although have broader applicability as compared to generating problem-specific surrogate physics models. This section describes a surrogate-model-based method for greatly decreasing the numbers of computer runs required for characterizing the uncertainty in the results of physical simulations that have reasonably monotonic response functions. The regulatory situation will be characterized using this method. However, it would be equally applicable to the ensemble and accreditation characterizations performed in Sections 3 and 4. 6.1. Base-line regulatory results This section will present the results of propagating the full five-hundred sample member LHS samples (low, medium, and high) through the simulation model for the regulatory conditions in order to form a base-line for the runlimiting technique described in Section 6.2. The result of interest is the surface temperature of the sample at
2402
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
Fig. 7. Regulatory configuration sample surface temperature at 1000 s for low, medium, and high levels of physical data.
1000 s. The regulatory criterion is that this temperature is not to exceed 900 °C, 99% of the time. The graph in Fig. 7 shows the results as cumulative probability distribution functions (CDFs). These curves are for propagating the full 500 sample members for each of the low, medium, and high levels of physical data (LHS distribution samples) through the regulatory model. In Fig. 7, the medium and high levels of data lie virtually on top of each other until approximately 850 °C, at which point the high level of data curve rises slightly more quickly than the medium level curve. The low level of data curve reflects the wider physical parameter distributions, and while crossing the 900 °C line at approximately the same probability as the other levels of data, it has much longer tails. From the raw data used to plot Fig. 7, the probabilities of exceeding 900 °C at 1000 s are calculated as 0.24 for the High level of physical data, 0.26 for the medium level, and 0.22 for the low level. Thus, based on just the raw LHS propagation through the model, not taking into account the known problem with the temperature-dependent thermal conductivity, all three levels of data decisively show that the regulatory criteria would not be met given the current data and simulation model. However, even this calculation required 500 simulations to be performed with the regulatory simulation model in order to adequately characterize the tails of the distributions (which control the results of the regulatory criteria comparison). If the simulation model were a long-running three-dimensional physics model, this would not be possible. The following sections present a methodology for addressing this problem, as well as considering the known model deficiencies. 6.2. A surrogate heat transfer model for the simulation As noted in Section 6.1, if the regulatory simulation model is long-running, it will not be possible to do sufficient simulations to adequately characterize the tails of the dependent distributions. For the simulations discussed in this section, we will assume that only ten actual simula-
tion runs of the full reference model will be possible due to time and/or resource constraints. Ten simulation runs may be enough to characterize the central tendencies of the dependent distributions, but it would not be sufficient to well-characterize the tails. The methodology presented here is loosely based on the source term clustering approach used in the NUREG-1150 Nuclear risk assessment work [8] which was codified in a utility program written for that analysis [15]. There are competing requirements for this analysis; to characterize the tails of the dependent variable distribution well, but with a quite limited number of simulations (ten). For this regulatory problem, there is a single variable of interest: front surface temperature at 1000 s. The time-temperature plots discussed in the ensemble and accreditation validation sections (Figs. 2, 4–6) show results that have smooth, monotonic temperature responses with time. Given the physics involved, this is to be expected. What this also means, however, is that within the original 500element LHS samples used throughout the previous sections of this paper, that many of the sample members in the LHS sample generate temperature profiles that will be very similar. If we could somehow predict which of the 500 sample members in an LHS sample would produce results that would be similar to each other, we could then group those sample members together, and choose one of the sample members in each group to act as a surrogate for the entire group. This is the technique used here to limit the number of simulation runs required of the full, computationally intensive model. The (long running) reference model simulates the temperature response of the surface of the sample given the thermal conductivity and volumetric heat capacity. A model that is designed to provide similar kinds of results is shown in Eq. (2) [16]. qffiffiffiffiffiffiffiffiffiffi3 2 k 2ðq00 Þ 1000 pqC p 4 5 T ð0; 1000Þ ¼ 25 þ ð2Þ k This is a simple semi-infinite slab model, which will predict the 1000 s front surface temperature of the slab, given the incident heat flux (q00 ), thermal conductivity (k), volumetric heat capacity (qCp), and assuming the 25 °C initial temperature for the regulatory situation. This model would not be sufficient to actually predict the accurate front surface temperature for the regulatory configuration. However, it does model physics that are somewhat similar to the regulatory simulation model, and uses the same variables that are being modeled as uncertain in the regulatory simulation model. It just happens to run in milliseconds on a spreadsheet instead of in hours or days, as the regulatory model is assumed to do. So if we use this simple surrogate model to predict the surface temperature at 1000 s for each of the 500 LHS sample members, we should have magnitude-inaccurate, but trendaccurate results for all 500 sample members. It is this set of
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
The surrogate model in Eq. (2) was applied to the three different LHS samples (for the low, medium, and high levels of physical data), and ten clusters formed from each. Fig. 9 shows how the clustering algorithm groups together sample members that have similar temperatures into the ten different clusters. Each group of horizontal dashes in Fig. 9 indicates some set of sample members that have been grouped together due to their proximity in surrogate-temperature space. The cluster numbers are along the horizontal axis. A median sample member was chosen from each of these clusters, and the regulatory simulation model run for each of the ten cluster centers, for each of the three levels of physical data available. This results in ten simulation results from the full reference simulation model per level of physical data, which are then mapped back to the original 500 LHS samples for each level, through their cluster numbers. At this point, the surrogate model results are no longer needed, and all further work is done with the ten results from the full reference simulation model. Fig. 10 shows the results of the reference calculations mapped back to the LHS samples. Contrast this figure with Fig. 7. Each curve in Fig. 10 is stair-stepped, since there are actually only ten unique results. Each of the ten results is also weighted by the number of sample members that it is related to by the number of sample members that went into the cluster that produced it. The mapped simulation-sample data now provide probabilities of exceeding the regulatory 900° level at 1000 s of 0.19 for the high level of data, 0.27 for the medium level, and 0.17 for the low level. If these are compared to the probabilities from the full 500 sample member run, they are similar, but not precise. The reason why is obvious from the stair-stepped nature of the clustered results in Fig. 10. Regulatory cutoff-temperatures of 875 or 925 might give quite different results because they would hit different ‘‘steps” on each curve. One way to solve this
1000
Surrogate Model Temperature (C)
surrogate-generated surface temperatures that we are going to use to group together the LHS sample members into sets of sample members that will ostensibly generate similar surface temperature results. Note that the trend-accuracy of this surrogate model is important to the clustering results. Later in this paper, this accuracy will be investigated against the full simulation model, but in general this will not be possible, and the surrogate model will have to be chosen on the basis of knowledge of the relevant physics. Fig. 8 illustrates the process of taking the 500 sample members, using the surrogate model to estimate a frontside temperature for each, and then grouping the sample members by estimated temperature into a set of groups of sample members that should generate similar results. The grouping is actually performed using the SPSS statistics package [11], using a technique termed ‘‘K-means clustering”. This clustering technique takes a predefined number of desired clusters (groups), considers the clustering variable (predicted surrogate-temperature), and groups the 500 temperatures together into the tightest ten groups that it can. It uses an iterative technique that attempts to form the groups in such a way that the Euclidean distance from the group centers to the group edges is minimized across all groups. Thus, the spread of predicted temperatures for all the sample members in any specific group is as small as it can be, given minimization of the spread in the other nine groups. Once this procedure is completed, there are ten groups of sample members. These ten groups of sample members are then each ordered by the predicted surrogate-temperature within each of the groups, and the median sample member selected from each group to represent that group. When this process is completed, ten sample members have been selected from the 500 that are the most representative of the ten groups that have been defined. Remember that forming ten groups was arbitrarily decided due to assumed resource constraints, so some other number of groups could just as easily be chosen, and the procedure would be the same. If there were time and resources to do thirty simulations, then thirty groups could be defined.
2403
950 900 850 800 750 700 650 0
2
4
6
8
10
Cluster Number of Case Fig. 8. Using the surrogate heat transfer model results to cluster LHS sample members.
Fig. 9. Clustering results on surrogate model prediction for high level of physical data.
2404
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407 Table 5 Probabilities of exceeding regulatory thresholds without model correction
0.5
High
0.4 0.3
Medium
0 800
900
1000
1100
Surface Temperature (C)
Cumulative Probability
Fig. 10. Regulatory configuration sample surface temperature at 1000 s for low, medium, and high levels of physical data using clustered results.
1 0.9 0.8 0.7 0.6
High
0.5 0.4 0.3 0.2 0.1 0 700
Baseline (500 runs)
Cluster centers (10 runs)
Cluster edges with interpolation (11 runs)
Low Medium High
0.22 0.26 0.24
0.17 0.27 0.19
0.23 0.26 0.16
Low
0.2 0.1 700
Level of physical data
Medium Low
800
900
1000
1100
Surface Temperature (C) Fig. 11. Regulatory configuration sample surface temperature at 1000 s for low, medium, and high levels of physical data using interpolated cluster results.
stair-stepped limit problem would be to interpolate the data within a cluster. Fig. 11 shows results from a different algorithm used with the clustering data. Here, the sample members on the edges of the clusters are used to run the reference simulation model instead of the sample member in the center of each cluster, and the 1000 s reference results are interpolated between the edges of each cluster, based on the surrogate predictions. Performing this interpolation actually requires eleven runs of the regulatory simulation model – one for the lowest surrogate-temperature sample member in the cluster involving the lowest temperature sample members, and then one run for each of the ten highest surrogate-temperature sample members from all ten clusters. The high temperature sample member for each cluster is then considered to be equivalent to the low temperature sample member for the next highest-temperature cluster. This is effective because the clusters are almost continuous edge-to-edge. If they were not (i.e., if there were large gaps between clusters), it might be necessary to run cases from the top and bottom of each cluster, for a total of twenty reference simulation cases. Or, five clusters could be formed and the top and bottom sample run for each, to maintain the original limit of ten reference simulation runs. As can be seen in Fig. 11, the interpolation removes the stair-stepped nature from having only ten actual reference
simulation model runs, but there are still resolution discontinuities in the cumulative distribution functions as compared with the full, 500-sample member distributions shown in Fig. 7. So the clustering algorithm introduces uncertainties of its own. Table 5 summarizes the probabilities of exceeding the regulatory threshold for the baseline calculation (requiring 500 simulation runs) and the two uses of clustering (cluster centers, and cluster edges with interpolation). The cluster center model without interpolation biases both the high and low cases away from the baseline results. The cluster edge-case run with interpolation biases only the high level of data away from the baseline. The question that these results raise is whether or not this bias is acceptable or not, especially considering that in the normal case, where 500 runs are not possible (otherwise, why cluster?), the analyst wouldn’t even know about the bias. Consider the plot in Fig. 12. This plot shows the surrogate model prediction for all 500 sample members (which you would always have) against the interpolated high physical data case (which you could always have), and then against the full 500 element Regulatory Simulation model case (which by assumption, you would never have, since you couldn’t run the 500 cases). The diagonal corner-to-corner line in the figure indicates the line where the surrogate model and the other two predictions would give the same results. The figure shows that the surrogate model always underpredicts the surface temperature at 1000 s. However, this underprediction is very consistent – 1000
Surrogate Predicted Temperature (C)
Cumulative Probability
1 0.9 0.8 0.7 0.6
950 900 850 800 750 Surrogate Versus Interpolated
700
Surrogate Versus Simulation 650 650
700
750
800
850
900
950
1000
Temperature (C) Fig. 12. Comparison of surrogate-temperature predictions at 1000 s against full regulatory simulation results and interpolated clustered results.
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
7. Reference simulation model-form bias The simulations using the reference simulation model itself have uncertainty due to the lack of ability of the model to incorporate the temperature-dependent thermal conductivity values. In order to fully characterize the uncertainty in the front surface temperature prediction, a way to estimate the effects of this model bias on the predictions is needed. The ‘‘best” characterizations of the reference simulation model versus experimental reality are the Accreditation Validation experiments and simulations. For the high level of data with the accreditation tests, a t-test comparison of the means was performed, which yields results similar to those shown in Table 3 for the configuration four, ensemble validation comparison. Similarly to Table 3, a difference in the mean value of the experimental results versus the simulation results is calculated for each time. At each time, the experiments have a mean temperature. The mean experimental temperature at each time in the accreditation experiment has been plotted in Fig. 13 against the difference in the mean values between the exper-
60
Mean Difference (C)
the plot of the surrogate model versus the interpolated model (that is based on eleven full simulation runs) shows that the interpolated model line parallels the ‘‘equality” line throughout most of the temperature range. This was exactly what was desired – a surrogate model that would predict the simulation model’s trends; but not necessarily the exact results. From this standpoint, the surrogate model is operating exactly as desired. The surrogate versus 500-run regulatory simulation is somewhat more interesting. Again, you would normally not have this result, since avoiding the 500 runs of the (assumed) long-running regulatory model was the goal of the clustering approach. Fig. 12 shows that within each cluster, there appears to be significant variability between the surrogate and simulation model within a narrow band around the interpolated line. The variability of the 500 simulation data around the interpolated result line is approximately +37 °C, 17 °C, or on the order of +5% to 2%. This could be considered a ‘‘model-form” uncertainty, introduced by the approximation in the parameter uncertainty modeling treatment (clustering with interpolation). This information will be used in Section 8. For the application of the surrogate-clustering approach in a ‘‘real” situation, where each full simulation run takes a week, or a month, etc., the appropriate choice of a surrogate model is more important, since plots such as Fig. 12 will not be possible. Verification of the trend-accuracy of the surrogate model will be limited to comparison of the surrogate results against the full simulation results only at those points where the full simulation is run. However, since the range of the full simulations will span the same range as the range of the surrogate model, a more limited assessment of the trend-accuracy of surrogate versus simulation may still be performed, after the full simulation runs are available.
2405
Extrapolation Line
50 40 30 20 10 0
Mean Difference
-10 100 200 300 400 500 600 700 800 900 1000
Accreditation Mean Temperature (C) Fig. 13. Mean experimental temperature versus simulation/experiment mean difference for accreditation experiment with high levels of physical data.
iment and the simulation at that time/temperature. As can be seen from Fig. 13, there is a distinct bias shown between the simulation and experiment at higher temperatures (500–700 °C). For temperatures encountered during the accreditation experiments, the bias is only up to 20 °C or so at 700 °C. At temperatures of 900–1000 °C, the extrapolated bias could be as much as 45–55 °C if the deviation remained linear. The simulation with the assigned physical parameter distributions is over predicting the temperature at higher temperatures that are closer to the regulatory condition. Having identified this systematic bias, it should now be accounted for in our model. The existence of this bias was predictable given the known, unmodeled temperature dependence in the thermal conductivity. The conclusion is, however, based on very limited accreditation experiments, and both the modified and original results might be considered when using the analysis results in a decision-making setting. It is possible that the existence of this obvious bias, supported by physical intuition on the limitations of the simulation model might form the basis for supporting further accreditation experiments, or model refinement. To account for this model-form bias, a linear correction relation has been derived based on the dotted extrapolation line shown in Fig. 13. This correction relation takes the form shown in Eq. (3), where D is the difference to be subtracted from the predicted simulation temperature, and Tsim is the original simulation temperature predicted by the regulatory simulation model D ¼ 0:122T sim 67:1
ð3Þ
Fig. 14 shows the low, medium, and high baseline results, as well as the interpolated clustered results after applying Eq. (3) to each of the results for all three levels of data. The model-form bias correction does not change the conclusions of the analysis, but it does change the probabilities of failing the regulatory criteria, as shown in Table 6. After correction, the 500-run baseline simulation and the eleven run Clustering Analysis with interpolation
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407 1
1
0.9
0.9
Cumulative Probability
Cumulative Probability
2406
0.8 0.7 0.6
High
0.5
Medium
0.4
Low
0.3 0.2 0.1 0
0.8 0.7 0.6
High
0.5
Medium
0.4
Low
0.3 0.2 0.1 0
700
800
900
1000
1100
700
800
Surface Temperature (C)
900
1000
1100
Surface Temperature (C)
Fig. 14. Baseline simulation results (left) and interpolated cluster results (right) after application of model-form bias correction.
Level of physical data
Baseline (500 runs)
Cluster edges with interpolation (11 runs)
Low Medium High
0.122 0.084 0.060
0.128 0.098 0.052
results both have probabilities of not meeting the regulatory criteria at 1000 s at all levels of data that are consistent, and within approximately 0.01 of each other. Table 6 and Fig. 14 assume that Eq. (3) is applicable to all levels of data (high, medium, low). It is likely that the corrections would actually be somewhat different for each level. Corrections have not been estimated separately for the low and medium levels of physical data for this work, however. 8. Conclusions As shown in Table 6, the regulatory criteria are shown to be failed for all levels of physical data. The closest is a probability of 0.05 for the high level of physical data, using the interpolated cluster results. The low levels of physical data provide higher probabilities of failing the regulatory criteria due to the wider input distributions as assessed in Section 2, and subsequently wider distributions on the dependent variable (front surface temperature). The distributions for the medium level of physical data have approximately the same variance as the high level, but the lower mean thermal conductivity tends to shift the temperatures upward compared to the high levels of data. As with any result based on a propagation of input uncertainty distributions, the results here are dependent upon the distributions that were assessed for the input parameters. The final distributions shown in Fig. 14 indicate our state of knowledge about the front surface temperature in terms of our knowledge about the physical parameters and accuracy of the simulation model. Another uncertainty, discussed in Section 6 but not included numerically in the results is the fact that the clustering methodology used to generate the results in Fig. 14 introduces some mea-
sure of uncertainty itself, estimated in Section 6 at around five percent. This information could be used in conjunction with the data in Fig. 14 to generate ‘‘families” of cumulative distribution function curves to more fully represent the uncertainty in the results. If the CDF curve in Fig. 14 representing the high level of data were distributed by five percent to the high and low side of the mean curve, it would produce a family of CDFs as a result. For some of the curves at the left tail of this family of curves, they would show that the regulatory criterion passes. Effectively, we would be placing a probability distribution on the probability of passing the criteria. This sort of ‘‘probability of frequency” interpretation of uncertainty is very typical in full-scale Quantitative Risk assessment result presentations. Fig. 15 shows an example of this kind of assessment, applying the ±5% variability due to the clustering itself (discussed in Section 6), and assuming it is normally distributed, to the high level of data cluster interpolation curve from Fig. 14. Each point along the original curve is treated as the mean of a normal distribution with 99.9% bounds of ±5%. This choice of a normal distribution is arbitrary in this case as an example. This results in a ‘‘space” of result curves. The circled area in Fig. 15 could be taken as a vertical cut across this family of curves to generate a distribution of the probability of exceeding the
1 0.9
Cumulative Probability
Table 6 Probabilities of exceeding regulatory thresholds with model correction
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 600
700
800
900
1000
Surface Temperature (C) Fig. 15. Distribution family for high level of physical data and interpolated cluster results.
M.D. Brandyberry / Comput. Methods Appl. Mech. Engrg. 197 (2008) 2390–2407
900 °C regulatory limit. Taking this vertical cut at 900 °C in Fig. 15, the resulting distribution has a minimum that would indicate that the regulatory criteria would be passed with a small probability (approximately 5%). The median shows that the nominal 0.06 probability of failing the criteria, and the upper bound indicates a 0.16 probability of failure. Thus, we could be 50% confident that the failure probability is 0.06 or less, 100% confident that it is no more than 0.16, and only 5% confident that it is less than the 0.01 regulatory limit. These are the types of results that should be generated in order to fully assess the confidence in simulation results for use in decision-making. The arbitrary normal distribution and ±5% variability estimated for the uncertainty introduced by the clustering method (Section 6) have been used here as an example. The results of the clustering surrogate model approach in this paper are dependent on having a surrogate model with which to estimate the dependent variable for the LHS sample members that accurately predicts dependent variable trends, but does not have to be magnitude-accurate. If the surrogate does not produce values that vary with the same trends that the full simulation will produce, then the clustering is likely to be invalid as well. Also, the extrapolation used to provide model-form uncertainty correction to the base results is dependent on there being no non-linear effects that would cause deviation from the linear extrapolation. This should be assessed on the basis of the physics of the problem. A non-linear extrapolated correction could be used if indicated. The clustering methodology presented here is currently being investigated for sensitivity to the numbers of clusters used, which sample member(s) per cluster should be used, better ways of estimating the uncertainty introduced by using clustering, and different ways of formulating the surrogate model. The methodology has been applied to solid rocket motor problems [17,18], and is being reviewed for application to aircraft structural design and reliability issues. Acknowledgement This work was performed at the University of Illinois Center for Simulation of Advanced Rockets, which is supported by the US Department of Energy through the University of California under subcontract number B523819.
2407
References [1] K.J. Dowding, M. Pilch, R.G. Hills, Formulation of the thermal problem, Comput. Methods Appl. Mech. Engrg. 197 (29–32) (2008) 2385–2389. [2] E. Pate´-Cornell, Risk and uncertainty analysis in government safety decisions, Risk Analysis 22 (3) (2002) 633–646. [3] J.C. Helton, F.J. Davis, Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems, Reliability Engineering and System Safety 81 (2003) 23–69. [4] S. Kaplan, J. Garrick, On the quantitative definition of risk, Risk Analysis 1 (1) (1981). [5] R.L. Iman, J.C. Helton, A comparison of uncertainty and sensitivity analysis techniques for computer models, Risk Analysis 8 (1) (1988). [6] A. Mosleh et al. (Ed.), Model uncertainty: its characterization and quantification, in: proceedings of workshop I in advanced topics in risk and reliability analysis, sponsored by the nuclear regulatory commission, NUREG/CP-0138, 10 (1993) 20–22. [7] Office of nuclear regulatory research (ONRR), PRA Procedures Guide, NUREG/CR-2300, US Nuclear Regulatory Commission, Washington, DC, 1983. [8] US Nuclear Regulatory Commission, Severe Accident Risks: An Assessment for Five US Nuclear Power Plants, NUREG-1150, Office of Nuclear Regulatory Research, Washington, DC, December 1990. [9] M.D. Brandyberry, D.P. Kearnaghan, J.M. East, PSA Uncertainties Introduced by Use of the PARTITION Computer Code for Source Term Clustering, in: Proceedings of Probabilistic Safety Assessment International Topical Meeting, Clearwater Beach, FL, January 26– 29, 1993, 383–388. [10] US Army Program Manager for Chemical Demilitarization, Tooele Chemical Agent Disposal Facility Quantitative Risk Assessment, Science Applications International Corporation, SAIC-96/2600, December, 1996. [11] SPSS for Windows, Release 14.0.0, 2005, SPSS Inc: Chicago. [12] SamplePower 2.0, developed by Michael Borenstein, December 20, 2000, distributed by SPSS Inc., Chicago, IL. [13] R.G. Ghanem, P.D. Spanos, Stochastic Finite Elements: A Spectral Approach, Dover publications, 1991. [14] B. Ganapathysubramanian, N. Zabaras, Sparse Grid Collocation Methods for Stochastic Natural Convection Problems, Journal of Computational Physics, in press. [15] R.L. Iman et al., PARTITION: A Program for Defining the Source Term/Consequence Analysis Interface in the NUREG-1150 Probabilistic Risk Assessments, Sandia National Laboratories, NUREG/ CR-5253, SAND88-2940, May 1990. [16] F.P. Incropera, D.P. DeWitt, Fundamentals of Heat Transfer, John Wiley & Sons, New York, 1981. [17] M.D. Brandyberry, M.D., Uncertainty Quantification in 3D Rocket Simulation, AIAA-2006-4586, in: presented at the American Institute of Aeronautics and Astronautics JPC Meeting, July 2006. [18] M.D. Brandyberry, E. Egejuru, Uncertainty Quantification for Multiphysics Solid Rocket Motor Simulations, in: 54th JANNAF Propulsion Meeting, Denver, CO, May 14–18, 2007.