Fluid Phase Equilibria 222–223 (2004) 95–106
Large scale data regression for process calculations—theory and description of the virtual database夽 M.A. Satyro a,∗ , A. Liu a , R.K. Agarwal a , Y.-K. Li a , O.J. Santollani a , W.Y. Svrcek b a
Virtual Materials Group, Inc., 657 Hawkside Mews NW, Calgary, Alta., Canada b The University of Calgary, Calgary, Alta., Canada
Abstract Optimized thermodynamic model parameter databases are the cornerstone for the creation of models for process calculations. In this paper we describe the statistical tools and the general characteristics of a new database that collects the necessary model parameters for process calculations and simulations. This new database was created based on evaluated data only, with special considerations towards consistency of pure component and thermodynamic equilibrium data, extensive empirical and semi-empirical quality tests of derived models and comprehensive documentation and retrieval facilities for every piece of information stored in the system. The new thermodynamic model parameter database, VIRTUAL, provides users of chemical process calculations with high quality, extensively tested and consistent model parameters based on carefully screened experimental data. © 2004 Elsevier B.V. All rights reserved. Keywords: Data; Equation of state; Activity coefficient; Vapor pressure; Model parameter determination
1. Introduction The most fundamental aspect of process simulation calculations can be traced back to the related thermodynamic model used for the mixture phase behavior and physical properties. Underneath this thermodynamic model we invariably find pure component properties such as critical properties and vapor pressures, as well as interaction parameters necessary to guarantee adequate performance of mixture models such as UNIQUAC and Peng–Robinson. Common engineering practice stresses the convenience of computer calculations. However, more often than not the accuracy of the underlining thermodynamic method used for simulations is not accessed, or, at best, accessed in an incomplete manner since most of the data collections currently available do not provide complete statistical information necessary for the proper performance evaluation of a model. For example, good quality engineering databases such as DIPPR 801 [1] and Yaws [2] provide only fragmentary statistical information about correlation errors in the form of average and maximum errors for temperature de夽 Paper to be presented at the 15th Symposium of Thermophysical Properties, Boulder, CO, June 22–27, 2003. ∗ Corresponding author. Tel.: +1 403 241 3929; fax: +1 403 241 8018.
0378-3812/$ – see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.fluid.2004.06.026
pendent properties such as vapor pressure. Standard deviations for important temperature independent physical properties such as critical properties and acentric factors are not provided. Temperature dependent properties correlations are also incomplete in a sense that the variances for regressed parameters are not available, thus precluding the use of such data for a comprehensive error propagation analysis of calculated physical properties for mixtures. For mixture interaction parameters, the situation is even more confusing. Interaction parameters for VLE models are indirect functions of pure component properties, and have to be uniquely identified with these pure component parameters. In the databases currently used to support VLE calculations there is no explicit guarantee that the vapor pressure and other pure component parameters are consistent with the interaction parameters being used in the simulations [3–8] with the exception of one property data package system [9] that assures automatic trace back capabilities to all its interaction parameters and pure component properties. This lack of quality control and traceability creates the potential for inconsistent simulations that “look” reasonable, but may have basic errors that are not obvious, and may cause significant economic losses related to the equipment performance, product specification failure and plant operability problems. These are known issues in chemical process simulation, but
96
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
since their proper resolution depend on significant resources related to data management, expert analysis, and processing capabilities, so far, these problems have never been resolved in a comprehensive manner. Our goal is to create a consistent data source for chemical process simulation using the most accurate and well maintained data source available, combined with extensive quality control, consistent use of thermodynamic quality tools such as thermodynamic consistency tests, extensive use of process simulation tools such as multiphase flash calculations, and complete statistical data analysis for the final estimation of uncertainties of the calculated values. It is our belief that a numerical result provided by a process simulation software tools without quantifiers of their uncertainties are incomplete from a scientific point of view, and potentially dangerous when used by engineers without appreciation for error estimates in the underlining thermodynamic models as well as errors in the underlining property data used to power the thermodynamic model such as vapor pressures and interaction parameters. This quality-driven solution for the calculation of physical properties used in process simulation is possible only if a robust data management system of pure component and mixture data with evaluated uncertainties is available [10,11], and the necessary thermodynamic framework of models and multiphase flash calculations is developed [9] to be used for extensive data regression and statistical analysis within the structure of rigorous flash calculations used for process simulation. These concepts were explored in detail by the creation of a large prototype to test the necessary mathematical and computational tools [12]. In this paper we will describe the details related to the statistical tools applied to the determination of optimal parameters for mathematical models used for physical property calculations and the overall structure of the parameters collected. This extensive set of optimized thermodynamic model parameters is stored in a modern relational database named currently named VIRTUAL.
2. Problem description The problem chemical and process engineers try to solve can be summarized as finding the best process operating conditions for a given process under certain constraints. The analysis of chemical processes is done using some type of process simulation tool that allows the engineers to mimic in the computer the mass and energy balances that govern the mathematical models of process equipment. At the heart of the process simulator we find a property package system, responsible for solving the multiphase flash equations and calculating fluid physical properties such as viscosity and density. The development of purely theoretical models based on quantum chemistry calculations is progressing [13], but the need for empirical and semi-empirical models is a constant
given our current technology, and these simpler models for thermodynamic properties are going to be used and enhanced for a significant part of this century. Therefore, since we cannot predict physical properties important for process calculations without experimental data, it relies on us to find the most appropriate path to use the existing data as intelligently as possible to maximize experimental knowledge of Nature, and combine this data with carefully crafted models to allow engineers to take full advantage of the data. Closing the circle, the models created based on experimental data have to be fully validated to provide engineers with a quantifiable degree of precision to the calculations being performed in simulations. We can then express the problem we want to solve as finding a model architecture m that maximizes that provides the best agreement between the data d and a set of adjustable parameters ϕ [14]. Mathematically we express this objective as: m = max p(θ|d, m)
(1)
The expression above tells us that the model m is the one that provides the best agreement between the measured data from all available (and supposed reasonable) mathematical models m using the adjustable parameters ϕ. 2.1. Basis for selection of a “best model” We have to select some criteria in order to decide how to define the “best” set of model parameters. There is a vast literature about how to proceed with respect to the selection of proper probabilistic models [15–17]. In our case we have to accept the fact that we will be working most of the time using data reported in the open literature for which limited information about the actual experiment is available. Therefore, sophisticated statistical models that could be useful for the detailed analysis of a specific experiment cannot be used. Rather, we have to use to our best advantage uncertainty information provided by NIST in the SOURCE database combined with the most appropriate statistical model taking into account this data and uncertainty information used to qualify the data by NIST. In other words, we seek a balance between mathematical rigor and the realities of the data—and data qualification—actually available for the creation of mathematical models for the prediction and correlation of thermodynamic data. We also have to keep in mind that the solution of Eq. (1) is dependent on the universe of mathematical models we have available for the solution of a given problem. Sometimes we will be able to find a “definitive” answer (from a practical point of view) for a certain modeling problem, while sometimes we will look for a small set of models m used to solve a modeling problem and let the user select the appropriate model based on his or hers personal experience and taste. For example, we will demonstrate that a good general vapor pressure correlation can be used to model a large amount of
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
experimental vapor pressure data and therefore this model can be chosen as a reasonable representation of this physical property. On the other hand, models for vapor–liquid equilibrium abound and one has to be flexible enough in providing enough models and respective parameters to allow individual modelers to proceed with their simulation work. Using Bayes’ rule, Eq. (1) can be transformed into: max p(θ|d, m) = max θ
θ
p(d|θ, m)p(θ|m) p(θ|m)
max p(θ|d, m) = max θ
max = θ
θ
p(d|θ, m)p(θ|m) ϕ p(d|θ)p(θ|m) dϕ
likelihood × prior evidence
(2) (3)
(4)
Bayes’ theorem allows us to factor the problem into three parts. The likelihood measures the match between the data and the model predictions with the model parameters based on an error model. The prior defines some advanced beliefs about which values of the coefficients are reasonable and the evidence measures how well the model describes the data [14]. Eq. (4) describes a formal Bayesian perspective, and strict adherence to this point of view is impractical for the class of problems we wish to solve. We note that that we would have to associate priors to each parameter we would like to use in the model—a formidable task since the parameters in most of chemical engineering models are empirical—and that the integral in the evidence term would have to be evaluated using some kind of Monte Carlo technique, and it would be valid only for one of the models in m. We therefore simplify the problem by assuming a “reasonable” model for the physical property calculations. This does not preclude us to investigate several different models for a given physical property and recommend what is the “best” from a statistical point of view, but we are accepting the fact that we have not performed an exhaustive search of all possible mathematical models available to describe the physical property we are interested in. Thus when we talk about the “best” model for a physical property the meaning of “best” has to be taken into context. Sometimes we will be content to recommend a small subset of “best” models such as models for vapor–liquid equilibrium calculations. It is important to keep in mind the limitations of any statistical analysis and the limitations of the theoretical tools used to model physical properties. For example, one could recommend a vapor–liquid equilibrium model as the “best” (from a statistical point of view). Now, this “best” model would emerge as the model that provided the best fit for a large number of binary systems, and there is no guarantee that this model is still the best when modeling multicomponent systems, nor that for some classes of systems another model is more accurate. We cannot address the former unless we have more theoretically based models, but we can address the latter by providing the necessary
97
statistical information for each binary system analyzed and therefore allowing the user to make a more educated selection of models. With the simplification described above Eq. (2) can be rewritten as p(d|θ)p(θ) max p(θ|d) = max (5) ϕ θ p(d) There are many standard textbooks on how to apply Eq. (5) to actual problems, such as Bard’s [14]. One of the clearest expositions available on data regression applied to chemical engineering problems is provided by Caracotsios [16], and the following discussion is based on his exposition. We start by stating that a model will be defined by a set of algebraic and/or differential equations. The values in these equations are classified into dependent and independent variables and parameters. The variables can be measured independently while performing an experiment while parameters are properties belonging to the proposed model that may or may not be estimated from the set of experimental variables. The following vector equation is used to describe a model: .
= 0 F (t, x, u (t, x), u (t, x), u (t, x), u (t, x) : θ)
(6)
F is a non-linear vector operator which is assumed to be continuously differentiable with respect to all of its parameters, t is the time variable, x are all the independent variables except time, u (t, x) is an n-dimensional vector of state variables and θ is an m-dimensional vector of parameters. The “.” operator denotes the time derivative operator while and denote the first and second derivative operators, respectively, for the other independent variables. The state variables u (t, x) and all their derivatives are also assumed dif ferentiable with respect to the model parameter vector θ. The data matrix is defined as Y (t, x) where the uth row contains the data vector for the uth experiment. A matrix of predicted values according to the model x, u Z(t, (t, x) : θ) is associated with Y (t, x): = Z(t, x, u Z (t, x) : θ)
(7)
is a non-linear function. The deviation between Usually Z function is an indicator the experimental data and the Z on how the model fits the data. This deviation defines the residual or error matrix (Eq. (8)): = Y (t, x) − Z(t, x, u x, u R(t, (t, x) : θ) (t, x) : θ)
(8)
into its column vectors, Caracotsios partition the matrix R Eq. (9): x, u R(t, (t, x) : θ) · · · eM (t, x, u (t, x) : θ)| (t, x) : θ)] = [e1 (t, x, u
(9)
Using the notation provided by Eq. (9) we can define a matrix of residuals, Eq. (10): (t, x) : θ) Vij (t, x, u · ej (t, x, u = ei (t, x, u (t, x) : θ) (t, x) : θ)
(10)
98
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
where the symbol · is the dot product operator. In Eq. (9) M is defined using Eq. (11): M = max Mu 1≤u≤N
(11)
where N is the total number of experiments and Mu the number of observations recorded in the uth experiment. The number of responses observed is usually different from experiment to experiment. For convenience we adopt the following notation, Eqs. (12)–(16): Y = Y (t, x)
(12)
= Z(t, u, θ) x, u Z( (t, x) : θ)
(13)
= R(t, u, θ) x, u R( (t, x) : θ)
(14)
= V (t, x, u V ( u, θ) (t, x) : θ)
(15)
= ei (t, x, u ei ( u, θ) (t, x) : θ)
(16)
It is now necessary to define a function to measure how well the model fits the experimental data, Eq. (17): φ) S = S(θ,
(17)
Eq. (17) states that the quality of how well our model fits the data is dependent on the model parameters. Given the wide availability of inexpensive high-speed computers, it is worthwhile to proceed with a rigorous statistical analysis dictated by our technical and scientific objectives instead of mathematical simplicity. Caracotsios suggests the use of a Bayesian approach to the statistical analysis. In is introduced to state this approach, a prior distribution p(θ) our knowledge—or ignorance—about the parameters of the model before the experimental data is collected. Given the the data probability distribuprior probability function p(θ), tion p(Y |θ) and the mathematical model one can calculate the probability distribution of the parameter vector p(Y |θ) after the data is measured. This is called the posterior probability distribution and is the basis for determining the function S and for making statistical studies about the model parameters. Assuming that the experimental data is known, then Y ) can be assumed to be a function of θ only, and this p(θ| is usually known as the likelihood function. Using Bayes’ theorem we have the following expression for the posterior probability distribution, Eq. (18): Y )αp(θ)l( θ| Y ) p(θ|
(18)
The likelihood function is fundamental for the parameter estimation it is the function used by the data matrix Y to To modify the prior knowledge of the parameter vector θ. define the likelihood function we have to define the structure of the data matrix and the distribution of errors for each experiment. It is also convenient now to properly define the data structure matrix Y to accommodate the structure we usually experience in thermodynamic related problems. First we partition the data matrix Y into Nb blocks where each block corresponds to all observations with the same
variance–covariance matrix. In practical terms, each data block corresponds to an experiment. In the same way we and the moment of the u, θ) partition the error matrix R( as described in Eqs. (19)–(21): residual matrix V ( u, θ) Y = |Y1 |Y2 | · · · |YNb |
(19)
= |R R · · · |R u, θ) 1 ( 2 ( Nb ( R( u, θ)| u, θ)| u, θ)|
(20)
= |V1 ( V2 ( · · · |VNb ( V ( u, θ) u, θ)| u, θ)| u, θ)|
(21)
The number of observed experimental values can change from data block to data block, but this number is assumed constant for a given data block. We also note that there are Ni experiments in the ith data block and Mj observed experimental values in the jth experiment. N denotes the total number of experiments and the following relations are defined: N=
Nb
Ni
(22)
i=1
M = max Mu
(23)
1≤u≤N
We now assume that the data does not contain systematic errors and that there are no dominant factors contributing to the errors in the data. If the above applies, then we can assume that the experimental errors are normally distributed with zero mean. This is expressed mathematically as the following relationships: i (u, j)) = 0 E(R
(24)
i (u, j)R k (v, l)) = 0, E(R
where u = v or i = k
i (u, j)R i (u, l)) = ( E(R σi )jl
(25) (26)
Eq. (24) states that the mean error is zero, while Eq. (25) states that the errors amongst different experiments are uncorrelated and finally Eq. (26) defines the variance– covariance matrix. If one imposes on the likelihood function the condition of adding to the fitting procedure the smallest amount of information (or maximum entropy), variational calculus can be used to derive the following relationship [15,16]: σ1 , σ2 , . . . , σNn |Y ) l(θ, N b −(1/2)Mk Nk −(1/2)Nk = (2π) det( σk ) k=1 N b
×
1 exp − tr( σk−1 Vk ( u, θ)) 2
(27)
k=1
With the posterior probability distribution worked out in Eq. (27), we now have to define the prior probability distribution. This can be expressed mathematically by assuming that the model parameters are distributed independently from the variances: σ1 , σ2 , . . . , σNb )αp( p(θ, σ1 )p( σ2 ) · · · p( σNb )
(28)
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
Eq. (28) can be transformed into Eq. (29): "
"
After the optimization procedure is complete, we have the optimal parameter vector θ∗ and the posterior probability function has its maximum value as shown in Eq. (34):
"
σ 1 , σ 2 , . . . , σ Nb ) p(θ, ∝ det( σ1 )−(1/2)(M1 +1) det( σ2 )−(1/2)(M2 +1) · · · det( σNb )−(1/2)(MNb +1)
"∗
(29)
The combination of Eqs. (18)–(27) and (29) is expressed as Eq. (30), which tells us the probability distribution of the parameter vector θ after all the experiments of interest are performed: σ1 , σ2 , . . . , σNn |Y ) l(θ, N b −(1/2)(Mk Nk ) −(1/2)(Mk +Nk +1) = det( σk ) (2π) k=1
×
N b k=1
1 exp − tr( u, θ)) σk−1 Vk ( 2
(30)
Nb Mi Mi
"
"
( σ i )−1 jk (Vi )kj ( u , θ)
(31)
In our specific case, we will not usually have covariance information, but variances are always known. In other words, the variance–covariance matrix is diagonal and Eq. (31) can be simplified into Eq. (32): = S(θ)
" " ( σ i )kk (Vi )kk ( u , θ)
(32)
i=1 k=1
Eq. (32) is the basic equation used for the data fitting procedures used in this work. 2.2. Statistical information on the parameter vector θ Caracotsios demonstrated that the posterior probability function of parameter θ can be expressed as Eq. (33): "
p( θ |Y ) ∝
Nb i=1
−Ni /2 det Vi ( u, θ)
−Ni /2 det Vi ( u, θ∗ )
(34)
If one takes the ratio of Eqs. (33) and (34) and the logarithm we have the following expression for the posterior probability function, Eq. (35):
p(θ " |Y ) log pmax (θ " |Y ) Nb Ni
2
u, θ)))−log(det( Vi ( u, θ∗ )))] [log(det(Vi (
(35)
Eq. (35) can be expanded in series as Eq. (36): Y ) ≈ log pmax (θ| Y ) − 1 (θ − θ∗ )T H( θ∗ )(θ − θ∗ ) log p(θ| 2
i=1 j=1 k=1
Nb Mi
Nb i=1
i=1
Eq. (30) is the starting point for the determination of the parameter vector θ and the variance–covariance matrices σi ; one can determine the parameter vector θ and/or the variance–covariance matrices σi in such a way as to maximize the posterior probability distribution as defined by Eq. (30). Given the type of data we have to work with from the SOURCE database, the most important special case of Eq. (30) is the one commonly known as weighted least squares. We assume that the variance–covariance matrix is known, the maximization of the posterior probability distribution is equivalent to the minimization of Eq. (31): = S(θ)
pmax ( θ |Y ) ∝
=
99
(33)
(36) The Hessian matrix H is given in Eq. (37): = H(θ)
Nb Ni ∂2 log(det(Vi ( u, θ))) i=1
2
θ ∂θ∂
(37)
Using the Taylor expansion above, Eq. (35) can be rearranged to provide the approximate probability distribution of the parameter vector θ described in Eq. (38): Y ) ≈ det(H( θ∗ ))1/2 exp(− 1 (θ − θ∗ )T H( θ∗ ) log p(θ| 2 × (θ − θ∗ ))
(38)
Eq. (38) is simply the expression of a multivariate normal distribution with expected value θ∗ with a covariance matrix given by Eq. (39): =H −1 (θ∗ ) Cov(θ)
(39)
The approximate probability contours for a given confidence level are provided by Eq. (40): θ∗ )(θ − θ∗ ) = χ2 (m, α) (θ − θ∗ )T H(
(40)
χ2 is the chi-squared distribution with m degrees of freedom and confidence level α.
3. Specific models for data fitting After the definition of the statistical approach for data regression, we are left with the task of casting the general mathematical framework into specific procedures for the fitting of physical properties important for thermodynamic calculations. In a broad sense, we have three classes of physical properties that have to be addressed—fixed point physical properties such as critical points, standard liquid densities
100
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
and normal boiling points; pure component temperature dependent physical properties such as vapor pressures, gas and liquid densities, ideal gas heat capacities; finally we have mixture dependent properties such as parameters for equation of state mixing rules and activity coefficient models interaction parameters.
Fixed point properties are important for thermodynamic models for two reasons. First, they provide the least amount of information necessary to characterize a pure compound, and a subset of these fixed point properties are used in generalized correlations that in turn can estimate fixed or temperature dependent physical properties. For example, when characterizing hydrocarbons it is common to correlate properties such as standard liquid density and molecular weight with normal boiling point. In addition, acentric factors can be used to estimate critical temperatures and pressures and from this information rough vapor pressure estimates can be constructed. Second, some important models for phase equilibria calculations have been parameterized using basic pure component properties. Notable models are simple cubic equations of state such as Soave–Redlich–Kwong (S–R–K) and Peng–Robinson (PR) which are widely used in process simulation calculations and are parameterized using critical temperatures and pressures and the acentric factor. Starting with Eq. (32), optimal values for point properties is given by Eq. (41), while the variance is given by Eq. (42): Nb Yi /σi2 θ ∗ = i=1 (41) Nb 2 i=1 1/σi (42)
3.2. Regression model—temperature dependent physical properties Temperature dependent physical properties are usually of the type expressed by Eq. (43): (43) F = F(T, θ∗ ) The condition of maximum probability can be expressed by a gradient of Eq. (32) being zero, Eq. (44): Nb M
i Fik − Yik ∂Fik ∂S = =0 ∂θj ∂θj σik2
3.3. Regression model—vapor/liquid equilibrium and excess properties Multicomponent thermodynamic equilibrium problems can be expressed as Eq. (45):
3.1. Regression model—fixed point physical properties
−1
Nb 1 Cov(θ ∗ ) = σ2 i=1 i
In the event of non-linear functions, Eq. (44) can be efficiently solved using the Marquart–Levenberg method.
(44)
i=1 k=1
Eq. (44) provide us with a set of linear equations if function F is linear in the parameter vector θ or a set of non-linear equations if function F is non-linear in the parameter vector If function F is linear then the system of Eq. (44) can θ. be solved without iteration and robust elimination methods using singular value decomposition can be used [15,17].
F = F ( ϕ, θ∗ )
(45)
where F is a vector of thermodynamic variables at their equilibrium, and ϕ a vector of intensive variables that define the state of the system we are interested. For example, ϕ could represent a vapor fraction, a pressure and the bulk composition of the system. Unless the models under study are very simple, Eq. (45) is always non-linear and frequently it has to be solved using robust multiphase flash algorithms [18] to avoid unrealistic parameters coming from unstable solutions of the material balance equations. This problem is rather common with published interaction parameters where regressions are done without proper attention to phase stability and unphysical behavior is predicted for systems with partial miscibility. For problems related to multicomponent thermodynamic equilibrium, Eq. (32) can be written as Eq. (46): S=
Nk Mi Nb (Fiku − Yiku )2 i=1 k=1 u=1
2 σiku
(46)
And the condition of maximum probability can be expressed as the point where the gradient of Eq. (46) is equal to zero as expressed in Eq. (47). Note that Nk represents the number of response variables collected during a thermodynamic experiment. For example, if the experiment was a TXY type measurement the Nk variables are the bubble temperature, dew temperature, the vector of liquid compositions and the vector of vapor compositions: Nb M N
k i ∂S Fiku − Yiku ∂Fiku = =0 2 ∂θj ∂θj σiku
(47)
i=1 k=1 u=1
It is interesting to note that popular non-linear least squares computer codes available in the literature, such as the one published by Anderson et al. [19] provide only a partial implementation of Eq. (47). In this specific case, Anderson et al.’s implementation considers only bubble pressure or temperature data when processing TXY or PXY datasets thus using an incomplete set of available experimental data. This was corrected by Satyro [20] by using a simultaneous calculation of bubble and dew point conditions for each data point. At this point, it is also convenient to note that there are alternatives to the solution of Eq. (32). Britt and Luecke [21] proposed that the measured data vector Y should also be adjusted depending on the variance–covariance matrix and internal constraints available due to the development of the
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
101
mathematical model, thus performing a simultaneous determination of the optimal model parameter vector θ∗ and a “reconciled” set of experimental values within some limits dictated by the uncertainty of the data. The basis of this idea can be traced to Deming’s work [22,23]. An alternative approach was suggested by Vasquez and Whiting [24] where the variance–covariance matrix is reevaluated at each optimal parameter vector θ∗ until convergence is achieved. Both Britt and Luecke and Vasquez and Whiting methods are more sophisticated from a statistical point of view and should be considered when the modeler has extensive statistical information about the experiment being performed. This sophistication is not however without a price. The Britt and Luecke method suffers from bad convergence behavior when the initial guesses for the parameter vector are far away from the converged values. In this work we will always be creating models based on published literature data and uncertainty information coming from careful data revision from the TRC specialists, and therefore the additional sophistication that methods such as Britt and Luecke bring is not justified.
correlations used in the refining industry for the estimation of thermodynamic properties of oil cuts. Since fixed point properties are directly used in equations of state used for thermophysical calculations and/or used in the estimation of thermodynamic properties, it is important to collect not only common statistics such as errors and deviations, but also the variance and standard deviation to allow us to perform error propagation.
4. Statistical information
At the end of the modeling scale we have parameters used for the representation of phase equilibria. These parameters usually appear in the mathematical models as interaction parameters, and they are used to subsidize the quality of the predictions coming from theoretically based models. The most famous of such models are activity coefficient models such as NRTL and UNIQUAC, or equations of state models such as Peng–Robinson. In addition to standard statistical information such as errors and deviations, the variance–covariance matrix for each binary pair is collected.
With the mathematical model for the regression work defined in the previous section, the next step is to define the information that will be collected during the data regression work. The information being collected is defined by the following requirements imposed on the mathematical models to be used for chemical process calculations: • Basic quality descriptors have to be easily accessed— average absolute percent error, average absolute deviation, percent bias, deviation bias, maximum absolute percent error, maximum absolute deviation, S at optimum, variance of the fit and 95% confidence intervals for optimized variables. • Error propagation descriptors have to be easily accessed— variance–covariance matrix for the optimal model parameters. • Complete backtracking of information have to be easily accessed—all references used for regression have to be available for inspection. • Visual description of fit, including data fit plots, error and deviation plots. This information is collected for each type of thermodynamic property we are processing. 4.1. Collected information—fixed point properties Fixed point properties are basic for process calculations. They may be used as part of the definition of simple equations of state such as Soave–Redlich–Kwong and Peng–Robinson or as parameters used in generalized correlations for the estimation of physical properties, such as
4.2. Collected information—temperature dependent physical properties Temperature dependent physical properties are also basic for chemical process calculations; for example process calculations without reliable vapor pressures are meaningless. In addition to standard statistical information such as errors and deviations, the variance–covariance matrix for each temperature dependent physical property is collected for each compound. 4.3. Collected information—phase equilibria and excess properties
4.4. Summary of collected statistical information It is convenient to summarize all the collected statistical information for each of the data types discussed above (Table 1). Table 1 Statistics collected for different physical properties Data regression parameters
Fixed point properties
Temperature dependent properties
Phase equilibria and excess properties
Average absolute error Average absolute deviation Maximum absolute error Maximum absolute deviation Maximum deviation position Bias Least squares function S Variance of fit Variance–covariance matrix
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes Yes
Yes
Yes
Yes
Yes Yes Yes Yes
Yes Yes Yes Yes
Yes Yes Yes Yes
102
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
5. Quality and consistency information Just as important as the statistical information, quality information is collected for each physical property stored in the database. Usually the quality information is collected taking advantage of interrelationships existing between physical properties given by classical thermodynamics. This in turn allow the user to better exercise his or hers judgment on the applicability of the thermodynamic models for specific process conditions unavailable to the database. Quality information is divided into two classes, pure component and mixture. 5.1. Pure component quality information Quality information for pure compounds is usually based in the comparison between a fixed point physical property and a value derived from a temperature dependent physical property. It is important to realize that the boundary between fixed point physical properties and temperature dependent physical properties is sometimes dictated by convenience. For example, normal boiling points and critical pressures and classified as fixed point properties, but naturally they are also in principle available from a “good” vapor pressure correlation. From a practical point of view, some fixed point physical properties represent extensively studied, narrow regions of a temperature dependent property and they provide a convenient way of spot checking the accuracy of temperature dependent physical properties as well as the accuracy of derived physical properties to ensure that the models are behaving in a physically sound manner. From a process simulation point of view, vapor pressures, densities, heat capacities and enthalpies of vaporization are the most important physical properties and special attention was devoted in the collection of quality and consistency for these physical properties. 5.1.1. Vapor pressure The quality of the vapor pressure correlation is accessed by calculating the critical pressure and the normal boiling point pressure based on the vapor pressure correlation and the normal boiling point and the critical temperature fixed point properties. In addition, the enthalpy of vaporization is estimated using the vapor pressure correlation using the Clausius–Clapeyron equation at the compound normal boiling point, and the isobaric heat capacity as predicted by the second derivative of the vapor pressure correlation and the ideal gas heat capacity is compared against the isobaric heat capacity at 25 ◦ C and the isobaric heat capacity at the normal boiling point. Therefore, vapor pressures allow us to spot check the quality of a vapor pressure correlation, ideal gas heat capacity correlation, enthalpy of vaporization correlation, normal boiling point, critical temperature and critical pressure simultaneously.
5.1.2. Liquid density The quality of the liquid density correlation is accessed by comparing the values predicted by the correlation and the liquid densities at 25 ◦ C and at the normal boiling point. 5.1.3. Ideal gas enthalpy of formation and ideal gas Gibbs free energy of formation Ideal gas enthalpies of formation and ideal gas Gibbs free energies of formation are fundamental quantities for the proper modeling of reactive systems. Usually only values at 25 ◦ C are reported [1], or simple polynomials with limited temperature range [2]. These two properties are modeled using the same formulation used for ideal gas heat capacities and available over a wide range of temperatures. As a quality check, the values calculated from the correlations are compared against high quality enthalpies of formation and Gibbs free energies at 25 ◦ C. 5.2. Phase equilibrium quality information Phase equilibrium quality information is fundamental for quality process simulations, but usually this information is obtained in an ad hoc manner by doing costly audits using process simulators. We believe that it is much more effective to provide this quality information a priori to process simulation users in order to allow them to intelligently decide if they need additional information and in the event that the model is not applicable, where they should invest in order to acquire quality experimental information. In addition, the landscape of phase equilibrium models is filled with incorrect phase splits, incorrect azeotrope predictions and inconsistent models. Quality and consistency information goes a long way to help us weed out bad models or at least advise the user against potentially bad results when using a model. 5.2.1. Thermodynamic consistency Thermodynamic consistency tests are useful to provide a measure of quality of vapor–liquid equilibrium data. The tests are based in intrinsic relationships that exist between thermodynamic quantities, such as the Gibbs–Duhem equation. In this work thermodynamic consistency is accessed using three models, Herrington [25], Fredenslund [26] and van Ness [27]. These tests are performed for each TXY or PXY dataset. 5.2.2. Azeotrope prediction Each dataset is examined for the existence of azeotropes by inspecting the XY curve (for TXY and PXY datasets) or the bubble pressure (for PX) or the bubble temperature (TX) datasets. If XY curves are available, then the azeotrope point is calculated by finding the point where the vapor and liquid compositions are identical. If only PX or TX data is available, maximum or minimum points are searched. If azeotropes are found, the azeotrope thermodynamic coordinates are stored in the VIRTUAL database and then are used
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
to check the quality of the prediction of VLE calculations when using the optimized model parameters determined for equations of state or activity coefficient models. 5.2.3. End points for pure components One of the unique points of the VIRTUAL database is its attention to detail and consistency in the use of thermodynamic properties. Since VIRTUAL provides a consistent set of vapor pressure parameters as well as the possibility of estimating the accuracy of vapor pressure predictions, this allows a rigorous quality test of vapor pressures measured when original VLE data was collected. When pure component data is available in the original VLE data the quality of vapor pressures for the components making up the mixture is accessed. 5.2.4. Proper phase split test After optimal interaction parameters are determined, it is important to check the accuracy of the predictions against the experimental data as well as the physical correctness of the predicted separation space. Each VLE model is analyzed with respect to the correctness of the predicted thermodynamic equilibrium, and quality information is collected to warn the user of model defects such as prediction of false phase splitting for single liquid phase systems or absence of phase splitting for heterogeneous systems. 5.2.5. Asymmetry information Each dataset is compared against a symmetric Gibbs excess model based on a two suffix Margules equation to provide information about the nature of the binary system as symmetric or asymmetric and allow a qualitative classification about the non-ideality of the system [28]. 5.2.6. Composition momentum Each dataset is screened to determine if the data is reasonably spread across the composition space and help the user to access the applicability of the optimized thermodynamic model parameters to a particular process.
103
Table 2 Fixed point descriptive properties Property
Description
Name CASN
Common name used for compound access Chemical abstracts service number, also used for compound access Chemical family type as defined by DIPPR (1) Chemical formula Unifac structure Common synonyms for compound Compound chemical structure in ChemDraw format
Family Formula UNIFAC Synonyms Structure
5.2.7. Infinite dilution activity coefficients Infinite dilution activity coefficients are collected for each optimized thermodynamic model parameters. 6. Fixed point physical properties Fixed point fixed properties collected in this work are summarized in this section. For convenience, the properties are divided into descriptive properties, basic properties, energetic properties and vapor–liquid equilibrium properties. 7. Temperature dependent physical properties Temperature dependent properties collected in this work are summarized in this section (Tables 2–6). It is important to note that the VIRTUAL database stores empirical parameters for phase specific properties such as liquid density. Properties such as viscosity and thermal conductivity are based on corresponding states methods (22, 23) and the stored parameters can be used to model the gas and liquid phases simultaneously. 8. Phase equilibrium and excess properties Phase equilibrium model parameters for a wide varied of thermodynamic models are stored in the VIRTUAL
Table 3 Fixed point basic properties Property
Unit
Description
Molecular weight Normal boiling point Standard liquid density at reference temperature
kg/kmol K kg/m3
Reference temperature for liquid density Critical temperature Critical pressure Critical compressibility Acentric factor Dipole moment Radius of gyration Temperature of fusion Triple point temperature Enthalpy of fusion
K K kPa – – Cm M K K kJ/kmol
Chemical compound molecular weight Boiling temperature at 101.325 kPa Liquid density at reference temperature. Usually 25 ◦ C or normal boiling point for low boiling compounds Reference temperature for liquid density Critical temperature Critical pressure Compressibility factor at critical point Acentric factor as defined by Pitzer Dipole moment Radius of gyration Temperature of fusion Coexistence temperature for gas, liquid and solid Enthalpy of fusion at temperature of fusion
104
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
Table 4 Fixed point energetic properties Property
Unit
Description
Ideal gas enthalpy of formation Ideal gas Gibbs free energy of formation Enthalpy of combustion
kJ/kmol kJ/kmol kJ/kmol
Ideal gas enthalpy of formation at 25 ◦ C and 101.325 kPa Ideal gas Gibbs free energy of formation at 25 ◦ C and 101.325 kPa Enthalpy of combustion at 25 ◦ C
Table 5 Fixed point vapor–liquid equilibrium properties Property
Unit
Description
Activity coefficient normalization
–
Electric charge
–
Wilson molar volume UNIQUAC van der Waals area UNIQUAC van der Waals volume Hayden–O’Connell association parameter
kmol/m3 – – –
Variable used to define the normalization used for activity coefficients as symmetric or asymmetric Electric charge of a compound. Normal compounds have zero charge while cations and anions have electric charges Molar volume at 25 ◦ C used by the Wilson activity coefficient model Molecular area normalized by the CH2 mer area Molecular volume normalized by the CH2 mer volume Self association parameter as defined by Hayden and O’Connell
Table 6 Temperature dependent physical properties Property
Unit
Description
Vapor pressure Sublimation pressure Liquid density Second virial coefficient Solid density Ideal gas heat capacity Ideal gas enthalpy of formation Ideal gas Gibbs free energy of formation Enthalpy of vaporization Viscosity Thermal conductivity Solid thermal conductivity Gas–liquid surface tension
kPa kPa kmol/m3 M3 /kmol kmol/m3 kJ/kmol K kJ/kmol kJ/kmol kJ/kmol Pa s W/m K W/m K M/m
Vapor–liquid vapor pressure Vapor–solid vapor pressure Liquid density Second virial coefficient Solid density Isobaric heat capacity at ideal gas state Ideal gas enthalpy of formation Ideal gas Gibbs free energy of formation Liquid–gas phase change enthalpy Gas or liquid viscosity Gas or liquid thermal conductivity Solid thermal conductivity Surface tension for gas–liquid interface
database. The models are summarized in Tables 7 and 8, and detailed descriptions of the models can be found in Ref. [16]. For convenience we divide the models into equation of state based models and activity coefficient based models.
and excess liquid phase volumes. Excess liquid phase enthalpies are fitted using the NRTL model, while excess liquid phase volumes are correlated using equations which are liquid phase model dependent. The details on the excess
8.1. Equation of state based models
Table 7 Equations of state for vapor–liquid equilibrium
Equations of state for vapor–liquid equilibrium are shown in Table 7.
Model
Description
RK SRK Refinery-SRK
Redlich–Kwong equation of state Soave–Redlich–Kwong equation of state Soave–Redlich–Kwong equation of state as specified by the American Petroleum Institute Soave–Redlich–Kwong equation of state modified for accuracy natural gas compressibility factor estimates Peng–Robinson equation of state Advanced Peng–Robinson equation of state Peng–Robinson equation of state as specified by the Gas Processors Association Peng–Robinson equation of state with Huron–Vidal mixing rules Benedict–Webb–Rubin equation of state as specified by Starling
8.2. Activity coefficient based models Activity coefficient models are a function of the vapor phase model selected for the data fitting. This fact is usually ignored and inconsistent use of interaction parameters abound in the process simulation practice. In this work, separate interaction parameters are collected for each combination of vapor phase models and activity coefficient models (Table 9). Currently only two excess property parameters are stored in the VIRTUAL database, excess liquid phase enthalpies
ZSRK
PR APR NGL-PR GE-PR BWRS
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106 Table 8 Vapor/liquid models available in the VIRTUAL database
A B C D E F G
105
Acknowledgements
0
1
2
3
4
5
6
7
8
9
A/0 B/0 C/0 D/0 E/0 F/0 G/0
A/1 B/1 C/1 D/1 E/1 F/1 G/1
A/2 B/2 C/2 D/2 E/2 F/2 G/2
A/3 B/3 C/3 D/3 E/3 F/3 G/3
A/4 B/4 C/4 D/4 E/4 F/4 G/4
A/5 B/5 C/5 D/5 E/5 F/5 G/5
A/6 B/6 C/6 D/6 E/6 F/6 G/6
A/7 B/7 C/7 D/7 E/7 F/7 G/7
A/8 B/8 C/8 D/8 E/8 F/8 G/8
A/9 B/9 C/9 D/9 E/9 F/9 G/9
For example, F/8 should be read as NRTL with the Hayden–O’Connell virial equation of state for the gas phase.
Table 9 Model codes for Table 8 Code
Phase model
Phase
A B C D E F G 0 1 2 3 4 5 6 7 8 9
Margules 3 Suffix Margules 4 Suffix Monsanto van Laar Wilson Tsuboka–Katayama–Wilson NRTL UNIQUAC RK SRK Refinery-SRK ZSRK PR APR NGL-PR BWRS Virial (Hayden–O’Connell) Ideal gas
Liquid Liquid Liquid Liquid Liquid Liquid Liquid Vapor Vapor Vapor Vapor Vapor Vapor Vapor Vapor Vapor Vapor
volume correlation will be presented in subsequent articles on VIRTUAL.
9. Conclusion The VIRTUAL database is the culmination of two technologies, the extensive and methodically evaluated databases created by NIST and the extensive physical property and phase equilibrium thermodynamics created by Virtual Materials Group. These two technologies allow the creation of fully documented, quality-driven and consistent databases of optimized model parameters for the calculation of mixture properties. This new optimized thermodynamic model parameters database for process simulation calculations, VIRTUAL, allows engineers to focus on the scientific and process problems at hand with extensive background information thus shortening the time spent in auditing the results of process simulators as well as allowing the engineer to look at the data critically and intelligently thus avoiding costly design mistakes and over design.
The authors are grateful to Virtual Materials Group for its permission to publish this work. We gratefully recognize Mr. Craig Morris from RedTree Development for his help in setting up PostgreSQL and the staff from NIST’s TRC group.
References [1] J.L. Oscarson, R.L. Rowley, W.V. Wilding, Design institute for physical properties project 801, AIChE (2002). [2] C.L. Yaws, Chemical Properties Handbook, McGraw-Hill, New York, 1999. [3] R. Agarwal, Y.-K. Li, O.J. Santollani, M.A. Satyro, A. Vieler, Chem. Eng. Prog. 97 (5) (2001) 42; R. Agarwal, Y.-K. Li, O.J. Santollani, M.A. Satyro, A. Vieler, Chem. Eng. Prog. 97 (6) (2001) 64. [4] N.S. Yang, Z.P. Xu, K.T. Chuang, Hydrocarb. Process. 5 (5) (1999) 109. [5] J. Sadeq, H.A. Duarte, R.W. Serth, Chem. Eng. Educ. 4 (4) (1997) 47. [6] V. Benso, A. Bertucco, RICHMAC Mag. 46 (1995). [7] C.A.D. Moura, H.P. Carneiro, B. Tech., Petrobras, Rio de Janeiro, 34, 3/4, Jul/Dez. 1991, p. 17. [8] D. Zudkevitch, Phase equilibria and fluid property, in: Proceedings of the Chemical Industry Conference, Berlin, 1980. [9] Virtual Materials Property Package System Users Manual, Virtual Materials Group, Inc., Calgary, Alta., Canada, 2001. [10] M. Frenkel, Q. Dong, R.C. Wilhoit, K.C. Hall, Int. J. Thermophys. 22 (2001) 215–226 [11] X. Yan, Q. Dong, M. Frenkel, K.R. Hall, Int. J. Thermophys. 22 (2001) 227. [12] R.K. Agarwal, Y.-K. Li, O.J. Santollani, M.A. Satyro, Paper Presented at the International Conference on Distillation and Absorption, September 30–October 2, 2002, Kongresshaus Baden-Baden, Germany. [13] A. Klamt, COSMO and COSMO-RS, Invited Contribution to the Encyclopedia of Computational Chemistry, P. v. R. Schleyer Hrsg., Wiley, 1998. [14] N. Gershenfeld, The Nature of Mathematical Modeling, Cambridge University Press, 1999. [15] Y. Bard, Non-linear Parameter Estimation, Academic Press, New York, 1974. [16] M. Caracotsios, Model parametric sensitivity analysis and non-linear parameter estimation theory and applications, Ph.D. Dissertation, 1986. [17] W.H. Press, S.A. Teukolsky, W.T Vetterling, B.P. Flannery, Numerical Recipes in C, 2nd ed., Cambridge University Press, 1992. [18] VMGThermo Version 3.85, Virtual Materials Group, Calgary, Alta., Canada, 2003. [19] J.M. Prausnitz, T. Anderson, E. Grens, C. Eckert, R. Hsieh, J. O’Connell, Computer Calculations for Multicomponent Vapor– Liquid and Liquid–Liquid Equilibria, Prentice-Hall, 1980. [20] HYPROP User’s Manual, Hyprotech Ltd., Calgary, Alta., Canada, 1991. [21] H.I. Britt, R.H. Luecke, Technometrics 15 (1973) 233. [22] W.E. Deming, Statistical Adjustment of Data, Dover, 1943. [23] J. Mandel, The Statistical Analysis of Experimental Data, Dover, 1964. [24] V.R. Vasquez, W.B. Whitting, Comput. Chem. Eng. 23 (2000) 1825– 1838.
106
M.A. Satyro et al. / Fluid Phase Equilibria 222–223 (2004) 95–106
[25] E.F. Herington, J. Inst. Petrol. 37 (1951) 457. [26] A. Fredenslund, J. Gmehling, P. Rasmussen, Vapor–Liquid Equilibria Using UNIFAC, Elsevier, 1977. [27] H.C. van Ness, S.M. Byer, R.E. Gibbs, AIChE J. 19 (1973).
[28] M.A. Gess, R.P. Danner, M. Nagvekar, Thermodynamic analysis of vapor–liquid equilibria: recommended models and a standard data base, AIChE (1991).