Chemometricsand intelligent laboratory systems Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
Nonlinear regression methods for modeling of heteroscedastic retention data in reversed-phase high-performance liquid chromatography Margriet M.W.B. Hendriks *, Pierre M.J. Coenegracht,
Durk A. Doornbos
Research Group Chemometrics, University Center for Pharmacy, A. Deusinglaan 2, 9713 AW Groningen, The Netherlands
Received 14 March 1994; accepted 11 May 1994
Abstract New models have been developed that accurately describe the response surfaces of capacity factors that are a function of changes in the pH and the fraction of organic modifier in reversed-phase high-performance liquid
chromatography (RP-HPLC). The purpose of this article is to illustrate one of the problems of nonlinear modeling of capacity factors; it is an extension of the articles on this nonlinear modeling in RP-HPLC that have appeared during the last three years. The main subject of this paper is a description of the methods that can be used in building correct models for the capacity factor as a function of both pH and modifier content of the mobile phase. The heterogeneity of the variance across all capacity factors causes the ordinary nonlinear least squares method to be less appropriate for regression. Two methods are described that take account of the effect of the heterogenic error structure of capacity factors in the modeling procedure. These two methods are then compared with the standard nonlinear modeling procedure.
1. Introduction The resolution R, and the selectivity cx are two important criteria in the optimization of chromatographic separations. An approach to the optimization of these criteria is to vary the mobile phase composition and search for a composition that results in optimal separations. To study the performance of the resolution or selectivity, knowledge on the behavior of the retention as a
* Corresponding
author.
function of the mobile phase composition is one of the first things that should be available. In most cases the mobile phase is composed of water and one or more organic solvents, the so-called modifiers, of which the fraction can be varied. However, for ionic or ionizable solutes, changing the pH or adding an ion-pairing reagent to the mobile phase, may result in better separations. In the research on the systematic optimization of chromatographic separations, recently much attention is paid to the combined use of the mobile phase composition and its pH as variables in the optimization process [l-3]. Models have been developed that accurately describe the rela-
0169-7439/94/$07.00 Q 1994 Elsevier Science B.V. All rights reserved SSDIO169-7439(94)00046-L
228
M.M. W.B. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
-_ k
i
3
4
5
6
7
PH Fig. 1. Plot of the capacity factor versus the pH of the mobile phase for 3,4-dihydroxyphenylacetic acid.
tion between the retention, or the capacity factor, and the modifier content of the mobile phase and the mobile phase pH. The complicating aspect of using the pH as an independent variable in mobile phase optimization is the nonlinear character of the response (the capacity factor) as a function of this pH. Looking at the response surface of ionizable solutes, the capacity factor varies considerably when the pH is changed (Fig. 1). Variations in the response are the most pronounced within a certain region around the acid constant, K,, of a certain solute. A distinct sigmoidal shaped response curve can be seen when the pH is varied over a reasonable range, for instance 2 to 3 units on both sides of the pK,. It is obvious that polynomial models are not suitable to model this kind of nonlinear behavior. In the literature different approaches are presented to tackle the problem of modeling the nonlinear dependence of the capacity factor as a function of the pH. The first and most simple solution is to vary the pH only over a very limited range, so that the resulting changes in the response values can be described by a linear equation [4]. The second approach, which was proposed by Bourguignon et al. [5], is based on a limited number of experiments, collected accord-
ing to a Doehlert design, and uses a quadratic polynomial function to fit the response. The same research group suggested the use of logistic transformations of the capacity factor before applying linear regression [6]. These transforms perform well if only the pH is considered as the independent variable, but difficulties arise when transforms are applied to the situation where both pH and modifier are varied. The third approach uses nonlinear models, and was first used by Haddad et al. [7] and further developed by Schoenmakers et al. [l]. Although more experiments are required, the third approach is preferred when the concern is not only solving the optimization problem but also the description of the overall separation over a relatively large experimental range, aiming for a good fit to the experimental data. A problem that still has to be solved and may stand in the way of using these models is whether or not’the number of experiments needed for the construction of nonlinear models is too large for the practical solution of this complex optimization. The gain in robustness and ruggedness of models and predictions is a major point that pleads for doing a larger number of experiments. An alternative for the simultaneous optimization strategies could be using the multicriteria steepest ascent method [8]. The first problem that has to be faced is how to model the response surface, and which model selection criteria are to be used. The evaluation of model selection criteria will be described in a following paper. In fact the overall model is based on a combination of two univariate models, but the exact form needs to be estimated, as will be explained in the next section. In this paper, a basic issue in building a nonlinear model for the pH dependence of the capacity factor is discussed. A problem is the nonconstant error variance of the capacity factors, which has to be dealt with in the modeling of the response. Two ways to take the non-constant error structure into account during the modeling process, without a priori knowledge of the sample variance structure, were examined. The first way is to modify the model by a logarithmic transformation, while at the same time the capacity fac-
MM. W.B. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
tors are logarithmically transformed. The second way is to use generalized nonlinear regression which contains a variance function estimation method.
2. Nonlinearity and hierarchical model building The modeling problem may be divided in finding a model for the modifier ratio dependence, and a model for the dependence of the capacity factor on the pH. The combination of these two models in such a way that also the effects caused by interaction between pH and modifier are included results in the overall model. An exponential curve can be used to represent the dependence of the capacity factor on the modifier content of the mobile phase, while an S-shaped curve is needed for the representation of the relation between the capacity factor and the mobile phase pH. Combining a sigmoidal pH model and an exponential modifier model may result in a suitable overall pH/modifier model [21. The core of the nonlinear model is formed on physicochemical and empirical knowledge of the relation between retention and pH. This knowledge contains simple relationships for acid-base equilibria [l]. Both states in which the solute can appear (protonated or deprotonated state) will give different retention times. For pH values sufficiently different from the pK,, the solute is completely in its protonated or deprotonated form. This form then dominates the retention process. Between the extreme pH values both forms of the solute appear and the overall retention time (and consequently the capacity factor) will be a weighted average of these two extreme values in retention time. Eq. 1 shows this dependence of the capacity factor on the pH [4].
kR= =
[A-l v-w [A-] + [HA] * kA-+ [A-] + [HA] * k~ &, * k,-+
[H+] * k,
K, + [H’l
(1)
in which k, is the overall capacity factor, K, is the acid constant, k,- is the capacity factor of
229
the capacity factor of HA. HA is A- and k, the protonated form, A- is the deprotonated form of the molecule. This model seems to be restricted to acidic solutes, but equivalent models can be constructed for other ionic solutes [31. Three unknown parameters (k,-, k, and K,) are included in the model. In Eq. 2 these terms are represented by (Y, p and y, where (Y and p can be seen as, respectively k,- and k,, the capacity factors belonging to the two states in which the solute can appear, and y represents the acid constant. The parameters (Y and y are not combined into one parameter (Y*= (Y* y, because k=
a*y+P*[H+] Y + [H+]
(2)
it would result in a strong correlation between this new parameter (Y* and the parameter y, and thus influence the modeling procedure. These three parameters (a, p, y) are all dependent on the modifier ratio (4) of the mobile phase. Besides the obvious dependence of the capacity factors, k,- and k, on the modifier ratio 4, one should also take into account the dependence of the acid constant K, on 4. By definition, the acid constant is specified for a solution in 100% water. Therefore the inclusion of a modifier dependent term in the model instead of just the acid constant K, is correct. Leitold and Vigh [9] found that the dependence of the acid constant on the modifier content resulted in an exponential form. As can be seen in Fig. 2 an exponential shape appears for the variation of the capacity factor with the modifier content. For the dependence of the acid constant, i.e. the y term, on the percentage modifier, the same exponential model was chosen as for the (Y and p. Thus for each term the dependence on the modifier content can be represented in the same way: cy=F(4,
ai) =a0*e(‘Y1*++‘Y2*+2)
p =F(4,
pi) =pO *e(PI*++&*@)
y =F(4,
ri) = yO *e(Yl*++Y2*+*)
(3)
230
MM. W.B. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
3. Heteroscedasticity
olo .
Oil
.
012
.
013 * 014
MeOH (modifier) Fig. 2. Plot of the capacity factor versus the methanol content (MeOH) of the mobile phase for 3,4-dihydroxyphenylacetic acid.
Linear or polynomial models that are used to describe the relation between the capacity factor and modifier ratios, are usually constructed in a hierarchical way. Similar to this procedure also in the nonlinear modeling process hierarchical models should be built and systematically stripped of superfluous model terms. Especially the generation of a large amount of, not closely related, models, should be avoided, to prevent data mining [lo] and chance correlations. As a starting point a relatively large model is chosen and this should be reduced by removal of non-significant parameters. Each new model contains a subset of the parameters of the overall model so that they form a set of hierarchical models. The inclusion of a new parameter will lower the sum of squared errors (SSE) and thus increase the modeling ability. The SSE gives an indication of the amount of variation that is explained by adding a parameter or term. To decide whether or not a certain parameter or set of parameters should be included in the model, and, if the model performs well, whether the expected prediction error of the model and the significance of parameters are the important validation criteria, is the purpose of ongoing investigation.
Heteroscedasticity, which means nonconstant variance [ill, is an important phenomenon that should be taken into account in modeling capacity factors, because of its influence on the model validation and the construction of (optimal) designs. Weijland [121 describes possible sources of error which cause uncertainty in the capacity factor k. Fluctuations in the flow-rate and the temperature are the main sources of variation in the estimation of large capacity factors. The variation in the measurements caused by these sources are relative errors, i.e. are related to the measured value of the capacity factors and increase with growing values of the capacity factors. Considerable errors occur at temperature fluctuations larger than O.OS’C [13]. The fluctuations in the flow-rate that appear in practice have not as large an effect as the fluctuations in temperature. A second source of error are inaccuracies in the measurement of the retention time t, and t,. Since the capacity factor is defined as k = (tR to>/to (to is the hold-up time), it follows that for small retention times, i.e. capacity factors, variations will be most pronounced owing to this cause. Other possible sources of variation are small variations in the mobile phase compositions, which influence the modifier content as well as the pH. Not all observations have the same influence on the estimation of model parameters. Observations with a large influence on the model are expected to give small residuals. Given a homogeneous error structure (uu~(E) = a*> it can be proven [14,15] that the resulting residuals may be non-homogeneous (uar(e) = (I - H>a*), in which I is the identity matrix and H is the Hat matrix containing the leverage values. Thus to detect variance heterogeneity, Cook and Weisberg [16] suggested the use of leverage corrected residuals, or the so-called studentized residuals. Using replicates would show the heterogeneous error structure immediately from the untransformed residuals. It was pointed out that capacity factors usually do not show a homogeneous error variance structure. The leverage corrected residuals (residuals
MM W.B. Hendriks et al. /Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
0.8-
231
a generalized version of WLS, generalized least squares, or apply a logarithmic transformation on the capacity factors.
.
0.60.40.20.0-
.
nmmlym.‘. .
I
-0.2,
.
.
.
. .
4. Logarithmic transformation
.
. . .
.
.
n I
Ij ,,,,,,,., ,,,,,,,,, ,.,,,,,,, l 0.01
0.1
1
10 Predicted k
Fig. 3. Leverage corrected residuals as a function of the predicted values of the capacity factors for 3,4-dihydroxyphenylacetic acid. Model constructed by NLS method.
ei divided by i(1) that are depicted in Fig. 3 as a function of predicted response values reveal a funnel-shaped structure, which implicates that the variance is not constant over the entire response range. However, if variance is non-constant, the quality of information about the response in regions where the variance is large is inferior to the quality of information in regions where the variance is small [11,191. Thus the heterogeneity of the variance is an influential factor in the modeling process. The most obvious method that can be used to correct for a heterogeneous variance structure is weighted least squares (WLS) [17]. One way to perform WLS is using replicate experiments to estimate the variance structure. The inverses of the, possibly smoothed, error variances are the weights to be used in weighted least squares. To get a reasonably accurate estimation of these variances the number of replicates should be large. In the case of high-performance liquid chromatography (HPLC) this would cost too much experimental effort. Fortunately there are ways to overcome this problem. In the case of modeling capacity factors it is more appropriate to use
For the optimization of mobile phases consisting only of organic modifiers, at constant pH, the capacity factors are usually logarithmically transformed [l&191. This logarithmic transformation of the capacity factor has two effects on the relation between the capacity factor and the modifier ratios: (1) it enables one to use a polynomial model and, (2) it transforms the heterogeneous variance into a homogeneous variance. When a logarithmic transformation is used in nonlinear models, the transformation offers no advantage with regard to the complexity of the model and thus the modeling process itself, but it changes the variance structure from heterogeneous to homogeneous. The resulting model then is “*y+P*[H+] In(k) = In Y + [H+l
(4)
The logarithmic transformation of the capacity factors is completed by a transformation of the original model with the natural logarithm to keep the original structure of the model. The (Y,/3 and y terms are the same as in the original model, and the hierarchical model selection process is performed in the same way. The logarithmic transformation is a transform both sides model [171. The transform both sides method is applied when the relation between predictor and response variables is assumed to be correct. The relation between these variables is preserved while the assumption is made that following transformation the residuals will have a constant variance. After transformation of the response and the model the data can be fit according to normal unweighted (nonlinear) least squares [17]. The logarithmic transformation is appropriate for a special kind of heteroscedasticity, namely a constant relative variance structure.
232
M.M. WB. Hendriks et al. /Chemometrics
and Intelligent L,aboratory Systems 25 (1994) 227-239
5. Generalized least squares For responses with a heterogeneous error structure a general model can be constructed [11,19]: Y =f(x,
B) + W(C1,Z, @)E
(5)
in which E is random, with a normal distribution with mean 0 and a standard deviation 1, and u is a scaling constant. f(x, /3) describes the functional relation between the regressors x and the dependent variable y, B being the set of parameters defining this relation. In the case of modeling capacity factors, y is equal to the capacity factor k, and x consists of the variables pH and the modifier content of the mobile phase. g(p., z, 8) is the variance function, where z is a set of independent variables, which may contain (part of) x, I_Lthe mean of the functional response value f(x, /3> for certain values of x, and 8 is a set of parameters for the variance function. The variance function may be known completely, when parameter values are known or partially unknown, but at least the overall shape of the function needs to be defined. When the variance function is partially unknown, the values of 8 have to be estimated. In the case where y is the capacity factor, it can be assumed that the variance function looks like g(ll,
z, @> =f(x,
Bje
(6)
This function is now defined as related to the theoretical model for the capacity factor, represented by f(n, /3X The corresponding errors are ~f(x, /?)‘E. If 8 = 0, the variance of y is homogeneous, while if 8 = 1 the variance is proportional to the values of y. In the latter case, a logarithmic transformation of the response y, before modeling, would be appropriate. To model this variance function both /3 and 8 need to be estimated, while for modeling the functional relation also /3 needs to be estimated. The estimation of the parameter values of the functional relationship and the estimation of the variance function can be implemented as an iterative process using the following algorithm [11,19]:
- First /? of f is estimated using unweighted nonlinear regression. - Subsequently 8 of g(p, z, 8) is estimated using nonlinear regression on the residuals that result from the first step. In the case of modeling capacity factors, 0 is estimated. - Then again /3 needs to be estimated using weighted nonlinear regression, for which the weights are given by w = l/g2(p, z, @) = l/f2e(.r, /3X The /I used to calculate these weights result from the previous iteration step. These three steps are usually repeated until convergence, or for a predefined number of times assuming that after this number of times convergence is reached. A discussion on the choice of the number of cycles in the iteration can be found in Carroll and Ruppert [17].
6. Experimental 6.1. Description of the data set and methodology All experiments and simulations described in the following section are based on data collected by Van de Venne [20]. In this data set capacity factors of carboxylic acids were collected at six different values of the pH (pH = 2-7) and five different values of the methanol content of the mobile phase (O-40%). The experiments were performed on a LiChrosorb RP-18 column. The data of the Van de Venne data set show the nonlinear behavior as described in the paragraph on model building. Most solutes that are used in this data set have their pK, values within the pH design space. In Fig. 4 a three-dimensional plot of the capacity factor versus the pH and the methanol content of the mobile phase is shown for 3,4-dihydroxyphenylacetic acid. Notice that the dependence of the capacity factor on the methanol content has an exponential form, while the dependence of the capacity factor on the pH has an S-shape or sigmoidal form. Fig. 4 was constructed by a seven-parameter model based on the model defined by Eqs. 2 and 3. Only significant parameters were used, which were selected on the basis of their asymptotic standard
M.M.WB. Hendriks et al./Chemometrics
and Intelligent Laboratory Systems 25 (1994) 227-239
k 26
233
formation of capacity factors before modeling is a standard procedure. 6.2. Simulation studies
Fig. 4. Response surface of the capacity factor of 3,4-dihydroxyphenylacetic acid versus the methanol content and the PH.
deviations which is standard output of the nonlinear regression procedure which is part of the SAS statistical software package [21]. The appropriate starting values for the nonlinear regression method were selected by visual inspection of the data as proposed by Ratkowsky [22]. A simulation study was performed, with the intention to investigate whether or not the logarithmic transformation and the generalized least squares method perform equally well in the nonlinear modeling of capacity factors and to compare them with the ordinary nonlinear least squares procedure. In these so-called real simulation experiments the data collected by Van de Venne were used as the basis for simulated data sets. In the experiment three different parameter estimation methods were compared, e.g., unweighted nonlinear least squares (NLS), generalized least squares (GLS) and nonlinear least squares after logarithmic transformation (LOG) of the capacity factors and the model. A more complete approach in comparing a weighted regression procedure with a transformation procedure would for example be to compare generalized least squares with a procedure in which an appropriate Box-Gox type transformation was estimated on the given data set and applied to the response as well as to the model [23]. The decision to choose for a logarithmic transformation originates from the fact that in mobile phase optimization the logarithmic trans-
For the simulation studies parameter values were chosen on the outcome of fitting the nonlinear model (Eqs. 2 and 3) to the untransformed capacity factor values belonging to a certain solute (mandelic acid) of the Van de Venne data set. Again the nonlinear regression procedure which is part of the SAS software package was used to fit the data to the model and only significant parameter values were selected, resulting in a seven-parameter model: eao+al *Q*eYo+Yl*~+ePo+BI*~+P2+9'*[H+] k=
eYo+Yl*+
+[H+] (7)
In this case the data set was used to construct simulated data sets that are based on models with realistic parameter values (a, /3, r>, but this is no strict necessity as also more or less arbitrarily chosen parameter values might have been used. At a certain value of the capacity factor, y = f
M.M.WB. Hendriks et al. /Chemometrics
234
and Intelligent Laboratory Systems 25 (1994) 227-239
and test sets, containing 30 and 20 equally spread design points, respectively, were generated using the same seven-parameter model (7) for each simulated data set. The only difference between the four simulation studies is the choice of the value of 8, which determines the degree of heteroscedasticity. For each value of 8 (0.5, 0.75, 1.0 and 1.29, 100 training sets and test sets were generated.
For each simulation (generation of one training set and one test set) the mean squared error of prediction, MSEP of the test set, was calculated. The predicted response values were calculated on the basis of a model that was estimated on the training set. Also the mean of the prediction error sum of squares (MPRESS) of the training set was calculated. The MPRESS is calculated on the ground of leave-one-out predictions, were
a)
.
0.2 -
n
0.0 -
n nnn
n n
.
I’m
n
mm
I
.
n n
-0.2 -
n n
.
-0.4 -
. n
.
I
-0.6 -
Predicted k
o’4] (d) !j
”
n
u, n
0
n
+
b:
nf
.
0.3-
!!
0.5-
0.2-
n
0.1-
0.0 -
mm n
nn
.
n
n
n
n
n
B
.
o.o-
m
n
n
n
-0.5 .
-0.1-0.2 -
-1.0 -
. srn
n
. n
n
. n
n
n
-0.3 -
m
Fig. 5. (a) Leverage corrected residuals plotted again!3t t he predicted values of capacity factor after using a logarithmically transformed model and capacity factors. The residuals are based on capacity factor values. (b) Leverage corrected residuals versus predicted capacity factors for the GLS procedure. The residuals are based on capacity factor values. (c) Leverage corrected residuals plotted against the predicted values of the capacity factor after using a logarithmically transformed model and capacity factors. The residuals are based on logarithmically transformed capacity factors. Cd) Weighted leverage corrected residuals versus capacity factor values for the GLS procedure. Weights are an outcome of the modeling procedure.
M.M. W.B. Hendriks et al. /Chemometrics
and intelligent Laboratory Systems 25 (1994) 227-239
the predicted value at observation i is calculated on a model that was estimated without the use of observation i. The MSEP and the MPRESS were corrected for the assumed (i.e. simulated) variance structure: each squared error (difference between experimental and predicted response value) is corrected by a weighting factor l/f(x, /3j2’, where /3 and 19 are given by the input values for the simulated data sets. The weighting of the individual components of the MSEP and the MPRESS is necessary; an error that goes with a large value of the capacity factor must not be as influential in these criteria as an error of the same size that belongs to a small capacity factor value. The Van de Venne data set was also used to evaluate the performance of the three modeling methods on real data. MPRESS values were calculated on all the observations, and MSEP values by splitting the data set into a training set, containing 18 observations, and a test set, containing the remaining 12 observations. MPRESS values in these calculations were corrected by weighting them with the weights found in the procedures (NLS squared errors have weight 1, and LOG squared errors have weights l/f(x, /3j2>. 7. Results The seven-parameter model (71, without a transformation of the capacity factor or the model, was used first for the modeling of capacity factors of solutes of the Van de Venne data, by an ordinary nonlinear least squares procedure. As an example of the result of such a procedure, the leverage corrected residuals for 3,4-dihydroxyphenylacetic acid, are plotted versus the predicted response values and are depicted in Fig. 3. Clearly a funnel-shaped structure can be seen, indicating a heterogeneous variance structure. After a logarithmic transformation of the response and the model, the variance of residuals calculated on the differences between predicted and experimental values of the transformed responses is expected to become homogeneous, which can be observed through the absence of a funnel shape in the residual plots. On the other
235
hand the residuals can also be calculated on the differences between the predicted and experimental values of the untransformed responses. In this case it is expected that the funnel shape in the residual plot will remain or become more prominent. Thus as with generalized least squares the funnel shape in these residuals indicates that the method takes account of the existing heterogeneous variance structure of the data. The plot of the residuals versus the response, obtained by the ordinary nonlinear least squares procedure, may show a less distinct heteroscedastic structure than the corresponding plot for the residuals of a model constructed by the GLS or logarithmic method when residuals are based on the untransformed response values. The reason is that an ordinary nonlinear least squares method will try to force the model through the higher capacity factor values as much as through the lower values. In Figs. 5a and 5b the leverage corrected residuals versus the capacity factors of 3,4-dihydroxyphenylacetic acid are plotted for the two methods that do take the heteroscedastic error structure into account. In both figures the heteroscedastic error structure can be observed. If after the logarithmic transformation of both the capacity factor and the model, the transformed response, i.e. the In(k) value, instead of the capacity factor, is used to calculate the residuals, the funnel shape is expected to disappear. Thus the residuals for the transformed responses are expected to show a homogeneous variance structure, assuming a constant relative error. This is easily illustrated by examining Eq. 5, assuming that the variance is proportional to the response (0 = l), since then a logarithmic transformation is the most appropriate. From this equation can be derived that a logarithmic transformation will result in a constant error. The leverage corrected residuals of the transformed response that are the result of nonlinear regression using the transformed model and response are depicted in Fig. 5c. The funnel shape has disappeared, but it seems that, in contrast with the expectations, the heterogeneous error structure was overcorrected for this solute, as the funnel shape appears to be more or less reversed. In Fig. 5b the leverage corrected residuals of
,236
M.M. WB. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
Table 1 Predictive performance criteria for modeling procedures from simulated data
9 = 0.50 e = 0.75 e = 1.00 0 = 1.25
MSEP,,
MS=',,,
MSEP,,
MPRESS,,s
MPRESS,,,
MPRESS,,
0.00318 0.00354 0.00396 0.00709
0.00293 0.00296 0.00303 0.00298
0.00285 0.00297 0.00309 0.00297
0.00375 0.00463 0.00611 0.01539
0.00387 0.00342 0.00357 0.00330
0.00342 0.00347 0.00363 0.00332
the generalized
nonlinear
regression
procedure
were depicted.
If this is compared with the residuals belonging to the unweighted procedure (Fig. 3), the great similarity is apparent. The GLS method takes the heteroscedasticity into account, and thus the residual plot shows a heterogeneous error structure. The weighted residuals for the same method, using the weights that were estimated in the GLS procedure, are plotted in Fig. 5d. This figure shows a random distribution of the error variance over the whole response range. This means it is independent of the values of the capacity factors, as was expected. The results of the simulation studies for the evaluation of the performance of the three modeling methods can be found in Table 1. Of first interest was if the two methods that take account of the heterogeneous variance structure perform better than the ordinary least squares procedure. Of further interest was whether or not the error structure, i.e. the value of 8, would affect the performance of the three different methods. All results should be compared to U* = 0.0025, which was used to generate the simulated data, and can be used as a reference value. The MSEP and MPRESS values for the GLS and the logarithmic transformation methods for 8 = 0.5 slightly differ from the untransformed
nonlinear least squares method. For higher values of 0 these differences become more pronounced. In conclusion, a trend could be detected in MSEP and MPRESS going from small values of 0 to larger values, which shows that the larger
the
heteroscedasticity
the
less ordinary
nonlinear least squares is able to fit the data and predict accurately.
No difference could be detected between the GLS and logarithmic transformation method. This indicates that given the proposed variance structure (f(x, /3)‘) and CT*= 0.05, no advantage is obtained by using GLS instead of using a logarithmic transformation, not even when 8 differs much from 1, given the conditions used in these simulation experiments. An additional observation that can be made from the contents of Table 1 is the good performance of MPRESS as a predictor for the MSEP, in absolute values as well as in tendency. The MPRESS values for each of the methods and the corresponding models, estimated on the real data, can be found in Table 2. The MPRESS values for the GLS method are weighted, the weights being the result of the GLS procedure. The MPRESS values for the LOG procedure are the sum of squared differences between predicted and original In(k) values. It is open to
Table 2 MPRESS values for different modeling procedures
Mandelic acid 3,4-Dihydroxyphenylacetic acid 3,5-Dihydroxybenzoic acid Vanillilmandelic acid 4-Hydroxymandelic acid
NLS
GLS
e
LOG
0.4214 0.1147 0.9033 0.0729 0.0808
0.0137 0.0328 0.0880 0.0956 0.0652
1.08 0.47 0.69 0.62 0.45
0.0120 0.1658 0.1038 0.0709 0.0412
M.M.U?B.
discussion whether these methods can be compared to each other and the standard nonlinear least squares procedure, because of the difference in the weighting of residuals. The expectation that in cases where 0 deviates much from 1, the GLS method performs better than the method that consists of the logarithmic transformation of capacity factors and the model, is not proven. It is common knowledge that capacity factors approximately between 2 and 10 have a constant relative variance structure. Depending on the range of values of capacity factors in the data set, a 8 equal to 1 or a 8 that deviates from 1 will be found. The application of cross-validation techniques
residuals 0.3
1
8. a 5
. ..
0.1
.
.
.
. .
.
fl
-0.1
.
.
.
0.2
. I
.
.
. .
.
:.
.
00
on the real data is not straightforward, because of differences in variance scale after transformations or weighting and the necessity of weighting of the individual squared errors. It is difficult to draw firm conclusions from the results as presented in Table 2, also because the data set containing real observations is probably too small to rely completely on the results of the crossvalidation procedure. The value of 8 has a great influence on the shape and the size of the variance structure. Typical residual plots resulting from generalized least squares procedures for different predefined values of 8, thus in fact weighted least squares procedures, can be found in Fig. 6. In this figure
residuals 0.4
.
I
0.2
e ~-0.75
. . .
.
. . .
.
-0.2
.
. 0.10
0.01
1.00
.
..
g . . .
. .
-0.3
1,:
lm
.
. .
.
0.0
.
.
-0.2 I
237
Hendriks et al. /Chemometrics and htelligent Laboratory Systems 25 (1994) 227-239
.
_0.41[ 10.00
lW.W
0.01
0.10
Pmdbctedk
1.00
10.00
100.00
Predicted k
fSSidlMlS
residuals
1.2 I
0.7 . 0.5
8
e=i
I
.
0.8
:
a=125
. l=
l =% .
.. -0.3
. .
.
.
5 . I
+’
.. .
-0.5\ 0.01
0.10
1.00 Prediiedk
10.w
1OO.w
0.10
1.00
10.00
Pmdictedk
Fig. 6. Typical residual plots resulting from weighted least squares procedures for different predefined values of 0.
lw.w
238
M.M. W.B. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
the weighted residuals of one compound (3,4-dihydroxyphenylacetic acid) is plotted against the predicted response values. The generalized least squares procedure found 0 = 0.46. The figure shows plots of the weighted residuals after applying a WIS procedure under the assumption of known variance structures for which 0 is 0.5,0.75, 1 and 1.25 respectively. In Fig. 3, the residuals after applying the NLS method were depicted, which corresponds to 0 = 0, the residual plot has a clear funnel shape. The plot has no structure for 0 is 0.5, which deviates not too much from 0.46. For values of 8 equal to or greater than 0.75 the residual plots show a structure which resembles more clearly a reversed funnel shape, the greater the difference from 8 = 0.46. In all the experiments the nonlinear models have to be correct. If not they will have a great influence on the estimation of 8, which will then be used as an extra modeling term.
8. Discussion The modeling procedures that take account of the heteroscedasticity, GLS and nonlinear least squares after logarithmic transformation, perform almost equally well, but in general perform better than the ordinary nonlinear least squares procedure. In this specific case the performance of the logarithmic transformation does not become much worse when the variance function deviates from the variance function that demands for a logarithmic transformation (0 = 11, as was shown in the simulation experiment. Because of the somewhat less elaborate implementation of the logarithmic transformation procedure, one could choose for this method. There is, however, an argument in favor for choosing the generalized least squares procedure, which is its flexibility. In any way it is important to keep in mind that not taking account of the heteroscedasticity results in worse predictions. Instead of using a power of the mean type variance function other kinds of functions may be considered. The error for low capacity factors (smaller than 1) can probably be considered constant. The error in capacity factors between 1 and
10 may be considered more or less proportional to the response values, while above a value of 10 it again may deviate. This complex variance structure is an argument to use a more elaborate kind of variance function in the GLS procedure. References [ll P.J. Schoenmakers, S. van Molle, C.M.G. Hayes and L.G.M. Uunk, Effects of pH in reversed-phase liquid chromatography, Anaiytica Chimica Acta, 250 (1991) l19. 121R.M. Lopes Marques and P.J. Schoenmakers, Modelling retention in reversed-phase liquid chromatography as a function of pH and solvent composition, Journal of Chromatography, 592 (1992) 157-182. [31 R.M. Lopes Marques, P.J. Schoenmakers, C.B. Lucasius and L. Buydens, Modelling chromatographic behaviour as a function of pH and solvent composition in RPLC, Chromatographia, 36 (1993) 83-95. [41 Cs. HorvLth, W. Melander and I. Molnar, Combined optimization of mobile phase pH and organic modifier content in the separation of some aromatic acids by reversed-phase high-performance liquid chromatography, Analytical Chemistry, 49 (1977) 142-154. [Sl B. Bourguignon, F. Marcenac, H.R. Keller, P.F. de Aguiar and D.L. Massart, Simultaneous optimization of pH and organic modifier content of the mobile phase for the separation of chlorophenols using a Doehlert design, Journal of Chromatography, 628 (1993) 171-189. El P.F. de Aguiar, B. Bourguignon, M.S. Khots, W. Pennincks and D.L. Massart, The use of logistic transformation in HPLC optimization, Quimica Analitica, 12 (1993) 177-182. [71 P.R. Haddad, A.C.J.H. Drouen, H.A.H. Billiet and L. de Gaian, Combined optimization of mobile phase pH and organic modifier content in the separation of some aromatic acids by reversed-phase high-performance liquid chromatography, Journal of Chromatography, 282 (1983) 71-81. Bl C.A.A. Duineveld, C.H.P. Bruins, A.K. Smilde, G.K. Bolhuis, K. Zuurman and D.A. Doornbos, Multicriteria steepest ascent, Chemometrics and Intelligent Laboratory Systems, 25 (1994) 183-201. 191A. Leitold and Gy. Vigh, pK* Values of phosphate buffer components in methanol-water reversed-phase eluents, Journal of Chromatography, 257 (1983) 384-386. m M.C. Lowel, Data mining, The Review of Economics and Statistics, 65 (1983) 1-12. [ill M. Davidian and P.D. Haaland, Regression and calibration with nonconstant error variance, Chemometrics and Intelligent Laboratory Systems, 9 (1990) 231-248. 1121J.W. Weijland, Strategies for Mobile Phase Optimization in Chromatography; a Chemometrical Approach, Thesis, Groningen, The Netherlands, 1986. [131 E. Grushka and I. Zamir, Precision in HPLC, in P.R.
M.M. W.B. Hendriks et al. / Chemometrics and Intelligent Laboratory Systems 25 (1994) 227-239
Brown and R.A. Hartwick, (Editors), High Performance Liquid Chromatography, Wiley, New York, 1988. [14] A.C. Atkinson, Plots, Transformations and Regression. An Introduction to Graphical Methods of Diagnostic Regression Analysts, Oxford University Press, &ford, 1985. [15] G.A.F. Seber and C.J. Wild, Nonlinear Regression, Wiley, New York, 1989. [16] R.D. Cook and S. Weisberg, Residuals and Influence in Regression, Chapman and Hall, New York, 1982. [17] R.J. Carroll and D. Ruppert, Transformation and Weighting in Regression, Chapman and Hall, New York, 1988. 1181 P.M.J. Coenegracht, N. van Tuyen, H.J. Metting and P.J.M. Coenegracht-Lamers, Application of a mixture design technique to the simultaneous optimization of analysis time and resolution in reversed-phase ion-pair
liquid chromatography
239
using a minimal resolution plot,
Journal of Chromatography, 389 (1987) 351-367. [19] S.T. Balke, Quantitative Column Liquid Chromatography; a Survey of Chemometric Methods, Elsevier, Amsterdam,
1984. [20] J.L.M. van de Venne, Nonpolar Chemically Bonded Stationary Phases in Liquid Chromatography, Thesis, Wibro, Helmond, 1979. [21] SAS Software Release 6.08, SAS Institute Inc., Cary, NC, USA. [22] D.A. Ratkowsky, Nonlinear Regression Modeling: a Unified Practical Approach, Dekker, New York, 1983. [23] G.E.P. Box and D.R. Cox, An analysis of transformations, Journal of the Royal Statistical Society, Series B, 26 (1964) 211-246.