Developing models for infinite dilution activity coefficients using factor analysis methods

Developing models for infinite dilution activity coefficients using factor analysis methods

Analytica Chimica Acta, 277 (1993) 223-238 223 Elsevier Science Publishers B.V., Amsterdam Developing models for infinite dilution activity coef...

1MB Sizes 2 Downloads 46 Views



Analytica Chimica Acta, 277 (1993) 223-238

223

Elsevier Science Publishers B.V., Amsterdam

Developing models for infinite dilution activity coefficients using factor analysis methods Russell B. Poe and Sarah C. Rutan Department of Chemistry, Virginia Commonwealth University, Richmond, VA 23284-2006 (USA)

Mitchell J . Hait and Charles A. Eckert School of Chemical Engineering and Chemistry, Specialty Separations Center, Georgia Institute of Technology, Atlanta, GA 30332 (USA)

Peter W. Carr Department of Chemistry, University of Minnesota, Minneapolis, MN 55455 (USA)

(Received 3rd July 1992)

Abstract This paper illustrates a novel use of factor analysis to characterize models for infinite dilution activity coefficients . Different forms of a modified cohesive energy density (MOSCED) equation were used to create synthetic data matrices to be analyzed by factor analysis . The abstract factors from these matrices were used to target test the different model parameters . By using the synthetic data, the factors contributing to the different forms of the MOSCED equation that were most susceptible to noise were identified. Factor analysis and target testing were then used to characterize a database of experimental infinite dilution activity coefficients for 29 solvents and 31 solutes . The factor analysis results for one form of the MOSCED equation were used to evaluate different physicochemical scales for dispersion, dipolarity and hydrogen bonding interactions . Keywords:

Principal component analysis ; Database ; Factor analysis ; Infinite dilution activity coefficients

In recent years, there has been a great deal of attention given to measuring and predicting infinite dilution activity coefficients . This fundamental area of chemistry is extremely important in developing new production processes and refining existing ones in the chemical industry [1]. Infinite dilution activity coefficients can be used in predicting the results of liquid-liquid extractions [2], chromatographic retention times [3], and S.C. Rutan, Department of Chemistry, Virginia Commonwealth University, Richmond, VA 232842006 (USA). Correspondence to:

multicomponent vapor-liquid equilibria for completely miscible systems [4]. However, accurately measuring infinite dilution activity coefficients is an extremely difficult and tedious task . There are many methods for measuring infinite dilution activity coefficients, including head space gas chromatography [5], ebulliometry [6] and dynamic gas chromatography [7]; each method has its own advantages and limitations. Most of these methods suffer from several potential errors, and it is difficult to obtain reliable results without very painstaking experimental techniques . It is neither practical nor possible to measure infinite dilution

0003-2670/93/$06.00 © 1993 - Elsevier Science Publishers B .V. All rights reserved

224

activity coefficients for all possible combinations of solvents and solutes ; therefore, it would be advantageous to be able to accurately predict infinite dilution activity coefficients from pure component properties . There are several methods available for predicting infinite dilution activity coefficients (y°°), the leading methods being the UNIFAC (UNIQUAC functional group activity coefficient) [8,9] and MOSCED (modified separation of cohesive energy density) [9-11] models . UNIFAC is a method based on summing group contributions of functional groups present in the solute and solvent, and MOSCED is an extension of regular solution theory where the mixture properties are determined from the different physical interactions: dispersion, induction, dipolar interactions, and hydrogen bonding. One of the advantages of MOSCED is that the predictions of infinite dilution activity coefficients are made based only on pure component properties. That is, binary parameters specific to the system are not needed . However, there is a need to determine appropriate scales to describe these pure component properties for representing dispersion, induction, dipolar, and hydrogen bonding interactions . Over the past few years, factor analysis methods have been widely used to characterize data for a variety of linear systems . Matrices consisting of data as a function of chromatographic retention time and UV-visible absorption wavelength have been successfully analyzed using factor analysis, however, data that may be modeled by more complicated equations are normally not studied using factor analysis . (In this study, the terms factor analysis, eigenanalysis and principal component analysis (PCA) will be used interchangeably.) In our work, we illustrate the use of factor analysis to characterize a database of infinite dilution activity coefficients . In addition, using PCA and target testing, we are able to evaluate different models for predicting infinite dilution activity coefficients. The PCA procedure yields a set of principal components which describe the major sources of variation in a data set . Target testing allows potentially significant pure component properties to be evaluated, and allows com-

RB. Poe et al /Anal. Chim. Acta 277 (1993) 223-238

parisons of different model formulations . We have generated synthetic data based on three different formulations of a MOSCED type model . Analysis of the synthetic data enables us to determine the number of linearly independent factors that are required to describe the different types of MOSCED models . The relative importance of each of the terms, and their sensitivity to the presence of experimental errors is also studied . THEORY

The List of symbols contains a summary of the notation used in this paper and a listing of all parameters used. The database used in this work consisted of a data matrix of infinite dilution activity coefficients for 29 solvents and 31 solutes, along with a set of physical properties for these solvents and solutes. A more detailed description of the database is presented in the Experimental section of the paper. Data pretreatment

The database of infinite dilution activity coefficients was examined and any row or column that contained more than 30% missing values was not included in the matrix to be analyzed . The data matrix that remained had 29 rows (solvents) and 31 columns (solutes). Any missing infinite dilution activity coefficients were then replaced with estimated values obtained from UNIFAC or MOSCED to complete the matrix . These values were thought to be better approximations of the missing values than the mean of the column or row and should contribute some chemical information. However, the average error of these estimated infinite dilution activity coefficients is higher than that of the experimentally determined values and it is certainly desirable to replace these initial estimates with values that are more consistent with the experimentally determined data. Missing data routine

After the original missing values are estimated by UNIFAC or MOSCED, the values were improved using a factor analysis approach for com-



225

RB. Poe et al. /Anal. Chim. Acta 277 (1993) 223-238

pensating for missing values [12]. The logarithms of the activity coefficients are used to form a matrix D . The matrix D can be expressed in the terms of its principal components ( 1) D = USVT where U is the normalized row-factor matrix, S is a diagonal matrix composed of the square roots of the eigenvalues and V T is the normalized column-factor space . Next, the number of primary eigenvalues (rank) is estimated using the IND function [13], the residual standard deviation (RSD) [14] and the F-test [15] . The primary eigenvalues are assumed to describe the variance resulting from physical and chemical interactions contained within the data matrix D and the secondary eigenvalues are assumed to describe the variance of the noise . The IND function, developed by Malinowski [13], is an empirical calculation using the eigenvalues that has been shown to reach a minimum value when the correct number of factors is chosen . The RSD and the F-test methods are based on the fact that the secondary eigenvalues should reflect the variance of the noise affecting the measurements . Once the rank of the data matrix is determined, the data matrix is regenerated using only the primary factors (2) The reduced matrix b contains less noise than the original matrix D . The missing values regenerated from the reduced matrix should be more consistent with the experimental values than the original MOSCED or UNIFAC values . These estimated values are then substituted for the UNIFAC and MOSCED values in the original matrix, giving a new matrix D 2 . Principal component analysis (PCA) is now performed on this matrix, using the same number of primary eigenvectors as in the first iteration . The missing values are again replaced with the reduced factor regenerated values. This process proceeds until the sum of squares of the deviations in the adjusted values in succeeding iterations decreases to a value below a preset tolerance . The final matrix contains the original experimental data and the iteratively modified MOSCED and UNIFAC values. Figure 1 is a flow chart for this missing data

routine. The solid triangles represent experimental values, the circles represent the missing values that were estimated from UNIFAC or MOSCED, and the hollow triangles are estimates for the missing values obtained after repeated iterations .

Data

Data and Semiempirical Values

Replace Semiempirical Values

D = USV T

n iterations

Data and Missing Data Generated Values

Fig. 1 . Flow chart for missing data factor analysis program .

(A) Experimental value, (0) missing values that were estimated from UNIFAC or MOSCED, and (A) estimate for the missing value obtained after repeated iterations .



226

RB. Poe et al /Anal. Chim. Acta 277 (1993) 223-238

Principal component analysis is performed one final time on the complete data matrix, including the values returned by the missing data routine . The normalized row-factors and column-factors are used to construct cluster diagrams and to correlate with the solute and solvent properties .

TABLE 1 Linear factors for alkane model Solute

Solvent

V2 V2 S 2 V2 S2/RT+0.3 In V2 Offset 2

s2

S1 Offset, In V,

Simulated alkane data

Before using principal component analysis on the data matrix of experimentally determined infinite dilution activity coefficients, several synthetic data matrices were constructed to evaluate the use of PCA in studying this type of data, since most of the studies reported on PCA have been for systems with relatively simple linear models [16] . The model used to construct the first synthetic data matrix follows regular solution theory [17], which is expected to hold for non-polar solutes dissolved in non-polar solvents, and is expressed as follows In y2 = V2 /RT(S 1 - 8 2 ) 2 + 0.3 ln(V2 /Vl )

(3)

where V is the volume, S is the Hildebrand solubility parameter, R is the gas constant and T is the absolute temperature . The subscript 1 denotes a solvent parameter and the subscript 2 denotes a solute parameter . The first term comes from regular solution theory and the second term is an approximation of a Flory-Huggins type term to account for differences in molecular size [10] . The 0.3 coefficient was derived from fits of an equation with this form to data for alkane solutes and solvents with moderate variation in size [18]. This format was found to be much better at describing the correct trends and signs in the In y°° data than the original Flory-Huggins term used for MOSCED {1 - (V2 /V1 ) aa + ln(V2 /V1 ) aa [10]} . For PCA, it is assumed that the data in a matrix D can be represented in a bilinear form n D=

k,

SkU kvk

(4)

k=1

where n is the number of linear factors contributing to the variance in the data, the U k are solvent dependent factors, the vk are solute dependent

factors, and the s k are scaling factors . To find the rank of the synthetic data, Eqn . 3 must be factored into a form consistent with Eqn. 4, as shown in the following equation In y2 = V2 Si/RT - 2V2 S 2 S 1 /RT + V2 S2/RT + 0 .3 In V2 - 0 .3 In V1

(5)

Analysis of this equation indicates that there are four solvent dependent terms (u t - u 4) and four solute dependent terms (v 1 - v4). Table 1 lists the four terms for the solutes and for the solvents . This analysis suggests, for example, that a target test (i.e., linear regression) of In V1 to the four solvent factors (u 1 - u 4) should yield a good fit. Simulated MOSCED data

In an attempt to more accurately model activity coefficients, a more versatile and comprehensive model must be used than that described above. The more complex model should consider non-polar, polar, and hydrogen bonding interactions; in this case a simplified form of the original MOSCED model was used In y 2 = V2/RT [(A1 -'k2)2 + gig2(T1 - r2 ) 2 +(at - a2)(Pt - I32)] + 0 .3 ln(V2/Vi) (6) where A is a dispersion parameter, q is an induction parameter, -r is a dipolarity parameter, a is a hydrogen bond acidity parameter, /3 is a hydrogen bond basicity parameter, and the remaining parameters are the same as for Eqn . 3. This equation is similar to the equation described by Thomas and Eckert [10]; however the Flory-Huggins term was simplified based on fits of alkane solute and solvent data, as mentioned above, and the asymmetry correction terms were left out of



R.B. Poe et at /Anal. Chim. Acta 277 (1993) 223-238 TABLE 2 Linear factors for MOSCED model Solute

Solvent

V2 V2 A 2

Ai+a i p l A,

V292

91 1 9iTi 91

V292 V292 T2 V2a2

pi al

V2)92 V2(A2 + a2)9 2)/RT +0.3 In V2

Offset, In V1

Offset2

227

interactions to describe the induced dipole interactions. This alternative model is V In ya = 2 [(At - A2) 2 + (A1 - A2)(Tt - T2) +(Tl - T2) 2 +(a1 - a2)(t'1 - 132)]

+ 0.3 ln(V2 /V1 )

(7)

This model has a rank of seven; the terms are listed in Table 3 .

EXPERIMENTAL

the model. The asymmetry correction terms were included in the original MOSCED model to account for the fact that the y" for solute A in solvent B is not normally the same as the f for solute B in solvent A which is predicted by the regular solution theory formalism (except for differences in molar volume, which are usually small). Current modifications to the MOSCED model have defined separate scales for solute and solvent parameters that removes the need for these asymmetry correction factors [19] . When the MOSCED equation is factored and expressed in the form of Eqn . 4, there are 9 terms contributing to the variance of the data . Table 2 lists the terms for both the solvents and solutes . Simulated modified MOSCED data

An alternative model for MOSCED was tested that did not use the q 2 parameters for modeling inductive interactions but instead used the polarizability parameter used for modeling dispersion

TABLE 3 Linear factors for modified MOSCED model Solute

Solvent

V2 V2 (2A 2 + T2) V2(A2 + A2r2 + z2 + a2p2)/RT + 0.31n V2 V2 (A 2 + 2r 2 ) V2p2 V2 a 2 Offset2

Ai + A1T1 +,r2 + a1p1 A1

Offset, 71 a, p1

In V1

Principal component analysis was used in the analysis of a matrix that contained 685 experimentally determined infinite dilution activity coefficients. The values for this matrix were obtained from a large database compiled by Hait et al . [19], which contained infinite dilution activity coefficients obtained using several different methods, each with a different level of experimental uncertainty and accuracy . Where no experimental values were available, the database contained several estimates from different versions of UNIFAC and MOSCED. The data matrix that was analyzed here contained y' values for 29 solvents and 31 solutes, therefore 899 values are required to complete the matrix to use principal component analysis . The final data matrix of infinite dilution activity coefficients contained 75% experimentally determined values, 20% of the values estimated by UNIFAC and 5% of the values estimated by MOSCED . A PCA program which predicts values for the missing entries (estimated initially by the UNIFAC or MOSCED values) was used to generate a complete data matrix for subsequent data analysis, as described in the Theory section (see Fig . 1). In addition to infinite dilution activity coefficients, the database contained various solvatochromic and physical parameters for the pure components [20-26] . Simulated data were created by generating a vector of random numbers for each parameter, over a physically realistic range of values . For all simulated models, data were generated for 29 solvents and 31 solutes, as these were the dimen-



228

RB. Poe et aL /Anal. Chim. Acta 277 (1993) 223-238

TABLE 4 Ranges of synthetic parameters Solute parameter

Range

Solvent parameter

Range

V2

50-120 7-15 7-10 0.9-1 .0 0-6 0-10 0-10

V1

al

50-200 7-15 7-10 0 .9-1 .0 0-6 0-10

Y1

0-10

82 A2 q2 T2

a2 02

81 A1 q, T1

sions of the experimental In y°° matrix. The ranges for the parameters used to simulate data according to the models given in Eqns . 3, 6, and 7 are listed in Table 4 . These parameters were saved for target testing of the resulting synthetic data matrices . All computations were performed on a 16-MHz 80386 computer equipped with an 80387 coprocessor, and running under the MS-DOS operating system . The principal components analysis, missing data and target testing programs were written using Turbo Pascal, version 6 .0 (Borland).

RESULTS AND DISCUSSION

Effect of normally distributed noise on synthetic data Using the alkane model described by Eqn . 3, three matrices were generated with identical values used as input parameters ; however, Gaussian distributed noise with a standard deviation of 0.05 or 0.201n y°° units was added to the activity coefficient values for two of the matrices . This resulted in three matrices of simulated In y°° values with approximately 0, 5 and 20% error in y®. The three matrices were analyzed by PCA and the abstract factors saved for target testing . The terms in Table 1 were used as test vectors to probe the effects of normally distributed noise on the performance of PCA for this type of model . After PCA of the data matrices containing 0 and 5% noise, the IND function, RSD and F-test indicated that there were 4 significant factors that

contributed to the data; however, these tests showed only three significant factors for the matrix containing 20% noise . This indicated that some of the model variations were indistinguishable from the noise, or that the four factors were no longer linearly independent, within the variance of the noise . This result is important because it indicates that important chemical interactions in the real data may be obscured by moderate noise levels, and that to adequately understand these interactions, the noise contributions to the real data should be substantially less than 20% . We explored several means for evaluating the results or quality of fit obtained from the target testing methodology . The SPOIL function is one method of determining the quality of fit [16], however, we found this parameter not as useful as the normalized standard error (NSE), defined as

NSE =

n -p 2 ~(x;-z)

( 8)

n-1 where the x i are the individual elements (physical properties) contained within the test vector, x is the mean property value for all solutes or solvents, x; represent the individual estimates obtained for the solute or solvent physical properties through the least-squares fit of the abstract factors to the test vector (i .e., the target test results), p is the number of abstract factors used in the target testing and n is the number of solvents or solutes in the test vector . We found by using this approach that we could compare similar scales . However, this approach still does not take into account the relative success in fitting scales with widely different inherent precision, or in fitting to an offset . When this normalized standard error is large, we cannot determine if the proposed term is important in describing the variance in the data set . In fact, if a test vector is highly correlated with a constant value (an offset), we would expect the NSE be fairly large . This is not to say that such terms do not contribute to

RB. Poe et al. /Anal. Chim . Acta 277

(1993)

TABLE 5 Normalized standard errors (%) for alkane model Test vector

Noise 5%

20%

Solute V2 V282 V2S2 /RT 12 (offset)

+ 0.3

In V2

5 .5 4.2 2.5 0.071 8

24 18 13 0.26 e

3.4 6.7 0.014 a 34

13 25 0.051 8 130

Solvent 8i 81 1 1 (offset) In V1

229

223-238

e Standard error.

the variations in the data, but it is more difficult to determine their validity by target testing . When PCA was performed on the data matrix with no noise, all 8 test vectors in Table 1 showed an excellent quality of fit with the abstract factors, as evidenced by very small NSE values. However, when noise was added to the data, different test vectors showed different fit qualities when fit to the abstract factors . This is illustrated in Table 5 where all of the NSE values are listed for the two synthetic alkane matrices with added noise contributions. This indicates that target testing of certain terms of Eqn . 4 were more susceptible to the presence of noise than others. For example, among the solute terms, the V2 and V2 8 2 terms give a relatively poor fit quality, while for the solvent terms, the In V 1 term is very sensitive to the presence of noise . The spe-

cific reasons for these discrepancies in fit quality are not clear. To determine if the fits for the 8 test vectors were significantly better than fits to a totally random vector, six additional random vectors were generated for each parameter (V 1 , V2 , 5 1 and 8 2 ) over the same ranges that were used in the original simulation . These random vectors were used to construct six alternate test vectors for each linear term listed in Table 1, excluding the offsets. These alternative test vectors were fit to the abstract factors obtained from analysis of the synthetic data that contained 5 and 20% noise . The average normalized standard error for the 95% confidence interval was calculated for each of these test vectors . This information is shown in Table 6 . The 95% confidence interval indicates the quality of fit that would be obtained if a real test factor could not be distinguished from a totally random parameter . From the six factors evaluated in this manner, the In V 1 term has a normalized standard error that is larger than the lower limit of the 95% confidence interval . Using this criterion, the In Vl term is indistinguishable from an inappropriate test vector for the data matrix containing 20% noise, and the corresponding term for the solute, the offset 1 2 , is also probably indistinguishable from an inappropriate test vector. Note that the SPOIL function indicated that the In V1 term was a valid test vector, even though the random vector tests indicated that In V1 was not distinguishable from a totally random vector . Using the MOSCED equation (Eqn . 6) to simulate the logarithms of limiting activity coeffi-

TABLE 6 Normalized standard errors (%) for random vector target tests Factor

Mean (5% noise)

Confidence interval a (5% noise)

Mean (20% noise)

Confidence interval a (20% noise)

V2 V2 8 2

103 91 90 108 104 97

76-130 43-139 48-132 88-128 83-125 85-109

128 109 98 102 103 113

87-169 67-151 65-131 81-123 72-134 84-142

V2 821RT - 0.3

s 81 In V1

In V2

a 95% confidence interval for six random vectors .



RB. Poe et al /Anal. Chim . Acta 277 (1993) 223-238

230

cients, three matrices were constructed containing 0, 5 and 20% normally distributed noise, as was done for the alkane model described above . These three matrices were analyzed with PCA and the abstract factors saved for target testing . The terms in Table 2 were used as test vectors . After eigenanalysis of the three MOSCED matrices with 0, 5 and 20% noise, the IND function, RSD and the F-testa consistently indicated 9, 7 and 6 significant factors, respectively, for the different noise levels . The NSE values for the test vectors are listed in Table 7. After inspection of the NSE values for the MOSCED data, it appeared that any term that included a q 2 term gave a poor fit when compared to the other test vectors from the same matrix. This indicates that development of an appropriate dipolarity scale within this particular model framework may be very difficult, since the T and q parameters always occur as a product . In addition, the test vector In Vt has a relatively poor fit; this would indicate that identifying an appropriate Vt scale may also be difficult .

TABLE 8

TABLE 7

a

Normalized standard errors (%) for modified MOSCED model Test vector

Noise 5%

20%

Solute

V2 V2(2A 2

+r2)

V2(A2 + A 2'F2

+ T2

16 13

3.3 2.7 1 .7 1 .3 0.086 4.0 3.7 2.7

11 12 6.0 4.8 0.23 15 14 12

+ a2/32)/RT

+ 0 .3 In V2 V2(A 2 +2r2 ) V2132

V2 a 2

1 2 (offset) V2 A 2 V2 T2

V2(A 2

3 .5 3.1

+ T 2)

a

a

Solvent

Ai+A1T1+T1 +a1/3 1 A1 11 (offset) Tt at

In V1 2A1 + T 1 A1 +2T1 A1 + T1

3.1 11 0 .010 a

3 .1 2.4 1.6 24 7.3 3.7 5.1

10 36 0.031 11 7.0 5.9 86 23 12 16

a

Standard error.

Normalized standard errors (%) for MOSCED model Test vector

Noise 5%

Solute V2 V2A 2 V2g 2 V2 g2T2

v2g 2T2 V2a2 V2182 V2(A2 + a 2/32)/RT +03 In V2 1 2 (offset)

The modified MOSCED model given by Eqn . 7 in the Theory section was also evaluated using 0, 5 and 20% noise added to the simulated activity coefficient values . The IND function, RSD and the F-test all indicated the presence of 7, 7 and 6 significant factors respectively . The NSE values for the test vectors are listed in Table 8 . In this case, the NSE values are much more consistent for these test vectors, indicating that it should be much easier to identify meaningful scales for the solvent and solute dipolarity for this type of model.

20%

5.2 5.5 17 6.7 18 1 .8 2.1 6.4

17 17 29 14 20 6.0 6.3 17 0.24

0.11 a

a

Solvent

Ai+a1 $ 1 A1 4iT1

q1 13 1 at 1 1 (offset) In V1 a

Standard error .

8.5 6.8 26 12 100 1.2 2.0 0.0064 44

a

16 27 27 15 110 5.7 4.8 0.026 75

Real data

a

PCA was used to analyze the 29 x 31 matrix containing 899 In y°° values described in the Experimental section. Approximately 25% of these values were originally estimated by UNIFAC or MOSCED; these values were improved using the factor analysis approach for compensating for

KB. Poe et al. /Anal. Chim. Acta 277 (1993) 223-238

missing values, as described in the Theory section. During this process, all In y°° values estimated from the factor analysis were examined and compared to the original experimental values contained in the database . Residuals which showed deviations greater than 20% , were flagged; this allowed us to identify several typographical errors and suspicious values in the database . These errors were corrected before proceeding with the data analysis . The matrix returned by the missing data routine was analyzed by PCA and the abstract factors saved for target testing and for the cluster diagrams . The results of principal component analysis of this data set show that the first 4 principal components describe 99 .6% of the variation in the data . The first principal component describes 81 .0% of the variance in the data set . The second, third and fourth principal components describe 15 .7, 1.9 and 1 .1% of the variance, respectively . The residual standard deviation should estimate the experimental uncertainty in the data set, if the correct number of factors has been chosen. PCA performed on the data set during the first iteration of the missing data routine (assuming four principal components) yielded an RSD of 0 .188 In units, or approximately a 20% error in the y°° values. This is consistent with the known error characteristics of the UNIFAC and MOSCED models. After the final iteration, the RSD was 0.102, reflecting an 11% error in the y" values, which is more consistent with the uncertainties estimated for the experimental data contained in the matrix [19] . The correlations between each abstract factor and the In y°° values for each solute were determined. Figures 2-5 show these correlations . Correlation coefficients in the range of 0 .5-0.7 indicate a moderate correlation and correlation coefficients in the range 0 .7-1 .0 indicate a strong correlation. Figure 2 shows the correlations for the first abstract factor with the solute In y°° values. The correlation coefficients on this figure fall into 2 groups; those with correlation coefficients above 0.8 and those with correlation coefficients below 0.4 . This indicates that the pattern in the most important principal component follows the behav-

231

SOLUTE CYCLO IEXANE 2,3-DIMETHYL 1UTANE ~~ n-P WANE 2,4-DIMETHYL P WANE -~ 2,5-DIMETHYL IEXANE n- IEXANE _ n-H :PTANE ~, 2-METHYL P :NTANE n- iCTANE ~_ s._~~ n-N ONANE ETHYL CYCLO IEXANE I-P :NTENE ~~ 1-P ENTER IS )PRENE T )LUENE B :NZENE -~" 1,4-D IOXANE 2-BU ANONE A :STONE CARBON TETRACH LORIDE n-PROPYL CH LORI DE t-BUTYL CHLORIDE -_ ~ ETHYL B WOMIDE ETHYL IODIDE CHLOR OFORM METHYL IODIDE METHYLENE CH LORIDE t & CARBON DI ;uLFIDE ACETO MITRILE NITROM ETHANE E 'HANOL 0 0.2 0.4 0 .6 0.8 CORRELATION Fig . 2. Solute correlations with the first principal component (vl ). Solid bars represent positive correlation and hatched bars represent a negative correlation .

for of the non-polar aliphatic, aromatic and halogenated aliphatic solutes . A majority of the solutes in this data set are non-polar and the first principal component reflects this bias . This is due to the fact that the first principal component always describes the greatest amount of variance in the data set . Figure 3 shows the correlations for the second abstract factor with the solute In y °° values . The solutes that give a strong correlation with the In y °° values are 1,4-dioxane, 2butanone, acetone, acetonitrile, nitromethane and ethanol . These solutes are polar and are relatively strong hydrogen bonding bases . Figure 4 shows the correlations for the third abstract factor with the solute In y°° values . This figure indicates a moderate correlation with methylene chloride and chloroform ; however, benzene and toluene also show moderate correlation coefficients . This result could be due the polarizability of the solute contributing to this abstract factor .

RB. Poe et al. /Anal. Chim. Acta 277 (1993) 223-238

232

Figure 5 shows the correlations for the fourth abstract factor with the solute In y" values . This abstract factor is moderately correlated with chloroform and methylene chloride, while showing very little correlation with any of the other solutes. This suggests that chloroform and methylene chloride are unique in their behavior as compared to the other solutes in the data set . This could be because chloroform and methylene chloride are slightly acidic as hydrogen bond donors, but are only very weak hydrogen bond donors. Figures 6 and 7 show the correlations with the first two abstract factors and the In y °° values for each solvent . Figure 6 shows the results for the correlation of the first principal component with the solvent In y' values. This principal component shows a strong correlation with the polar compounds, while it shows a weak correlation

SOLUTE CYCLOHEXANE f 2,3-DIMETHYL BUTANE n-PENTANE 2,4-DIMETHYL PENTANE 2,5-DIMETHYL HEXANE n-HEXANE n-HEPTANE 2-METHYL PENTANE n-OCTANE n-NONANE ETHYL CYCLOHEXANE I-PENTENE 1-PENTENE ISOPRENE TOLUENE BENZENE 1,4-DIOXANE 2-BUTANONE ACETONE CARBON TETRACHLORIDE n-PROPYL CHLORIDE t-BUTYL CHLORIDE ETHYL BROMIDE ETHYL IODIDE CHLOROFORM METHYL IODIDE METHYLENE CHLORIDE CARBON DISULFIDE ACETONITRILE NITROMETHANE ETHANOL 0

0.2

0.4

0.6

0.8

1

CORRELATION

Fig. 4. Solute correlations with the third principal component (v3). (see Fig. 2 for explanation).

SOLUTE CYCLOHEXANE 2,3-DIMETHYL BUTANE n-PENTANE 2,4-DIMETHYL PENTANE 2,5-DIMETHYL HEXANE n-HEXANE n-HEPTANE 2-METHYL PENTANE n-OCTANE n-NONANE ETHYL CYCLOHEXANE I-PENTENE 1-PENTENE ISOPRENE TOLUENE BENZENE 1,4DIOXANE 2-BUTANONE ACETONE CARBON TETRACHLORIDE n-PROPYL CHLORIDE t-BUTYL CHLORIDE ETHYL BROMIDE ETHYL IODIDE CHLOROFORM METHYL IODIDE METHYLENE CHLORIDE CARBON DISULFIDE I ACETONITRILE NITROMETHANE ETHANOL 0

0.2

0 .4 0.6 0.8

1

CORRELATION

Fig. 3. Solute correlations with the second principal component (v2). (see Fig . 2 for explanation) .

with the non-polar solvents . Since there are more polar solvents present in the data matrix, this result is reasonable . Figure 7 shows the correlation of the second solute principal component with the solvent In y°° values . This principal component has a strong correlation with the non-polar solvents while it shows a weak correlation with the polar compounds . The correlations for the third and fourth principal components are not given, as they showed essentially no correlation with any of the solvents . Another means for presenting these results is to plot each of the solutes (or solvents) in the coordinate system generated by the principal components analysis. Figure 8 is a cluster diagram of the solutes plotted as a function of the first and second principal components . The nonpolar solutes lie in a line parallel to the first principal component. The chlorinated aliphatic

RR Poe et al. /Anal. Chim. Acta 277 (1993) 223-238

233

compounds lie in a cluster centered around 0 .1 on the first principal component axis . Chloroform and methylene chloride lie to the left of the chlorinated group . The two aromatic compounds included as solutes lie in the center of the chlorinated alkane cluster . The oxygen containing bases and the polar compounds form groups of their own. All of these groupings are consistent with the known chemistry of these compounds . The principal components were used for target testing to evaluate the validity of the MOSCED model, and to explore the use of various pure component parameters within this model format . Since the MOSCED model expressed by Eqn . 6 yielded target test terms (shown in Table 2) that either confounded the q and r terms with one another, or generated additive terms where the relative weighing of the terms would be difficult to ascertain for real physical parameters (i.e.,

SOLVENT CYCLOHEXANE HEXANE ISOOCTANE HEPTANE OCTANE PERHYDROSQUALENE p-XYLENE TOLUENE BENZENE CARBON TETRACHLORIDE ETHYL ACETATE ACETONE CYCLOHEXANONE DMSO ANISOLE ACETOPHENONE ANILINE BENZONITRILE OUiNOUNE NITROBENZENE PROPIONITRILE ACETONITRILE N,N DIMETHYL ACETAMIDE N,N DIMETHYL FORMAMIDE NITROETHANE NITROMETHANE ETHANOL BUTANOL OCTANOL 0

0 .2

0.4 0.6 0.8

I

CORRELATION SOLUTE CYCLOHEXANE 2,3-DIMETHYL BUTANE n-PENTANE 2,4-DIMETHYL PENTANE 2,5-DIMETHYL HEXANE n-HEXANE n-HEPTANE 2-METHYL PENTANE n-OCTANE n-NONANE ETHYL CYCLOHEXANE I-PENTENE 1-PENTENE ISOPRENE TOLUENE BENZENE 1,4DIOXANE 2-BUTANONE ACETONE CARBON TETRACHLORIDE n-PROPYL CHLORIDE t-BUTYL CHLORIDE ETHYL BROMIDE ETHYL IODIDE CHLOROFORM METHYL IODIDE METHYLENE CHLORIDE CARBON DISULFIDE ACETONITRILE NITROMETHANE ETHANOL

Fig . 6 . Solvent correlations with the first principal component (u l). (see Fig . 2 for explanation) .

I

+ a13), we chose to evaluate the modified MOSCED model expressed by Eqn . 7. The formats for the target test terms for this model are given in Table 3 . There are several experimental scales that might be used to represent the parameters that appear in this model . For the volume scale, the molar volume (V) and the hard sphere volume (VB), as estimated according to Bondi [20], were tested . For representing the polarizability, the refractive index (n D ) or Onsager's function of the refractive index [ f(n D)] were examined [21]. For the polarity parameter, the solvatochromic -rr i*cr scale, as developed by Kamlet et al . [22], which characterizes the dipolarity and polarizability of a solvent, was examined . An alternate 7r* scale has been developed by Li et al . ( .7rr) [231 which describes the dipolarity-polarizability of solutes . Hydrogen bond acidity and basicity of the solA2

I 1 ∎

∎ 1 ∎ F

0

0.2

0.4

0.6

0.8

1

CORRELATION Fig. 5 . Solute correlations with the forth principal component (v4). (see Fig . 2 for explanation) .



RB. Poe et aL

234

vents were modeled by the a KT and F3KT scales of Taft et al. [24]. Hydrogen bond acidity and basicity of the solutes were modeled by the a2 and /32 scales of Abraham et al . [25], or by the chromatographically-based a2 and t6c scales of Li et al . [23,26]. The results of the target testing are listed in Tables 9 and 10. A normalized standard error (NSE) of 40% or less was considered a good fit. NSE values between 40% and 80% appeared to be correlated with the test vectors when the fit was examined visually, and NSE values over 80% had a poor quality of fit . Test parameters with a small range compared to their offsets tend to have higher NSEs, however, a high NSE does not necessarily imply that the factor does not contribute to the overall model but that it is more difficult to identify using target testing . The NSE values obtained were compared to NSE values

/AnaL Chim. Acta 277 (1993) 223-238

Second Principal Component

0.6 X

0.5

xx

0.4 0.3 0.2

X X

0.1 0

a,

S

-0.1

0

0 .05

0 .1

0 .15

0 .2

0 .25

0.3

First Principal Component

Fig. 8. Cluster diagram of the solutes plotted as a function of the first and second principal components . (*) Non-polar compounds (cyclohexane, 2,3-dimethylbutane, n-pentane, 2,4dimethylpentane, 2,5-dimethylhexane, n-hexane, n-heptane, 2-methylpentane, n-octane, n-nonane, ethylcyclohexane, isopentene, 1-pentene, isoprene and carbon disulfide) . (+) Aromatic compounds (toluene and benzene) . ( •) Oxygen containing base compounds (1,4-dioxane, 2-butanone and acetone). (0) Halogenated compounds (carbon tetrachloride, n-propyl chloride, tert-butyl chloride, ethyl bromide, ethyl

iodide, chloroform, methyl iodide and methylene chloride) . (X) Polar compounds (acetonitrile, nitromethane and ethanol).

SOLVENT CYCLOHE ANE H ~S ISOOC ANE ~~ HE ANE ~~~~~ i~r_~ O ANE PERHYDROSOUA ANE ^-~~~~ P-XY ENE ~~ TO ENE ~~~®~~ ~~ BEN ;ENE CARBON TETRACHLO WIDE ETHYL ACE ATE ACET ONE CYCLOHEXAN ONE D WSO :YY ANI OLE ACETOPHEN ONE ANI LINE W% BENZONI RILE OUINO LINE NITROBEN ;ENE PROPIONIT RILE ACETONIT RILE N,N DIMETHYL ACET AIDE :XXXX N,N DIMETHYL FORM TIDE \N NITROETH ANE :wt' NITROMETH ANE ETH NOL 1-BUTANOL OCTANOL 0 0 .2 0 .4 0 .6 0 .8 CORRELATION

Fig. 7. Solvent correlations with the second principal component (u2). (see Fig. 2 for explanation) .

obtained from fits to random test vectors with the same offset and standard deviation as the test vector . The results listed are the 95% confidence intervals for the NSE values for 6 random test vectors . These results indicate that while many of the NSE values are large, the test vectors can account for more of the variation in the data than a purely random vector. The standard errors of the test vectors are also given in Table 9 . Another approach for evaluating the success of target testing is to compare the standard error of fit to the experimental error in the target test vector parameters . In general, the factors for the solutes fit slightly better than the corresponding terms for the solvents. This may be related to the observation that the third and fourth solvent factors do not yield significant correlations with any of the individual solvents that were studied . There does not ap-



R.B. Poe et aL /Anal. Chim . Acta 277 (1993) 223-238

23 5

TABLE 9 Target tests for modified MOSCED model : solutes a Parameter

NSE (%)

SE b

Range °

95% C .I . d (%)

V2 VB V

60 60

12 21

28-99 52-180

86-164 86-158

9 .3-35

118-138

V2 A2 Vf(nD)

60

4 .0

V2(2A2 + r2), V2(A2 + 2r2), V2r2 Vrr2c VB7r2 C 71r2c

39 39 21

V1/27r 2*C V7r KT VB7r KT

30 51 51 37

KT

8 .2 4 .4 0 .050 0 .65 13 6 .8 0 .12

-29-36 -16-20 -0 .18-0.67 -2 .1-4 .9 -10-63 -5 .5-33 -0 .08-0.85

98-110 97-105 86-110 93-111 99-109 95-105 96-108

V2(A2+A2r2+a2(i2)+0 .3In V2 V3 H/RT + 0 .3 In V VB3H/RT + 0 .3 In VB VSH SH V1/232

109 101 101 48 75

2 .6 1 .5 1400 15 140

9 .4-20 5 .3-12 4800-11000 45-170 460-1200

124-196 129-221 122-214 100-124 105-179

V202 VJ9KT V~BPKT I3 KT V 1/2PKT V#2 VB#2 /32 V 1/2,6h VO2 VB02 J32 V1/2p22

81 81 64 73 78 77 61 70 87 88 77 80

9 .3 5 .1 0 .099 0 .97 9 .0 4 .8 0 .096 0 .93 13 7 .5 0 .15 1 .5

0-43 0-24 0-0 .48 0-4 .5 0-43 0-24 0-0 .50 0-4 .5 0-67 0-38 0-0 .79 0-7 .5

96-114 90-112 94-110 90-116 85-117 96-112 88-116 93-115 93-107 96-110 85-113 81-111

0-48 0-25 0-0 .82 0-19 0-10 0-0 .33 0-19 0-10 0-0 .33 0-2 .5

90-108 80-112 83-113 90-110 80-116 86-114 95-109 83-113 84-114

V2a2 VaKT VBaKT aKT Vat VBa2h ah2 Vac VBac 2 ac 2 V1/2ac 2

34 33 25 36 35 28 34 33 26 27

3 .7 1 .9 0 .044 1 .7 0 .87 0 .021 1 .4 0 .70 0 .017 0 .15

85-111

12 Offset

-

0 .23

a See List of symbols . b Standard error . ` Minimum and maximum parameter values for the compounds in the data set . d 95% confidence interval of the NSE values for 6 random vectors .



RR Poe et al. /Anal. Chim . Acta 277 (1993) 223-238

236 TABLE 10 Target tests for modified MOSCED model : solvents a Parameter

NSE (%)

SE b

Range

95% C.I. d (%)

(Ai+A1T 1 +TI +a i P i) S H2 512;/V SH/V

1/2

74 50 54

28 0.38 2.9

47-190 0.13-3 .0 2.9-22

88-152 77-127 83-125

A1

nD On D)

430 220

?r2*2 C are C/V irz C /V 1/2 KT

r KT/ V 7rK,/V 1'2

0.31 0.051

1 .35-1 .63 0.18-0.26

282-572 167-271

-0.16-1 .0 -0.001-0.014 -0.01-0 .12 -0 .04-1 .0 -0.003-0.016 -0.004-0.12

79-133 89-123 92-126 84-126 84-128 97-115

0.080 0 .0009 0.0080 0.078 0.0009 0.0079 0.18 0.0022 0.019

0-0 .33 0-0.0057 0-0 .043 0-0 .33 0-0 .0057 0-0 .043 0-0 .82 0-0 .014 0-0 .11

94-110 84-114 77-125 88-108 76-116 85-115 97-111 88-116 92-108

0.12 0.0011 0.010 0.13 0.0014 0.013 0.19 0.0015 0.016

0-0.78 0-0 .011 0-0 .093 0-1 .42 0-0.020 0-0 .17 0-0.76 0-0 .011 0-0.090

81-125 93-115 84-126 85-117 77-123 92-116 83-119 99-119 90-114

0.92 1.03

3.3-5 .7 4.0-6.3

159-255 193-285

(2A1 + T1), (A1 + 2 r1), T1 47 0.16 29 0.0013 36 0.013 55 0.19 36 0.0017 42 0.016 11

Offset

0.20 a1

a 2h a21V a2/V 1/2 a 2C a2/V a2/V 112 aKT aKT/V aKT /V1/2

72 65 67 73 68 70 72 66 68 $1

f32 X2/1' V 112 p2 /

P2 P2/V

PC 112 ~t /V PKT

fxT/V #KT/V

1/2

50 35 39 37 31 31 64 42 50 In V1

In VB In V a-d

200 230

See footnotes to Table 9 .

pear to be a significant difference in the quality of fits when using molar volume or the Bondi volume scale . This should be expected since these scales are very collinear . The parameters used for

modeling the polarizability of the solvent gave particularly poor NSE values, but this is largely due to the limited range of these parameters . In the case of the solutes, the Vir * terms did not fit



RB. Poe et al. /Anal. Chim . Acta 277 (1993) 223-238

237

as well as -rr *, and for the solvents, the a * terms did not fit as well as it*/V, indicating that the 7r * scales may require normalization by the volume or the square root of the volume, as might be suspected from a dimensional analysis. Terms for the hydrogen bonding acidity fit well for the solutes in the data set but did not fit as well for the solvents. Terms for the hydrogen bonding basicity fit well for the solvents in the data set but did not fit as well for the solutes . These results agree since the solute acidity term and the solvent basicity term appear in the same term, as seen in Table 3 . In addition, for the solutes, the Va and V(3 terms did not fit as well as a and /3, and for the solvents, the a and (3 terms did not fit as well as a/V and /3/V, indicating that the hydrogen bonding terms may also require normalization by the volume . In theory, the solute-based scales (C and h) should better represent solute properties, while the solvent-based scales (KT) should better represent the solvent properties, however, the NSE values indicate that the solute-based scales (C and h) fit as well as the solvent-based scales (KT) for the solvents .

suggests that some scales may need to be normalized by the volume, as indicated by the lower normalized standard errors . The scope of the target testing described here was limited by the noise in the original data set . However, this paper demonstrates that factor analysis methods can be used successfully in the evaluation of very complex models .

Conclusions This paper illustrates a novel use of factor analysis to characterize infinite dilution activity coefficients. Factor analysis aided in the detection of outliers and typographical errors in the database . In addition, by using factor analysis with synthetic data matrices generated to study the different models for infinite dilution activity coefficients, we were able to determine the number of factors for each model and identify factors which may be more susceptible to noise . This work with the synthetic data matrices also indicated which model for infinite dilution activity coefficients may be the most desirable to use to develop different scales for the physicochemical parameters (i.e ., volume, dipolarity, dispersion and hydrogen bonding interactions) . Factor analysis and target testing provide a basis to study different model forms of the MOSCED equation and different physicochemical scales by using a database of real infinite dilution activity coefficients and physical properties. The quality of fit from the test parameters

Symbols V Volume or molar volume . VB Hard sphere volume according to Bondi [20] . 5 Hildebrand's solubility parameter for synthetic alkane model . Hildebrand's solubility parameter (experiSH mentally determined) . Refractive index. Onsager function of refractive index [21] f(n D ) = (n2 - 1)/(2nD + 1). Dispersion parameter for synthetic MOSCED models. Induction parameter for synthetic MOSq CED model. T Polar parameter for synthetic MOSCED models . * C Combination dipolarity-polarizability paz rameter experimentally determined by Li et al . [23]. ,T* Combination dipolarity-polarizability parcr rameter experimentally determined by Kamlet et al . [22].

SCR and RBP would like to acknowledge the National Science Foundation (Grant CHE8921315) for supporting this research . MJH and CAE would like to acknowledge the financial support of DuPont Chemical Co ., Amoco Chemical Co. and the National Science Foundation . PWC would like to acknowledge support from the National Science Foundation .

LIST OF SYMBOLS

Subscripts 1 Solvent 2 Solute



RB. Poe et al.

238

a a2

aK-r

a2 R PC

Nwr

P2 1

Hydrogen bonding acidity parameter for synthetic MOSCED models . Hydrogen bonding acidity parameter experimentally determined by Li et al . [23]. Hydrogen bonding acidity parameter experimentally determined by Taft et al . [24]. Hydrogen bonding acidity parameter experimentally determined by Abraham et al . [25]. Hydrogen bonding basicity parameter for synthetic MOSCED models . Hydrogen bonding basicity parameter experimentally determined by Li et al . [26]. Hydrogen bonding basicity parameter experimentally determined by Taft et al . [24]. Hydrogen bonding basicity parameter experimentally determined by Abraham et al . [25]. Offset vector [1 1 1 . . .1 1].

Notation for linear algebra Scalars are represented by lower case italics . Column vectors are represented by bold, lower case letters. Row vectors are represented by bold, lower case letters with the superscript T for transpose. Matrices are represented by bold, upper case letters.

REFERENCES 1 S. Zeck, Fluid Phase Equilib., 70 (1991) 125 . 2 J.M. Prausnitz, R .N. Lichtenthaler and E.G. De Azevedo, Molecular Thermodynamics of Fluid-phase Equilibria, Prentice-Hall, Englewood Cliffs, NJ, 2nd edn ., 1986. 3 M. Roth and J. Novak, J. Chromatogr ., 258 (1983) 23 .

/Anal Chim. Acta 277 (1993) 223-238

4 L.B. Schreiber and C.A. Eckert, Ind . Eng . Chem. Process Des. Dev ., 10 (1971) 572 . 5 A . Hussam and P.W. Can, Anal. Chem., 57 (1985) 793 . 6 E.R. Thomas, B.A. Newman, G .L. Nicolaides and C .A. Eckert, J. Chem. Eng. Data, 27 (1982) 233 . 7 J.M . Miller, Chromatography: Concepts and Contrasts, Wiley, New York, 1988, p . 150. 8 J.C. Bastos, M .E. Soares and A .G. Medina, Ind. Eng. Chem. Res., 27 (1988) 1269 . 9 J.H. Park and P.W. Carr, Anal . Chem., 59 (1987) 2596 . 10 E.R. Thomas and C.A. Eckert, Ind. Eng. Chem. Process Des. Dev., 23 (1984) 194 . 11 W.J. Howell, A .M. Karachewski, KM. Stephenson, C .A. Eckert, J.H. Park, P.W. Can and S .C. Rutan, Fluid Phase Equilib., 52 (1989) 151 . 12 A.P. Dempster, N.M. Laird and D .B. Rubin, J . Roy . Stat . Soc. B, 39 (1977) 1 . 13 E.R. Malinowski, Anal . Chem ., 49 (1977) 612. 14 E.R. Malinowski, Anal . Chem., 49 (1977) 606. 15 E.R. Malinowski, J . Chemom., 3 (1988) 49 . 16 E.R. Malinowski, Factor Analysis In Chemistry, Wiley, New York, 2nd ed ., 1991 . 17 J.H. Hildebrand, J.M. Prausnitz and R.L. Scott, Regular and Related Solutions, Van Nostrand Reinhold, New York, 1970. 18 S.C. Rutan and P .W. Can, (1988) unpublished results . 19 M.J. Hait, C.L. Liotta, C.A. Eckert, D.L. Bergmann, A.M. Karachewski, A.J. Dallas, D .I . Eikens, J .J . Li, P .W. Carr, R.B. Poe and S.C. Rutan, Ind. Eng . Chem. Res., submitted for publication. 20 A. Bondi, Physical Properties of Molecular Crystals, Liquids, and Glasses, Wiley, New York, 1968 . 21 L. Onsager, J. Am. Chem. Soc ., 58 (1936) 1486 . 22 M.J. Kamlet, J.L. Abboud and R.W. Taft, J . Am. Chem. Soc., 99 (1977) 6027. 23 J. Li, Y . Zhang, A .J. Dallas and P .W. Can, J . Chromatogr., 550 (1991) 101 . 24 R.W. Taft, J.M. Abboud, M .J. Kamlet and M.H. Abraham, J. Sol. Chem., 14 (1985) 153 . 25 M.H. Abraham, P .L. Grellier, D .V . Prior, P.P. Duce, J .J . Morris and P.J. Taylor, J . Chem. Soc., Perkin Trans. II, (1989) 699 . 26 J. Li, Y . Zhang, H. Ouyang and P.W. Can, J. Am. Chem. Soc., (1992) in press .