Atmospheric Environment 36 (2002) 3629–3641
Source apportionment of exposures to volatile organic compounds. I. Evaluation of receptor models using simulated exposure data Shelly L. Miller*, Melissa J. Anderson, Eileen P. Daly, Jana B. Milford Department of Mechanical Engineering, University of Colorado, 427 UCB, Boulder, CO 80309-0427, USA Received 2 October 2001; accepted 21 March 2002
Abstract Four receptor-oriented source apportionment models were evaluated by applying them to simulated personal exposure data for select volatile organic compounds (VOCs) that were generated by Monte Carlo sampling from known source contributions and profiles. The exposure sources modeled are environmental tobacco smoke, paint emissions, cleaning and/or pesticide products, gasoline vapors, automobile exhaust, and wastewater treatment plant emissions. The receptor models analyzed are chemical mass balance, principal component analysis/absolute principal component scores, positive matrix factorization (PMF), and graphical ratio analysis for composition estimates/source apportionment by factors with explicit restriction, incorporated in the UNMIX model. All models identified only the major contributors to total exposure concentrations. PMF extracted factor profiles that most closely represented the major sources used to generate the simulated data. None of the models were able to distinguish between sources with similar chemical profiles. Sources that contributed o5% to the average total VOC exposure were not identified. r 2002 Elsevier Science Ltd. All rights reserved. Keywords: Hazardous air pollutants; Air toxics; Source attribution
1. Introduction Source-oriented dispersion models have traditionally been used to predict impacts from outdoor sources on air pollution concentrations. Indoor sources of air pollution and exposures due to personal activities, however, tend to be more erratic than outdoor sources, due to variability in activities, indoor environments, and source strength. Thus, attempting to estimate source contributions to personal exposure using source-oriented models can be difficult. Receptor-oriented models may be better suited to this type of analysis because only the receptor concentration and for some models the chemical composition of emissions, or source profiles, need to be known. Never*Corresponding author. Tel.: +1-303-492-0587; fax: +1303-492-3498. E-mail address:
[email protected] (S.L. Miller).
theless, while receptor modeling has been widely used to estimate source contributions to outdoor concentrations, few studies have analyzed personal exposure using receptor modeling (Anderson et al., 2001; Yakovleva et al., 1999). The primary goals of this study were to develop a simulated data set of personal exposures and to evaluate the ability of four currently available receptor models to estimate source contributions to the simulated personal exposures. The receptor models applied are the chemical mass balance (CMB), principal component analysis (PCA)/absolute principal component scores (APCS), positive matrix factorization (PMF), and graphical ratio analysis for composition estimates (GRACE)/source apportionment by factors with explicit restriction (SAFER), incorporated in the UNMIX model. CMB is a receptor model that has been widely applied to ambient air pollution problems (Watson et al., 1990a, 2001). Recently, for example, Zappoli et al. (1998) used
1352-2310/02/$ - see front matter r 2002 Elsevier Science Ltd. All rights reserved. PII: S 1 3 5 2 - 2 3 1 0 ( 0 2 ) 0 0 2 7 9 - 0
3630
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
CMB to determine organic components of fine aerosol in Europe. Vega et al. (2000) applied CMB to hydrocarbons in Mexico City using emissions profiles measured at local sources. Numerous studies have demonstrated the use of PCA for identifying and estimating pollutant source profiles (Daisey et al., 1994; Sweet and Vermette, 1992; Thurston and Spengler, 1985). Yu and Chang (2000) and Jeon et al. (2001) used PCA to analyze sources of ambient particulate matter and ozone, respectively. The APCS method, which quantifies source profiles and contributions from PCA factors, was first applied by Thurston and Spengler (1985) to determine sources of particulate matter in Boston. PMF is an advanced algorithm in receptor modeling (Paatero and Tapper, 1994). Previous PMF applications include identifying sources of bulk wet deposition concentrations of strong acids (Anttila et al., 1995), personal exposures to particulate matter (Yakovleva et al., 1999), particulate matter in urban aerosol (Ramadan et al., 2000) and personal exposure and outdoor volatile organic compounds (VOCs) (Anderson et al., 2001). GRACE/SAFER, incorporated in the UNMIX model, is another advanced technique (Henry, 2000). Henry et al. (1994) proposed the GRACE/SAFER method to overcome limitations of PCA, upon which the model is based, and used it to determine the composition of mobile VOC sources. Kim and Henry (2000) applied the SAFER model to ambient particulate matter data. While these models have been used for source apportionment, it is not obvious which model or combination of models is most suitable to solve a specific problem. One method of testing the utility of receptor models involves creating a simulated data set with known contributions from a variety of sources. For ambient applications, Quail Roost II, a simulated data set of outdoor particulate matter concentrations, was created to test the abilities of various receptor models in apportioning particulate matter to its sources (Currie et al., 1984). In this study, to assess the limitations and abilities of each receptor model to extract sources of personal exposure to select VOCs, a simulated data set of realistic exposure concentrations was created. Generated by Monte Carlo sampling, this data set includes simulated exposures from personal activities, as well as contributions from outdoor sources. CMB, PCA/APCS, PMF, and GRACE/SAFER with UNMIX were applied to the simulated data set to test their ability to estimate source contributions and, in the latter three cases, source profiles. Results were compared across models and to the known contributions. In Part II of this paper, Anderson et al. (2002) applied the models to personal exposure data from studies conducted in California and New Jersey from 1984–1990.
2. Simulated data set The simulated data set, consisting of 829 observations, reflects exposures to 13 VOCs from both activityrelated and outdoor sources: environmental tobacco smoke (ETS), paint emissions, cleaning and/or pesticide products, pumping gasoline, gasoline vapors, automobile exhaust, and wastewater treatment plant emissions. The suite of VOCs were chosen to reflect the compounds that were measured in the 1984–1990 US Environmental Protection Agency (EPA) in Total Exposure Assessment Methodology (TEAM) studies of exposure to VOCs. To create the simulated data set, 24-h averaged VOC exposures were computed by summing over all sources the product of two terms: (1) time the subject was exposed to a source over the course of a day and (2) the VOC exposure concentration for that source. The Activity Patterns of California Residents (APCR) database was used to simulate exposure of adults to each source (Jenkins et al., 1992; Wiley et al., 1991). The APCR data were generated from a random sampling of California residents. In addition to answering general interview questions, the respondents recounted a 24-h diary of their activities from the previous day. These diary data were used in this study to determine a subset of respondents who had been exposed to ETS or had conducted any of the activities of interest (painting, pumping gasoline, and cleaning) and the duration of time for each. Table 1 details for each source the number of adults simulated and the geometric mean and standard deviation of the time exposed. The number of adults who reported painting or pumping gasoline in the APCR database was increased to augment the proportion of these sources to total exposure. VOC exposure concentrations from published measurements were used to simulate exposure to ETS (Coultas et al., 1990; Miller et al., 1998; Proctor et al., 1991), gasoline vapors from pumping gasoline (Rappaport et al., 1987), and paint emissions (Norb.ack et al., 1995). Published indoor air concentrations were used to simulate exposure from cleaning and/or pesticide products (Wallace et al., 1987), since no exposure concentrations for this source were available. The exposure concentrations used to simulate exposures to gasoline and painting were altered slightly from the original published data. Decane was omitted for gasoline vapors, due to its small concentration. 1,1,1-TCA and benzene were added to painting because these two compounds were prevalent in painted sheet-rock emissions (Wallace et al., 1987). Because the concentrations were sometimes reported as arithmetic means (AM) and standard deviations (ASD), geometric means (GM) and standard deviations (GSD) had to be estimated for those concentrations. Assuming that the true distribution was lognormal, a best fit to the lognormal distribution with the given AM
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
3631
Table 1 Concentrations and other parameters for activity-related sources used to create simulated data seta Parameter/ compoundb
ETS
Pumping gas
Paint
Cleaning and/or pesticides
Reference(s)
(Proctor et al., 1991, Coultas et al., 1990)c 580 94 (127) — 16 (3.4) — — 12 (3.4) 44 (3.4) — — 13 (3.4) — 2.5 (3.4) — 94 (3.4)
(Rappaport et al., 1987) 69 9.7 (5.1) — 475 (2.2) — — 44 (2.3) 544 (1.9) — — 65 (2.3) — — — 604 (1.9)
(Norb.ack et al., 1995)
(Wallace et al., 1987)
12 125 (139) 4.5 (1.6) 17 (1.3) — — 52 (1.5) 97 (1.7) 9.0 (6.0) 13 (4.7) — — — — 54 (3.9)
627 56 (64) 641 (1.5) — 1331 (1.1) 283 (1.0) — — 3.2 (1.2) 21 (0.50)f — 8.4 (1.0) — 6.6 (1.4) —
No. exposedd Exposure timee (min) TCA BNZ TET CFM EBZ MPX DEC UND OXY PDB STR TCE TOL
GM (mg m3) and GSD (in parentheses) given for individual compound concentrations except where indicated. TCA—1,1,1-Trichloroethane; BNZ—Benzene; TET—Carbon Tetrachloride; CFM—Chloroform; EBZ—Ethylbenzene; MPX— m,p-Xylene; DEC—n-Decane; UND—n-Undecane; OXY—o-Xylene; PDB—p-Dichlorobenzene; STR—Styrene; TCE—Trichloroethylene; TOL—Toluene. c GM from (Proctor et al., 1991), GSD from (Coultas et al., 1990). d Number of participants exposed to this source. e AM (ASD) for individuals exposed to VOC from this activity. f AM and ASD given. a
b
and ASD was iteratively constructed. Table 1 details the GM and GSD used to create the activity-related exposures in the simulated data. Exposure concentrations of each VOC from each source were simulated as shown below: Cijk ¼ exp½lnðGMij Þ þ lnðGSDij Þ Nijk
j ¼ 1; y; 4;
ð1Þ
where Cijk is the concentration of compound i from source j to which participant k was exposed (mg m3). ETS is denoted as source 1, paint emissions as source 2, cleaning and/or pesticide products as source 3, and gasoline vapors from pumping gasoline as source 4. GMij is the GM of the concentration of compound i from source j (mg m3) and GSDij is the GSD of the concentration of compound i from source j (unitless). Nijk is a random number normally distributed with a mean of 0 and standard deviation of 1, denoted as Nð0; 1Þ; included to add noise to the data, with a different value drawn for each source, compound and participant (Miller et al., 1998). 24-h average exposure to compound i from source j for participant k; Eijk (mg m3), is Eijk ¼
Cijk tjk ð60 24Þ
j ¼ 1; y; 4;
ð2Þ
where tjk is the time (min) participant k was exposed to source j according to the diary. To create a realistic data set, contributions from outdoor sources were also included in each simulated exposure. Emissions from three typical outdoor sources, gasoline vapors (Rappaport et al., 1987), automobile exhaust (Lin and Milford, 1994), and wastewater treatment plants (Harkov et al., 1987), created the outdoor contributions. The gasoline vapor profile used to create outdoor air concentrations was the same profile used in simulating exposure to pumping gasoline, with the modification that decane was included for the outdoor profile. It was assumed that the total VOC concentration from these three sources, including emissions of VOCs not modeled, summed to 140 mg m3, of which 60% came from automobile exhaust, 25% from gasoline, and 15% from wastewater. Because mass fractions of the compounds modeled were lower for gasoline vapors than for wastewater, the actual percent contributions were 60% from automobile exhaust, 5% from gasoline, and 35% from wastewater. From the assumed concentrations, the arithmetic mean emission rates were calculated as follows: ERj ¼ Ct fj W U% MH
j ¼ 5; y; 7;
ð3Þ
3632
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
where ERj is the AM of the emission rate for source j (mg s1). Gasoline vapors are denoted as source 5, automobile exhaust as source 6, and wastewater emissions as source 7. Ct is the total ambient air concentration (140 mg m3) and fj is the fraction of Ct from source j: W is the width of the affected airshed, assumed to be 10 km. U% is the average daily windspeed (5 m s1) and MH is the average daily mixing height (1247 m) from an annual time series of historical meteorological data from Denver, CO, 1981 (National Climatic Data Center, 2001). Using an assumed arithmetic standard deviation for the total emissions rate for each source, corresponding geometric means and standard deviations were estimated. The emission rate of outdoor source j; ERjk (mg s1), contributing to exposure of participant k was then randomized as follows: ERjk ¼ exp½lnðGMER;j Þ þ lnðGSDER;j Þ Njk j ¼ 5; y; 7;
ð4Þ
where GMER;j is the GM of the emission rate of source j (mg s1), GSDER;j is the GSD of the emission rate of source j (unitless), and Njk is a Nð0; 1Þ random number with a different value drawn for each source and participant. Emission rates of individual compounds within each source were then estimated as ERijk ¼ ERjk exp½lnðGMMF;ij Þ þ lnðGSDMF;ij Þ Nijk j ¼ 5; y; 7;
ð5Þ
where GMMF;ij is the GM of the mass fraction of compound i in emissions from source j (unitless), GSDMF;ij is the GSD of the mass fraction of compound
i in emissions from source j; and Nijk is a Nð0; 1Þ random number with a different value drawn for each compound, source, and participant. Table 2 details the parameter values used to create the exposures from outdoor sources. From the emissions rates, exposure concentrations of each VOC from the outdoor sources were calculated for each participant as shown below: Cijk ¼
ERijk W Uk MHk
j ¼ 5; y; 7;
where Cijk is the concentration of compound i from outdoor source j for participant k (mg m3). For each participant, a distinct value of the windspeed, Uk ; and mixing height, MHk ; was used from the time series of meteorological data from Denver, CO, 1981 (National Climatic Data Center, 2001). It was assumed that the participants were exposed to outdoor air for 24 h, so Eijk ¼ Cijk for j ¼ 5; y; 7: Finally, the 24-h average exposure of each selected APCR participant k to compound i; Eik ; from all sources was computed: Eik ¼
7 X
ð7Þ
Eijk :
j¼1
3. Models CMB predicts the contribution of different sources to measured receptor concentrations using an inverse
Table 2 Compound mass fractions and other parameters for outdoor sources used to create simulated data seta Parameter/compound
Gas vapors
Auto exhaust
Wastewater
Reference No. exposedb Exposure time (min) Emission rate (kg h1)c TCA BNZ TET CFM EBZ MPX DEC UND OXY PDB STR TCE TOL
(Rappaport et al., 1987) 829 1440 7847 (1.78) — 0.00923 (1.28) — — 0.00090 (1.28) 0.00474 (1.28) 0.00104 (1.28) — 0.00130 (1.28) — — — 0.01071 (1.28)
(Lin and Milford, 1994) 829 1440 18,830 (1.85) — 0.0268 (1.07) — — 0.0128 (1.02) 0.0434 (1.03) — — 0.0187 (1.03) — 0.0042 (1.12) — 0.0684 (1.06)
(Harkov et al., 1987) 829 1440 4708 (1.78) 0.1482 (1.28) 0.0115 (1.28) 0.0456 (1.28) 0.0973 (1.28) 0.0078 (1.28) — — — — — — 0.0114 (1.28) 0.0465 (1.28)
a
GM (unitless) and GSD given for individual compound mass fractions. Number of participants exposed to this source. c GM (kg h1) and GSD given. b
ð6Þ
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
variance weighted least-squares linear regression (Watson et al., 1990a, b). Inputs to the model are chemical composition profiles for likely sources and the chemical composition of receptor concentrations. The fundamental equation in the CMB model is Eik ¼
p X
Fij Sjk
i ¼ 1; 2; y; m; k ¼ 1; 2; y; n;
ð8Þ
j¼1
where Fij ; the known source profile matrix, is the fraction of emissions from source j that is composed of compound i: Sjk is the contribution of source j to the total measured VOC concentrations for participant k: CMB equations for the m compounds in an observation are solved simultaneously by an effective variance leastsquares estimate, in which the weighted sum of squared differences (w2k ) between the measured and modeled concentrations is minimized (Britt and Luecke, 1973; Watson et al., 1984): 2 2 3 Pp m E F S X ik ij jk j¼1 1 7 6 ð9Þ w2k ¼ 5; 4 ði jÞ i¼1 Vik where Vik ¼ s2Eik þ
p X ðSjk Þ2 s2Fij ; j¼1
where sEik is one standard deviation of the measured exposure concentration of compound i for participant k and sFij is one standard deviation of the measured fraction of compound i in emissions from source j: The effective variance, Vik ; is adjusted iteratively as estimates of the Sjk ’s are refined. This method gives a higher weight to compounds with higher precision in both the source and receptor measurements than to species with lower precision. In contrast to other receptor models, which extract source compositions from the data, CMB requires the user to supply source profiles. Also in contrast to the other models, CMB is applied separately to each observation, rather than operating on the data set as a whole. Assumptions made in using CMB have been discussed elsewhere (Watson et al., 1990a). Apportionment was performed on 100 randomly selected observations from the simulated data using CMB (version 7.0). The data were analyzed in two ways: using all source profiles creating the simulated data (Solution 1) and using only the source profiles that resulted in the best fit of the model (Solution 2). For Solution 2, apportionment was started using the six known source profiles (one profile was used for gasoline vapors even though it was used to simulate exposure to pumping gasoline and to an outdoor source) and proceeded by eliminating those with consistently negative source contribution estimates (SCE). Sources that did not improve the fit of the model were dropped as
3633
well. The subset of source profiles reported best fit the greatest number of observations, as evidenced by R2 and w2 values, and the percentage ratio of calculated to actual total VOC exposure concentrations (% mass) (Anderson, 2001). Note that this application represents a best case for CMB because the source profiles used to generate the simulated data are known. PCA/APCS, PMF and UNMIX are all multivariate receptor models, which analyze a series of k ¼ 1; y; n observations simultaneously in an attempt to determine the number of sources, p; their chemical composition, Fij ; and their contributions to each observation Sjk such that Eik ¼
p X
Fij Sjk þ eik
i ¼ 1; 2; y; m; k ¼ 1; 2; y; n;
ð10Þ
j¼1
where eik ; the residual error, is minimized. Differences among the three multivariate models considered here include their treatment of errors or uncertainty in the exposure data, and their ability to incorporate physical constraints on factor profiles and source contribution estimates. Principal Component Analysis is a well-established tool for analyzing structure in multivariate data sets. It starts with a large number of correlated variables (e.g., exposure concentrations) and seeks to identify a smaller number of independent factors that can be used to explain the variance in the data (Hopke, 1985). The variables are assumed to be linearly related to some number of underlying factors. First, the data are transformed into a dimensionless standardized form: Zik ¼
Eik Ei si
i ¼ 1; 2; y; m; k ¼ 1; 2; y; n;
ð11Þ
where Ei is the arithmetic mean exposure concentration for compound i over all participants and si is its standard deviation. The PCA model is Zik ¼
p X
Gij Hjk
i ¼ 1; 2; y; m; k ¼ 1; 2; y; n
ð12Þ
j¼1
or in matrix notation: Z ¼ GH; where G is the matrix of factor loadings and H is a matrix of factor scores. Gij represents the correlation of compound i with factor j and Hjk the relative impact of the jth factor for the total VOC exposure of the kth participant. The factor loadings in Eq. (12) are determined from an eigenvector decomposition of the matrix of pairwise correlation coefficients for the m compounds: R¼
1 ZZT : m1
ð13Þ
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
3634
Letting K represent an m m matrix whose columns are the eigenvectors of R and k a diagonal matrix of its eigenvalues, the full factor-loading matrix G is G ¼ Ll2 :
ð14Þ
With G determined, H is calculated by inverting Eq. (12). PCA is most useful if a relatively small number of factors account for most of the variability in the data, thus reducing the dimensionality of the problem. Consequently, the eigenvectors obtained by decomposing R are sorted by descending order of the corresponding eigenvalues and low-ranking eigenvectors are discarded. In this study, the factors retained had corresponding eigenvalues close to or >1.0. Varimax rotation (Henry, 1987) was used to redistribute the variance to give a more interpretable structure to the factors. Because PCA factor scores are derived from standardized data, they are correlated with, but not proportional to, the factor contributions to the exposure concentrations. A factor score of Hjk ¼ 0 means the factor has an average influence on observation k; not that it has no influence. APCS (Thurston and Spengler, 1985) are used to derive quantitative estimates of factor contributions and profiles associated with each factor. This method calculates factor scores for absolute zero exposure concentrations, and uses them to re-scale the factor scores from PCA. The rescaled scores are known as APCS. Subsequently, exposure concentrations are regressed on the APCS to derive values of Sjk (Eq. (10)). Once the source contributions are known, the factor profiles can be estimated from nonstandardized exposure concentration data using Eq. (10). In PCA/APCS, factor profiles and source contribution estimates are not constrained to be nonnegative. For this study, the number of factors was chosen such that negative values in APCS source profiles, if any, were relatively small. PMF incorporates error estimates of the data to solve matrix factorization of the linear model as a constrained, weighted least-squares problem (Paatero and Tapper, 1994). PMF directly factors the data, as compared to PCA, which factors the standardized data. PMF seeks to solve the matrix equation: Eik ¼
p X
Aij Bjk þ eik
i ¼ 1; 2; y; m; k ¼ 1; 2; y; n;
j¼1
ð15Þ where Aij represents the loading of compound i on factor j and Bjk the jth factor score for the total VOC exposure of the kth participant. Because the factorization is done directly on the exposure concentration matrix E; the factor loading or profile matrix A and factor score matrix B are directly interpretable as source profiles and normalized source contribution estimates,
without the need for rotations or re-centering about zero, as required with PCA. Subsequently, total exposure concentrations are regressed on the factor scores to un-normalize the source contribution estimates. In PMF, the user chooses the number of factors, p: Given p; PMF solves Eq. (15) by minimizing the weighted sum of squares m X n X eik 2 Q¼ ; ð16Þ sik i¼1 k¼1 where sik is the standard deviation representing the uncertainty in the measurement Eik : PMF uses error estimates of the data to downweight those observations that are compromised by sampling errors, detection limits, missing data or outliers. Factors are constrained to be nonnegative. Further details on the least-squares minimization algorithm can be found in Paatero and Tapper (1994). For this study, error estimates (Tik ) used to calculate sik values were determined for each observation using the following formula (Anderson et al., 2001): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Tik ¼ ½ð0:5 DLi Þ2 þ ð0:1 Eik Þ2 ; ð17Þ where DLi is the detection limit for the compound of interest across all observations. DLi s for all compounds except toluene are from Clayton and Perritt (1993). The toluene DL was set to 0.1 mg m3, which is similar to the DL for the other compounds. The recommendations of Paatero and Tapper (1994) were followed for the rotational freedom (FPEAK=0), outlier threshold distance (a ¼ 4:0) and choice of the robust solution mode. PMF solutions for the simulated data were attempted with from 3 to 7 factors. Results are reported for 3 factors, based on their interpretability along with the Q value, the distribution of scaled residuals for individual compounds, and a linear regression analysis of the factor scores versus the sum of the measured VOC concentrations. In the linear regression, the attempt was made to avoid negative regression coefficients, and optimize R2 and % mass values. More details on the error model and model performance parameters can be found in Anderson et al. (2001). The GRACE/SAFER method estimates factor profiles by a combination of graphical analysis and multivariate receptor modeling. This method is incorporated in the UNMIX model, which runs under Matlabs (Mathworks, 2001). We ran UNMIX version 2.2 under Matlab 5.0. GRACE is a mathematical procedure used to generate pairwise scatter plots of measured receptor concentrations, which provide physical constraints for input to SAFER (Henry, 2000). SAFER, a multivariate receptor model based on PCA, differs from PCA in that it uses a new transformation method based on the
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
self-modeling curve resolution (SCMR) technique to derive meaningful factors (Kim and Henry, 2000). The SMCR technique imposes physical constraints, such as nonnegativity of factor profiles and contributions, as an integral part of the profile derivation procedure (Lawton and Sylvestre, 1971). UNMIX assumes that factor compositions are approximately constant and that all observations are not affected by all factors. The user specifies a tracer, which is a compound contributed predominantly by only one source, and the number of factors. Further details on the model can be found in Henry and Kim (1990) and Henry (2000). UNMIX solutions for the simulated data were attempted for a variety of compounds, different tracers and from 3 to 6 factors until a feasible solution was found that optimized model performance. Results are reported for 3 factors with carbon tetrachloride as the tracer, based on the minimum R2 value (Min. R2 ), minimum signal-to-noise ratio (Min. S=N), and optimized % mass (Ramadan et al., 2000). The minimum R2 refers to the smallest R2 value for any compound in the model. The Min. S=N is based on the regression sum of squares and the error sum of squares of the eigenvectors (Henry et al., 1999). The % mass value was calculated by comparing the total average modeled concentration with the total average measured concentration, across all observations. Two positivity constraints were implemented in the final three-factor UNMIX model. The first constraint was to reject a factor composition if the sum of the negative values for any compound in a composition was >15% of its mean. The second constraint was to allow up to 25% of the mass of a factor to be negative. Outliers were also downweighted with the lower 15% of the data having reduced influence on the calculations. No high values were downweighted. Applying UNMIX to the full simulated data set did not produce meaningful factors due to outliers. Thus, concentrations of each compound were plotted against concentrations of toluene, which was one of the most dominant and variable compounds in the data, to identify possible outliers. In this fashion, one outlier was removed from the simulated data.
4. Results Selected statistics for the simulated exposures are presented in Table 3. Log-probability plots indicated that exposures for all compounds were approximately log-normally distributed. For most compounds, the geometric means of the simulated exposures analyzed in this study are comparable within a factor of two to actual exposures measured during the TEAM studies conducted in New Jersey and California in the late 1980s (Anderson et al., 2002, Table 1). Five compounds in the simulated data had lower concentrations by almost an
3635
order of magnitude compared to the measured exposures: TET, DEC, UND, PDB, and TCE. This is expected, due to the limited number of sources included in the simulated data set. Toluene was not reported in the TEAM studies. For the simulated exposures, the geometric standard deviations were slightly higher than those from the TEAM studies, 2.6–8.4 compared to 1.7–6.2 (Anderson et al., 2002, Table 1). Several of the sources included in generating the synthetic data set have similar profiles, so collinearity is expected to limit the ability of least-squares estimation to resolve them. To examine this issue, the singular value decomposition method described in Hopke (1985) was applied to the matrix of source compositions. Variance decomposition proportions greater than 0.5 corresponding to singular values with condition indexes above 10 were found for ETS, pumping gasoline, and auto exhaust. As discussed below, the PMF and CMB models were unable to resolve these sources. Source apportionment was performed on the simulated data using CMB, PCA/APCS, PMF, and UNMIX. The actual source contributions for the simulated personal exposures are known, allowing the receptor model results to be compared with the true values. Table 4 compares source contribution estimates for the simulated data, for the profiles estimated or used by the four models. In the table as well as in the corresponding figure, factors are identified by numbers and letters (‘‘C’’—CMB; ‘‘PA’’—PCA/APCS; ‘‘P’’— PMF; ‘‘U’’—UNMIX). The average and standard deviation of source contributions across participants is shown. In the table, the major compounds present in the profiles are indicated in the first column and results for similar profiles are lined up across from each other. Fig. 1 shows the source profiles used to create the data set and also used as input to the CMB model, and the factor profiles extracted by the other three models. The sum of chemical contributions in each factor is normalized to one. The error bars shown for the PMF and UNMIX factors represent 71s (standard deviation) error in the factor compositions, as output by each model. Error bars for the CMB profiles represent 71s of variability in the measured input profiles, as reported in the references. Error estimates for the APCS factor profiles were not calculated. Figs. 1a–f show the source profiles that were used to create the simulated data and included in CMB. CMB was applied first using all of the source profiles (Solution 1) and second attempting to find a subset of the profiles that would result in the best fit of the data (Solution 2). A limitation of the first solution was that with all six source profiles included, the model only converged for 75 of the 100 randomly chosen observations to which it was applied. For Solution 1, the average and standard deviation of the CMB model performance statistics for those 75 observations are: R2 ¼ 0:9770:06;
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
3636
Table 3 Integrated 24-h exposure concentrations for individuals exposed to each source in the simulated data GM (mg m3) (GSD) Compound
ETS n ¼ 580
Pumping gas n ¼ 69
Paint n ¼ 12
Cleaning and/or pesticides n ¼ 627
Auto exhaust n ¼ 829
Gas vapors n ¼ 829
Wastewater n ¼ 829
Total n ¼ 829
TCA BNZ TET CFM EBZ MPX DEC UND OXY PDB STR TCE TOL
— 1.4 (5.7) — — 1.1 (6.1) 4.0 (5.8) — — 1.2 (5.7) — 0.21 (5.3) — 9.1 (5.3)
— 2.9 (2.5) — — 0.28 (2.8) 2.0 (2.4) — — 0.36 (2.4) — — — 3.9 (2.4)
0.28 (2.6) 1.2 (2.5) — — 3.7 (2.6) 2.9 (3.0) 0.59 (6.6) 0.57 (5.8) — — — — 2.9 (4.9)
28 (2.8) — 58 (2.7) 12 (2.6) — — 0.14 (2.7) 0.91 (2.6) — 0.36 (2.6) — 0.29 (2.9) —
— 3.4 (3.4) — — 1.6 (3.4) 5.5 (3.4) — — 2.4 (3.4) — 0.54 (3.4) — 8.7 (3.4)
— 0.50 (3.3) — — 0.048 (3.3) 0.26 (3.3) 0.060 (3.3)
4.6 (3.3) 0.37 (3.3) 1.5 (3.3) 3.1 (3.3) 0.25 (3.3) — — — — — — 0.36 (3.3) 1.5 (3.3)
9.7 (4.5) 8.0 (3.0) 5.3 (8.4) 5.7 (4.1) 4.3 (3.2) 13 (3.2) 0.13 (4.7) 0.90 (3.4) 4.5 (3.2) 0.36 (2.6) 0.87 (3.2) 0.48 (3.3) 25 (3.2)
0.070 (3.3) — — — 0.56 (3.3)
Table 4 Actual and estimated average source contributions (SCE, %) for the simulated personal exposure data. Standard deviations representing variability across participants are shown Major compounds
Actual
CMB Solution 1
CMB Solution 2 PMF
UNMIX
PCA/APCS
Profile
Profile SCE
Profile SCE
Profile SCE
Profile
SCE
SCE
TET, TCA, CFM
Cleaning/ pesticides
21733 1C: Cleaning/ 23734 1C pesticides
BNZ, EBZ, MPX, OXY, STR, TOL
Auto exhaust 2778 2C: Auto ETS 29731 exhaust 5C: ETS
30726 2C 26 731
Profile SCE
24733
1P
23733 1U
56730
2P
59730
TCA, BNZ, CFM, EBZ, MPX, OXY, TOL TCA, BNZ, TET, CFM, EBZ, TCE
Wastewater 1677
3C: Wastewater
12716 3C
TOL BNZ, EBZ, MPX, DEC, OXY, TOL
Pumping gas 278 Gas vapor 271
4C: Gas vapor 6711
EBZ, MPX, DEC, UND, OXY, TOL
Paint
6C: Paint
3710
378
20716
3P
21735 1PA
29742
2U
54730 2PA
71742
3U
25720
18713
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
3637
Fig. 1. Source profiles and extracted factor profiles associated with sources of personal exposure to VOCs from simulated data: (a) cleaning/pesticides, (b) outdoor sources, (c) aromatics-dominated sources, (d) wastewater, (e) toluene, and (f) painting.
3638
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
w2 ¼ 0:2070:27; % mass=105%771%. Table 4 indicates that for this subset of observations, the average source contribution estimates agree well with the actual values for the simulated data set. However, for more than 90% of the participants a large negative SCE was obtained for at least one of the three aromaticsdominated sources (auto exhaust, gasoline vapor or ETS). With Solution 2, the subset of profiles judged to best fit the data were those for cleaning/pesticides, auto exhaust and wastewater. With these three profiles, the CMB model converged for all but 2 of 100 randomly chosen observations and only small negative values were obtained for any source contribution estimates. The average and standard deviation of performance statistics across participants for Solution 2 are: R2 ¼ 0:8770:16; w2 ¼ 6:979:18; % mass=82%729%. Three factors were chosen as the optimal number for the PMF model (n ¼ 829; Fig. 1a, c and d). The R2 value for the PMF model fit to total simulated exposure concentrations was 0.90 and % mass equaled 85%725% for total VOC concentrations; Q was 50079 for individual compound concentrations. Profiles 1P, 2P, and 3P closely match those for cleaning/ pesticides, auto exhaust, and wastewater, respectively. The average source contribution estimates given by PMF for the cleaning/pesticides and wastewater profiles are within 15% of the actual contributions for these sources. As discussed in the next section, the source contribution estimate for profile 2P appears to combine the actual contributions from all the aromatics-dominated sources, as is also the case with CMB Solution 2. Three factors were also chosen as the optimal number for the UNMIX model (n ¼ 828; Fig. 1a, b and e). For UNMIX, the minimum R2 of 0.73 is less than the target of at least 0.8 that is recommended for an adequate model (Henry, 2000). The highest value of S=N for any compound was 4.02 and the % mass was 98%710% for total VOC concentrations. The first profile extracted by UNMIX (1U, Fig. 1a) matches the cleaning/pesticides profile. The second UNMIX profile (2U, Fig. 1b) includes TCA and CFM in addition to the aromatic compounds. As discussed below, this profile resembles a composite outdoor source comprised of the automobile exhaust, gasoline vapor and wastewater profiles used to simulate the outdoor sources of personal exposure to VOCs. The average source contribution for profile 2U of 54% reasonably matches the sum of the average contributions for auto exhaust, (outdoor) gasoline vapor, and wastewater (45%) that were used to generate simulated data. The third UNMIX profile is dominated by toluene (3U, Fig. 1e), and contributes 25% of the total exposure concentrations, on average. Finally, two factors were carried through the PCA/ APCS analysis of the simulated data (n ¼ 829; Fig. 1a and b), accounting for 66% of the variance in the data for exposure concentrations of individual compounds.
Inclusion of additional factors resulted in unacceptably large negative values in the factor profiles determined from APCS. The R2 value for the regression of total VOC concentrations to absolute principal component scores was 0.93. The first factor profile extracted by PCA/APCS is similar to the cleaning/pesticides profile used to generate the data, but includes MPX and TOL in addition to TET, TCA, and CFM (Fig. 1a). The average source contribution estimate for profile 1PA is somewhat high compared to the true cleaning/pesticides contribution. Profile 2PA is similar to profile 2U from UNMIX (Fig. 1b), and again appears to be a composite of the outdoor sources. Compared to the actual outdoor source contribution the SCE for profile 2PA is overestimated by about 60%.
5. Discussion Of the source profiles used to generate the simulated data, only the cleaning/pesticides profile (1C) was identified by all three multivariate models (factors 1PA, 1P, and 1U). At the other extreme, painting, a relatively minor contributor to the simulated exposures, did not show up as a factor in any of the multivariate models. A direct comparison of the cleaning/pesticides profiles is shown in Fig. 1a. Relatively good agreement was obtained between the actual cleaning/pesticide source contribution and the estimates from each of the models, including the two CMB solutions. The percentage error between actual and estimated average contributions ranged from 10% to 38%. The three sources that were outdoor sources were not extracted separately by either PCA/APCS or UNMIX, but rather appear to be extracted as one outdoor source profile. Both PCA and UNMIX extracted TCA and CFM in addition to the aromatics for a profile (2U, 2PA) that resembled the composite of automobile exhaust, gasoline vapor, and wastewater shown in Fig. 1b. In the composite profile, the mass fractions for each compound in the original sources were weighted by each source’s contribution to the total VOC concentration. This result suggests that these models may not be able to separate sources that are strongly correlated, as occurs in the simulated data set due to the common influence of meteorological variables, nor sources to which everyone is exposed. The fact that both models extracted a similar outdoor source is not surprising, as the UNMIX algorithm is based on PCA and both models operate on correlations in the data. The average source contribution for the composite profile is estimated to be 45%, which is overpredicted by UNMIX by 20% and by PCA/APCS by about 60%. PMF was the only model to extract a source profile resembling that of wastewater emissions. Fig. 1d
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
compares PMF profile 3P with the wastewater source profile used to generate the simulated data. Reasonable agreement was obtained between the actual wastewater contribution and the estimates from PMF and from the two CMB solutions. The percentage errors between actual and estimated average contributions for wastewater ranged from 13% to 25%. Fig. 1c compares the aromatics-dominated profile extracted by PMF (2P) with the auto exhaust, gasoline vapor, and ETS source profiles. Factor 2P matches the auto exhaust profile more closely than it matches the gasoline vapor or ETS profiles, which have relatively high mass fractions of benzene and toluene, respectively. The SCE for 2P is higher than the actual average source contributions of either auto exhaust or ETS, 59% compared to 27% (auto exhaust) or 29% (ETS), but closely matches the sum of the contributions from auto exhaust, ETS and gasoline vapor (58%). Similarly, CMB Solution 2 performed best with the auto exhaust profile as input. The SCE from CMB Solution 2 was 56%, again similar to the sum of the actual contributions from the aromatics sources. The auto exhaust, gasoline vapor and ETS profiles thus appear to be too similar for the PMF and CMB models to be able to distinguish. Though not shown here, the diagnostic parameters indicated that for UNMIX and PMF, respectively, four and six factors were the optimal solutions to the problem. Including one or two additional factors served only to extract individual compounds from the data set. In each case, the highly variable and ubiquitous compounds (toluene, benzene, xylenes) were extracted individually as the number of factors was increased. This result highlights the critical role of user judgment in determining the best number of factors to fit a data set, in addition to using diagnostics.
6. Conclusions In this study, the PMF model appeared best able to extract the major source profiles used to create the simulated data set. It was difficult, however, for PMF to extract sources that are similar; in this case, the aromatic sources auto exhaust, ETS, and gasoline vapors. It appears that PMF extracted one factor to represent the sum of these three sources. The CMB model similarly had problems with collinearity, tending to produce large, negative source contribution estimates if profiles for more than one of these three sources were included. Applied to the simulated data, PCA/ APCS and UNMIX identified similar factors to each other, with the addition of a separate toluene factor from UNMIX. UNMIX was apparently responding to the high variability of toluene by
3639
extracting it as a single source from the data. Both of these models are sensitive to correlation in the simulated data introduced by common meteorological effects on outdoor sources. None of the models identified the sources that contributed o5%, on average, to the total exposure concentrations.
Acknowledgements This research was supported by US EPA grant number R82-6788-010. The PMF model was used under a licensing agreement with Pentti Paatero of the University of Helsinki. We are grateful to Dr. Ronald Henry for his assistance with UNMIX and to Dr. Pentti Paatero for his assistance with PMF. Although the US EPA funded the research described in this article, it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency and no official endorsement should be inferred.
References Anderson, M.J., 2001. Source apportionment of toxic volatile organic compounds. M.S. Thesis, Department of Civil, Environmental, and Architectural Engineering, University of Colorado, Boulder, CO. Anderson, M.J., Miller, S.L., Milford, J.B., 2001. Source apportionment of exposure to toxic volatile organic compounds using postive matrix factorization. Journal of Exposure Analysis and Environmental Epidemiology 11, 295–307. Anderson, M.J., Daly, E.P., Miller, S.L., Milford, J.B., 2002. Source apportionment of exposure to volatile organic compounds. II. Evaluation of receptor models using TEAM study data. Atmospheric Environment 36, 3629–3641. Anttila, P., Paatero, P., Tapper, U., J.arvinen, O., 1995. Source identification of bulk wet dep-osition in Finland by positive matrix factorization. Atmospheric Environment 29, 1705– 1718. Britt, H.I., Luecke, R.H., 1973. The estimation of parameters in nonlinear, implicit models. Technometrics 15, 233–247. Clayton, C.A., Perritt, R.L., 1993. Data base development and data analysis for California Indoor Exposure Studies. California Air Resources Board, Report No. A133-187. Coultas, D.B., Samet, J.M., McCarthy, J.F., Spengler, J.D., 1990. Variability of measures of exposure to environmental tobacco smoke in the home. American Review of Respiratory Disease 142, 602–606. Currie, L.A., Gerlach, R.W., Lewis, C.W., Balfour, W.D., Cooper, J.A., Dattner, S.L., DeCesar, R.T., Gordon, G.E., Heisler, S.L., Hopke, P.K., Shah, J.J., Thurston, G.D., Williamson, H.J., 1984. Interlaboratory comparison of source apportionment procedures: results for simulated data sets. Atmospheric Environment 18, 1517–1537.
3640
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641
Daisey, J.M., Mahanama, K.R.R., Hodgson, A.T., 1994. Toxic volatile organic compounds in environmental tobacco smoke: emissions factors for modeling exposures of California populations. California Air Resources Board, Report No. A133-186. Harkov, R., Jenks, J., Ruggeri, C., 1987. Volatile organic compounds in the ambient air near a large, regional sewage plant in New Jersey, Paper 87-95.1. 80th Annual Meeting of the Air Pollution Control Association, New York. Henry, R.C., 1987. Current factor analysis receptor models are ill-posed. Atmospheric Environment 21, 1815–1820. Henry, R.C., 2000. UNMIX Version 2 Manual. US Environmental Protection Agency, March. Henry, R.C., Kim, B.M., 1990. Extensions of self-modeling curve resolution to mixtures of more than three components—finding the basic feasible region. Chemometrics and Intelligent Laboratory Systems 8, 205–216. Henry, R.C., Lewis, C.W., Collins, J.F., 1994. Vehicle-related hydrocarbon source compositions from ambient data—the GRACE-SAFER method. Environmental Science and Technology 28, 823–832. Henry, R.C., Park, E.S., Spiegelman, C.H., 1999. Comparing a new algorithm with the classic methods for estimating the number of factors. Chemometrics and Intelligent Laboratory Systems 48, 91–97. Hopke, P.K., 1985. Receptor Modeling in Environmental Chemistry. Wiley, New York. Jenkins, P.L., Phillips, T.J., Mulberg, E.J., Hui, S.P., 1992. Activity patterns of Californians—use of and proximity to indoor pollutant sources. Atmospheric Environment 26A, 2141–2148. Jeon, S.J., Meuzelaar, H.L.C., Sheya, S.A.N., Lighty, J.S., Jarman, W.M., Kasteler, C., Sarofim, A.F., Simoneit, B.R.T., 2001. Exploratory studies of PM10 receptor and source profiling by GC/MS and principal component analysis of temporally and spatially resolved ambient samples. Journal of the Air and Waste Management Association 51, 766–784. Kim, B.M., Henry, R.C., 2000. Application of SAFER model to the Los Angeles PM10 data. Atmospheric Environment 34, 1747–1759. Lawton, C.L., Sylvestre, E.A., 1971. Elimination of linear parameters in nonlinear regression. Technometrics 13, 461– 467. Lin, C., Milford, J.B., 1994. Decay-adjusted chemical mass balance receptor modeling for volatile organic compounds. Atmospheric Environment 28, 3261–3276. Mathworks, 2001. http://www.mathworks.com/ accessed September 2001. Miller, S.L., Branoff, S., Lim, Y., Liu, D., Van Loy, M.D., Nazaroff, W.W., 1998. Assessing Exposures to air toxicants from environmental tobacco smoke. California Air Resources Board, Report No. 94-344. National Climatic Data Center, 1981. National Weather Service data for Denver, CO. http://www.epa.gov/ scram001/ accessed August 2001, from the Support Center for Regulatory Air Models, US Environmental Protection Agency. Norb.ack, D., Wieslander, G., Edling, C., 1995. Occupational exposure to volatile organic compounds (VOCs), and
other air pollutants from the indoor application of water-based paints. Annals of Occupational Hygiene 39, 783–794. Paatero, P., Tapper, U., 1994. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126. Proctor, C.J., Warren, N.D., Bevan, M.A.J., Baker-Rogers, J., 1991. Comparison of methods of assessing exposure to environmental tobacco smoke in non-smoking British women. Environment International 17, 287–297. Ramadan, Z., Song, X.H., Hopke, P.K., 2000. Identification of sources of Phoenix aerosol by positive matrix factorization. Journal of the Air and Waste Management Association 50, 1308–1320. Rappaport, S.M., Selvin, S., Waters, M.A., 1987. Exposures to hydrocarbon components of gasoline in the petroleum industry. Applied Industrial Hygiene 2, 148–154. Sweet, C.W., Vermette, S.J., 1992. Toxic volatile organic compounds in urban air in Illinois. Environmental Science and Technology 26, 165–173. Thurston, G.D., Spengler, J.D., 1985. A quantitative assessment of source contributions to inhalable particulate matter pollution in metropolitan Boston. Atmospheric Environment 19, 9–25. Vega, E., Mugica, V., Carmona, R., Valencia, E., 2000. Hydrocarbon source apportionment in Mexico City using the chemical mass balance receptor model. Atmospheric Environment 34, 4121–4129. Wallace, L.A., Pellizari, E., Leaderer, B., Zelon, H., Sheldon, L., 1987. Emissions of volatile organic compounds from building materials and consumer products. Atmospheric Environment 21, 385–393. Watson, J.G., Cooper, J.A., Huntzicker, J.J., 1984. The effective variance weighting for least squares calculations applied to the mass balance receptor model. Atmospheric Environment 18, 1347–1355. Watson, J.G., Robinson, N.F., Chow, J.C., Henry, R.C., Kim, B.M., Nguyen, Q., Meyer, E.L., Pace, T.G., 1990a. Receptor model technical series, Vol. III, CMB7 user’s manual. US Environmental Protection Agency, Report No. EPA-450/4-90-004. Watson, J.G., Robinson, N.F., Chow, J.C., Henry, R.C., Kim, B.M., Pace, T.G., Meyer, E.L., Nguyen, Q., 1990b. The USEPA/DRI chemical mass balance receptor model. Environmental Software 5, 38–49. Watson, J.G., Chow, J.C., Fujita, E.M., 2001. Review of volatile organic compound source apportionment by chemical mass balance. Atmospheric Environment 35, 1567–1584. Wiley, J.A., Robinson, J.P., Piazza, T., Garrett, K., Cirksena, K., Cheng, Y.T., Martin, G., 1991. Activity patterns of California residents. California Air Resources Board, Report No. A733–149. Yakovleva, E., Hopke, P.K., Wallace, L.A., 1999. Receptor modeling assessment of Particle Total Exposure Assessment Methodology data. Environmental Science and Technology 33, 3645–3652. Yu, T.Y., Chang, L.F.W., 2000. Selection of the scenarios of ozone pollution at southern Taiwan area utilizing principal component analysis. Atmospheric Environment 34, 4499– 4509.
S.L. Miller et al. / Atmospheric Environment 36 (2002) 3629–3641 Zappoli, S., Andracchio, A., Fuzzi, S., Facchini, M.C., ! Barcza, Gelencs!er, A., Kiss, G., Kriv!acsy, Z., Moln!ar, A., T., M!esz!aros, E., Hansson, H.C., Rosman, K., Zebuhr, Y., 1998. Organic components and chemical mass balance of
3641
fine aerosol in different areas of Europe. Joint International Symposium on Global Atmospheric Chemistry, Seattle, WA.