Chemometrics Applied to Environmental Systems Philip K. Hopke PII: DOI: Reference:
S0169-7439(15)00190-2 doi: 10.1016/j.chemolab.2015.07.015 CHEMOM 3064
To appear in:
Chemometrics and Intelligent Laboratory Systems
Received date: Revised date: Accepted date:
1 June 2015 30 July 2015 31 July 2015
Please cite this article as: Philip K. Hopke, Chemometrics Applied to Environmental Systems, Chemometrics and Intelligent Laboratory Systems (2015), doi: 10.1016/j.chemolab.2015.07.015
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Chemometrics Applied to Environmental Systems
RI
PT
Philip K. Hopke* Institute for a Sustainable Environment and Department of Chemical and Biomolecular Engineering Clarkson University Potsdam, NY 13699 USA
AC CE P
TE
D
MA
NU
SC
Abstract To understand the processes that give rise to the observed concentrations of chemical species in environmental samples, it is often useful to apply chemometric methods to help convert chemical data into environmental information. In this review, various chemometical methods including pattern recognition, mixture resolution methods, and probability estimation will be described and their applications to a variety of environmental systems are presented. However, there are important differences between applying chemometrical methods in the laboratory as compared to environmental data. The environmental system can be monitored but in general, it is not possible to plan experiments and use known distributional properties of the system under study. Given the complexity of these systems, large data sets are typically needed to provide the basis for obtaining a good understanding of its functioning. Expanding computer capabilities coupled with improved analytical methods have allowed the production and analysis of larger and larger data sets with more and more computationally intensive sophisticated algorithms that offer the promise of even greater information retrieval in the future.
*
E-mail:
[email protected]
ACCEPTED MANUSCRIPT Introduction Environmental systems function with complex interactions between biotic and abiotic components. In order to understand the functioning of environmental systems, it is often
PT
necessary to chemically characterize these systems. Samples may be collected and analyzed for their composition or chemical species can be measured directly in the environment using in situ
RI
monitors or sensors. Thus, like for any chemical system, it is necessary to use appropriate
SC
methods to relate the analytical signals with the species concentrations (analytical chemistry) as well as then examine the relationships among the measured species in various ways to describe
NU
the functioning of the chemical processes that gave rise to the observed values. Thus, chemometric methods can provide valuable tools to extract information from the often large sets
MA
of environmental data.
Environmental applications have been published in Chemometrics and Intelligent Laboratory Systems beginning with the first issue [1] and continuing with the papers derived
D
from several of the workshops organized by the U.S. Environmental Protection Agency (Vol. 3,
TE
No. 1, 1–158, 1988; Vol. 37, No. 1, 1-214, 1997; Vol. 60., No.1, 1-281, 2002). There has also been a parallel development of environmetrics with its own journal starting in 1990 that focuses
AC CE P
more on ecological rather than chemical data analyses. In this paper, the application of major chemometric methods to environmental chemical data will be reviewed. These examples will provide illustrations of the methods and problems to which chemometrical methods can be applied.
Pattern Recognition
There are various ways that pattern recognition methods [2] can be applied to explore environmental multivariate data. For these problems, unsupervised methods are needed since there are rarely exemplars of specific properties that can serve as the basis of a supervised analysis. Methods like cluster analysis have been widely applied to a variety of data. Hopke [3] reports the application of cluster analysis to the composition of surficial sediments collected from Chautauqua Lake. Chautauqua Lake is a 24 kilometer long, narrow lake in southwestern New York State. The abundances of fifteen elements was determined by neutron activation analysis for grab samples of the bottom sediments [4] which had been taken for sediment particle size analysis [5]. Ruppert et al. [6] reported the particle size
ACCEPTED MANUSCRIPT distribution as characterized by percent sand percent silt, and percent clay, as well as the %organic matter and water depth above the sample. In addition parameters describing the particle size distribution were determined including measures of the average grain size, mean
PT
grain size; median grain size, and parameters describing the shape of the distribution; sorting (standard deviation), skewness, kurtosis, and normalized kurtosis. Figure 1 presents the cluster
RI
output using Squared Euclidean Distance as the dissimilarity measure and the mean within-
AC CE P
TE
D
MA
NU
SC
cluster distance over all pairs of points. It can be seen that the sites fell into four main groups.
Figure 1. Dendrogram of the hierarchical cluster analysis of sediment composition data from Chautauqua Lake, NY.
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
Figure 2. Map of Chautauqua Lake with site locations indicated to show to which cluster they belong. The symbols are defined as to for cluster A, X for cluster B, 0 for cluster C and + for cluster D.
In order to interpret these clusters, it is useful to indicate which sites belong to each cluster. Figure 2 presents this view by giving each of the four major clusters a distinct symbol. It can be seen that the top cluster (A) in Figure 1 contain only sites in the center of the lake. The other three clusters contain all of the near-shore sites and the three near shore clusters come together before merging with the basin sites. There are differences between the three near-shore clusters. The bottom cluster (D) is very sandy, and moderately well sorted. Most of the sites for this group fall in the northern end of the lake. There seems to be little active sedimentation in
ACCEPTED MANUSCRIPT these sites and only very limited current or wave action. These are sites which have an above average value for coarse-grained material. The second cluster (B) represents areas of active sedimentation and high wave action. All of the active delta areas fall in this cluster as well as
PT
those sites through the narrows at the center of the lake. Although the currents have not been measured, this region would be expected to have the strongest flow as the lake reaches its
RI
narrowest point. These high energy environments are where there is active transport of
SC
sediments. In the deltas, the silty material is being deposited at a sufficiently high rate that the material cannot be sorted as it is deposited. The largest value of organic carbon is found in this
NU
cluster. This result agrees with the earlier observation that there is rampant weed growth in these deltic areas. The third cluster (C) represents an intermediate energy environment. In these areas
MA
there is sufficient energy to sort the sediments and remove the fine-grained material. Thus, the material in these areas is moderately well-sorted. Thus, the three near-shore clusters do represent reasonable different environments and source material. The cluster pattern of these sites has
D
helped to confirm the interpretation of the causal factors in this study.
TE
Another application of pattern recognition has been employed to provide a basis for developing quantitative results from qualitative or semi-quantitative data [7]. There are several
AC CE P
techniques available that can provide only qualitative or semi-quantitative characterization of samples. Of particular interest is the chemical and/or physical characterization of individual airborne particles. Such data provides information on the nature of the mixtures in the particle compositions. If the aerosol is an external mixture, then individual particles have distinct chemical and physical characteristics determined by their sources and the processes by which they were formed. An internally mixed aerosol is one in which all of the particles have uniform compositions that are the same as the bulk particle composition. In reality, the ambient aerosol is rarely either a true internal or external mixture. Typically, distinct composition particles are modified as they are transported from the source to the sampling location. Particularly cloud or fog processing will result in much more uniform particle compositions. Thus, individual particle characteristics can provide information on both the particle sources and atmospheric transformation processes. Techniques for characterizing individual particles include Computer-Controlled Scanning Electron Microscopy (CCSEM) and Aerosol Time-of-Flight Mass Spectrometry (ATOFMS). CCSEM is described in detail by Casuccio and Hopke [8]. Particles collected on a filter can be
ACCEPTED MANUSCRIPT characterized for their physical size, shape and elemental composition. However, the fluoresced x-rays used for the compositional measurements rarely can be used to provide accurate elemental concentrations in individual particles.
PT
The ATOFMS was developed by Prather and coworkers [9,10,11]. In this field instrument, particles are accelerated from the ambient atmosphere into a vacuum such that their
RI
velocity is a function of the aerodynamic diameter. The particle velocity is determined by the
SC
time for it to traverse the distance between two continuous laser beams. The particle is then bombarded with a high power laser to ablate and ionize the particle surface. The resulting ions
NU
are extracted into positive and negative ion time-of-flight mass spectrometers. Thus, the instrument provides two mass spectra and particle size for each particle.
MA
However, the pattern of ion intensities are generally good qualitative indicators of the particle composition. However, there is variability in the coupling of the photon energy into the particle surface that is composition dependent and there is typically shot-to-shot variation in the
D
laser intensity so again, the mass spectra do not provide quantitative measures of particle
TE
composition. In both of these cases, it is possible to use the qualitative or semi-quantitative data to obtain new quantitative variables based on the classification of the particles into homogeneous
AC CE P
particle types.
Early work by Kim and Hopke [12,13] showed that individual particles could be classified using various unsupervised pattern recognition methods including hierarchical and nonhierarchical cluster analysis and rule-building expert systems. However, these static classification methods can only be applied to the whole data set and are not necessarily adaptable to other problems. Thus, dynamic classification approaches were needed and have been developed through the use of neural networks. Artificial neural networks are models that simulate the human pattern recognition system and perform pattern recognition for multiple complex signals or data. Adaptive resonance theory (ART) neural networks can be applied to perform particle classification. ART is a type of a neural network that can perform fast category learning and recognition. ART2 was first described by Grossberg [14,15]. A series of further developments were achieved by Carpenter, Grossberg, and coworkers [16,17,18,19,20]. The ART2A algorithm, a variation with only one weight matrix, performs as well as the ART2 method but runs two to three times faster. Xie et al. [21] and Wienke et al. [22] applied ART-2a for the rapid identification of airborne particles
ACCEPTED MANUSCRIPT from the chemical compositions and particle shape indices. ART2A can perform recognition well, and most importantly, ART2A can generate new clusters for particles with unfamiliar patterns and incorporate the rules for this new cluster in its knowledge base for the further use.
PT
Now quantitative measures can be derived from the class memberships. The mass of each particle can be estimated based on its aerodynamic diameter so either the particle mass in each
RI
class or the number of particles in each class can be used in subsequent analyses. Zhao et al.
SC
[23] have used such mass concentrations in a multivariate calibration model using sizesegregated composition data to permit the prediction of the bulk aerosol composition based on
NU
the single particle ATOFMS data. Alternatively, Owega et al. [24] have used particle number concentrations in a factor analysis model to obtain a quantitative source apportionment.
MA
Mixture Resolution Problem There are many situations where environmental samples represent a combination of materials with characteristic compositions such that the measured constituent concentrations can be written
D
as: g k f kj k 1
eij
TE
p
xj
(1)
AC CE P
Where xj is the concentration of chemical species j measured in the environmental sample, fkj is the concentration of chemical species j in material from ―source‖ k, gk is mass contribution of ―source‖ k to the sample of interest, eij are the unmodeled portion of the variation, and k has values from 1 to p ―sources.‖ For air pollution studies, the separation of the composition into source contributions is known as receptor modeling [25, 26]. These methods are being widely used in many locations [2728, 29]. These models have also been applied to other environmental systems such as sediments [30] or water quality [31]. There are some important differences in the application of mixture resolution methods to environmental problems because of the lack of fixed ―spectral‖ characteristics. In the spectrochemical problem, each compound has a fixed spectrum. However, in the environment, sources like a coal-fired power plant emit particles of varying composition depending on the nature of the coal being consumed and the combustion conditions. Different coals from different locations will have different mineral species laid down with the carbonaceous material from which the coal has formed. Thus, as a plant burns through a supply of coal, there will be a variable input of the various mineral phases [32]. Airborne particle size distributions have been
ACCEPTED MANUSCRIPT analyzed to identify the origins (sources) of the observed particles since different types of combustion or particle formation processes give rise to different ―spectra‖ of particle sizes [33]. However, aerosol dynamics in the atmosphere including coagulation, dry deposition, and
PT
condensation of low vapor pressure species formed through atmospheric oxidation can change the size distributions as a function of time since the particles were emitted. Thus, any analysis
RI
has to take variability in the characteristics of the ―sources‖ into account.
SC
In general, the analysis of environmental samples produces data with significantly more uncertainty than is typically be obtained with spectroscopic methods. Samples tend to have
NU
complex matrices and there can often be a variety of analytical problems obtaining good compositional data. There can be both positive and negative sampling artifacts such as the
MA
adsorption of semivolatile organic species on a fibrous filter or the loss of collected ammonium nitrate through dissociation to ammonia and gas-phase nitric acid. It is possible to overcome these problems with carefully designed sampling systems [34,35], but there can be difficulties in
D
the data analysis when such sampling problems were not carefully considered in the design of
TE
the sampling program. Even with good samples, the complexities of analyzing for multiple species with the associated interferences and limits on delectability of low concentration species
AC CE P
leads to analytical uncertainties that are generally at least 5% for species well above detection limits and 20% or more as the concentrations approach the limit of detection for the species. Thus, in general, receptor modeling has to deal with at least an order of magnitude greater uncertainty in the analytical results that provide the input data to the mixture resolution algorithm. The modeling output then typically reflects this higher level of uncertainty. To date most receptor modeling has focused on particulate matter because the composition of particles was thought to be more conserved than for gaseous species. There have been some efforts to examine gaseous materials, but those studies have generally been limited to less reactive organic compounds. Multivariate Regression For airborne particle composition data, the regression solution where the sources of the chemical constituents are known is commonly called the Chemical Mass Balance problem. It solves equation 1 using the environmental concentrations measured in a sample and the known source profiles, fkj, for the area in which the samples were collected. If the sources in an area are
ACCEPTED MANUSCRIPT known and an opportunity to sample them, then the number of sources, p, is known as are the source composition profile vectors, fk. Thus, the problem becomes a multiple regression problem with only the vector of source contributions, gi, to be estimated. This multiple
PT
regression problem is termed a chemical mass balance (CMB) analysis. The concept of a atmospheric mass balance model was suggested independently by Miller et al. [36] and by
RI
Winchester and Nifong [37]. In these initial models, specific elements were associated with
SC
particular source types to develop a mass balance for airborne particles. Subsequently, more chemical species than sources were used in a least-squares fit to provide estimates of the mass
NU
contributions of the sources [38].
There were a number of these early application of the mass balance analysis including
MA
Gent, Belgium [39], Heidelberg, Germany [40], and Chicago, Illinois [41]. Several major research efforts have subsequently resulted in substantially better source data. The source emission studies led to much improved resolution of the particle sources in Washington, D.C.
D
[42,43]. In the first of these studies, Kowalczyk et al. [40] introduced the use of weighted least-
TE
squares regression to fit six sources with eight elements for ten ambient samples. Subsequently, Kowalczyk et al.[41] examined 130 samples using 7 sources with 28 elements included in the fit.
AC CE P
They obtained an excellent fit of the ambient concentration data and a quite good understanding of the major sources of airborne particles in the Washington, D.C. area. In 1979, Watson [44] and Dunker [45] independently suggested a mathematical formalism called effective variance weighting that included the uncertainty in the measurement of the source composition profiles as well as the uncertainties in the ambient concentrations. As part of this analysis, a method was also developed to permit the calculation of the uncertainties in the mass contributions. Effective-variance least squares has been incorporated into the standard personal computer software developed by the U.S. EPA for receptor modeling. The most extensive use of effective-variance fitting has been made by Watson and colleagues in their work on data from Portland, OR [46]. Since that study, a number of other applications of this approach have been made in a wide variety of locations and extensive libraries of compositional profiles of emission sources have been developed to be used in the mass balance models. These models are described in detail by Watson et al. [47]. The CMB model is available from the U.S. Environmental Protection Agency [48] as is the library of source profiles [49]. However, many of the stationary source profiles were measured prior to 1993. There have been many recent
ACCEPTED MANUSCRIPT measurements of mobile source profiles, but there is considerable uncertainty in how to aggregate them to provide appropriate fleet averages that represent the mix of emissions observed in major cities. Chow and Watson [50] examines how the Chemical Mass Balance
PT
(CMB) receptor model has been used to quantify source contributions from fossil fuel combustion and other sources to ambient concentrations of PM2.5 and PM10 for urban and
RI
regional scales. Non-fossil fuel sources, such as fugitive dust, cooking, vegetative burning, and
SC
natural or human-caused biogenic compounds must be considered together with fossil-fuel sources in a CMB analysis to obtain closure for PM2.5 and PM10 mass. CMB analyses in 22
NU
different studies have found fossil fuel combustion to be a large contributor to PM2.5 and PM10 concentrations, with most of the primary contributions originating from diesel- and gasoline-
MA
powered vehicle exhaust
One of the more extensive uses of the CMB model has been in the source apportion-ment of organic matter particularly that associated with fine particles (aerodynamic diameter < 2.5
D
µm). Beginning in the 1980's and through the mid-1990s, Glen Cass and his research group
TE
developed source sampling and analysis methods for specific organic compounds and report composition data for a series of source samples collected in the Los Angeles area [51, 52, 53, 54,
AC CE P
55]. Archived samples from Los Angeles and the southeastern US were analyzed in an equivalent manner leading to the Chemical Mass Balance model results [56,57]. Similar procedures have been used by a number of groups [58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72].
Subramanian et al. [73] Identified some of the strengths and challenges associated with performing CMB analysis with molecular markers in the context of apportioning motor vehicle emissions. A significant strength of the approach is the strong correlations in the ambient molecular marker concentrations. These correlations reflect the high source specificity of certain organic species and imply the existence of well-defined source profiles, even in Pittsburgh, a location strongly influenced by regional transport. However, a major challenge is the variability in the source profiles. The motor vehicle profiles are much more variable than the ambient data, which creates significant uncertainty in the CMB estimates. Hennigan et al. [74] suggest that there is also a problem with the reactivity of the organic markers that will produce changes in the source profiles as the particles move from the source to the receptor site. Smog chamber experiments with levoglucosan suggest that it has a lifetime of
ACCEPTED MANUSCRIPT 0.7 to 2.2 days. However, they have examined conditions at room temperature and moderate concentrations of hydroxyl radicals. Such conditions may not be representative of the environment when residential wood smoke is an important component of the ambient aerosol.
PT
The issue of reactivity was originally raised by Miller et al. [36] where the mass balance equation was written as j
g k f kj
RI
p
xj
(2)
eij
SC
k 1
where αj is the coefficient of fractionation for species j that indicates the amount of that species
NU
remaining in the aerosol at the sampling site. Although it was incorporated in the earliest formulations of the CMB model, it has not been used and has been assumed to be unity.
MA
Multivariate Curve Resolution
A detailed review of multivariate curve resolution methods has been provided by Hopke
D
[75]. Alternative approaches that only utilize the ambient concentration data have been
TE
developed in terms of multiple forms of factor analysis that are actually trying to solve the selfmodeling curve resolution or mixture resolution problem. To solve this problem, it is necessary
p
xij
gij f kj k 1
AC CE P
to solve equation (1) using multiple sample data so that the model being fit is now given as: eij
(3)
where now there are multiple samples denoted by i, the mass contributions now form a matrix, and the residuals, eij, account for the part of the variation in the data than cannot be fit to the model.
The first receptor modeling analyses reported in the literature were factor analysis using eigenvector methods that had been developed in the social sciences for interpreting large data sets. Blifford and Meeker [76] used a principal component analysis with several types of axis rotations to examine particle composition data collected by the National Air Sampling Network (NASN) during 1957-61 in 30 U.S. cities. Prinz and Stratmann [77] examined both the aromatic hydrocarbon content of the air in 12 West German cities and data on the air quality of Detroit using factor analysis methods. In both cases, they found solutions that yielded readily interpretable results. There was no further use of factor analysis until it was reintroduced in the mid-1970's by Hopke et al. [78 ] and Gaarenstroom et al. [79 ] in their analyses of particle composition data from Boston, MA and Tucson, AZ, respectively. A problem that exists with
ACCEPTED MANUSCRIPT these forms of factor analysis is that they do not permit quantitative source apportionment of particle mass or of specific elemental concentrations. In an effort to find alternative methods that would provide information on source contributions when only the ambient particulate analytical
PT
results are available, other approaches were employed. Hopke and coworkers used Target Transformation Factor Analysis [80 ] originally developed by Malinowski [81]. Henry and
RI
coworkers [82, 83, 84] have developed alternative methods based on eigenvector methods. The
SC
initial SAFER model evolved into Unmix [85]. These concepts provide the basis for a geometrical interpretation using ―edges‖ as outlined by Henry [86]. However, Unmix is still
NU
based on an eigenvector analysis, and thus, it represents an unweighted least squares fit to the data [81, 87].
MA
Methods have been developed that use an explicit least-squares formalism such that individual data points can be appropriately weighted by their estimated uncertainties. This approach also allows easy application of natural non-negativity constraints [88] and allows more
D
complicated models to be developed where they are appropriate to fit the conceptual model built
TE
for the given data set. There are two approaches that allow properly weighted analysis, positive matrix factorization (PMF) [89, 90] and multivariate curve resolution-weighted alternating least
AC CE P
squares (MCR-ALS)[91, 92]. These methods have been compared by Tauler et al. [93]. Both approaches can achieve similar results, but PMF has been developed into software that is freely downloadable [94]. Thus, PMF has been much more widely applied to environmental data. Most of the applications have been to airborne particle composition data [e.g. 95, 96, 97].]. It has been applied to VOC data [98], to Aerosol Mass Spectrometry (AMS) data [99, 100 and many others], and to particle number size distributions [e.g., 101; 102]. PMF has been applied to sediment analysis [103] and recently to water quality data [104]. Surface water monitoring networks provide a time series of data. Li et al. [104] applied PMF to monitoring data to apportion water pollution sources in the Daliao River (DLR) basin. The DLR basin includes the Hun and Taizi River catchments in northeast China. This basin is densely populated and heavily industrialized. Fourteen monitoring stations located on the two rivers were used for monitoring 13 physical and chemical parameters from 1990 to 2002. Five sources/processes were identified in the Hun River and four in the Taizi River based on marker species and the spatial-temporal variations of resolved factors. These factors included point and nonpoint sources for both rivers. In addition, an industrial pollution source emission inventory
ACCEPTED MANUSCRIPT was compared with the resolved industrial sources. Results reveal that chemical transformations have influenced some chemical species. However, this influence is small compared with observed seasonal variations. Therefore, identification of pollution point and nonpoint sources by
PT
their seasonal variations is possible, which will also aid in water quality management. The spatial variation of the industrial pollutants typically corresponded with the urban industrial pollution
RI
source inventories.
SC
With PMF being applied through the use of the multilinear engine [105], it is possible to build constraints into the model as was done in two studies where there was an effort to separate
NU
multiple sources of similar composition [106, 107, 108]. Amato et al. [106] applied the multilinear engine to data from an urban background site in Barcelona (Spain) to quantify the
MA
contribution of road dust resuspension to PM10 and PM2.5 concentrations. A recent emission profile of local resuspended road dust had been previously obtained [109 ]. This a priori information was introduced into the model as auxiliary terms in the object function to be
D
minimized by the implementation of so-called "pulling equations" [110]. Escrig et al. [107]
TE
applied a similar approach to speciated PM10 data obtained at three air quality monitoring sites between 2002 and 2007 in a highly industrialized area in Spain. The source apportionment of PM
AC CE P
in this area is an especially difficult task. There are industrial mineral dust emissions that need to be separately quantified from the natural sources of mineral PM. Amato and Hopke [108] have applied constraints to combine the analysis of the three sites in the St. Louis area into a single analysis such that known source profiles could be worked into the analysis. The latest version of EPA-PMF has the capability of applying some constraints into the analysis. Not all environmental data fit naturally into bilinear models as outlined in equation 3. For example, size segregated particulate matter can be collected using cascade impactors that aerodynamically separate the particle. Chemical characterization of these size-segregated samples provides a data set that has composition as a function of size and time. Different sources produce different sized particles so that the ―spectral‖ characteristics of that source are a matrix of composition as a function of size instead of a vector of compositions. Pere-Trepat et al. [111] developed a model given as: R
X A( r ) b( r ) E r 1
(4)
ACCEPTED MANUSCRIPT
where X is the third-order tensor of observed data, A(r) is the th source profile array and b(r) is the corresponding
th
contribution vector. The tensor E having the same size as X contains the
PT
residuals. Pere-Trepat et al. [111] report the results of the analysis of samples collected in Detroit, MI over a late winter period. Li et al. [112] analyzed data from multiple samplers
RI
deployed at Dulles International Airport in the Washington, DC area. Five major emission
SC
sources: soil, road salt, aircraft landings, transported secondary sulfate, and local sulfate/construction were identified. Aircraft landings with the ablation of tires and volatilization
NU
of bearing grease were the largest source of PM2.5. The study shows that time- and size- resolved particulate matter compositional data can assist in the identification of the airport emission
MA
sources and atmospheric processes leading to the observed ambient concentrations. Thus, there are readily available, effective methods for solving the mixture resolution problem in a variety of
D
environmental settings.
Regulatory Uses
TE
Probability Estimates
AC CE P
There are a number of areas in which estimates of probabilities play a role in environmental decision making. A major case in point is the form of National Ambient Air Quality Standards (NAAQS). These standards set requirements for air quality that are defined in terms of an indicator, an averaging time for the measurement, a concentration and a form. For example, the NAAQS for carbon monoxide (CO) is defined as 9 ppm (10 mg/m3) for an 8 hour average not to be exceeded more than one time per year and 35 ppm (40 mg/m3) for 1 hour again not to be exceeded more than one time per year. Particulate matter standards are defined for particle sizes less than 10 µm (PM10) and less than 2.5 µm (PM2.5). For each of these standards, there is a 24-hour standard based on integrated 24-hour collections of particulate matter on filters using a specified sampling system. The major problem with PM is that it is expensive to measure and thus, for PM10, samples only have to be collected every 6th day while PM2.5 samples have to be collected every 3rd day. Thus, there is incomplete data for PM as compared to the gaseous species measurements that are made with continuous instruments that can provide complete data. No data set is fully complete because of
ACCEPTED MANUSCRIPT instrument downtime for calibration, maintenance, and repairs, but with PM, 66.7% to 83.3% of the data are missing by design. These two standards have different forms to in part compensate for the incomplete nature of
PT
the measurement data. For PM10 the standard level must not be expected to be exceeded more than once per year based on a three-year average. Davidson and Hopke [113] explored the
RI
probability issues with expected exceedance and percentile forms of PM NAAQS to determine
SC
the influence of the incomplete data. They show that a percentile standard has much greater power to identify sites that are actually in non-attainment of the standard with a lower probability
NU
of a Type 1 error of misclassifying attainment areas as non-attainment. The form of the PM2.5 NAAQS is a percentile standard also based on three years of data.
MA
However, instead of sampling taking the 98th percentile value of the distribution of the required 3 years of measurements, PM2.5 values for the selected percentile are calculated separately for each year and then the average of these 3 values is used as the NAAQS indicator value. Salako and
D
Hopke [114] compared this average of the 3 year’s values method and compared to the standard
TE
percentile in the multiyear data. The relationships between the values obtained using these two approaches have been explored. PM data measured from January 2004 to December 2008 at 20
AC CE P
sites in 20 different states in United States were selected from the US EPA Aerometric Information Retrieval System (AIRS) website. PM samples were collected for 24-hour periods from midnight to midnight every third day for PM2.5 and every sixth day for PM10. At some sites, continuous measurements of PM2.5 were made and averaged to provide 24-hr values. Using these data, the NAAQS percentile values were compared with the actual 98th percentile values of the three years of data. Regression and t-test analyses were used to compare these two methods. High correlation coefficients and no significant difference were observed in most cases. Overall, the two methods showed substantial agreement such that either of the two approaches could serve as the statistical form of the 24-hour standard.
ACCEPTED MANUSCRIPT Conditional Probability Measures Several data analysis tools have been developed to provide additional information on the nature of the system under study by calculating conditional probabilities that various
PT
source/receptor relationships exist. For example, the influence of local sources on the concentrations measured at a given location can be assessed using the Conditional Probability
RI
Function (CPF) [115]. The conditional probability function (CPF) analyzes point source impacts
SC
from varying wind directions using the source contribution estimates from PMF coupled with the wind direction values measured on site. The CPF estimates the probability that a given source
NU
contribution from a given wind direction will exceed a predetermined threshold criterion. Generally, the same daily contribution is assigned to each hour of a given day to match to the
mΔθ n Δθ
(5)
D
CPFΔθ =
MA
hourly wind data. The CPF is defined as
TE
where mΔθ is the number of occurrence from wind sector Δθ that exceeded the threshold
AC CE P
criterion, and nΔθ is the total number of data from the same wind sector. Generally, Δθ is chosen so that there are reasonable numbers of events assigned to each wind sector. Calm wind (< 1 m/sec) periods are excluded from the analysis due to the isotropic behavior of wind vane under calm winds. A threshold criterion (e.g., upper 25 percentile) is chosen from tests with several different percentile of the source contribution from each source to define the directionality of the sources. The sources are likely to be located to the direction that have high conditional probability values. As an example of a CPF analysis, Figure 3 presents the results for lead data measured in East St. Louis, MO [116]. The plots point to the southwest that is consistent with the Figure 3. CPF plot for Pb measured at the Midwest Supersite in East St. Louis, MO.
location of a primary lead smelter in Herculaneum, Missouri, about 40 km to the south-south-west of the site. Furthermore, the zinc smelter on the southwest direction also
ACCEPTED MANUSCRIPT emits high Pb. The small peak of Pb in the northerly direction is identified as the steel processing plant in Granite City, IL. This approach has proven useful in many studies to help identify likely sources of the estimated source contributions.
PT
This conceptual framework can be extended to larger scale source/receptor relationships through the use of trajectory ensemble methods. It is possible to calculate air parcel back
RI
trajectories that provide information on the pathway the air took reaching a sampling site during
SC
the period when a sample is being collected. However, there is uncertainty in these calculations that increases as the calculated is extended further backward in time and space. Ensemble
NU
methods use a collection of a large number of trajectories to look at statistical variables where the noise in the trajectory locations can be averaged out.
MA
Trajectories are represented by segment endpoints. Each endpoint has two coordinates (e.g., latitude, longitude) representing the central location of an air parcel at a particular time. To calculate the PSCF, the whole geographic region covered by the trajectories is divided into an
D
array of grid cells whose size is dependent on the geographical scale of the problem so that the
TE
PSCF will be a function of locations as defined by the cell indices i and j. Let N be the total number of trajectory segment endpoints during the whole study period, T.
AC CE P
If n segment trajectory endpoints fall into the ij-th cell (represented by nij), the probability of this event, Aij, is given by P Aij
nij N
(4)
where P[Aij] is a measure of the residence time of a randomly selected air parcel in the ij-th cell relative to the time period T. Suppose in the same ij-th cell there is a subset of mij segment endpoints for which the corresponding trajectories arrive at a receptor site at the time when the measured concentrations are higher than a pre-specified criterion value. In this study, the criteria values were the calculated mean values for each species at each site. The probability of this high concentration event, Bij, is given by P[Bij], P Bij
mij N
(5)
ACCEPTED MANUSCRIPT Like P[Aij] this subset probability is related to the residence time of air parcel in the ij-th cell but the probability B is for the contaminated air parcels.
P Bij P Aij
mij
(6)
nij
RI
Pij
PT
The potential source contribution function (PSCF) is defined as
SC
Pij is the conditional probability that an air parcel which passed through the ij-th cell had a high concentration upon arrival at the trajectory endpoint. There are several problems with the PSCF
NU
analysis approach. Near the edge of the spatial domain of the back trajectories, there are relatively few trajectories in any given grid cell. In many of the studies [117,118, 119], an arbitrary weight function is used to reduce the values in cells with few endpoints. The weighting
MA
procedure outlined by Han et al. [120] has been found to be a useful general approach to this problem. In this approach, the average number of trajectory endpoints per cell, n¯, is calculated
D
by diving N by the total number of grid cells. The arbitrary W(ηij) are as follows: 0.72 when 1.5 n¯ ≥ηij≤3 n¯,
TE
1.00 when ηij ≥3 n¯,
AC CE P
0.42 when 0.75 n¯ ≥ηij≤1.5 n¯, and 0.17 when ηij≺0.75 n¯
Another problem that has typically been found is the ―trailing‖ effect. In some cases, there are barriers that prevent trajectories from areas upwind of a source to reach the receptor site without passing through the grid cell containing the source. Mountain/valley terrain such as exists in Los Angeles, CA is one such example [121,122]. Thus, the trajectories passing through the source cell will transport pollutants to the receptor site resulting in these trajectories will be classified into the above the criterion value group (―contaminated‖ trajectories). Thus, the area upwind of the source in which these ―contaminated trajectories originate will also tend to be assigned as a source area such that there is a ―trailing‖ effect of contamination backward from the source cells is often observed. PSCF has been applied to both particle and precipitation compositional data in a series of studies over a variety of geographical scales [123, 124, 125, 126] To illustrate the utility of the PSCF method, results from St. Louis, MO for secondary sulfate and secondary nitrate particles [127]. Figure 4 presents the PSCF map for secondary
ACCEPTED MANUSCRIPT sulfate particles. The areas of highest probability are in the upper Ohio River Valley and the Tennessee Valley. There are many coal-fired power plants in these areas and are known major SO2 source regions. This result suggests that transformation of the emitted SO2 to particulate
PT
sulfate and the subsequent transportation of these particles from coal-fired power plants results in
TE
D
MA
NU
SC
RI
high values of PM2.5 sulfate in St. Louis.
AC CE P
Figure 4. PSCF Map for sulfate measured in downtown St. Louis, MO between 2001 and 2003.
Figure 5 presents the map for nitrate-bearing particles. High potential source regions for the nitrate-rich secondary aerosol factor appeared in areas of Iowa and Kansas that have been linked to high ammonia emissions, primarily from extensive animal confinement facilities in northwestern Iowa and fertilizer application to agricultural croplands in locations like Kansas128,129,130 Manure and anhydrous ammonia fertilizer are typically applied to the croplands in fall and spring which leads to evaporation of ammonia into the air. Gaseous ammonia reacts with nitric acid in the particle/gas nitrate equilibrium where the particle phase is favored at relatively cool temperatures such as occur in the late fall and early spring. High ammonium nitrate concentrations observed in the St. Louis area are likely influenced by transport of ammonium nitrate formed over the high ammonia source region or elsewhere during transport.
SC
RI
PT
ACCEPTED MANUSCRIPT
NU
Figure 5. PSCF Map for nitrate measured in downtown St. Louis, MO between 2001 and 2003.
MA
There are other trajectory ensemble methods that are based on the same premise of averaging the properties of a large number of individual trajectories. Zhao et al.131 show the utility of two
D
back trajectory analysis methods designed to be used with multiple site data: simplified
TE
quantitative transport bias analysis (SQTBA) and residence time weighted concentration (RTWC). These techniques were applied to nitrate and sulfate concentration data from two rural
AC CE P
sites (the Mammoth Cave National Park and the Great Smoky Mountain National Park) and five urban sites (Chicago, Cleveland, Detroit, Indianapolis, St. Louis) for an intensive investigation on the spatial patterns of origins for these two species in the upper midwestern area. A general conclusion was that the origins of the nitrate in these seven sites were mainly in the upper midwestern areas while the sulfate concentrations in these seven sites were mainly from the Ohio River Valley and Tennessee River Valley areas. The upper midwestern areas are regions of high ammonia emissions rather than high NOx emissions. In the winter, metropolitan areas showed the highest nitrate emission potential suggesting the importance of local NOx emissions. In the summer, ammonia emissions from fertilizer application in the lower midwestern area made a significant contribution to nitrate in the rural sites of this study. The impact of the wind direction prevalence on the source spatial patterns was observed by comparing the urban and rural patterns of the summer. A number of other methods describing the probability distribution for source locations including residence time analysis [132], simplified quantitative trajectory bias analysis (SQTBA) [133, 134], and residence time weighted concentrations [133, 135].
ACCEPTED MANUSCRIPT
SUMMARY This review has illustrated the application of a variety of chemometric techniques to
PT
environmental data. These examples illustrate the variety of data analysis problems that need to be addressed. It is not possible to be comprehensive in this review as there is too much work to
RI
be able to summarize. There remain many opportunities to apply data analysis tools to improve
SC
the interpretation of complex environmental data. Because of the uncontrolled nature of the environment, it is necessary to collect a large number of samples and characterize them as
NU
thoroughly as possible for their chemical and physical characteristics. The resulting large data sets then provide difficulties in interpretation without the application of a suite of multivariate
MA
tools. However, modern data analysis tools offer the opportunity to extract considerable information on the underlying chemical and physical processes that give rise to the measured values and an understanding of the functioning of complex environmental systems.
D
New more computationally intensive methods such as resampling methods (jackknifing and
TE
bootstrapping) and Bayesian approaches have become used more frequently as it has become practical to utilize them even with large data sets. However, the environmental community still
AC CE P
requires methods to be packaged in easily used programs before their use becomes widespread. For example, there has been rapidly increasing use of PMF since the US EPA made a version freely available [94]. There are other interfaces to the multilinear engine that have been developed for specific applications such as the analysis of Aerosol Mass Spectrometer data and these have also gained wide acceptance and use. Given that there is a freely available statistics engine, R [136], it would be good to see many of the environmental chemometrics methods developed into R scripts. The scripts can be provided as supplemental material files that most publishers have established for just this purpose. We are not developing these methods just for our own use and we will enhance both the chemometrics and environmental science fields by making state-of-the-art methods readily accessible to the whole community.
ACCEPTED MANUSCRIPT REFERENCES
AC CE P
TE
D
MA
NU
SC
RI
PT
1. M.D. Cheng and P.K. Hopke, Investigation on the Use of Chemical Mass Balance Receptor Model: Numerical Computations, Chemom. Intel. Lab Syst. 1 (1986) 33-44. 2. R.G. Brereton, ed., Multivariate pattern recognition in chemometrics, Amsterdam, Elsevier Science Publishers, B.V., 1992. 3. P.K. Hopke, Application of Multivariate Analysis to the Interpretation of the Chemical and Physical Analysis of Lake Sediments, J. Environ. Sci. Health A11 (1976) 367-383. 4. P.K. Hopke, D.G. Ruppert, P.R. Clute, W.J. Metzger, and D.J. Crowley, Geochemical Profile of Chautauqua Lake Sediments, J. Radioanal. Chem. 29 (1976) 39-59. 5. Clute, P. R., 1973, Chautauqua Lake Sediments. M. S. Thesis, State University College at Fredonia, N.Y., 127 p. 6. D.F. Ruppert, P.K. Hopke, P.R. Clute, W.J. Metzger and D.J. Crowley, Arsenic Concentrations and Distributions in Chautauqua Lake Sediments, J. Radioanal. Chem. 22 (1974)159-169. 7. P.K. Hopke, Quantitative Results from Single Particle Characterization Data, J. Chemometrics 22 (2008) 528–532. 8. G.S. Casuccio, P.K. Hopke, Scanning Electron Microscopy, in Receptor Modeling for Air Quality Management, P.K. Hopke, Ed., Elsevier, Amsterdam, 1991. 9. K. A. Prather, T. Nordmeyer, K. Salt, Real-time characterization of individual aerosol particles using time-of-flight mass spectrometry, Anal. Chem. 66 (1994) 1403. 10. L. S. Hughes, J. O. Allen, M. J. Kleeman, R.J. Johnson, G.R. Cass, D.S. Gross, E. E. Gard, M. E. Gälli, B. Morrical, D. P. Fergenson, T. Dienes, C. A. Noble, D.-Y. Liu, P. J. Silva, K. A. Prather, Size and composition distribution of atmospheric particles in southern California, Environ. Sci. Technol. 33 (1999) 3506. 11. X.-H. Song, N.M. Faber, P.K. Hopke, D.T. Suess, K.A. Prather, J.J. Schauer, G.R. Cass, Source Apportionment of Gasoline and Diesel by Multivariate Calibration Based on Single Particle Mass Spectral Data, Anal. Chim. Acta. 446 (2001) 329. 12 D.S. Kim, P.K. Hopke, D.L. Massart, L. Kaufman, G.S. Casuccio, Multivariate Analysis of CCSEM Auto Emission Data, Sci. Total Environ. 59 (1987) 141-155. 13 D.S. Kim, P.K. Hopke, The Classification of Individual Particles Based on ComputerControlled Scanning Electron Microscopy Data, Aerosol Sci. Technol. 9 (1988) 133-151. 14 S. Grossberg, Adaptive pattern recognition and universal recoding, I: Parallel development and coding of neural feature, Biological Cybernetics 23 (1976) 121-134. 15 S. Grossberg, Adaptive pattern recognition and universal recoding, II: Feedback, expectation, olfaction, and illusion, Biological Cybernetics 23: 187-202 (1976). 16 G.A. Carpenter, S. Grossberg, Pattern Recognition by Self-Organizing Neural Networks, Cambridge, MA, MIT Press, 1991 17 G.A. Carpenter, S. Grossberg, ART2: Self-organization of stable category recognition codes for analog input patterns, Appl. Opt., 12 (1987) 4919-4930. 18 G.A. Carpenter, S. Grossberg, ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures, Neural Networks, 3 (1990) 129-152. 19 G.A. Carpenter, S. Grossberg, D.B. Rosen, Art 2-a - an adaptive resonance algorithm for rapid category learning and recognition, Neural Networks, 4 (1991) 493-504 20 G.A. Carpenter, S. Grossberg and D.B. Rosen, Neural Networks, 4 (1991) 759-771.
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
21 Y. Xie, P.K. Hopke and D. Wienke, Airborne Particle Classification with a Combination of Chemical Composition and Shape Index Utilizing an Adaptive Resonance Artificial Neural Network, Environ. Sci. Technol., 28 (1994) 1921-1928. 22 D. Wienke, Y. Xie, P.K. Hopke, An Adaptive Resonance Theory Based Artificial Neural Network (ART-2a) for Rapid Identification of Airborne Particle Shapes from Their Scanning Electron Microscopy Images, Chemom. Intell. Lab. Syst., 25 (1994) 367-38 7 23 W. Zhao, P.K. Hopke, X. Qin, K.A. Prather, Predicting Bulk Ambient Aerosol Compositions from ATOFMS Data with ART-2a and Multivariate Analysis, Anal. Chim. Acta 549 (2005) 179–187 24 S. Owega, B.U.Z. Khan, R. D'Souza, G.J. Evans, M. Fila, R.E. Jervis, Receptor modeling of Toronto PM2.5 characterized by aerosol laser ablation mass spectrometry, Environ. Sci. Technol. 38 (2004) 5712-5720 25. P.K. Hopke, Receptor Modeling in Environmental Chemistry, New York, J. Wiley & Sons, 1985. 26. Hopke, P.K., Receptor Modeling for Air Quality Management, Amsterdam, Elsevier Science, 1991. 27. A. Reff, S. Eberly, P. Bhave, Receptor modeling of ambient particulate matter data using positive matrix factorization: Review of existing methods, J. Air Waste Manage. Assoc. 2007; 57, 146-154. 28. M. Viana, T.A.J. Kuhlbusch, X. Querol, A. Alastuey, R.M. Harrison, P.K. Hopke, W. Winiwarter, M. Vallius, S. Szidat, A.S.H. Prévôt, C. Hueglin, H, Bloemen, P. Wåhlin, R. Vecchi, A.I. Miranda, A. Kasper-Giebl, W. Maenhaut, R. Hitzenberger, Source apportionment of particulate matter in Europe: a review of methods and results. J. Aerosol Sci. 2008; 39, 827-849 29. C.A. Belis, F. Karagulian, B.R. Larsen, P.K. Hopke, Critical review and meta-analysis of ambient particulate matter source apportionment using receptor models in Europe. Atmospheric Environ. 2013; 69, 94–108. 30. K.L. Sundqvist, M. Tysklind, P. Geladi, P.K. Hopke, K. Wiberg, PCDD/F source apportionment in the Baltic Sea using Positive Matrix Factorization, Environ. Sci. Technol. 44 (2010) 1690–1697. 31. H. Li, P.K. Hopke, X. Liu, X. Du, F. Li, Application of positive matrix factorization to sources apportionment of surface water quality of the Daliao River Basin, Northeast China, Environ. Monitor. Assess. 187 (2015) 80-92. 32. Roscoe, B.A., Chen C.Y., Hopke, P.K.., Comparison of the Target Transformation Factor Analysis of Coal Composition Data with X-Ray Diffraction Data, Anal. Chim. Acta 1984; 160, 121-134. 33. Kim, E., Hopke, P.K., Larson, T.V., Covert, D.S., Analysis of ambient particle size distributions using unmix and positive matrix factorization, Environ. Sci. Technol. 2004; 38, 202-209. 34. T.G. Dzubay, R.K. Stevens, Sampling and Analysis Methods for Ambient PM-10 Aerosol, in Receptor Modeling for Air Quality Management, Hopke, P.K., ed., Elsevier Science Publishers, Amsterdam, 11-44 (1991). 35. G.T. Wolff, M.S. Ruthkosky, D.P. Stroup, P.E. Korsog, A characterization of the principal PM-10 species in Claremont (summer) and Long Beach (fall) during SCAQS, Atmospheric Environ.25A: 2173-2186 (1991).
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
36. M.S. Miller, S.K. Friedlander, G.M. Hidy, A Chemical element balance for the Pasadena aerosol, J. Colloid Interface Sci. 39: 65-176 (1972). 37. J.W. Winchester, G.D. Nifong, Water pollution in Lake Michigan by trace elements from pollution aerosol fallout, Water, Air, Soil Pollut. 1:50-64 (1971). 38. Friedlander, S.K., Chemical Element Balances and Identification of Air Pollution Sources, Environ. Sci Technol. 7:235-240 (1973). 39. R. Heindryckx, R. Dams, Continental, marine and anthropogenic contributions to the inorganic composition of the aerosol of an industrial zone, J. Radioanal. Chem. 19: 339-349 (1974). 40. Bogen, J. Trace Elements in Atmospheric Aerosol in the Heidelberg Area Measured by Instrumental Neutron Activation Analysis, Atmospheric Environ. 7:1117-1125 (1973). 41. D.F. Gatz, Relative contributions of different sources of urban aerosols: application of a new estimation method to multiple sites in Chicago, Atmospheric Environ. 9: 1-18 (1975). 42. G.S. Kowalczyk, C.E. Choquette, G.E. Gordon, Chemical element balances and identification of air pollution sources in Washington, DC, Atmospheric Environ., 12: 1143-1153 (1978). 43. G.S. Kowalczyk, G.E. Gordon, S.W. Rheingrover, Identification of atmospheric particulate sources in Washington, DC using chemical element balances, Environ. Sci. Technol., 16: 79-90 (1982). 44. J.G. Watson, Chemical Element Balance Receptor Model Methodology for Assessing the Source of Fine and Total Suspended Particulate Matter in Portland, Oregon, Ph.D. Thesis, Oregon Graduate Center, Beaverton, OR, 1979. 45. A.M. Dunker, A Method for Analyzing Data on the Elemental Composition of Aerosols, General Motors Research Laboratories Report MR-3074 ENV-67, Warren, MI, 1979. 46. J.G. Watson, J.A. Cooper, J.J. Huntzicker, The Effective Variance Weighting for Least Squares Calculations Applied to the Mass Balance Receptor Model, Atmospheric Environ., 18:1347-1355 (1984). 47 J.G. Watson, J.C. Chow, T.G. Pace, Chemical Mass Balance, in Receptor Modeling for Air Quality Management, Hopke, P.K., ed., Elsevier Science Publishers, Amsterdam, 1991; 83-116. 48. http://www.epa.gov/scram001/receptor_cmb.htm 49. http://www.epa.gov/ttn/chief/software/speciate/index.html 50. J.C. Chow, J.G. Watson, Review of PM2.5 and PM10 apportionment for fossil fuel combustion and other sources by the chemical mass balance receptor model, Energy & Fuels 16, 222-260 (2002). 51. W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Sources of fine organic aerosol. 3. Road dust, tire debris, and organometallic brake lining dust: roads as sources and sinks, Environ. Sci. Technol., 27:1892-1904 (1993). 52. B.R.T. Simoneit, M.A. Mazurek, T.A. Cahill, Contamination of the Lake Tahoe air basin by high molecular weight petroleum residues, Journal Air Pollution Control Association, 30:387390 (1980). 53. B.R.T. Simoneit, M.A. Mazurek, Air pollution: The organic components, Critical Reviews in Environmental Control, 11, 219-276 (1981). 54. B.R.T. Simoneit, M.A. Mazurek, Organic matter of the troposphere_II. Natural background of biogenic lipid matter in aerosols over the rural western United States, Atmos. Environ., 16: 2139-2159 (1982).
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
55. B.R.T. Simoneit, W.F. Rogge, M.A. Mazurek, L.J. Standley, L.M. Hildemann, G.R. Cass, Lignin pyrolysis products, lignans, and resin acids as specific tracers of plant classes in emissions from biomass combustion, Environ. Sci. Technol., 27:2533-2541 (1993). 56. J.J. Schauer, W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Source apportionment of airborne particulate matter using organic compounds as tracers, Atmos. Environ. 30:3837-3855. 57. Zheng, M., Cass, G.R., Schauer, J.J., Edgerton, E.S., 2002. Source apportionment of PM2.5 in the southeastern United States using solvent-extractable organic compounds as tracers, Environ. Sci. Technol., 36:2361-2371. 58. R.E. Cox, M.A. Mazurek, B.R.T. Simoneit, Lipids in Harmattan aerosols of Nigeria, Nature, 296(5860), 848-849 (1982) 59. M.A. Mazurek, M. Mason-Jones, H. Mason-Jones, L.G. Salmon, G.R. Cass, K.A. Hallock, M. Leach, Visibility-reducing organic aerosols in the vicinity of Grand Canyon National Park: Properties observed by high resolution gas chromatography, Journal of Geophysical Research, 102:3779-3793 (1997). 60. M.A. Mazurek B.R.T. Simoneit, G.R. Cass, H.A. Gray, HA. Quantitative high-resolution gas chromatography and high-resolution gas chromatography/mass spectrometry analyses of carbonaceous fine aerosol particles, Int. J. Anal. Chem. 29:119-139 (1987). 61. M A. Mazurek, G.R. Cass, B.R.T. Simoneit , Interpretation of high-resolution gas chromatography and high-resolution gas chromatography/mass spectrometry data acquired from atmospheric organic aerosol, Aerosol Sci. Technol., 10, 408-420 (1989). 62. M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Biological input to visibility-reducing aerosol particles in the remote arid southwestern United States, Environ. Sci. Technol. 25: 684-694 (1991) 63. M A. Mazurek, B.R.T. Simoneit, Characterization of biogenic and petroleum-derived organic matter in aerosols over remote, rural and urban areas, in Identification and Analysis of Organic Pollutants in Air, edited by L. H. Keith, pp. 353-370, Ann Arbor Science/Butterworth, Boston, MA (1984). 64. L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Mathematical modeling of urban organic aerosol: properties measured by high-resolution gas chromatography, Environ. Sci. Technol., 27:2045-2055 (1993). 65. L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Seasonal trends in Los Angeles ambient organic aerosol observed by high-resolution gas chromatography, Aerosol Science and Technology, 20: 303-317 (1994). 66 W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Contribution of primary aerosol emissions from vegetation-derived sources to fine particle concentrations in Los Angeles, Journal of Geophysical Research 101: 19541-19549 (1996) 67. W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Sources of fine organic aerosol. 1. Charbroilers and meat cooking operations, Environ. Sci. Technol., 25: 11121125 (1991). 68. W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Sources of fine organic aerosol. 2. Noncatalyst and catalyst-equipped automobiles and heavy-duty diesel trucks, Environ. Sci. Technol., 27: 636-651 (1993).
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
69. W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, B.R.T. Simoneit, Sources of fine organic aerosol. 4. Particulate abrasion products from leaf surfaces of urban plants, Environmental Science & Technology, 27: 2700-2711 (1993). 70. W.F. Rogge, L.M. Hildemann, M A. Mazurek, G.R. Cass, Sources of fine organic aerosol. 6. Cigarette smoke in the urban atmosphere, Environmental Science & Technology, 28: 1375-1388 (1994). 71. B.R.T. Simoneit, J.J. Schauer, C.G. Nolte, D.R. Oros, V.O. Elias, M.P. Fraser, W.F. Rogge, G.R. Cass, Levoglucosan, a tracer for cellulose in biomass burning and atmospheric particles, Atmospheric Environment 33: 173-182. 72. M.A. Mazurek, Molecular identification of organic compounds in atmospheric complex mixtures and relationship to atmospheric chemistry and sources, Environ. Health Perspect. 2002; 110, 995-1003. 73. R. Subramanian, N.M. Donahue, A. Bernardo-Bricker, W.F. Rogge, A.L. Robinson, Contribution of motor vehicle emissions to organic carbon and fine particle mass in Pittsburgh, Pennsylvania: Effects of varying source profiles and seasonal, Atmospheric Environment 2006; 40, 8002-8019. 74. C.J. Hennigan, C. J., A. P. Sullivan, J. L. Collett Jr., A. L. Robinson, Levoglucosan stability in biomass burning particles exposed to hydroxyl radicals, Geophys. Res. Lett. 2010; 37, L09806, doi:10.1029/2010GL043088. 75. P.K. Hopke, Applying Multivariate Curve Resolution to Source Apportionment of the Atmospheric Aerosol, in 40 Years of Chemometrics, B.K. Lavine, ed., Symposium Series, Washington, DC. American Chemical Society, 2015. 76. Blifford, I.H., Meeker, G.O., A Factor Analysis Model of Large Scale Pollution, Atmospheric Environment 1967 1, 147-157. 77. Prinz. B., Stratmann, H., The Possible Use of Factor Analysis in Investigating Air Quality, Staub-Reinhalt Luft 1968; 28, 33-39. 78. Hopke, P.K., Gladney, E.S., Gordon, G.E., Zoller, W.H., Jones, A.G., The Use of Multivariate Analysis to Identify Sources of Selected Elements in the Boston Urban Aerosol, Atmospheric Environment 1976; 10, 1015-1025 79. Gaarenstroom, P.D., Perone, S.P. , Moyers, J.P., Application of Pattern Recognition and Factor Analysis for Characterization of Atmospheric Particulate Composition in Southwest Desert Atmosphere, Environ. Sci. Technol. 1977; 11, 795-800. 80. Hopke, P.K., Target Transformation Factor Analysis As An Aerosol Mass Apportionment Method: A Review And Sensitivity Study, Atmos. Environ. 1988; 22, 1777-1792. 81. Malinowski, E., Factor Analysis in Chemistry, 2nd ed., Wiley, New York, 1991. 82. Henry, R.C. Kim, B.-M., Extension of Self-Modeling Curve Resolution Mixtures of More Than Three Components. Part 1. Finding the Basic Feasible Region, Chemom. Intell. Lab. Sys. 1989; 8, 205-216 83. Kim, B.-M., Henry, R.C., Extension of self-modeling curve resolution to mixtures of more than three components: Part II. Finding the complete solution, Chemom. Intell. Lab. Syst. 1999; 49, 67– 77. 84. Kim, B.-M., Henry, R.C., Application of the SAFER model to Los Angeles PM10 data, Atmos. Environ. 2000; 34, 1747–1759.
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
85. U.S. Environmental Protection Agency, Receptor Modeling, http://www.epa.gov/ttn/scram/receptorindex.htm 86. Henry, R.C., Multivariate receptor modeling by N-dimensional edge detection. Chemom. Intell. Lab. Syst. 2003; 65, 179-189. 87. C.L. Lawson, R.J. Hanson, Solving Least Squares Problems, Classics in Applied Mathematics (Book 15), Philadelphia, Society for Industrial and Applied Mathematics (1987) 88. Henry, R.C., Multivariate Receptor Models In: Hopke, P. K., ed. Receptor Modeling for Air Quality Management, Elsevier, Amsterdam pp. 117-147, 1991. 89. Paatero, P., Tapper, U., Analysis of different modes of factor analysis as least squares fit problems. Chemom. Intell. Lab. Sys.1993: 18: 183–194. 90. Paatero P., and Tapper U. Positive Matrix Factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994; 5, 111–126. 91. M. Schuemans, I. Markovsky, P.D. Wentzell, S. Van Huffel, On the equivalence between total least squares and maximum likelihood PCA. Anal. Chim. Acta 2005; 544, 254–267. 92. P.D. Wentzell, T,K, Karakach, S. Roy, M.J. Martinez, C.P. Allen, M. WernerWashburne, Multivariate curve resolution of time course microarray data. BMC Bioinformatics 2006; 7, 343. 93. R. Tauler, M. Viana, X. Querol, A. Alastuey; R.M Flight; P.D Wentzell ; P.K. Hopke, Comparison of the results obtained by four receptor modeling methods in aerosol source apportionment studies, Atmospheric Environ. 2009; 43, 3989-3997. 94. U.S. Environmental Protection Agency, http://www.epa.gov/heasd/research/pmf.html 95. Polissar, A. V.; Hopke, P. K.; Paatero, P.; Malm, W. C.; Sisler, J.F., Atmospheric aerosol over Alaska 1. Elemental composition and sources. J. Geophys. Res. 1998; 103, 19045-19057. 96. Ramadan, Z., Song, X.-H., Hopke, P.K., Identification of Sources of Phoenix Aerosol by Positive Matrix Factorization. J. Air Waste Manage. Assoc. 2000; 50, 1308-1320. 97. Ramadan, Z., Eickhout, B., Song, X.-H., Buydens, L.M.C., Hopke, P.K., Comparison of Positive Matrix Factorization (PMF) and Multilinear Engine (ME-2) for the Source Apportionment of Particulate Pollutants. Chemom. Intell. Lab. Sys. 2003; 66, 15-28. 98. Kim, E., Brown, S.G., Hafner, H.R., Hopke, P.K., Characterization of Non-Methane Volatile Organic Compounds Sources in Houston during 2001 using Positive Matrix Factorization, Atmos. Environ. 2005; 39, 5934-5946. 99. Lanz, V.A., Alfarra, M.R., Baltensperger, U., Buchmann, B., Hueglin, C., Prevot, A.S.H., Source apportionment of submicron organic aerosol at an urban site by factor analytical modeling of aerosol mass spectra, Atmos. Chem. Phys. 2007; 7, 1503-1522. 100. Ulbrich, I. M., Canagaratna, M. R., Zhang, Q., Worsnop., D. R., Jimenez, J. L., Interpretation of organic components from Positive Matrix Factorization of aerosol mass spectrometric data, Atmos. Chem. Phys. 2009; 9, 2891-2918. 101. Kim, E., Hopke, P.K., Larson, T.V., Covert, D.S., Analysis of ambient particle size distributions using UNMIX and Positive Matrix Factorization. Environ. Sci. Technol. 2004; 38, 202-209. 102. Kasumba, J., Hopke, P.K., Chalupa, D.C., Utell, M.J., Comparison of sources of submicron particle number concentrations measured at two sites in Rochester, NY, Sci. Total Environ. 2009; 407, 5071–5084.
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
103. K.L. Sundqvist, M. Tysklind, P. Geladi, P.K. Hopke, K. Wiberg, PCDD/F source apportionment in the Baltic Sea using Positive Matrix Factorization, Environ. Sci. Technol. 2010; 44, 1690–1697. 104. H. Li, P.K. Hopke, X. Liu, X. Du, F. Li, Application of positive matrix factorization to sources apportionment of surface water quality of the Daliao River Basin, Northeast China, Environmental Monitoring and Assessment 2015; 187, 80-92. 105. Paatero, P. The Multilinear Engine—a Table-Driven Least Squares Program for Solving Multilinear Problems, Including the n-Way Parallel Factor Analysis Model; Comput. Graphic. Stats. 1999; 8, 1-35 106. Amato, F., Pandolfi, M., Escrig, A., Querol, X., Alastuey, A., Pey, J., Perez, N., Hopke, P.K., Quantifying road dust resuspension in urban environment by multilinear engine: A comparison with PMF2. Atmos. Environ. 2009; 43, 2770-2780. 107. Escrig, A., Monfort, E., Celades, I., Querol, X., Amato, F., Minguillion, M.C., Hopke, P.K., Application of optimally scaled target factor analysis for assessing source contribution of ambient PM10. J. Air Waste Manage. Assoc. 2009; 59, 1296-1307. 108. Amato, F. and Hopke, P.K., Source apportionment of the ambient PM2.5 across St. Louis using constrained positive matrix factorization. Atmos. Environ. 2012; 46, 329-337 109. Amato F, Pandolfi M, Viana M, Querol X, Alastuey A, Moreno T. Spatial and chemical patterns of PM10 in road dust deposited in urban environment. Atmos. Environ. 2009; 43, 16501659. 110. Paatero, P. and Hopke, P.K., Rotational tools for factor analytic models. J. Chemom. 2009; 23, 91-100. 111. E. Pere-Trepat, P.K. Hopke, P. Paatero, Source Apportionment of Time and Size Resolved Ambient Particulate Matter Measured with a Rotating DRUM Impactor, Atmospheric Environ. 2007; 41, 5921–5933. 112. N. Li, P.K. Hopke, P. Kumar, S.S, Cliff, Y. Zhao, C. Navasca, Source Apportionment of Time and Size Resolved Ambient Particulate Matter, Chemom. Intell. Lab. Syst. 2013; 129: 15– 20. 113. J.E. Davidson and P.K. Hopke, Implications of Incomplete Sampling on a Statistical Form of the Ambient Air Quality Standard for Particulate Matter, Environ. Sci. Technol. 1984; 18, 571-580 114. G.O. Salako, P.K. Hopke, Impact of Percentile Computation Method on PM 24-hour Air Quality Standard, J. Environ. Manage. 2012; 107, 110-113. 115. E. Kim, P.K. Hopke, E. Edgerton, Source Identification of Atlanta Aerosol by Positive Matrix Factorization, J. Air Waste Mange. Assoc. 2003; 53, 731-739. 116. G. Wang, P.K. Hopke, J.R. Turner, Using Highly Time Resolved Fine Particulate Compositions to Find Particle Sources in St. Louis, MO, Atmospheric Pollution Research 2011; 2, 219-230. 87. Y. Zeng and P. Hopke, A Study on the Sources of Acid Precipitation in Ontario, Canada, Atmospheric Environ. 1989; 23, 1499-1509. 118. M.D. Cheng, P.K. Hopke, L. Barrie, A. Rippe, M. Olson, S. Landsberger, Qualitative Determination of Source Regions of Long-Range Transported Aerosol Using Data Collected at Canadian High Arctic, Environ. Sci. Technol. 1993; 27, 2063-2071.
ACCEPTED MANUSCRIPT
AC CE P
TE
D
MA
NU
SC
RI
PT
119. M.D. Cheng, P.K. Hopke, Y. Zeng, A Receptor-Oriented Methodology for Determining Source Regions of Particle Sulfate Composition Observed at Dorset, Ontario, J. Geophys. Res. 1993; 98, 16,839-16,849. 120. Y.-J. Han, T.M. Holsen, P.K. Hopke, S.-M. Yi, Comparison between Back-Trajectory Based Modeling and Lagrangian Backward Dispersion Modeling For Locating Sources of Reactive Gaseous Mercury, Environ. Sci. Technol. 2005; 39, 1715-1723. 121. N. Gao, M.D. Cheng, P.K. Hopke, Potential Source Contribution Function Analysis and Source Apportionment of Sulfur Species Measured at Rubidoux, CA during the Southern California Air Quality Study, 1987, Anal Chim Acta, 1993; 277, 369-380. 122. N. Gao, M.D. Cheng, P.K. Hopke, Receptor Modeling of Airborne Ionic Species Collected in SCAQS, 1987, Atmospheric Environ. 1993; 28, 1447-1470. 123. P.K. Hopke, N. Gao, and M.D. Cheng, Combining Chemical and Meteorological Data to Infer Source Areas of Airborne Pollutants, Chemom. Intel. Lab. Syst. 1993; 19, 187-199. 124. A. Fan, P.K. Hopke, T. Raunemaa, M. Oblad, J.M. Pacyna, A Study on the Potential Sources of Air Pollutants Observed at Tjorn, Sweden, Environ. Sci. Poll. Res. 1995; 2, 107-115. 125. N. Gao, P.K. Hopke, and N.W. Reid, Possible Sources of Some Trace Elements Found in Airborne Particles and Precipitation in Dorset, Ontario, J. Air Waste Manage. Assoc. 1996; 46, 1035-1047. 126. M-D Cheng, N. Gao, and P.K. Hopke, Source Apportionment Study of Nitrogen Species Measured in Southern California in 1987, J. Environ. Engineer. 1996; 122, 183-190. 127. J.H. Lee, P.K. Hopke, Apportioning Sources of PM2.5 in St. Louis, MO using Speciation Trends Network Data, Atmospheric Environ. 2006; 40, S360-S477. 128. D.L. Coe, S.B. Reid, Research and Development of Ammonia Emission Inventories for the Central States Regional Air Planning Association. Final Report. STI Report # STI-902501-2241FR (2003). 129. M.D. Goebes, R. Strader, C.I. Davidson, An ammonia emission inventory for fertilizer application in the United States, Atmospheric Environment 2003; 37, 2539−2550. 130. D.M. Kenski, D. Gay, S. Fitzsimmons, Ammonia and Its Role in Midwestern Haze, Regional and Global Perspectives on Haze: Causes, Consequences and Controversies Visibility Specialty Conference, Air & Waste Management Association, Asheville, NC, October 25−29, 2004. 101. W. Zhao, P.K. Hopke, and L. Zhou, Spatial Distribution of Source Locations for Particulate Nitrate and Sulfate in the Upper-Midwestern United States, Atmospheric Environ. 2007; 41, 1831-1847. 132 R.L. Poirot, P.R. Wishinski, Visibility, sulfate, and air mass history associated with the summertime aerosol in North Vermont. Atmospheric Environ. 1986; 20, 1457- 1469. 133. L. Zhou, P.K. Hopke, W. Liu, Comparison of two trajectory-based models for locating particle sources for two rural New York sites. Atmospheric Environ. 2004; 38, 1955-1963. 134. J.R. Brook, D. Johnson, A. Mamedov, Determination of the source areas contributing to regionally high warm season PM2.5 in eastern North America. J. Air Waste Manage. Assoc. 2004; 54, 1162-1169. 135. Stohl, A., Trajectory statistics —a new method to establish source–receptor relationships of air pollutants and its application to the transport of particulate sulfate in Europe. Atmospheric Environment 1996; 32, 947–966. 136. The R Project for Statistical Computing, https://www.r-project.org/
ACCEPTED MANUSCRIPT Figure Captions Figure 1. Dendrogram of the hierarchical cluster analysis of sediment composition data from Chautauqua Lake, NY
PT
Figure 2. Map of Chautauqua Lake with site locations indicated to show to which cluster they belong. The symbols are defined as to for cluster A, X for cluster B, 0 for cluster C and + for cluster D.
RI
Figure 3. CPF plot for Pb measured at the Midwest Supersite in East St. Louis, MO.
SC
Figure 4. PSCF Map for sulfate measured in downtown St. Louis, MO between 2001 and 2003.
AC CE P
TE
D
MA
NU
Figure 5. PSCF Map for nitrate measured in downtown St. Louis, MO between 2001 and 2003.