Environmental Forensics (2002) 3, 323±329 doi:10.1006/enfo.2002.0102, available online at http://www.idealibrary.com on
Aspects of Hydrocarbon Fingerprinting Using PLSÐNew Data From Prince William Sound Stephen M. Mudge* School of Ocean Sciences, University of WalesÐBangor, Menai Bridge, Anglesey LL59 5EY, U.K. (Received May 2002, Revised manuscript accepted July 2002) Partial least squares (PLS) techniques are used in the re-analysis of NOAA hydrocarbon data previously investigated in Mudge (2002). New data have been provided for coal and oil signatures and these have been investigated further. The eects of zeros (less than the limit of detection) in the dataset can be overcome by addition of small values at approximately half of this limit; this then enables logarithms to be taken of the entire dataset which greatly improved the usefulness of principal component analysis (PCA). Source samples collected close to each other had dierent signatures, probably due to their environmental histories which was also seen when aliphatic hydrocarbons were included in the signatures. Key compounds describing each could be seen in Coomans' Plots. Signatures developed from formation oils, riperian oils and coals from the eastern Gulf of Alaska (GoA) provided mean ®ts to subtidal samples within PWS of 22, 19 and 38% respectively. This suggests mixed and variable sources across the sampling area. The overall conclusion must be a question regarding the partitioning between oil and coal source materials as they look very similar in this # 2002 AEHS. Published by Elsevier Science Ltd. All rights reserved. particular location. Keywords: PLS; hydrocarborn; ®ngerprinting; Gulf of Alaska.
Introduction
impact on PC1 will have high loadings (either positive or negative) whereas those compounds which are relatively unimportant and, therefore, do not have a major in¯uence on the data, will have values close to zero. PC2 is ®tted orthogonal to the ®rst component so there is no component of PC1 in¯uencing PC2. Once the ®rst two PCs have been elucidated, their projection can be described in terms of the two sets of loadings. These projections, which represent the signature de®ned in terms of the chemical compounds used, can now be applied to the environmental data (Y-Block). The amount of variance explained by each X-Block signature can be quanti®ed. This can be shown graphically either through a scatter plot of the weightings on each sample or as the total variance explained. If the signature is similar to that of the environmental data, a high value for the explained variance is produced. Conversely, if a poor ®t is produced, the explained variance is also small. Each signature can be ®tted in turn and all are ®tted independently of each other. If none of them explain the variation seen in the data, the ®ts will be small in every case. A fuller treatment of the PLS methodology including the matrix manipulations used can be found in Geladi and Kowalski (1986). The advantage of PLS over other methods (e.g. simple ratios) is the way it uses all compounds and develops a signature based on the internal relationships between each one. In general, the more compounds that are used, the better the speci®city of the signature. PCA can be used independently of the PLS technique to determine the number of potential sources that may be present in the Y-Block. The scores plot from such an analysis will group sites according to their chemical composition; those that co-vary are likely to have the
Within the last few years there has been several advances in the use of multivariate statistical techniques applied to large chemical datasets. This has been helped by the widespread use of personal computers, statistical and mathematical modelling packages. However, simply having access to a tool does not mean it can be applied to all data and in all cases. In the recent past, Partial Least Squares (PLS) has been applied to several datasets including that of the ExxonMobil and NOAA teams debating the issue of background hydrocarbons in Prince William Sound (PWS) and the Gulf of Alaska (GoA) (Mudge, 2002). The PLS technique was developed by Wold et al. (1984) and has evolved into a powerful tool in environmental forensics (e.g. Yunker et al., 1995; Mudge and Seguel, 1999). In essence, PLS performs PCA on data which are de®ned as the signature (Geladi and Kowalski, 1986). This data set which can be chemical, physical or biological in nature, is called the X-Block and ideally will be a pure source sample but could be made up of environmental samples which have a high proportion of a single source such as coal deposits or oiled areas. Since the samples come from the same source, although the concentrations may vary, PCA will generate a principal component 1 (PC1) that explains most of the variance in the analytical data, typically 490%. This projection or vector in n-dimensional space where n is the number of chemical compounds analysed can be described by a series of loading factors on each compound; those compounds which have a major *Tel: 44-1248-351151; Fax: 44-1248-716367; E-mail: s.m.mudge@ bangor.ac.uk
323 1527-5922/02/030323+07 $35.00/00
# 2002 AEHS. Published by Elsevier Science Ltd. All rights reserved.
324 S. M. Mudge
same or a similar source. Inspection of the groupings may provide an insight into the number of sources although care must be exercised when dealing with mixtures of variable composition. This PCA technique may also be used to explore the source data and determine the groupings within the possible source materials. The debate with the PWS and GoA background hydrocarbons is that Short and co-workers from the NOAA Auke Bay Laboratory and USGS cite evidence based on the presence of coal on nearby beaches and the extensive coal measures between the Bering River and Ice Bay that the background hydrocarbons come from such deposits (Short et al., 1999). In contrast, Page and co-workers from Bowdoin College, Arthur D. Little and the ExxonMobil Corporation believe they originate from natural oil seeps in the same regions around the eastern shores of the GoA (Page et al., 1999; Boehm et al., 2001) although both groups recognise the importance of eroding source rocks. The motivation for their work revolves around the potential toxicity of the pre-spill sediments based on their PAH content. A previous paper by Mudge (2002) using PLS indicated a mixed source which varied geographically across the region. This work also indicated signi®cant overlap between signatures and suggests some rivers are ``coal'' sources and some are ``oil'' sources. Recently, Van Kooten (2000) has indicated that several of the Alaskan coals are thermally immature and show very strong oilprone characteristics. For instance, 25% of one coal could be converted into an oil upon heating. This paper has two principal aims: the ®rst brie¯y explores aspects of the use of PLS in developing signatures from environmental samples including zeros in the dataset and non-normal distributions. The second aim is to investigate the use of some new source data from the GoA area from Short (unpublished data) which complements the previous work (Mudge, 2002). This previous work used data from both ExxonMobil and NOAA although source allocation was not really possible with the latter data.
Methods Datasets This paper presents the results of statistical analyses on data solely from the NOAA group. The new unpublished data provided by Short comprised 37 samples from four sources and 207 environmental samples analysed for 66 PAHs and aliphatic hydrocarbons. These data were speci®cally sought as they provide the source signatures that were unavailable for the previous work (Mudge, 2002). A second dataset was extracted from the Exxon Valdez total hydrocarbon database (EVTHD, from NOAA) was also used with the new source samples above. All samples were from depths 4 100 m and were from the East and North of the GoA and PWS. Zeros In reality, zeros in datasets usually mean one of two things; less than the detection limit of the method being used or secondly, missing data. In the latter case, it is better to leave gaps in any dataset and let any software set these to the missing data code. That way, they are removed from the calculations and cannot in¯uence any outcome. How one treats the data in the ®rst case, however, may have consequences with regard to the results. PCA was performed on environmental data with the zeros present and after the addition of a small factor, in this case 0.01 ng/g wet weight to simulate a value below the limit of detection. Nonnormal distributions Most quantitative statistical methods require the data or the residuals to be normally distributed. In most environmental data, there is a spread of values from less that the limit of detection to relatively high values in source or near source samples. An example of the distribution of naphthalene from the NOAA dataset can be seen in Figure 1.
150
0.012
0.008
Frequency
100
0.006 50
Probability
0.01
0.004 0.002
0
0 360 340 320 300 280 260 240 220 200 180 160 140 120 100 80 60 40 20 0 Naphthalene conc
Figure 1. Frequency histogram of the naphthalene data (ng/g) from the EVTHD dataset. The curve represents a normal distribution ®tted to the data mean and standard deviation.
Aspects of Hydrocarbon Fingerprinting Using PLSÐNew Data From Prince William Sound 325
This distribution is typical of most environmental data. One mechanism of overcoming this problem is to take the logarithmic value of the data although this does introduce problems of its own. These include the loss of zeros in the data (although this can be overcome) but more seriously producing asymmetric values either side of any ®tted line. The data from Figure 1 is replotted as a logarithmic frequency plot in Figure 2. To overcome the zeros when taking logarithms, a small value less than the limit of detection can be added as suggested above.
Zeros Addition of the small values to the datasets made no discernable dierence to the PCA results. The explained variance, scores and loadings were identical and no detrimental eect could be seen from this approach. Problems may arise, however, if a sample has many compounds close to the limit of detection when the added values may dominate and unduly eect the signature. Fortunately, no samples fell into this category with these data. Non-normal distributions
Data range
Figures 1 and 2 demonstrate the usefulness of the logarithmic approach to normalizing datasets. The addition of the 0.01 ng/g also had no discernable eect on the results. However, when PCA is performed on these data with and without taking logarithms, obvious eects can be seen (Figures 3 and 4). The loadings for the two plots are similar with naphthalene and alkyl naphthalenes from oil sources predominant towards the lower right of the ®gures but the spread of the data in the log transform case makes it much easier to see groupings and clusters. This plot also indicates that a number of potential sources are needed to explain the variance in the data. The loadings suggest a main axis comprising a ``coal to oil gradient'' but with dierent types of each dictated by the spread about that axis.
Since the data can range over several orders of magnitude, this has a strong eect on the weighting given to those samples. Therefore, the data can be mean centred to unit variance to enable the compounds to be directly compared with each other. PCA and PLS analyses The data from the PWS and GoA areas were explored using these multivariate methods. To facilitate the whole statistical process, a PCA and PLS computer package from Umetrics (www.umetrics.com) was employed (SIMCA-P). During the exploratory phase of the data analysis, the eects of zeros and log transformations were investigated and the most appropriate methods used (log10(X 0.01) where X is the concentration of each compound). Signatures were developed from coals, shales, riperian and formation oils using data supplied by Short. Environmental data came from the GoA and PWS, mainly extracted from the EVTHD.
PCA and PLS analyses PCA was performed on new data for shales collected from Kayak Island (Short, unpublished data) which had been separated into size fractions prior to analysis. Two sites (A and B) within 8 m were analysed separately although it was anticipated that they would be used as single signature. However, the scores from PCA are shown in Figure 5. Several things are apparent from these results: (1) Although they are only 8 m apart, the two sites have suciently dierent PAH signatures to be completely separated in a (PCA).
Results The data pre-treatments are considered initially followed by an analysis of the new signature data from Short (unpublished data) using the most appropriate transformations.
70
1
60 0.8
0.6
40 30
0.4
20 0.2 10 0
0 2.6
2.4
2.2
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
Log Naphthalene conc Figure 2. Logarithmic frequency histogram of the data from Figure 1 with associated normal distribution.
Probability
Frequency
50
326 S. M. Mudge 15 10
t(2)
5 0
0.5 INDEN O
0.4
–10 –10
0
10
20
Loadings on PC2
–5
0.3
0.2 40 0.1
30
t(1)
FLUORANT BENA PY BEN ZOBF L BENZOP DIB ENZ
ACENT HY P ER YLENE
ANT HRA
BENZOK FL
B ENANTH BENE PY
C1FLUORAPYRENE C4CHRYS BIP HENY L
50
C HR YSENE
FLUOR ENE
0
C 1CHRYS
C1FLUOR
C3CHRYS PHENAN C3DI TENAN HIO C C1PHENAN C2CHR 4PHENAN YT SH C3F 2P LUOR H C 3PHENAN C2DIT C4NAP H O DI ACE C1DI NTHIO TU THIO HHI E C2FL C3NAP OR H C2NAPH MENAP1 AP2
–0.1
NAPH
–0.2 –0.05
0
0.05 0.1 Loadings on PC1
0.15
0.2
Figure 3. The PCA scores and loadings of 217 PWS and GoA samples after addition of 0.01 ng/g to each concentration.
4 2
t(2)
0 –2 –4
0.5 BENZOKFL
INDENO ANTHRA
–10
0 t(1)
Loadings on PC2
–8 –20
ACENTHY
0.4
–6
0.3
DIBENZ
ACENTHE
0.2
PERYLENE
20
C1CHRYS BENANTH
0.1
C3FLUOR C1FLUOR BENA FLUORANT PY C1DITHIO C1FLUORA DITHIO BENZOP PYRENE C4CHRYS C2DITHIO C1PHENAN BIPHENYL CHRYSENE NAPHC2CHRYS BENEPY C3CHRYS BENZOBFL C2FLUOR C2PHENAN C3DITHIO FLUORENE C3PHENAN PHENANTH C4NAPH MENAP2 MENAP1 C4PHENAN C2NAPH
0 –0.1 –0.2
C3NAPH
–0.3 0
0.05
0.1
0.15
0.2
Loading s on PC1 Figure 4. The same data as Figure 3 but after log transformation.
(2) All the size fractions within the B site have the same PAH signature but for the A site there is a wider spread with the 52 mm fraction signi®cantly dierent from the rest.
4
AShale < 2
3 2 t(2)
Further analysis of this data by PLS indicates that these two shales have compounds that provide an insight their geochemistry. When a Coomans' Plot (Coomans et al., 1984, a plot of the class distances for two separate signatures as a scatter plot) is produced, it indicates those compounds that are common to both signatures and are therefore, not diagnostic. These lie in the lower left quadrant of Figure 6. Those compounds indicative of Shales A are in the top left and for Shales B are in the lower right. Compounds that do not ®t into either would be in the upper right.
5
BShale 63-45 BShale 600-50
1 0
BShale <2
BShale 15-10 BShale 250-125 AShale 15-10
–1
AShale 600-500
–2
AShale 63-45 AShale 250-125
–3 –10
–5
0
5
10
15
t(1)
Figure 5. Score plot for two shales (A and B) collected 8 m apart on Kayak Island separated into size fractions (mm).
Aspects of Hydrocarbon Fingerprinting Using PLSÐNew Data From Prince William Sound 327 DIBENZ
4
Shales B
3 BENZOKFL
more coaly
ANTHRA
2 PERYLENE BENANTH
1
more oily
NAPH BENZOP C2FLUOR MENAP1 MENAP2 BIPHENYL BENZOBFL C4NAPH BENAPY FLUORANT FLUORENE C3FLUOR C1FLUOR ACENTHE C3CHRYS INDENO BENEPY ACENTHY C2NAPH C3NAPH PYRENE C2CHRYS C3PHENAN C1CHRYS C1FLUORA C1DITHIO C4CHRYS C1PHENANC2PHENAN DITHIO C2DITHIO C3DITHIO CHRYSENE PHENANTH C4PHENAN
0
Figure 6. A Coomans' Plot indicating the compounds signi®cant in each signature.
Coal signature with and without aliphatic hydrocarbons The new data as well as having the PAHs has the concentrations of aliphatic hydrocarbons (C10 ±C34) including pristane and phytane available. Since the shorter chain components are more volatile or easily degraded compared to the majority of the PAHs, it may be possible to observe a change in signature based on the loss of these components. In this case, the EVTHD from NOAA was used with 3372 samples. The signature was developed from coals from Kosakuts River, Samovar Hills and Kayak Island. The results are displayed geographically in Figure 7. The majority of the best ®t samples (0.7±0.9) fall outside PWS along the route of the GoA current which is likely to be the principal transport mechanism of suspended materials. There is little change in the explained ®t (76±80%) while in that region. When the aliphatic hydrocarbons are excluded (Figure 8), 1.8% of the samples fall into the top class (70±90% of the variance explained) compared to only 0.6% when including the aliphatic hydrocarbons. Unfortunately, they are not as obvious on the diagram as they fall within PWS where a wide diversity of ®ts can be seen even at closely related points. The high explained variances seen in the Cook Inlet disappear when the aliphatic compounds are excluded
Latitude
62
from the signature suggesting a relatively ``fresh'' aliphatic rich component to these samples. Oil versus coal signatures Three separate signatures were provided as possible sources of hydrocarbons to the region by Short (unpublished data). They were riparian oils, formation oils and coals. These three were used as separate X-block signatures for a subtidal dataset from PWS. A wide diversity of ®ts was obtained and the results are summarized in the following table (Table 1). There were clear dierences between the ®ts but there appeared to be no geographical pattern that was readily Table 1. Percentage of each explained variance class for the three potential sources Explained variance 0.0±0.2 0.2±0.4 0.4±0.6 0.6±0.8 0.8±1.0
Riperian oils
Formation oils
Coals
43.9 40.9 14.4 0.8 0
32.5 25.4 39.7 2.4 0
25.3 27.2 27.8 19.1 0.6
PWS
Cook Inlet
60 0 - 0.1 0.1 - 0.3 0.3 - 0.5 0.5 - 0.7 0.7 - 0.9
58
GoA
56 –158
–156
–154
–152
–150
–148
–146
–144
–142
–140
–138
Longitude Figure 7. The proportion of the variance explained using coal as a signature. The data includes the aliphatic hydrocarbon component.
328 S. M. Mudge
Latitude
62
Cook Inlet
PWS
60 0 - 0.1 0.1 - 0.3 58
0.3 - 0.5 0.5 - 0.7 0.7 - 0.9
GoA
56 –158
–156
–154
–152
–150
–148
–146
–144
–142
–140
-138
Longitude Figure 8. The proportion of the variance explained using coal as a signature. The data does not include the aliphatic hydrocarbon components.
explainable. In general, however, the coals had a larger number of good ®ts (460%) compared to the oils and the mean ®ts were 38% for coal, 22% for formation oils and 19% for riperian oils. These data suggest that individually, the unaltered signatures for the oils and coals did not, on the whole, explain all the variance in samples; possible exception could be made in the case of some samples with the coal signature. However, when added together, these signatures did explain most of the variance. In some cases, the summed ®ts were greater than one again highlighting the overlap between signatures (e.g. Mudge, 2002).
Discussion These results indicate the importance of data exploration and transformation in order to obtain the most information regarding source partitioning. This cannot be done without knowledge of the geochemical origins of the compounds in the samples. When the concentrations of any component after analysis are less than the detection limit, they should be left as zeros in the raw data. This enables the person conducting the statistical analysis to add a small value to every concentration representing half the detection limit. This has no discernable eect on untransformed PCA. However, when logarithms are taken of the data, this addition allows those compounds to be used as part of the discriptor. The distribution of the data should be investigated, possibly visually, prior to analysis and appropriate action taken if the values display a nonnormal distribution. Several transformations are possible but the log10 is most widely used and normalises the data adequately in most cases. The advantage of this transformation can clearly be seen in Figure 4 where the same overall distribution as in Figure 3 is seen but at a wider scale allowing easier interpretation. The new data from Short contain several interesting insights. The PCA of the Kayak Island shales (Figure 5) highlights the dierent environmental histories that materials may have undergone even though they are located physically close. In this case, the Coomans' Plot (Figure 6) demonstrates that Shales A are more coaly in that they have large ring PAHs and anthracene as
diagnostic compounds whereas Shales B are more oily as they have the naphthalenes and their alkylated homologues as diagnostic compounds. This may result from dierent weathering patterns or residence times in the environment. Alternatively, it may be due to a dierent source of material. The results presented for the EVTHD by Mudge (2002) also indicated a wide range of possible ®ts for signatures even at the same site, dependent in some respects on sample matrix. Due to these dierent weathering and/or source processes, apportioning a single or restricted number of sources to the entirety of PWS seems inappropriate, whether it be coal or oil. The weathering processes can be seen in part by the change in ®t across the GoA into PWS when using signatures including aliphatic hydrocarbons. Van Kooten (2000) has highlighted the high hydrogen content of several coals in this area of Alaska and including the aliphatic compounds in the coal signature seems appropriate. The PLS model demonstrates a gradient away from the Eastern GoA towards PWS with few good ®ts in the Sound itself. When the aliphatic hydrocarbons are excluded from the signature, the best ®ts tend to be inside the Sound although this is masked by the large number of poorer ®ts from other sources. On the whole, however, these coals provide a better description of the variance in the PWS data than either the riperian or formation oils (Table 1). As before (Mudge, 2002), there is signi®cant overlap that still exists between signatures due to commonality between coal and oil and the fact that coals in this area can yield oil upon heating (Van Kooten, 2000). The question therefore becomes ``when is a coal an oil?''
Conclusions It may be concluded from this extention to previous analysis that; (1) There is a mixed source of hydrocarbons that contribute to the background in PWS. (2) This mixed source includes coals and oils which exhibit a range of properties themselves, i.e. there is more than one type of coal and more than one type of oil. The signatures demonstrate an overlap between these two general sources and absolute
Aspects of Hydrocarbon Fingerprinting Using PLSÐNew Data From Prince William Sound 329
partitioning may be impossible due to commonality between sources. (3) Samples collected close to each other can exhibit dierent characteristics which may in part be due to dierent environmental histories and weathering processes. This can be demonstrated by the use of the more labile compounds in the source materials such as the aliphatic hydrocarbons and small (naphthalene) PAHs.
References Boehm, P.D., Page, D.S., Burns, W.A., Bence, A.E., Mankiewicz, P.J. and Brown, J.S. 2001. Resolving the origin of the petrogenic hydrocarbon background in Prince William Sound, Alaska. Environ. Sci. Technol. 35, 471±479. Coomans, D., Broeckaert, I., Derde, M.P., Tassin, A., Massart, D.L. and Wold, S. 1984. Use of a microcomputer for the de®nition of multivariate con®dence regions in medical diagnosis based on clinical laboratory pro®les. Comp. Biomed. Res. 17, 1±14. Geladi, P. and Kowalski, B.R. 1986. Partial least squares regression: a tutorial. Anal. Chim. Acta 185, 1±17.
Mudge, S.M. 2002. Reassessment of the hydrocarbons in the Gulf of Alaska: identifying the source using Partial Least Squares. Environ. Sci. Technol. 36, 2354±2360. Mudge, S.M. and Seguel, C.G. 1999. Organic contamination of San Vicente Bay, Chile. Mar. Poll. Bull. 38, 1011±1021. Page, D.S., Boehm, P.D., Douglas, G.S., Bence, A.E., Burns, W.A. and Mankiewicz, P.J. 1999. Pyrogenic polycyclic aromatic hydrocarbons in sediments record past human activity: a case study in Prince William Sound, Alaska. Mar. Poll. Bull. 38, 247±260. Short, J.W., Kvenvolden, K.A., Carlson, P.R., Hostettler, F.D., Rosenbauer, R.J. and Wright, B.A. 1999. Natural hydrocarbon background in benthic sediments of Prince William Sound, Alaska: oil vs coal. Environ. Sci. Technol. 33, 34±42. Van Kooten, G.K. 2000. In: Alaska Geological Society & Geophysical Society of Alaska, Science and Technology Conference. (Swenson, R.F., Ed.). Wold, S., Albano, C., Dunn, W.J., Edlund, U., Esbensen, K., Geladi, P., Hellberg, S., Johansson, E., Lindberg, W. and SjoÈstroÈm, M. 1984. Multivariate data analysis in chemistry. In: Chemometrics: Mathematics and Statistics in Chemistry, (Kowalski, B.R., Ed.). Dordrecht, Holland, D. Reidel Publishing Company. Yunker, M.B., Macdonald, R.W., Veltkamp, D.J. and Cretney, W.J. 1995. Terrestrial and marine biomarkers in a seasonally ice-covered Arctic estuaryÐintegration of multivariate and biomarker approaches. Mar. Chem. 49, 1±50.