Analytica Chimica Acta, 191 (1986) 309-317 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands
SOURCE ALLOCATION OF ORGANIC AIR POLLUTANTS BY APPLICATION OF FUZZY C-VARIETIES PATTERN RECOGNITION
K. E. THRANE*
Norwegian Institute for Air Research, N-2001 Lillestr#m (Norway) R. W. GUNDERSON
Utah State University, Logan, Utah (U.S.A.) (Received 27th March 1986)
SUMMARY Data for polynuclear aromatic hydrocarbons (PAH) from 44 air samples are processed by unsupervised clustering techniques in order to resolve the contributions from two sources (domestic heating and motor vehicles). The fuzzy c-varieties (FCV) clustering algorithms are applied. The cluster configuration which best describes the characteristic properties of the samples is selected by computation of validity discriminant coefficients. The FCV method permits the data samples to belong partially to different clusters, and source apportionments are estimated by multiplying the membership values by the PAH concentrations of the individual samples. The results are compared to those obtained by other methods of dispersion or receptor modelling in the same areas. The FCV method is valuable for estimating contributions from two types of emission sources.
Numerous air-quality studies have shown that polynuclear aromatic hydrocarbons (PAH) are one of the most important groups of pollutants in urban and residential areas. Such hydrocarbons are produced by incomplete combustion of organic material, the main sources being car engines and domestic heating (wood, oil, coke, etc.). Industrial sources may also contribute. Some PAH compounds are carcinogens, so that identification of sources and reduction of emissions are considered most important. However, estimation of the relative influence of different types of combustion source in an airshed has proved to be difficult. The problems involved are well known [l--8]. Different methods for source allocation of combustion products in air have been presented. Although several workers have stated that the relative amounts of the individual compounds vary depending on the source of the emission, and that the composition of PAH in an air sample may carry information about its origin, cluster analysis appears not to have been used previously to study apportionment of PAH from the two most important source types. Some studies have involved clustering techniques to estimate the contribution of PAH from an industrial source where the PAH are released by an electrolytic process [ 9-111. This paper demonstrates the application of unsupervised pattern recognition and cluster analysis in order to distinguish between the contributions of 0003-2670/86/$03.50
o 1986 Elsevier Science Publishers B.V.
310
PAH from heating and traffic. The relative contributions from the two sources, are estimated from the cluster analysis, and compared with results obtained by independent methods. THE CLUSTERING
METHOD
The fuzzy c-varieties (FCV) clustering method was applied. The FCV algorithms were first published by Bezdek et al. [ 121. Since then, they have been used in many cluster applications. In air pollution studies, the method was first used to estimate the contribution of PAH from aluminum industries relative to other sources in the regions near the factories, and proved to be a useful tool [ 9-111. The FCV program can be applied to unsupervised pattern recognition, i.e., it automatically searches for patterns in the data and clusters the data according to these patterns. The program should therefore be most useful when pre-specified structures are difficult to obtain, as in air pollution studies of combustion products. There is a basic difference between the FCV and most of the better-known methods for cluster analysis. Traditional methods require that every sample is assigned to only one of the clusters, whereas the FCV technique permits data samples to possess partial memberships in different clusters. In most cases, ambient air samples contain a mixture of pollutants from a number of sources so that the FCV technique is well suited to such samples. If the samples can be clustered according to the origin of the pollutants, the contribution from each source can be calculated from the membership values and the concentrations of PAH in each sample. Briefly, the FCV program is as follows. The measurement matrix X = (xl, . . ., x,) consists of n samples xk, each with d variables xk = (3tki. . . ix&, and c (c > 2) is the number of clusters sought. The membership value of object xk with respect to cluster i (i = 1, 2, . . ., c) is uik. The membership values are subject to the conditions 0 =Gu& G 1 and YE:, 1uik = 1. The computation consists of looping through the two equations
0, =
f k=l
(Uik)mxk/ f k=l
Ug
(2)
where Pi is the cluster centre vector for cluster i and 111is a fixed weighting exponent. The iteration proceeds until a pre-specified minimum change criterion is met for either the membership values or the cluster center vectors. Usually one chooses m = 2, but by increasing m, less weight is attached to the importance of small membership values. This means that the higher the value of m, the “fuzzier” the algorithms become. The choice m = 1 means that each sample is a member of only one cluster (hard clustering).
311
One of the advantages of fuzzy clustering is the possibility of using the fuzziness of a given configuration as an indicator of its quality. This can be achieved by computing the cluster validity coefficient (CVC) of the clusters, which is a pairwise measure of the separation, for increasing values of m [lo]. The CVC is the maximum value of the ratio of the distance between the two cluster centres to the weighted scatter of the two clusters. The algorithms are given in [lo]. This means that the greater the coefficient, the better the separation between data clusters. The membership weighted values in the definition of the cluster validity coefficient are also raised to the power of m. By introducing successively increasing values of m, the effect of samples with poor membership is increasingly filtered out. An indication of the cluster quality can therefore be obtained by comparing the values of the cluster discriminant from computations where m is increased stepwise. If there is little change in the values of CVC, the conclusion is that most of the data have shared membership values close to either zero or unity, which means a good quality. A marked increase in the values as m increases would be taken as an indication of substantial sharing between classes. Such a configuration would be considered of poor quality and rejected, Examples of good and poor quality of cluster configurations and the effect on the cluster validity coefficient when m is increased are illustrated in Fig. 1. MONITORING
STATIONS
AND METHODS
Locations and data collection The PAH data used for the cluster analysis were obtained from three different air pollution studies. The samples had been collected at four stations in South Norway (see Table 1) exposed to PAH mainly from domestic heating and car traffic. The relative contribution from the two source types varied from one station to another (Table 1); it was assumed that other sources were of minor importance in the areas near the sampling stations. One station was located in Elverum (population 10 000) where some wood is used for domestic heating, but oil and electricity are also used. These samples were collected in order to obtain information about the influence of residential wood combustion on the air quality in an inland town where the winter season is cold and inversions occur frequently in winter. The second station was situated in Aurskog (population 500) where wood is mostly used for domestic heating. This is a residential, sparsely built-up area and the only traffic is private cars. This site was chosen to obtain a “fingerprint” of the PAH in air emitted mainly from residential wood burning. The meteorological conditions in this area are quite similar to those in Elverum. Two stations were located in the same neighbourhood in Oslo (population 450 000). One was set up in a street with dense traffic during daytime (St. Olavsgate) and the other one in Nordahl Brunsgate, on top of a roof and facing a backyard. All samples were collected during January and February but not simultaneously (Table 1).
312
Fig. 1. Two examples illustrating cluster configurations of good and poor quality. Example 1 shows two well separated clusters. On increasing the weighting exponent m, few samples will be ignored; the changes of the CVC are minor. In example 2, there are samples with shared or low memberships that will be filtered out with increasing m, and the scatter S, and S, will decrease; this will result in a marked increase in the CVC. TABLE 1 Number of samples(n) collected at each station and information on the main contributors of PAH Station
n
PAH exposure
Elveruma Aurskogb Nordahl BrunsgateC St. OlavsgateC
15 14 7 8
Mainly domestic heating, but also motor vehicles Mainly domestic heating by wood burning Domestic heating and motor vehicles Mainly motor vehicles, but also domestic heating
a129 km north-east of Oslo, collected in 1983. b45 km due east of Oslo, collected 1984. “Downtown Oslo, collected in 1985 (see text).
in
Some of the studies included inorganic components such as lead and other metals, sulfur dioxide and nitrogen oxides, and the isotopic carbon ratio 14C/i2C. Meteorological observations such as ambient air temperature, wind speed and wind directions were obtained from nearby weather stations. In all the studies, the same methods for sampling and quantitation were applied; to obtain data of good quality, the methods had been carefully selected and tested [ 131. Sampling time was 24 h. Particulate PAH were collected on glass-fibre filters while the volatile portion was trapped on polyurethane foam plugs. Individual PAH components were quantified by high-resolution glass-capillary gas chromatography, integration of chromatographic peaks and standard calibration techniques.
313
Input data The selection of PAH compounds (Table 2) was based on the quality of the chemical data (i.e., consistency and minimum risk of interferences, degradations and losses) and on the importance of the individual PAH for source identification. For example, cyclopenta(cd)pyrene and coronene are associated with traffic while the main source of retene (1-methyl-7-isopropylphenanthrene) is wood combustion [ 141. The data were standardized by centering each column of features to the mean of the column and then normalizing to the standard deviation of the column. RESULTS
AND DISCUSSION
Clustering of PAH samples The FCV algorithm was used to partition the data into c = 2, c = 3 and c = 4 classes. The quality of the cluster configurations was determined by computing the cluster validity discriminant coefficient (CVC) for increasing weighting exponents (m). The CVC values (Table 3) were computed for successively increasing values of the membership weighting exponent (m). When the data were partitioned into two clusters (c = 2), only minor changes of the CVC were observed with increasing values of m, i.e., increasing m has little effect on the class models, which means that the configuration is of acceptable quality. Subclustering was tested for samples with a membership value >0.5 in cluster 1; the CVC increased slightly as m increased. When the data were partitioned into three clusters (c = 3), the discriminant values indicate good separation between clusters 1 and 2, but not between cluster 3 and the other two; thus partitioning into three clusters was not satisfactory. The CVC also increased markedly with increasing m when the computation was tested for four clusters; this configuration is therefore also of poor quality. On the basis of these results, partitioning of the data into two clusters followed by subclustering is considered superior to the other configurations. The memberships of the samples in the different clusters were therefore studied for these configurations. With c = 2, all the samples from Aurskog (mainly wood combustion) were assigned to one cluster. The samples from Elverum and from the Oslo stations were assigned to the other, i.e., samples exposed to pollution from both traffic and domestic heating were clustered together. The subclustering, however, separated street samples exposed mainly TABLE 2 The PAH compounds included in the cluster analysis NO.
Compound
NO.
Compound
NO.
Compound
NO.
Compound
1 2 3 4
Biphenyl Fluorene Phenanthrene Anthracene
5 6 ‘7 3
Fluoranthene Pyrene Retene Benzo(b)fluorene
9 10 11 12
Cyclopenta(cd)pyrene Benzo(a)anthracene Chrysene/triphenylene Benzo(e)pyrene
13 14 15 16
Benzo(a)pyrene Indeno(l.2.3-cd)pJ Benzo(ghi)perylenc Coronene
314 TABLE 3 Results for cluster validity coefficients with different weighting exponents
(m)
Validity coefficient
Clusters
m=2
m=4
2 clusters, 1 and 2 2 subclusters of 1
3.26 2.90
6.18 7.18
3 clusters,
1 and 2 1 and 3 2 and 3
6.33 2.99 5.37
4 clusters,
1 1 1 2 2 3
20 6.13 9.26 4.8 7.11 2.39
and and and and and and
2 3 4 3 4 4
m=6
m=8
6.92 10.1
6.97 11.7
12.7 8.12 12.5
12.5 12.4 18.9
11.8 16.7 23.4
123 36.3 59.2 13.6 19.1 8.04
145 83.4 130 22.3 31 15.9
151 384 382 28.9 43.3 28.2
to car traffic from the samples collected in Nordahl Brunsgate and Elverum where the main source was domestic heating. For c = 3, most samples exposed mainly to traffic were assigned to one cluster, but their membership values were low. This indicates that the centre (prototype) of this cluster was not similar to the typical PAH pattern in air samples from traffic exhaust. The second and third clusters consisted of samples from Aurskog, and from Elverum and Nordahl Brunsgate, respectively. Cluster analysis evaluated for c = 4 gave almost the same result as for three clusters; the only difference was that the cluster containing the samples from St. Olavsgate was broken into two clusters. Two samples from this street station formed one cluster where the membership values for all the other samples were close to zero. The other samples from St. Olavsgate were assigned to the other cluster, but their membership values were low. The samples from Aurskog on the one hand and from Elverum and Nordahl Brunsgate on the other were grouped into the two remaining clusters. These results show that c = 2 and subclustering is the best way of separating the contribution from the two source types in this particular data set. It should, however, be noted that the street samples contained pollutants from both types of source, although most of the PAH came from car exhaust. The above discussion suggests that a more representative set of samples for the traffic exposure would have made the clustering less fuzzy and the interpretation of the clustering results easier. Also it would have been a great advantage if all the samples had come from the same area or sampling station. Source apportionment The relative contribution of PAH from the two sources, domestic heating and motor vehicles, was estimated by multiplying the membership values for
315
each sample in the clusters by the total PAH concentration measured in the sample. The sum of the weighted PAH estimates for each station was divided by the total measured PAH concentration at the site. The results are given in Table 4 together with results obtained by several independent methods in the same areas. Information on the fuel consumptions was available in Elverum and in Oslo and the relative emissions of PAH in the two areas were estimated on the basis of these data and emission factors [ 15, 161. Lahmann et al. [ 51 have used the ratio between two of the more stable PAH, indeno(1,2,3cd)pyrene and benzo(ghi)perylene (IP/BghiPe) to obtain a first impression of the relative contribution from the two types of source. This ratio should be 0.35-0.39 for PAH emitted from car exhaust only and 0.90 for PAH from domestic heating by coal and oil combustion [5]. Although the conditions in Norway are quite different from those in West Germany, the ratios were calculated for our samples. The relative contribution of particulate carbon in air from wood combustion was determined in 7 samples from Elverum by the isotopic ratio 14C/‘2C; the average result is given. Comparison of the results in Table 4 shows that there is satisfactory agreement between three of the estimated results in Elverum. The result obtained by using the IP/PghiPe ratio is obviously too high. The average ratio in air samples collected at this station was 1.15. Lahmann et al. [5] also found that the ratio exceeded 0.90 in winter and suggested that the types of fuel used and sources other than heating and traffic influenced the PAH profiles as well as the ratio of the two compounds. There is good agreement between the estimated results for Aurskog given the probable uncertainties in the initial data. The calculations based on consumption and emission factors were made for the entire Oslo area [ 161; as expected, the relative contribution from domestic heating at the street station was less than the influence from heating averaged over the Oslo area (Table 4). At Nordahl Brunsgate, the ratio IP/BghiPe yielded a result that was only 10% lower than the one obtained by cluster analysis. This ratio at St. Olavsgate was 0.35, indicating no PAH from heating although this station did receive pollutants from TABLE 4 Relative contribution Station
Elverum Aurskog Nordahl Brunsgate St. Olavsgate
of PAH from domestic beating at four monitoring stations Relative contribution
(%)
AB
Bh
CC
Dd
88 97 60 30
95
>lOO 100 50 0
90
35-55
‘Estimated by FCV. bEstimates based on consumption of fuel and emission factors [ 15, 161. CEstimated from the ratio indeno( 1,2,3cd)pyrene/benzo(ghi)perylene [ 51. dContribution of wood carbon estimated by the ratio 14C/12C.
316
domestic heating; accordingly, the contribution from heating obtained by this ratio seems to be too low. Lead is a good tracer for motor vehicles, and the concentrations of this pollutant were measured in Elverum and Oslo. The results from a regression analysis for lead and total PAH showed little correlation in Elverum (correlation coefficient r = 0.57); r was 0.83 and 0.82 in Nordahl Brunsgate and St. Olavsgate, respectively. The average ratios between lead and total PAH indicated that the contribution of PAH from traffic relative to the total contribution in Elverum was about 25% of the relative contribution in Oslo; the relative contributions from traffic at the two Oslo stations could not be distinguished by using these ratios [ 161. CONCLUSIONS
The control of hazardous pollutants in urban air depends on a good understanding of the origins and fate of the compounds. Reliable source identification is therefore essential, and several dispersion and receptor models have been developed. Dispersion models rely on information about emission rates, stack parameters and meteorological conditions, and are commonly used when such information is available. In many cases, emission data are difficult to obtain and receptor models based on air quality measurements have become alternatives for source allocation. In the case of organic pollutants, source allocation has proved to be particularly difficult. The main aim of this study was to find a practical solution to this problem. The results show that the FCV algorithms can be useful in a receptor model which identifies the major combustion sources of PAH in air and estimates their contributions to air pollution. These investigations demonstrate that this method leads to valid results. The FCV algorithms have several advantages over other methods. First, the computation can be done on the basis of the monitoring data only, without knowledge of emission inventories. Secondly, FCV can be used for unsupervised pattern recognition. In this mode, the compositions of the variables in air samples exposed to the individual sources need not be known prior to data processing. Unsupervised pattern recognition is therefore useful when the composition of the pollutants emitted by a source is unknown and when the individual compounds are unstable during transport from the source to the sampler. In air pollution studies of PAH, these problems are acute, because the PAH composition varies from one source to another within the same source type, depending on the combustion conditions, and decomposition of PAH can occur on exposure to light and air. Provided that there is a difference in the compositions (patterns) of air pollutants from the sources and that some of the samples selected for the cluster analysis contain pollutants typical for these sources, the FCV program can find these patterns and divide the data samples into clusters. These clusters can be traced back to the appropriate sources from knowledge of position of the source, meteorological conditions, topography of the environment or the concentration of tracer
317
elements [ 9, 111. Thirdly, when the FCV technique yields clusters that can be associated with sources, the prototype (centre) of each cluster will indicate a composition that is typical for the source related to that cluster. Thus information on the PAH composition in air samples from the various sources can be obtained and the largest emitters of the most hazardous PAH components can be identified. The FCV algorithms are powerful tools in the difficult studies of organic air pollutants, and their use can supplement, support and in some cases replace other methods. The authors are grateful to Mr. S. Larssen and Mr. J. Schjoldager for making the data available for this study. REFERENCES 1 J. M. Daisey, M. A. Leyko and T. J. Kneip, in P. W. Jones and P. Leber (Eds.), Polynuclear Aromatic Hydrocarbons, Ann Arbor Science Publishers, Ann Arbor, 1979, p. 201. 2P. S. Pedersen, J. Ingwersen, T. Nielsen and E. Larsen, Environ. Sci. Technol., 14 (1980) 71. 3 J. M. Daisey, in Receptor Models Applied to Contemporary Pollution Problems, Air Pollution Control, Pittsburg, 1982, p. 348. 4 S. K. Friedlander, in Proceedings of the 2nd International Conference on Carbonaceous Particles in the Atmosphere, Linz, Sept., 1983, Paper No. 2, Institut fir Analytische Chemie, Wien, 1983. 5 E. Lahmann, B. Seifert, L. Zhao and D. Bake, Staub-Reinhalt. Luft, 44 (1984) 149. 6 J. R. Cretney, H. K. Lee, G. J. Wright, W. H. Swallow and M. C. Taylor, Environ. Sci. Technol., 19 (1985) 397. 7 A. Greenberg, F. Darack, R. Harkov, P. Lioy and J. Daisey, Atmos. Environ., 19 (1985) 1325. 8 P. Masclet, G. Mouvier and K. Nikolaou, Atmos. Environ., 20 (1986) 439. 9 K. E. Thrane and L. WikstrSm, in M. Cook and A. J. Dennis (Eds.), Polynuclear Aromatic Hydrocarbons: Mechanisms, Methods and Metabolism, Columbus, Ohio, 1984, p. 1299. 10 R. W. Gunderson and K. E. Thrane, in J. J. Breen and P. E. Robinson (Eds.), Environmental Applications of Chemometrics, ACS symposium series 292, Washington, DC, 1985, p. 130. 11 K. E. Thrane, Atmos. Environ., (1987) in press. 12 J. C. Bezdek, C. Coray, R. W. Gunderson and J. D. Watson, SIAM J. Appl. Math., 40 (1981) 339,358. 13 K. E. Thrane, A. Mikalsen and H. Stray, Int. J. Environ. Anal. Chem., 23 (1985) 111. 14T. Ramdahl, Nature, 306 (1983) 580. 15 M. MCller, T. Ramdahl, I. Alfheim and J. Schjoldager, Environ. Int., 11 (1985) 189. 16 B. Sivertsen, The Relative Contribution of Air Pollutants from Various Sources to Man and Environment, Final Report - MIL 4. Norw. Inst. Air Research OR 41/85, Lillestrgm, 1985.