Journal of Environmental Radioactivity 89 (2006) 150e158 www.elsevier.com/locate/jenvrad
Classification of soil samples according to their geographic origin using gamma-ray spectrometry and principal component analysis Snezana Dragovic´ a,*, Antonije Onjia b a
Institute for the Application of Nuclear Energy e INEP, Radioecology, Banatska 31b, 11080 Belgrade, Serbia and Montenegro b Vinca Institute of Nuclear Sciences, PO Box 522, 11001 Belgrade, Serbia and Montenegro Received 20 September 2005; received in revised form 20 March 2006; accepted 4 May 2006 Available online 21 June 2006
Abstract A principal component analysis (PCA) was used for classification of soil samples from different locations in Serbia and Montenegro. Based on activities of radionuclides (226Ra, 238U, 235U, 40K, 134Cs, 137Cs, 232 Th and 7Be) detected by gamma-ray spectrometry, the classification of soils according to their geographical origin was performed. Application of PCA to our experimental data resulted in satisfactory classification rate (86.0% correctly classified samples). The obtained results indicate that gamma-ray spectrometry in conjunction with PCA is a viable tool for soil classification. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Principal component analysis; Radionuclides; Soils; Classification
1. Introduction Owing to advances in analytical instrumentation, it is now possible to generate relatively large data sets in a cost effective manner, a situation that may not have been feasible a number of years ago. These data sets are difficult to evaluate using simple univariate statistical methods, * Corresponding author. Tel.: þ381 11 199 242; fax: þ381 11 618 724. E-mail address:
[email protected] (S. Dragovic´). 0265-931X/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.jenvrad.2006.05.002
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
151
specially due to their complexity and to their multivariate nature. Multivariate methods have been widely applied to evaluate and interpret the large amounts of data generated by current analytical methods (Brereton, 2003). Moreover, multivariate data analysis searches for relationships among objects (samples), among variables and among objects and variables. Principal component analysis (PCA) is one of the multivariate statistical methods frequently used in exploratory data analysis (Joliffe, 1986; Wold et al., 1987). PCA allows the transformation and visualization of complex data sets into a new perspective in which the most relevant information is made more obvious. Using PCA, contamination sources may be identified and their geographic and temporal distributions estimated. PCA is also frequently employed for analysis and interpretation of the data sets obtained in monitoring programmes. It is of considerable interest to be able to determine the geographic origin of soils analytically. The geographic origin can be recognized with minimum effort if the relevant constituents are analyzed and the results included in data analysis algorithms. A number of analytical techniques and methods may be used in the classification of soil samples of different taxonomy and/ or geographic origin (Eswaran et al., 2002). In the last few years, concentration profiles of chemical substances in conjunction with pattern recognition methods have been widely employed to classify soils (Ramadan et al., 2001; Antonic´ et al., 2003; McBratney et al., 2003; Slavkovic´ et al., 2004). The specific problem in this study was to discriminate soil samples from Serbia and Montenegro according to geographic origin. Soil samples were collected from different regions and analyzed by gamma-ray spectrometry. The samples were then classified according to their origin based on their radionuclide (226Ra, 238U, 235U, 40K, 134Cs, 137Cs, 232Th and 7Be) activities. Natural environmental radioactivity depends on the geological and geographic conditions, and appears at different levels in the soils of each different geological region (UNSCEAR, 2000). The specific levels of terrestrial environmental radiation are related to the geological composition of each lithologically separated area, and to the content of natural radionuclides in rocks from which the soils originate in each area. Geologically, the territory of Serbia and Montenegro includes a great number of rock complexes (magmatic, sedimentary and metamorphic rocks) which are markedly different with respect to age, genesis, mineral content and petrochemical and geochemical characteristics. Outstanding differences in natural radioactivity of soils can exist in relation to their geological origin (Dimitrijevic´, 1995). Therefore, the set of natural radionuclides used in this work to differentiate investigated areas is based on geology. The concentrations of cesium isotopes are influenced by the altitudes of sampling areas. Ecosystems on higher altitudes are predisposed to receive more fallout and therefore higher concentrations of these radionuclides (Howard et al., 1991; Dragovic´ et al., 2004). The concentrations of beryllium on the ground are higher in areas of high rainfall. Since precipitation is generally higher in upland regions, an increase in concentration of beryllium with altitude is to be expected (Salisbury and Cartwright, 2005). Due to these reasons, cesium isotopes and beryllium were used in our work to differentiate areas according to their susceptibility to fallout. 2. Materials and methods 2.1. Samples Samples of undisturbed soils were taken from 15 locations in Serbia and Montenegro (Fig. 1) during 2003: 1 e Slatina (n ¼ 7), 2 e Beljanica (n ¼ 6), 3 e Zˇeljevica (n ¼ 6), 4 e Kopaonik (n ¼ 10), 5 e Avala
152
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
Fig. 1. Simplified geological map of Serbia and Montenegro with sampling locations (towns are marked by dots and mountains by rectangles).
(n ¼ 6), 6 e Devojacki Bunar (n ¼ 8), 7 e Bukulja (n ¼ 8), 8 e Kosmaj (n ¼ 7), 9 e Stara Planina (n ¼ 5), 10 e Surdulica (n ¼ 5), 11 e Bogic´evica (n ¼ 6), 12 e Durmitor (n ¼ 8), 13 e Kosovska Kamenica (n ¼ 7), 14 e Kukavica (n ¼ 5) and 15 e Loznica (n ¼ 9). A total of 103 samples, including six to eight subsamples, were collected down to a depth of 5 cm from each location. The vegetation and organic debris were removed from the samples. The samples were dried at 105 C to constant weight and then powdered and passed through a 2 mm mesh sieve to homogenize them. The homogenized samples were placed in 1 L Marinelli beakers, which were sealed hermetically and kept aside for about a month to ensure equilibrium between 226Ra and its daughters before being taken for gamma spectrometric analysis.
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
153
2.2. Measuring setup Measurements were performed using an HPGe gamma-ray spectrometer (ORTEC-AMETEK model GEM 25, 34% relative efficiency and 1.65 keV FWHM for 60Co at 1.33 MeV, 8192 channels). All samples were measured for 60 ks. The spectra obtained were processed using Gamma Vision 32 software (ORTEC, 2001). The 238U activity was evaluated through gamma-ray emission at 63.3 keV (branching 4.8%) of its daughter 234Th, neglecting the 63.8 keV gamma ray from 232Th, which has a branching as low as 0.27%. For the determination of 235U activity the gamma ray line at 143.8 keV was used. The 226Ra activity was determined through the gamma ray energies at 295.2 and 351.9 keV of 214Pb and those at 609.3, 1120.3 and 1764.5 keV of 214Bi. For the measurements of the 232Th activity, the gamma ray lines at 911.1 and 969.1 keV of 228Ac were used. The 137Cs, 134Cs, 40K and 7Be isotopes were directly measured at 661.7, 795.8, 1460.8 and 477.6 keV, respectively. In the case of 134Cs first and second-order coincidence summing corrections were calculated, taking into account only pair coincidences and pair and triple coincidences, respectively (Debertin and Scho¨tzig, 1979). Prior to the samples’ measurement, the environmental background at the laboratory site was determined with an empty Marinelli beaker under identical measurement conditions. Background spectral intensities were later subtracted from corresponding sample intensities. Uncertainties of all measurements were calculated taking into consideration the counting statistical error and weighted systematic errors that include the uncertainty in the efficiency calibration, the uncertainty of the calibration source activity and uncertainties in the nuclide library used. 2.3. Principal component analysis Principal component analysis is a powerful, linear, unsupervised, pattern recognition technique used as a mathematical tool for analyzing, classifying and reducing the dimensionality of numerical data sets in a multivariate problem (Brereton, 2003; Joliffe, 1986; Antonic´ et al., 2003; Hopke, 2003). In the unsupervised pattern recognition, there is no a priori information of the classification of any of the objects and the pattern recognition method is used to find the group structure in the data as would be needed to find the groups of similar objects. Generally, the k-th principal component PCk, is a linear combination of the n response vectors Xij for the analyte under study: PCk ¼ a1k X1j þ a2k X2j þ a3k X3j þ . þ ank Xnj
ð1Þ
whilst the sum of the coefficients aik, called loadings, is set to unity. The loading of a single variable indicates how much that variable participates in defining the PC (the squares of the loadings indicate their percentage in the PC). Typically, PCA decomposes the primary data matrix by projecting the multidimensional data set onto a new coordinate base formed by the orthogonal directions with maximum variance in the data. The eigenvectors of the data matrix are called principal components and they are mutually uncorrelated. The principal components PCk are ordered so that PC1 displays the greatest amount of variance, followed by the next greatest PC2 and so on. The magnitude of each eigenvector is expressed by its own eigenvalue, which gives a measure of the variance related to that principal component. As a result of the coordinate change, it is possible to achieve a data dimensionality reduction to the most significant principal components and elimination of the less important ones without any considerable information loss. In mathematical terms, n correlated random variables are transformed into a set of d n uncorrelated variables. These uncorrelated variables are linear combinations of the original variables and can be used to express the data in a reduced form. The main features of PCA are the coordinates of the data in the new base (score plot) and the contribution to each component of the variable (loading plot). The score plot is usually used for studying the classification of the data clusters; while the loading plots can be used for giving information on the relative importance of the variable to each principal component and their mutual correlation (Penza et al., 2002). In other terms, the PCA procedure can be applied by finding the eigenvectors of the primary
154
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
matrix of the datapoints, and to form a transformation matrix from these eigenvectors ordered so that the corresponding eigenvalues are in decreasing order. The number of significant components was estimated from the values of eigenvalues. The technique has three effects: it orthogonalizes the components of the input vectors (so that they are uncorrelated with each other), it orders the resulting orthogonal components (principal components) so that those with largest variation come first, and it eliminates those components which contribute least to the variation in the data set (Wu and Massart, 1996). The software package SPSS 10.0 was used in this study (SPSS 10.0 for Windows).
3. Results and discussion Descriptive statistics of radionuclide activities in the soil samples are given in Table 1. Radionuclide activities in these samples varied by factors of up to 3e5 for 226Ra, 238U, 235U, 40K and 232Th, 7 for 7Be, 13 for 134Cs, and 21 for 137Cs. The standard deviation was the greatest for 40 K and the smallest for 134Cs. The arithmetic mean and the standard deviation of the activities for all samples and locations were used to describe the central tendency and variation of the data. Grubb’s test, an important procedure for removing outliers from the data set, was applied prior to PCA (Grubb, 1969). This resulted in no outliers detected in the data set. ShapiroeWilk’s test (significance level a was 0.05) (Shapiro and Wilk, 1965) for normality of activity distribution within each radionuclide was applied first (Table 1) and revealed normal distribution of the data. The multivariate data set consists of eight radionuclides (variables) as measured over 103 soil samples (observations), yielding a 103 8 (rows by columns) primary data matrix. The number of significant PCs, which represent linear combinations of the original radionuclide activities, was selected on the basis of the Kaiser criterion (Kaiser, 1960). This criterion retains only PCs with eigenvalues that exceed one. The scree plot test, which consists of plotting the eigenvalues against the number of the extracted components and finding the points where the smooth decrease of eigenvalues appears to level off to the right of the plot, eliminates components that contribute to factorial scree only. The scree plot for the given data set (Fig. 2) showed Table 1 Descriptive statistics and results of ShapiroeWilk’s test of normality of radionuclide activities (Bq/kg) in soil samples collected from Serbia and Montenegro Parameter Descriptive statistics Mean activity Uncertainty Median Mode Standard deviation Skewness Kurtosis Range Minimum Maximum
226
238
235
40
30.8 1.4 29.9 32.7 9.14 0.42 0.45 41.3 13.6 54.9
29.7 1.2 30.2 15.4 9.42 0.44 0.33 38.8 14.6 53.4
1.36 0.2 1.38 1.14 0.462 0.53 0.22 2.10 0.51 2.61
567 9.8 593 728 163.6 0.17 0.73 648 271 919
0.986 0.037
0.978 <0.010
Ra
ShapiroeWilk’s statistics R 0.989 P-value 0.083
U
0.986 0.043
U
K
232
134
137
7
40.7 1.1 39.7 45.3 13.54 1.20 2.18 65.1 18.3 83.4
0.1 0.1 0.05 0.02 0.074 0.91 0.56 0.24 0.02 0.26
48.3 0.4 42.9 31.2 26.19 0.44 0.39 106.8 5.25 112
1.79 0.5 1.50 0.70 1.045 0.66 0.91 3.40 0.54 3.94
0.940 <0.010
0.947 <0.010
Th
Cs
Cs
0.986 0.035
Be
0.946 <0.010
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
155
4
Eigenvalue
3
2
1
0 1
2
3
4
5
6
7
8
Component number Fig. 2. Eigen analysis of the correlation matrix (scree plot).
that only the first three components complied with the Kaiser criterion. Hence, reduced dimensionality of the descriptor space is three. Table 2 shows the eigenvalues and percentage variance for the principal component extracted. The first three PCs explained 81.7% of the total variance among eight variables, where the first component (PC1) contributed 51.5%, the second component (PC2) contributed 16.0%, and the third (PC3) 14.2% of the total variance. The remaining 18.3% variance present in the data is regarded as unexplained one and the factors belonging to this variance can readily be omitted. To gain a better insight into the latent structure (hidden regularities) of the data and in order to ensure that the resulting factors were uncorrelated, the PCA extracted correlation matrix was subjected to varimax (variance maximizing) orthogonal rotation. The application of varimax rotation of standardized component loadings enabled us to obtain a clear system as a result of the maximization of component loadings variance and elimination of invalid components, i.e. after varimax orthogonal rotation was applied, factors were obtained comprising the radionuclide loadings. Table 3 shows three significant factors obtained. The factors comprise the activities, in which loadings report how each radionuclide is related to these factors. The first factor (48.8% of variance) comprises 226Ra, 238U and 232Th with high loadings. All radionuclides have positive loadings in this factor. No significant loadings were obtained for any variable of factors 2 and 3, which are responsible for 17.6% and 15.3% of total variance, respectively. Such distribution of the total variance means that it is possible to compress the Table 2 Eigen analysis of the correlation matrix PC
Eigenvalue
Proportion (%)
Cumulative (%)
1 2 3 4 5 6 7 8
4.120 1.283 1.139 0.720 0.523 0.127 0.065 0.023
51.5 16.0 14.2 9.0 6.5 1.6 0.8 0.3
51.5 67.5 81.7 90.8 97.3 98.9 99.7 100
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
156
Table 3 Varimax rotated factor loadings of radionuclide activities in soil samples Variable
Factor 1
Factor 2
Factor 3
Communality
226
0.943 0.926 0.866 0.181 0.920 0.019 0.144 0.713 3.9068 48.8
0.228 0.250 0.389 0.866 0.091 0.146 0.565 0.216 1.4103 17.6
0.089 0.170 0.191 0.125 0.124 0.884 0.582 0.028 1.2251 15.3
0.950 0.950 0.937 0.798 0.871 0.803 0.678 0.556 6.5422 81.8
Ra 238 U 235 U 40 K 232 Th 134 Cs 137 Cs 7 Be Variance % Var
information provided in the data set onto the first three PCs (i.e. to project the information carried by the original variables onto the three PCs) without losing any substantial information. The radionuclides with high loadings (226Ra, 238U and 232Th) in the first factor are the most useful for differentiating soil samples based on geography, i.e. the classification could be done if only these radionuclides are measured. The reason is in the large variability in the data as discussed above. There are other contributions to the classification described in highorder factors as well as there is some variance due to uncertainty in measurements, but these are negligible. The PCA, applied to the given data set, has shown a differentiation between the soil samples according to their geographic origin (Fig. 3). Quantitative estimation of the classification efficiency could be obtained from the classification matrix (Table 4), which shows the number of samples in each class and percentage of correctly classified samples. An overall 86.0% correct classification was achieved. Samples from nine locations were accurately classified (100%). Some samples from the other locations were misclassified, i.e. the classification rate ranged from 50 to 85.7%. Some samples were incorrectly assigned as they fell on the borders between their own class and another one. 2.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1.6 1.2 0.8
PC 2
0.4 0.0 -0.4 -0.8 -1.2 -1.6 -2.0 -2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
PC 1 Fig. 3. Score plot of PC1 and PC2 illustrating the differentiation of soil samples according to their geographic origin.
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
157
Table 4 Classification matrix Class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
No. of samples Corr. class. (%)
7 100
6 66.7
6 100
10 50.0
6 50.0
8 100
8 100
7 85.7
5 100
5 100
6 100
8 100
7 71.4
5 100
9 66.7
The results obtained in this study are in good correlation with those obtained in many other studies dealing with soil classification. Thus, PCA was shown to be an efficient tool for soil classification based on the content of trace elements (Slavkovic´ et al., 2004) or polycyclic aromatic hydrocarbons (Golobocanin et al., 2004). The application of PCA in forensic classification of soils of different origins (Thanasoulias et al., 2002) gave a classification efficiency of 85%, which is close to the efficiency obtained in this study. From the study presented here, it was concluded that adequate classification (86% correct classification of the data set) of soil samples of different origins could be achieved by analyzing their gamma-ray spectra and applying principal component analysis.
Acknowledgements This work was supported by the Ministry of Science and Environmental Protection of the Republic of Serbia (Contract No. ON142039).
References Antonic´, O., Pernar, N., Jelaska, S.D., 2003. Spatial distribution of main forest soil groups in Croatia as a function of basic pedogenetic factors. Ecol. Modell. 170, 363e371. Brereton, R.G., 2003. Data Analysis for the Laboratory and Chemical Plant. John Wiley & Sons, Ltd, West Sussex. Debertin, K., Scho¨tzig, U., 1979. Coincidence summing corrections in Ge(Li)-spectrometry at low source-to-detector distances. Nucl. Instrum. Methods 158, 471e477. Dragovic´, S., Nedic´, O., Stankovic´, S., Bacic´, G., 2004. Radiocesium accumulation in mosses from highlands of Serbia and Montenegro: chemical and physiological aspects. J. Environ. Radioact. 77, 381e388. Dimitrijevic´, M., 1995. Geology of Yugoslavia. Geoinstitute, Belgrade (in Serbian). Eswaran, H., Rice, T., Shrens, R., Stewart, B.A., 2002. Soil Classification: A Global Desk Reference. CRC Press. Golobocanin, D.D., Sˇkrbic´, B.D., Miljevic´, N., 2004. Principal component analysis for soil contamination with PAHs. Chemom. Intell. Lab. Syst. 72, 219e223. Grubb, F., 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1e21. Howard, B.J., Beresford, N.A., Hove, K., 1991. Transfer of radiocaesium to ruminants in unimproved natural and seminatural ecosystem and appropriate countermeasures. Health Phys. 61, 715e725. Hopke, P.K., 2003. The evolution of chemometrics. Anal. Chim. Acta 500, 365e377. Joliffe, I.T., 1986. Principal Component Analysis. Springer, New York. Kaiser, H.F., 1960. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 20, 141e151. McBratney, A.B., Mendonc¸a Santos, M.L., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3e52. ORTEC, 2001. Gamma Vision 32, Gamma-ray spectrum analysis and MCA emulation, Version 5.3, Oak Ridge, USA. Penza, M., Cassano, G., Tortorella, F., 2002. Identification and quantification of individual volatile organic compounds in a binary mixture by SAW multisensor array and pattern recognition analysis. Meas. Sci. Technol. 13, 846e858. Ramadan, Z., Song, X.H., Hopke, P.K., Johnson, M.J., Scow, K.M., 2001. Variable selection in classification of environmental soil samples for partial least square and neural networks models. Anal. Chim. Acta 446, 233e244. Salisbury, R.T., Cartwright, J., 2005. Cosmogenic 7Be deposition in North Wales: 7Be concentrations in sheep faeces in relation to altitude and precipitation. J. Environ. Radioact. 78, 353e361.
158
S. Dragovic´, A. Onjia / J. Environ. Radioactivity 89 (2006) 150e158
Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrics 52, 591e 611. Slavkovic´, L., Sˇkrbic´, B., Miljevic´, N., Onjia, A., 2004. Principal component analysis of trace elements in industrial soils. Environ. Chem. Lett. 2, 105e108. SPSS 10.0 for Windows,
. Thanasoulias, N.C., Piliouris, E.T., Kotti, M.S.E., Ermiridis, N.P., 2002. Application of multivariate chemometrics in forensic soil discrimination based on the UVeVis spectrum of the acid fraction of humus. Forensic Sci. Int. 130, 73e82. UNSCEAR, 2000. Sources and effects of ionizing radiation. Report to General Assembly, with Scientific Annexes, United Nations, New York. Wold, S., Esbensen, K., Geladi, P., 1987. Principal component analysis. Chemom. Intell. Lab. Syst 2, 37e52. Wu, W., Massart, D.L., 1996. Artificial neural networks in classification of NIR spectral data: selection of the input. Chemom. Intell. Lab. Syst. 35, 127e135.