Applied Geochemistry Applied Geochemistry 20 (2005) 341–352 www.elsevier.com/locate/apgeochem
Dealing with outliers and censored values in multi-element geochemical data – a visualization approach using XmdvTool Katrin Gru¨nfeld
*
Royal Institute of Technology, Department of Land- and Water Resources Engineering, Teknikringen 72 bv, 100 44 Stockholm, Sweden Received 5 December 2003; accepted 30 August 2004 Editorial handling by A. Danielsson
Abstract Dealing with geochemical data also means coping with their underlying limitations that are related to sampling, analytical techniques, and other characteristics of the data. This paper discusses the issue of data cleaning, using a regional geochemical dataset of 6 heavy metals in glacial till. Interactive data manipulation techniques provided in the freeware visualization system XmdvTool were used for exploring both metal concentrations reported as under the detection limit, and high or extreme values (outliers) in the dataset. The proposed integrated visual evaluation (IVE) approach for selective removal of outliers outperformed simple removal of the highest concentrations of metals, showing that existing spatial multi-element fingerprints in data could be recognized and preserved by IVE. The uniqueness of visualization is in simultaneous display of both multivariate and spatial information. Being simple and interactive, integrated visual evaluation can be recommended as a valuable complementary tool in cleaning and analysing multielement geochemical data. 2004 Elsevier Ltd. All rights reserved.
1. Introduction Geochemical data are widely used for exploration purposes and environmental investigations. For describing and analysing the data, tools from traditional univariate statistics to complex spatial statistical and multivariate statistical techniques are employed. However, such data pose specific problems due to the fact that they are related to sample weights, sample spacing, sampling scheme, and the analytical techniques applied. The data may thus contain both sampling bias (introduced during the sampling process), and/or measure*
Tel.: +46 87906810; fax: +46 87907030. E-mail address:
[email protected].
ment bias (introduced as a part of the measurement or preparation and analytical process). Missing values as well as values under the analytical detection limit (censored values) may be present in geochemical sampling data. Moreover, most geochemical data do not follow a normal distribution, and common data sets contain an abundance of rather small values along with a few very large ones, so-called outliers (Reimann and Filzmoser, 2000). Geochemical data are always complex and contain many variables, and signals from geological and other factors that influence the material from which the geochemical samples are collected appear as multielement patterns and anomalies. An important feature of geochemical data is that the sampled values are often spatially autocorrelated.
0883-2927/$ - see front matter 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.apgeochem.2004.08.006
342
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
All the above-mentioned properties of geochemical data have an influence on the choice of suitable analysis methods. Techniques used for data description should be relevant for the data at hand, and the importance of a systematic approach to evaluate multi-element geochemistry has been emphasized (Grunsky and Smee, 1999). Reimann and Filzmoser (2000) suggested that in order to get one-dimensional insight, a careful description of each element regarding the distribution, censored values and outliers should be carried out (most often by means of univariate statistics) as the first step in analysing multi-element geochemical data. This first step is a prerequisite to successful data cleaning and transformation before multivariate statistical analysis is approached (Reimann et al., 2002). Depending on the data properties, raw data may or may not be valuable to explore. Raw geochemical data are not widely used, because the commonly applied statistical and geostatistical techniques available in standard software packages have an assumption that data distributions are close to normal or Gaussian distribution. As a general rule, almost all measured variables in regional geochemical data have neither normal nor lognormal data distributions, and it has been demonstrated by Reimann and Filzmoser (2000) that neglecting this fact will lead to biased or faulty results when parametric statistical methods are used. Any automatic data treatment or preprocessing may either increase the bias, or some useful information will be lost. One can also argue that raw geochemical data are most meaningful, especially when the results are to be interpreted. In this study, the focus is on exploring and removing censored and outlying values from multi-element regional geochemical data. The decisions to be made about censored and outlying values may be either removal or replacement. Removal is most often absolute and regarding outliers a certain percentage of the samples from the tail of the frequency distribution of element concentrations are removed. Replacement is a compromise and the outlying values (and/or censored values) can be replaced with more suitable concentrations, to improve the distribution without any loss of sample support. One should keep in mind there is a difference between values and samples. In multi-element data, values (concentrations or contents) are related to a univariate distribution of a chemical element while samples contain values of a number of measured chemical elements. A value reported at, or less than, the detection limit is likely an overestimate of the true value and as a result, the estimate of the mean and variance of the sample population will be positively biased (Grunsky and Smee, 1999). Rawlins et al. (2002) suggested that when the number of samples with values less than the lower detection limit become significant, the detection limit is considered too high to provide reliable data and those
chemical elements should be left out from the analysis. A significance limit being 1% has been proposed by Grunsky and Smee (1999), and only if the number of censored values is less than 1%, a simple replacement of a censored value by a value of 0.33–0.5 times the lower limit of detection is acceptable. Unfortunately, as mentioned by Reimann et al. (2002), the rare or other elements with considerable number of concentrations under detection limits may often be the most interesting ones to study. To aid analysis and interpretation, geochemical data should be cleaned of strange or erroneous values, and outliers should be dealt with in one way or another, like removing anomalous samples or changing their values. Whether removed or not, outlying samples can contain valuable information on mineralization or pollution, so their recognition and correct interpretation is important. This has never been an easy task for a data analyst as several well-grounded decisions have to be made. The first question is how to differentiate between strange outliers and the ones, which belong to geochemical fingerprints, and the second question is how many outliers should be removed from the data. Different approaches can be found in the literature, for example, Rawlins et al. (2002) suggested that in regional geochemical data values above 97.5% do not belong to the natural geochemical baseline and are most often explained by natural mineralization or contamination. The traditional approaches of outlier removal are commonly applying univariate statistics, to identify a number of samples in the highest range of the concentration to be deleted as outliers. The disadvantage of the univariate approaches is in ignoring the relationships between the concentrations of the different chemical elements in the dataset. The standard parametric multivariate statistical techniques for outlier detection are not robust to the presence of outliers and cannot thus be applied to raw geochemical data. Outlier detection in multivariate data has traditionally been studied by the statistical community, and performance of several tests have been reported in Becker and Gather (2001), but robust techniques may not be familiar or available to a non-expert, and have most often not been tested with geoscientific datasets. For most of the estimators, like projection pursuit techniques (Pan et al., 2000), a multivariate normal distribution is required. Relative efficiency of the statistical tests to detect outliers in normally distributed geochemical samples has been reported by Velasco et al. (2000). The detection tools may differ depending on the number of expected outliers in the data (few or multiple outliers), and often models are fitted to data. Subjecting raw geochemical data to automatic transformation in order to achieve a normal distribution is not recommended because the presence of extreme outliers can substantially affect the outcome. According to Atkinson and Riani (1997) different chemical elements
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
may need different transformations, and graphical plots of data provide clear and informative indications of the importance of individual observations. Interactive tools that encourage exploration of data are becoming more popular and recently the use of simple exploratory techniques have been suggested for outlier detection in the first stages of the analysis of geochemical and environmental data (Reimann et al., 2002). Unwin (2000) argues that to some extent graphical methods are the only practical alternative to analyse spatial data. Unfortunately most commonly available uni- and multi-variate statistical approaches cannot account for the spatial component that is always present in geochemical survey data. At the same time, geostatistical tools, like experimental variograms, can be used to detect the presence of errors and strange outliers, but only in univariate data. As the traditional data analysis techniques are mostly numerical, there is a need for tools that aid visualization of individual multi-element samples in raw geochemical data regarding the distribution of the concentrations of each element, relations between the elements, and spatial reference. This study explores a regional geochemical dataset in multidimensional feature space, using the visualization freeware XmdvTool. The paper demonstrates how individual censored/outlying samples can be visualized and explored in multi-element and spatial space, thus facilitating the decision-making process and selective removal. The illustrations show to what degree the univariate removal of high-valued outliers may discard information on existing spatial patterns, for example mineralization in a regional geochemical dataset.
2. Materials and methods Data used in this present study consists of 1411 samples, collected over an area of 100 · 100 km in southern Sweden in a regional geochemical survey of glacial till. The sampling scheme was irregular (1 sample per 6 km2), and samples were taken from below the zone of weathering. Element concentrations thus indicate the natural (or geogenic) patterns derived from underlying geology. The fine fraction (<0.06 mm) of till samples was analysed by XRF for the total contents of elements. The present study concerns 6 heavy metals: Cu, Co, Pb, Ni, V and Zn. Precision of the data (the smallest concentration step) is 1 ppm and the detection limits of the metals are between 2 and 10 ppm. The area has been studied before and there are known relations of high levels of Pb with acid (felsic) volcanic rocks, and the remaining 5 metals with basic (mafic) rocks (Zhang et al., 1998). The glacial movement is known to be in the direction from NNW to SSE. A simplified geological map of the area is presented in Fig. 1. The overall method of the study is presented in Fig. 2 as a data flow diagram.
343
Visualization freeware XmdvTool was created by professor M. Ward, Worcester Polytechnics, Massachusets, USA, and is available from http://davis.wpi.edu/ ~xmdv/. From the provided interactive display techniques, scatterplot matrices and parallel coordinates were used for visualization of the geochemical data. The datasets cleaned from censored and outlying values were interpolated and analysed in raster geographic information system Idrisi32. In XmdvTool each variable (or dimension) may be independent of or interdependent with one or more of the other variables. Variables may be discrete or continuous in nature, or take on symbolic (nominal) values. Each dimension corresponds to an axis, and in a parallel coordinate display the N axes are organized as uniformly spaced vertical lines. A data element in N-dimensional space manifests itself as a connected set of points (one on each axis), forming a polyline. In a scatterplot matrix, which can be opened as an auxiliary display window, two-dimensional scatterplots of all pairs of variables are plotted (the scatterplot of variables X and Y indicates the spatial location of the samples within the defined study area). Brushing is a process in which a user can highlight (select), or delete (hide) a subset of data being graphically displayed. In situations where multiple views of the data are being shown simultaneously, brushing is associated with linking, in which brushing data elements in one view affects the same data in all other views. In XmdvTool, the shape of the brush is that of an N-dimensional hyperbox. The user simply needs to specify N brush dimensions and the mechanism used to perform this is to use N slider bars. Brushes are displayed as shaded regions, with data points which fall within the brush highlighted in different color. Four separate brushes can be defined in XmdvTool. From the original data (Dataset 1), percentile values were calculated for each element in order to provide a reference during a visual inspection of the data. Frequency distribution of the element concentrations was assessed using frequency histograms, applying class widths equal or close to original data precision of 1 ppm. Next the original multi-element dataset (including spatial coordinates of samples) was displayed in parallel coordinates and as a scatterplot matrix. During visual exploration the metal concentrations lower than detection limits were selected (brushed) for one metal at a time and the samples (visualized by polylines) studied for their multi-element contents and spatial location within the area (seen in scatterplot display window). To aid visual comprehension, interactive zooming of the display area to aid sample selection, and reordering of the dimensions to reveal relations of metals in selected samples were used. The concentrations of all metals in brushed data samples were printed and compared to percentile values. This iterative procedure of IVE (integrated visual evaluation) was repeated for
344
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
Fig. 1. Generalized geological map of the study area (in south of Sweden).
all values under the detection limit. All samples that both contained at least one censored value and were not considered informative regarding their multi-element pattern were deleted. The remaining data (Dataset 2) were saved as two identical files for the following outlier removal. The percentiles were calculated again to detect any changes. The next step involved dealing with multivariate ouliers. Outlying values in multi-element samples of the whole dataset were selected considering one metal at a time, and selectively removed after studying the multi-element character and spatial location of the sample. The geological map was consulted and frequent reference was made to percentile values, in order to remove all very high concentrations as well as samples with high values not related to any visible structures in multi-element and spatial space. The IVE process was terminated when the distribution of polylines on the axes of parallel coordinates was considered to be acceptable in terms of
visual separability (that means the outliers clearly separated from the main body of data were removed, resulting in substantial decrease of the concentration ranges). The removed outlying samples were compiled into a separate file. The remaining cleaned dataset (Dataset 3) was imported into Idrisi32, and interpolated into surfaces of a regular grid with 50 m resolution, using an inverse distance-weighted average algorithm for exact interpolation between existing values (Map 3). The univariate removal of the outliers consisted of deleting high-value samples from the tail of the distribution of each metal, the number of samples being equal to the number of outliers removed by IVE approach. The new data (Dataset 4) were imported into GIS and interpolated as described above (Map 4). As pixel values in raster images Map 3 and Map 4 for the same metal are expected to be almost identical, except for sample locations connected to removed values, a simple division of images was chosen to visualize the differences.
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
345
Dataset 1
Remove censored values Dataset 2
Remove outliers using IVE
Dataset 2
Outliers Dataset 3
Remove 51 highest values
Interpolation to surface Dataset 4 Map 3 Interpolation to surface
Map 4
Division: Map3/Map4
Ratio map Fig. 2. Method of study (integrated visual evaluation) presented as a data flow diagram.
Assuming that Map 3 images may have higher pixel values, ratio image of Map 3/Map 4 (pixels from the first image are divided by corresponding pixels from the second) was calculated for each element. The outlier dataset containing samples removed during IVE was studied in XmdvTool regarding element associations and spatial features.
3. Results Histograms of the original dataset (Fig. 3) show the frequency distribution of metal values, and different levels of outlier contamination. Class width or interval was 1 ppm for Co, 2 ppm for Cu, Ni and V, and 3 ppm for Pb and Zn. The elements Pb, Ni and Cu exhibit the most skewed distributions and extreme outliers. Parallel coordinate display of the original Dataset 1 is shown in Fig. 4. The extreme outliers having much higher concentra-
tion, force the rest of the data values into a clutter of overlapping points thus making it difficult to perceive the details in the distribution of lower concentrations. Scatterplot displays of variables X, Y and V are shown in Fig. 5, illustrating the importance of spatial reference for interactive exploration of geochemical data. The spatial trends present in the data can easily be discovered and visualized. The metals Cu, Co, Ni, V and Zn show similar trends in spatial distribution of their concentrations while Pb follows a different trend. The number of values under, or equal to, the detection limit was as follows: Pb 1 value, Co 2 values, Cu 14 values, and Ni 50 values. Thus there could be a maximum of 67 samples that contain one censored value. Concerning the multi-element properties of those samples, the values under the detection limit for one element were in many samples coincident with the censored or low values of the other elements, but in some cases with the extremely high values of the other elements. For
346
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
129
211
Freq
Freq
0
0
Co
0
58
0
229
297
Freq
Freq
0
Cu
194
0 0
Ni
204
87
134
Freq
Freq
0
0
Pb
324
0
Zn
234
0 0
V
188
Fig. 3. Original data (Dataset 1) presented as frequency histograms, with class intervals 1 ppm (Co), 2 ppm (Cu, Ni, V) and 3 ppm (Pb, Zn).
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
347
Fig. 4. Parallel coordinates display of the original dataset (Dataset 1), showing maximum concentrations and spatial coordinates on the vertical axes, and individual samples as polylines across axes. Brushed (in black) are all samples containing censored values of Ni.
example, censored values of Ni are related to values lower than the 30th percentile of Co and V, while single values of Cu, Zn and Pb reach the 80th to 95th percentiles, respectively (see Fig. 4). All censored values that were contained in a total of 54 samples (about 3.8% of the original data) were removed during the cleaning process. The influence of the removal to the percentile values was found to be relatively small as the percentiles changed by a maximum of 2 ppm (compared to the original precision of data 1 ppm). After dealing with censored values, the number of samples decreased from 1411 in Dataset 1 to 1357 in Dataset 2. There were several outliers in Dataset 2 that were identified as strange or atypical, for example, the maximum or highest values of all metals (see Figs. 3 and 4). A total of 51 samples (about 3.8% of Dataset 2) were removed using the IVE approach described above, resulting in Dataset 3 (Fig. 6). After univariate outlier removal, 6 separate datasets (Dataset 4) were compiled, because the samples removed during data cleaning were different for each
metal. Maximum values differ for the same elements in two datasets, with the data cleaned through the IVE approach having 10–36 ppm higher concentrations of the metals. With the exception of the relatively higher concentrations and number of outliers being larger in Dataset 3, there are no other considerable differences. After interpolation of metal concentrations in Datasets 3 and 4 the resulting maps (Map 3 and 4, not illustrated here) look quite similar, and differences can only be revealed using color manipulation. Division of images (respective pixel values in Map 3 and 4) resulted in pixel values between 0.48 and 3.77, indicating up to 3.77 times higher pixel values in Map 3. The values less than 1 (dark-colored) in the image (see Fig. 7) indicate pixels, where the interpolated Dataset 4 had higher values than interpolated Dataset 3 (that means, the location of outliers present only in Dataset 4), and the values greater than 1 (light-colored) mean the opposite – those are the locations of outliers present only in Dataset 3.
348
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
Fig. 5. Scatterplot matrix of original data (Dataset 1), displaying variables X, Y and V. Brushed (in black) are samples with low concentrations of V, visualizing a presence of spatial trends in the distribution of V within the study area.
In outlier data, most of the removed samples containing outlying values were located within, or in the vicinity, of areas with known elevated element contents (due to geological features) in the eastern and southeastern part of the study area (Fig. 8). Data cleaned from outliers using the univariate approach (Dataset 4) were not studied separately.
4. Discussion and conclusions In the initial stages of analysis, raw geochemical data are often characterized by frequency histograms. Histograms summarize data distribution and may have variable class width due to different levels of outlier contamination, and the extreme outliers may even be left out. In case the concentration range of an element is large and/or analytical measurement precision is very good, the whole range of the histogram cannot be plotted using original measurement precision as class width, because the maximum number of classes is limited. For example, only Co in the present dataset had a range of values that allowed a class width of 1 ppm (the original precision) while for other metals 2 or 3 ppm were used (see Fig. 3). Increasing the class interval (and decreasing the number of classes) means generalizing the frequency distribution, which may affect the visual appearance, like location of mode. However, histograms are still useful when the number and extremity of outliers are being assessed in raw data.
Compared to a histogram (see Fig. 3), a parallel coordinates display of univariate data (Fig. 4) preserves all the original details in data and may be linked to spatial coordinates, to provide quantitative output information about the number and concentrations of selected (brushed) values. Parallel coordinate display, similar to histograms, does also have visual limitations related to ranges of concentrations reaching several hundreds or even thousands of measurement units. Clutter due to large numbers of samples may also affect the visual impression, and there is a limit to the size of a dataset that can be visually explored without extensive use of zooming. The same applies to the maximum number of dimensions to be displayed simultaneously. However, the dimension-reordering tool allows the user to interactively select and deselect any of the variables to be displayed, and change their display order, allowing fast and simple visual exploration of multi-dimensional data. The brushing tool allows for definition of several brushes so the observations or concentration ranges of interest can be either highlighted or hidden. Another advantage is that the extreme values easily can be removed (or replaced), and those decisions are based on the comprehensive information on multi-element data displayed and explored simultaneously (see Fig. 4). XmdvTool provides simple, fast and interactive visualization tools for studying multi-element composition in every geochemical sample, which is essential for decisions about the importance of and faith of censored
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
349
Fig. 6. Parallel coordinate display of Dataset 3 (cleaned using IVE). Brushed (in black) are all samples containing concentrations of Ni higher than 21 ppm (the 90th percentile).
values often present in data. In cases of removal of censored values, the first step of data cleaning, purely visual evaluation of those multi-element samples may be very subjective, due to very extreme outliers present in the data. Using numerical output of brushed sample values avoids decisions being based only on visual impression. Brushed data can also be saved as a new data file. In Fig. 4 censored values of Ni are highlighted, and without comparison to percentile values it is not easy to draw conclusions about the ranges of other metals, which actually reach their 80th–95th percentiles in the same samples. Scatterplot is a helpful tool discovering features in data distribution, including spatial features. Clustering of low values of a metal in X–Y space (Fig. 5) indicates that these are related to underlying geology. The strong spatial trends visible in the scatterplot refer to the influences of different rocks having substantially different chemical composition. The size of the study area together with number of samples and sampling density define the readability of the X–Y scatterplot, and one could assume that there may be both advantages and disadvantages related to visual exploration of very densely sampled or sparse geochemical data.
In the present study, the number of censored values (3.8% of data) was not considered to be large enough to affect the quality of data. How useful IVE is with a larger number of censored values is not clear, but even with a few it would not be reasonable to remove or replace them without considering the other concentrations in the same samples, as well as spatial location. The presence of samples that contained both values under the detection limit for some of the elements and extreme outlying values for other elements illustrates the power of IVE in dealing with raw multivariate geochemical data. Using traditional numerical analysis techniques for data analysis, those samples may not be as easily studied and dealt with. Regarding the censored values, one could have replaced some or all instead of deleting them, but the focus of this study was mainly on presenting the visual approach rather than interpretation of the dataset. Considering the abundance of data in low concentration ranges even more samples could have been deleted, but for data interpretation, replacement based on multi-element characteristics of the samples would be more relevant. The dataset cleaned using IVE (Dataset 3) shows considerable decrease in maximum values for all metal con-
350
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
Fig. 7. Ratio map (Map 3/Map 4) of interpolated Pb content in glacial till. Light-colored locations indicate location of samples removed by univariate approach (present only in Dataset 3) and dark-colored locations show location of samples removed by IVE (present only in Dataset 4). Square symbols show location of all (51) samples removed as outliers during IVE.
centrations (Fig. 6), compared to original data (Fig. 4). Even if the decrease is more moderate than for Dataset 4 with 10–36 ppm lower maximum concentrations, the visual separability of polylines has improved considerably after outlier removal. As seen in Fig. 6 the brushed Ni concentrations over the 90th percentile are clearly associated with Cu, Co, V and Zn, but not Pb. Knowing the composition of the rock in the area one can relate this multi-element signature to mafic rocks. Outlier removal is the most important step in improving a positively skewed frequency distribution of an element. Removal of disturbing noise (especially errors) from geochemical data is important for successful multivariate statistical analysis. It is also the most subjective step in a study and is often influenced by non-quantitative decisions based on background knowledge or auxiliary data. The question is whether this subjectivity impairs the reliability of the results, and to which degree outcomes obtained by different analysts using the same data would differ. As glacial till in the present study was assumed to reflect the natural or back-
ground levels of studied metals, samples that belong to a multi-element geochemical fingerprint may indicate mineralization related to certain rock type. It is however difficult to decide about the inconsistency of an outlier, and whether one value should be replaced with a more moderate concentration instead of the whole sample being removed. From all possible approaches, this study concentrated on data cleaning, and revealing differences between a traditional and an alternative technique for outlier removal. Modification of the extremely high values is not considered in this study, as removal of outliers was not final (outliers were compiled into separate dataset). Normally there is a compromise between decreasing the skewness of the data distribution and keeping valuable information. The number of outliers to be removed or replaced may be related not only to the skewness of the distribution, but also to multi-element fingerprints in the data and spatial location of the samples. Moreover, sample media, sampling scale, and other information available may have a relation to data cleaning, thus making the decision about removal or replacement
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
351
Fig. 8. Scatterplot matrix display of outliers (51 samples) removed by IVE (Dataset 2 – Dataset 3). Brushed (in black) are all samples with Pb concentration >44 ppm. The scatterplot in first column and second row shows spatial location of samples within the study area.
better suited to actual data and purpose of the study. To constrain the removal or replacement, statistical or other numerical techniques should be applied in parallel with visual analysis tools. During IVE the analyst gets valuable information about the inner structure of the multivariate data, and that information is essential for the later stages of data analysis, for selection of multivariate statistical techniques, and interpretation of their results. Disadvantages of visual, compared to numerical data assessment are related to relatively poor documenting and reporting possibilities. There are limitations in space as well as reproducing colors for printed illustrations, and extensive documentation of intermediate results in numerical form is needed for making reproducible IVE, which is an iterative part of the exploration process. To compare spatial location of the highest metal concentrations in cleaned Dataset 3 and 4, an exact interpolation and image overlay were used. The main reason for interpolation was to enhance the visual outcome, because changes in one pixel would be impossible to recognize in original point data. Once a sample has been removed, only neighbouring samples contribute to interpolation of pixel values on that location. In case the same sample has not been removed from the second dataset, it has influence on the surrounding pixel values,
and a circular feature will be formed in the ratio image. That makes it easier to recognize the differences between the two outlier-cleaned datasets. The higher the concentration of the removed outlier compared to surrounding samples, the larger the difference from 1 in the ratio image, the value of one indicates areas that are identical in the two interpolated maps (Map 3 and 4). Low values (dark locations) are related to moderate to high concentrations removed during IVE, due to outlying values of other metals present in the same samples. High values (over 1) show the location of high concentrations belonging to multi-element fingerprints, preserved during IVE. As the highest removed concentrations are not visualized in ratio images, the locations of all outliers removed during IVE are also indicated using symbols (Fig. 7). In Fig. 7, showing Pb, there are about 30 light-colored locations showing the location of samples with Pb content over 44 ppm in Dataset 3 (deleted during univariate removal), and 8 dark-colored locations (coinciding with symbols). One has to keep in mind that from a total of 51 samples, the number of common samples removed using both methods may vary for different metals. However, in Fig. 7 the number of light-colored locations is considerably higher than the dark-colored ones, and light-colored locations rather than dark-co-
352
K. Gru¨nfeld / Applied Geochemistry 20 (2005) 341–352
lored ones tend to form spatial clusters. This same trend is also observed for other metals, and that means for each metal, informative outliers related to mineralization are still present after data cleaning with IVE, but lost using univariate approach. Univariate outlier removal resulted in 6 different datasets, which will cause a loss of information in case a common database is to be compiled using cleaned data. The 51 outliers removed from Dataset 2 using IVE reveal interesting patterns in scatterplots and parallel coordinates for the element Pb. Highlighted in Fig. 8 are all removed outliers with concentrations of Pb higher than 44 ppm, in total 17 samples, which form a separate cluster in many of the scatterplots. The maximum concentration of Pb in Dataset 4 is 44 ppm (for comparison note that the 95th percentile is 41 ppm), thus it is obvious that the same 17 samples were also removed during the univariate approach, and are not included in subsequent interpolation and comparison steps (Fig. 7). Being located in the SSE of the study area (Fig. 8), the highlighted Pb values indicate the location of known mineralization related to acid volcanic rocks, and among these values only a few are associated with contents of other metals over their 50th percentile. The remaining 34 outliers removed during IVE with concentrations of Pb less than 44 ppm have a completely different spatial and multi-element character, and include both mafic mineralization fingerprint samples, and several samples containing single extreme concentrations of metals. That means two known mineralizations can easily be separated in the outlier data in both multi-element and spatial space (see scatterplots in 6th column or row in Fig. 8). Compared to original data, it is easier to explore outlier dataset having a low number of samples. In conclusion, image ratios confirmed the better performance of the IVE approach compared to the univariate one, concerning outlier removal. The differences are due to existing element associations in the high-valued samples, and those multivariate fingerprints are unique in each dataset. Relying on the results, it can be assumed that the stronger the existing spatial patterns and multielement associations in the original geochemical data the more the results between selective and univariate removal of outliers would differ. In the test dataset, spatial clustering due to mineralization was quite strong and univariate removal of high values therefore not desirable. To conclude from the results of the study, removal of outliers should be kept to a minimum, and replacement may often be a better alternative for cleaning geochemical survey data from censored values and extreme outliers. The XmdvTool offers powerful visual tools for getting an overview of both the univariate distributions of samples and the multivariate and spatial structures inherent in raw geochemical data. IVE allows for interactive inspection of the composition of individual multielement samples in detail, which has not been possible
with any of the traditional numerical analysis tools. At the same time, the contribution of the IVE to data analysis depends on the number of variables, ranges of concentrations, sample size etc. IVE is not a stand-alone alternative to statistical methods of data analysis, it is rather a possibility to include visual insight, justify the decisions, increase the reliability of analysis results, and test hypotheses. Integrated visual evaluation proved to be helpful in identifying noise and strange outlying samples, thus the approach has a potential in describing raw geochemical data.
Acknowledgements The study has been financed by a grant from the Geological Survey of Sweden. The author thank H. Henkel (KTH) for discussions, and M. Ward (WPI) for help with visualization software. The comments and suggestions of two anonymous reviewers, and B.L. Sim (University of Ottawa) greatly improved the manuscript.
References Atkinson, A.C., Riani, M., 1997. Bivariate boxplots, multiple outliers, multivariate transformations and discriminant analysis: the 1997 Hunter lecture. Environmetrics 8, 583– 602. Becker, C., Gather, U., 2001. The largest nonidentifiable outlier: a comparison of multivariate simultaneous outlier identification rules. Comp. Stat. Data Anal. 36, 119–127. Grunsky, E.C., Smee, B.W., 1999. The differentiation of soil types and mineralization from multi-element geochemistry using multivariate methods and digital topography. J. Geochem. Explor. 67, 287–299. Pan, J.-X., Fung, W.-K., Fang, K.-T., 2000. Multiple outlier detection in multivariate data using projection pursuit techniques. J. Statist. Plan. Inference 83, 153–167. Rawlins, B.G., Lister, T.R., Mackenzie, A.C., 2002. Tracemetal pollution of soils in northern England. Environ. Geol. 46, 612–620. Reimann, C., Filzmoser, P., 2000. Normal and lognormal data distribution in geochemistry: death of a myth. Consequences for the statistical treatment of geochemical and environmental data. Environ. Geol. 39, 1001–1014. Reimann, C., Filzmoser, P., Garrett, R.G., 2002. Factor analysis applied to regional geochemical data: problems and possibilities. Appl. Geochem. 17, 185–206. Unwin, A., 2000. Using your eyes – making statistics more visible with computers. Comp. Stat. Data Anal. 32, 303– 312. Velasco, F., Verma, S.P., Guevara, M., 2000. Comparison of the performance of fourteen statistical tests for detection of outlying values in Geochemical Reference Material Databases. Math. Geol. 32, 439–464. Zhang, C., Selinus, O., Schedin, J., 1998. Statistical analyses of heavy metal contents in till and root samples in an area of southeastern Sweden. Sci. Total Environ. 212, 217–232.