n
Original Research Paper
181
Chemometrics and lnrelligenr Laboratory Systems, 12 (1991) 181-187 Elsevier Science Publishers B.V., Amsterdam
A new procedure for the visual inspection of multivariate data of different geographic origins R. Leardi * Istituto Analisi e Tecnologie Farmaceutiche ed Alimentari, 16147 Genova (Italy)
Vii Brigata Salerno (P&ate),
E. Marengo I/fa Druento 115,10151
Torino (Italy)
and R. Todeschini Diparrimento di Chimica Fisica ed Elettrochimica, (Received
28 February
1991;
Vii Golgi 19, 20133 Milan0 (Italy)
accepted 6 July 1991)
Leardi, R., Marengo, E. and Todeschini, R., 1991. A new procedure for the visual inspection of multivariate data of different geographic origins. Chemometrics and Intelligenr Laboratory Sysfems, 12: 181-187. A new display method to be used for data of different geographic origin is presented. A geographic map of the region is shown, the zone corresponding to each sample being in a colour which is derived from the sum of the fundamental colours which have been associated with the variables (original or transformed). Three real examples show that the interpretation of the resulting plots is much easier compared to dealing with classical loading and score plots, both concerning the differences between the objects as well as the relative importance of the variables.
INTRODUCTION
The graphical analysis of data is a powerful tool to extract useful information from complex data and is widely applied in several different forms [l-3]. However, the graphical analysis of multivariate data is usually difficult because of the impossibility to visualize data with a dimensionality greater than three. Principal component 0169-7439/91/$03.50
analysis (PCA) provides a powerful technique to reduce space dimensionality by eliminating collinearity, redundant information and noise [4]. Unfortunately in several cases, in spite of the use of PCA, the intrinsic dimensionality of the significant pattern space still remains too large. In some studies of the correlation of sample descriptive features and their geographic origin the problem is much worse. In fact, although
0 1991 - Elsevier Science Publishers B.V. All rights reserved
182
PCA can be applied to reduce the dimensionality of the descriptors, in principal component score plots the spatial information of the sample is only related to the variable values and is totally unrelated to any existing geographic dependency. A possible solution involving the use of differently coloured regions has been suggested and employed [5]. The use of graphical packages is limited by the relatively small number of predefined colours. In the present paper a new algorithm for the visualization and the analysis of the dependence on geography of up to three-dimensional data is presented. The variables can be either the original descriptors or transformed variables.
THEORY Any colour can be described as a mixture of the three fundamental ones: red, green and blue (RGB). In this mixture, red, green and blue enter in the appropriate proportion to give the final colour tone and lightness, the contribution of each ranging from 0 to 1. This combination of colours also applies to a computer screen where any colour is obtained by a mixture of red, green and blue. On this basis, in the RGB scale, (0, 0, 0) corresponds to black and (1, 1, 1) to white. In a study, each variable can be associated with one fundamental colour so that every object of the data set under investigation, described by the values of these variables, is exactly defined by the colour obtained combining the fundamental ones in the proper combination. Since there are three fundamental colours it is possible to visualize, at maximum, three variables at a time. The colour characterizing every sample will depend on the selected association between variables and fundamental colours and on the scaling technique employed for every variable which must be reduced within the range (O-1). In any case the importance of every variable considered for the specific object determines the contribution of every fundamental colour to the final colour. It is then possible to plot a coloured area (box or circle), with the specific colour of any object, on a geographic map, in the position
Chemometrics and Intelligent Laboratory Systems
n
corresponding to the geographic coordinates of the considered sample. This procedure allows the visual determination of trends of geographical variation of the variables as well as geographic clusterization of the collected samples or other complex patterns. This method is very sensitive, taking advantage of the high sensibility of the human eye to colour changes. A valuable extension of this method can be made by substituting, for the classical geographic coordinates, specific internal spatial coordinates associated with the different parts of an industrial plant, thus inspecting the flow of the whole process. Since usually the experimenters are required to deal with high-dimensional multivariate spaces, it seems natural to apply this technique of data exploration using principal components instead of the original variables. The principal components are new variables, orthogonal to each other, that are linear combinations of the original variables, calculated so that the maximum possible amount of variance of the original data is explained by the smallest possible number of principal components. This compression of the original information into a smaller number of variables often allows the recognition of patterns of interest hidden in the multidimensional structure of the original data. Four main procedures can be adopted to obtain better visual results: factor rotations, scaling, colour enhancement and colour scale inversion. A factor rotation can be applied on the principal components to obtain a simplified distribution of the original variables in each component, thus allowing an easier interpretability of the components. Moreover, two scaling procedures can be suggested: independent range scaling of each variable (each variable assumes values between 0 and 1) and a global range scaling based on the largest range of the variables (only the variable with the largest range assumes values between 0 and 1; the remaining variables assume values between 0 and the ratio range(var)/largest range). The former procedure balances the descriptive power of the variables since it considers them
w
Original Research Paper
Fig. 1. Three-dimensional colour representation in the RGB space. Red box: colour subspace before translation; violet box: colour subspace after translation for &our enhancement.
potentially all equally effective; it should be used when the importance of the variables is not related to their range. In this case the colour space spanned by the objects is the whole colour space, representable as a cube in the RGB scale, each axis ranging between 0 and 1 (Fig. 1). The latter procedure preserves the difference between the importance of the variables and should be advised when the variables are principal components, each explaining a decreasing part of the total variability of the data. In this case the colour space spanned by the objects is a subset of the whole RGB space, corresponding to a box with one edge associated to the most varying variable ranging between 0 and 1. If one or two variables have small ranges, all objects located near the lowest region of their ranges (i.e., near the black vertex; see Fig. 1) will have quite dark colours; only objects with high values of the variable with the largest range maintain bright colours. A simple translation of the range center of the narrow ranging variables and consequently of the colour box (red box in Fig. 1) can be made
183
towards higher values of the RGB axis (for example, violet box in Fig. 1, where the center of the box is moved to the central gray point 0.5, 0.5, 0.5). In this way the projection of the center of the box on the diagonal of the RGB cube, which represents the global brightness of the box itself, moves to positions corresponding to brighter &ours. The general increase of the brightness of the whole graphic representation obtained enhances the visual capability of discriminating between the different objects without modifying their distances in the colour space. The last procedure to enhance the visual results consists of the possible inversion of the colour scales so that, concerning the variables undergoing this transformation, objects with low values have high colour values. In any case, several combinations of the different parameters (the best selection of the fundamental colour/variable coupling and the eventual colour scale inversion) must be evaluated experimentally since the optimum depends on the particular visual sensibility of the user’s eye and on the particular structure of each data set. From this point of view, the background colour can also be of great importance.
SOFlWARE
Four different versions are available: two stand-alone versions in QuickBasic and Fortran languages, one to be implemented in PARVUS [6], and one in the SCAN [7] package for chemometric analysis. The Fortran version makes use of the Graphic Kernel System (GKS) as the graphic language and interface. The programs can be run on any PC with a sufficiently refined graphic interface (PGA, VGA) to allow the contemporary visualization of 256 colours on the screen; in this case it is possible to visualize up to 255 objects at the same time on the screen (the 256th colour being used for the background). The input consists of two data files. One contains the geographical coordinates of the objects (for example, latitude and longitude) defined as coordinates of class 1. This data file may option-
Chemometrics and Intelligent Laboratory Systems
184
H
ally contain four other classes of coordinates (borders, class 2; rivers, lakes, seas, class 3; mountains, class 4; human settlements, class 5). For classes 2-5 the points of each different class are joined together until a minus sign of the class is found; a new sequence of the same class can start from the first new point. The second data file contains the values of the variables for every object; they can be original variables as well as principal components or other transformed variables. The programs also allow the visualization of a single variable, in which case the colour scale ranges between blue and red. This permits a straightforward interpretation of the ~~elation~.be~een the geo~aphic~ distribution and the variable distribution.
EXAMPLES
Portuguese olive oil [8,91 This data set was used to study the classification of 70 Portuguese olive oils of different geographic origin on the basis of the chemical analy-
Fig. 3. PC, map of the Portuguese olive oil samples. Blue for the lowest values of scores and red for the highest.
PC
Fig. 2. Three-~mensiona1 oil samples.
score plot of the Portuguese olive
sis of their acidic and sterolic fraction (13 variables). Three components are significant and the score plot obtained from an orthogonal rotation (raw varimax rotation of loadings [lo]) is shown in Fig. 2. Two clusters and a singleton can be easily identified. However this representation does not show the information related to the geographical origin of each object, so that it is rather cumbersome to determine if the structure in the variable space is somehow connected with the samples; in other words, it is not easy to understand whether two oils with similar composition (i.e., with similar locations in the variable space) derive from contiguous or somehow similar zones. Figs. 3, 4 and 5 show the three rotated principal components ‘(PC,, PC, and PC,, with ex-
n
185
Original Research Paper
colours near the red and the blue, respectively. All other samples have colours formed by the different contributions of the three fundamental colours, but it is evident that the green contribution decreases with the distance from the sea. Analysis of water wells in the Crema region [II] This data set was used to study the relationships between the chemical composition of water from 95 wells in the surroundings of the city of Crema (Italy, yellow colour) and the regions geomorphological features. The region (Fig. 7) is characterized by a flatness fundamental to a plain crossed by three fluvial valleys (Adda, Serio, Oglio rivers from
Fig. 4. PC, map of the Portuguese olive oil samples. Blue for the lowest values of scores and red for the highest.
plained variances of 23.6, 19.7 and 13.9%, respectively), one at a time. It can be seen immediately that PC, differentiates the oils of the northern part of Portugal (Dour0 Valley) from all other samples, while on PC, the southernmost sample (from Algarve) has a completely different value from all other Sam-. ples. PC, has a very interesting behaviour: in fact in this case no groups can be detected but it is clear that its values strongly depend on the distance from the sea. Fig. 6 is obtained by taking into account the three components at the same time (PC, = red, PC, = blue, PC, = green). Its interpretation is easy, since the previously defined groups (Dour0 Valley and Algarve) appear very different, having
Fig. 5. PC, map of the Portuguese olive oil samples. Blue for the lowest values of scores and red for the highest.
186
Chemometrics and Intelligent Laboratory Systems
w
Fig. 7. Map of scores of the first two principal components of the water well samples of the Crema region. PC, (salt content): green for the highest values; PC, (iron): red for the highest values. In the map the city of Crema (yellow), the Adda, Serio and Oglio rivers (blue, from west to east) and the residual zone of the ancient PO plain (brown, between the Serio and Oglio rivers) are also shown.
The differently coloured samples confirm the known geomorphological trend of the territory as well as suggesting some new hypotheses about the old fluvial valley on the left side of the Serio river which had not been previously recognized. Fig. 6. Map of scores of the first three principal components of the Portuguese olive oil samples: PC, (North Portugal): red for the highest values; PC, (Algarve region): blue for the highest values; PC, (distance from the sea): green for the highest values. The Douro and Tago rivers are also shown (blue, from north to south).
west to east, in blue), with a north-south direction and by a residual zone of the ancient PO Plain (Pleistocene Terrace, in brown colour), between the Serio and Oglio rivers. Each water sample is described by nine chemical and physico-chemical variables (pH, conductibility, Cl-, NO;, hardness, Ca, Fe, alkalinity, sulphate). The first principal component represents the salt content (green for highest values) while the second principal component is mainly determined by the difference between nitrates and iron (red for the highest values) (Fig. 7).
Pollution of the PO river 1121 In 19 sampling sites along the PO river 19 variables describing the water pollution were measured. Three PCs explaining 74.6% of the total variance were found to be significant. After orthogonal rotation (raw varimax rotation of loadings) their interpretation was: PC, (red) is related to industrial pollution (Cr, Mn, Ni, Zn); PC, (green) and PC, (blue) are related to urban pollution (Pb, NO;, surfactants and coliforms for PC,; conductance, SO:-, Cl-, NO; for PC,). A map (Fig. 8) clearly shows a general trend of increasing pollution (from dark to light colours), particularly for urban pollution. Along this trend some particular zones are evident: the first source of urban pollution corresponds to the town of Moncalieri; the sharp decrease is due to the presence of a wastewater plant; a new source of
n
187
Original Research Paper
In abstract applications the display method can be utilized by using a formal representation of the data space as the coordinates, e.g., the scores of the first two principal components of the descriptors. In regression diagnostics the variables can be residuals or responses calculated by different methods; in classification problems, the variables can be similarities or probabilities belonging to up to three classes.
REFERENCES Fig. 8. Map of scores of the first three principal components of the pollution PO river samples. PC, (industrial pollution): red for the highest values; PC2 (urban pollution): green for the highest values; PC, (urban pollution): blue for the highest values. In the map Moncalieri, Torino and Chivasso (yellow, along the PO river from west to east) and the Dora Riparia and Malone rivers (blue) are also shown.
pollution arises in Turin, with a sharp increase at the confluence with the Dora Riparia river which flows through the town; near Brandizzo the presence of an industrial plant greatly increases the industrial pollution, immediately diluted by the confluence of the Malone river. A few kilometers further downstream there is another increase of urban pollution due to the town of Chivasso. 7 CONCLUSIONS
The display method proposed here represents a useful complementary tool for the interpretation of multivariate problems related to spatial coordinates. It can be applied not only to typical geographical problems and environmental and ecological studies but also to other projections of space-related variables: (a) processing variables checked in industrial plants; (b) geological variables projected in planes defined by positions along the section line and at the depth of the samples; (c) marker variables in human, or animal, body locations.
8
9
10 11 12
H.J. Birks, Multivariate analysis in geology and geochemistry: An introduction, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 15-28. R.A. Reyment, Multivariate analysis in geoscience: Fads, fallacies and the future, Chemometrics and intelligent Laboratory Systems, 2 (1987) 79-91. K. Esbensen, L. Lindquist, I. Lundholm, D. Nisca and S. Wold, Multivariate modelling of geochemical and geophysical exploration data, Chemometrics and Intelligent Laboratory Systems, 2 (1987) 161-175. J. Davis, Statistics and Data Analysis in Geology, Wiley, New York, 1986. C. Armanino, R. Leardi, S. Lanteri and G. Modi, Chemometric analysis of Tuscan olive oils, Chemometrics and Intelligent Laboratory Systems, 5 (1989) 343-354. M. Forina, R. Leardi, C. Armanino and S. Lanteri, PARVUS: an extendable package of programs for data ex_. ploration, classification and correlation, Elsevier Scienttftc Software, Amsterdam, 1988. R. Todeschini, V. Cosentino, I.E. Frank and G. Moro, SCAN: software for chemometric analysis, Jerle Inc., Stanford, CA, 1991. M. Forina, C. Armanino, S. Lanteri, C. Calcagno and E. Tiscornia, Valutazione caratteristiche dell’olio di oliva in funzione dell’annata di produzione mediante metodi di classificazione multivariati, La Riuista Italiana delle Sostanze Grasse, 60 (1983) 607-613. M.S. Leitas Ferreira Diaz, Delimitagao de zonas oleicolas Portuguesas por analyse en componentes principais, Bulletim do Znstituto do Azeite e Productos Oleaginosos, 1 (6) (1985) 90-117. R.J. Rummel, Applied Factor Analysis, Northwestern University Press, Evanston, IL, 1970. R. Todeschini, D. Pitea, L. Aloisi and G. Bassi, in preparation. R. Aruga, G. Negro and G. Ostacoli, Unsupervised pattern recognition of surface water pollution data. I. The PO river, Annnli di Chimica, 80 (1990) 341-355.