ARTICLE IN PRESS
Computers & Geosciences 34 (2008) 320–338 www.elsevier.com/locate/cageo
‘‘compositions’’: A unified R package to analyze compositional data K. Gerald van den Boogaarta,, R. Tolosana-Delgadob a
Institut fu¨r Mathematik und Informatik, Ernst-Moritz-Arndt-Universita¨t Greifswald, D-17487 Greifswald, Germany Geowissentschaftliches Zentrum, Sedimentologie & Umweltgeologie, Georg-August-Universita¨t Go¨ttingen, D-37077 Go¨ttingen, Germany
b
Received 5 October 2006; received in revised form 21 November 2006; accepted 23 November 2006
Abstract This contribution presents a new R package, called ‘‘compositions’’. It provides tools to analyze amount or compositional data sets in four different geometries, each one associated with an R class: rplus (for amounts, or open compositions, in a real, classical geometry), aplus (for amounts in a logarithmic geometry), rcomp (for closed compositions in a real geometry) and acomp (for closed compositions in a logistic geometry, following a log-ratio approach). The package allows to compare results obtained with these four approaches, since an analogous analysis can be performed according to each geometry, with minimal and straightforward modifications of the instructions. Beside these grounding classes, the package also includes: the most-basic features such as data transformations (e.g. logarithm, or additive logistic transform), basic statistics (both the classical ones, and those developed in the log-ratio framework of compositional analysis), high-level graphics (like ternary diagram matrix and scatter-plots) and high-level analysis (e.g. principal components or cluster analysis). Results of these functions and analysis are also provided in a consistent way among the four geometries, to ease their comparison. r 2007 Elsevier Ltd. All rights reserved. Keywords: Aitchison geometry; Compositional data; Euclidean space; Multivariate data analysis; Software
1. Introduction This paper aims at presenting a new package for R, the open-source statistical environment, devised to analyze compositional data. This paper describes its structure, and shows many of the available functions, but it is not intended as a guide to its use. Nevertheless, the appendix includes a recommended draft sequence of analysis. If more guidance on its basic usage is wanted, the package itself carries a hands-on instruction, UsingCompositions.pdf: it is installed in the /doc subdirectory of the package (available through the html help of R), and can be downloaded from the package home page, http://stat.boogaart.de/compositions/. Also, van
Corresponding author.
E-mail addresses:
[email protected] (K.G. van den Boogaart),
[email protected] (R. Tolosana-Delgado). URL: http://www.iamg.org/CGEditor/index.htm 0098-3004/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2006.11.017
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
321
den Boogaart and Tolosana-Delgado (2006) present a detailed guide on the use of the package ‘‘compositions’’ within the scope of Aitchison (1986) analysis of compositions. In the mathematical geology community, there is a strong disagreement on how to do a statistical analysis of compositional data. Though several warnings on the spurious effects of the so-called closure operation on the covariance matrix (Chayes, 1960), most users of statistics in geosciences simply ignore the problem, and continue using classical statistical methods. Aitchison (1982) put forward several considerations on compositions and their sample space (the so-called Simplex), which he argued that should be honored by any statistical method for compositions: results should be independent of the measurement units, as well as of permutations of the parts involved, and subcompositions should behave as marginals in classical statistics. Attending to these considerations, he suggested a methodology, which avoided the closure effect by taking logratios: shortly, he proposed to transform the data, apply standard statistical procedures on the transformed scores and back-transform the results, when it was sensible. His approach was complemented with a series of operations to measure change and distance between compositions. Afterwards, the Simplex (equipped with these operations) was identified as an Euclidean space on itself (Billheimer et al., 2001; Pawlowsky-Glahn and Egozcue, 2001). This opened the door to the concept of coordinates, and allowed Pawlowsky-Glahn (2003) to reformulate Aitchison’s recipe in algebraic terms: take coordinates of your observations with respect to a chosen basis, use standard statistical techniques on them (being real unbounded numbers), and apply the results to the basis used. This is what is called the principle of working on coordinates: statistical analysis and probabilistic theory are applied to the coordinates of the observations in an Euclidean space structure. On the other side of the spectrum, Shurtz (2003) suggests that the most important consideration on compositions is mass balance, and that any statistical analysis should honour this fundamental law: consequently, the closure effect on correlation should not be considered spurious anymore. In the same direction, Rehder and Zier (2002) interpret compositions as standardized (closed) mixtures of components, and put forward the convex mixture as the fundamental operation on compositions. These two considerations are mutually consistent, and conceptually support the sense of a classical Euclidean geometry and of linear statistical techniques in compositional data. It is nevertheless worth mentioning that these arguments to justify the suitability of an Euclidean approach have arisen as an argument against Aitchison’s solution. The number of applications found in the references using a classical approach is enormously higher than those using Aitchison’s, although no argument or conscious decision is done on most of them to support their choice. In the authors’ opinion, two reasons explain this. First, there is a lack of available software to analyze a data set under Aitchison’s postulates: to the authors’ knowledge, there are: a set of Matlab routines (CoDa, programmed by Aitchison, available on request), an MSExcel-based package (Thio´-Henestrosa and Martı´ nFerna´ndez, 2005, CoDaPack) and two sets of R routines (Beardah and Baxter, 2005; Bren and Batagelj, 2005). The second reason is the inertia of analysts, who classically consider that raw data statistics are ‘‘purely descriptive’’, i.e. there was no prior choice of a model. This idea is rather naive, as applying no transformation carries an implicit choice of a distance model. In this context, the presented package is essentially built with the twofold aim of: (a) providing an easy way to analyze compositions using a classical approach and using the Aitchison one, so that (b) the analyst can compare results and ground a decision on which geometry to choose. The fundamental structure of compositions follows Pawlowsky’s (2003) principle of working on coordinates: the user chooses a geometry to represent the compositional data set, and then the package automatically computes its coordinates in an adequate reference system, conducts any desired analysis on the coordinates, and (when needed) applies the obtained results to the system basis before representing or returning results to the user. The procedure may be summarized as
ARTICLE IN PRESS 322
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
HCO3
Cl
SO4
Fig. 1. Single ternary diagram obtained plotting MyComposition ¼ acomp(Hydrochem[,c("Cl", "SO4", "HCO3")]).
A new analysis can be conducted by simply choosing another geometry for the same data set, and results are returned in a comparable fashion. As a quick example, the reader might find useful to type these instructions (explained in detail afterwards) in an R command window (Fig. 1), 4 4 4 4 4
library("compositions") # load library data(Hydrochem) # load data set MyData ¼ Hydrochem[,c("Cl","SO4","HCO3")] # select variables MyComposition ¼ acomp(MyData) # declare them compositional plot(MyComposition) # plot them (figure 1)
which produces a ternary diagram. Otero et al. (2005) present a practical example of such a comparison of several geometries, applied to biplots (Gabriel, 1971). They also present a description of the data set used in the examples of this paper. 2. Working paradigms Four different geometries have been considered in compositions, arising from asking two questions: (1) Is the size (total sum) of the composition relevant, or is it an artifact of the sampling procedure? (2) Is the scale of distance between compositions an absolute (e.g. 1 and 2 are so far as 51 and 52) or a relative (e.g. 1 and 2 are so far as 10 and 20) one? Answering the first question, the user chooses whether the numbers of the data set represent amounts or proportions of each part: in the first case, data are open, only considered to be non-negative, whereas in the second case they are also considered (and forced) to be closed. The closure operation takes a positive vector x ¼ ðx1 ; x2 ; . . . ; xD Þ of D components and divides it by the sum, effectively making the components sum up to one: 1 cloðxÞ ¼ PN
i¼1 xi
ðx1 ; x2 ; . . . ; xD Þ.
(1)
This is the classical definition of a composition: a vector of non-negative components summing up to a constant, which (in this case) is fixed to 1. The set of points in the D-dimensional real space RD which fulfils these conditions is called the D-part Simplex (SD ).
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
323
Answering the second question, the user chooses whether the distance between samples should be computed using a difference or a quotient: in the first case, data are left unchanged or linearly transformed, whereas in the second case a logarithmic transformation will be applied. This implies that under a relative geometry, all components must be strictly positive. By combining the four answers to the two fundamental questions, four approaches are obtained. A summary of these approaches and their associated operations is included in Table 1, whereas Table 2 contains the several transformations used to compute coordinates in each geometry. The names of the available geometries in the package and their descriptions are the following: rplus: An absolute distance and a significant total amount yields an Euclidean geometry on the positive D orthant RD þ as a subset of the real space R , using the classical operations of sum, product by a scalar and scalar product. Canonical coordinates of the analyzed object are computed with the identity transformation (called isometric identity transform, iit, to highlight the parallelism with the coordinate functions of the other geometries). aplus: A relative distance and a significant total amount yields an own Euclidean geometry on the positive orthant RD þ as a vector space on itself, with operations defined as the component-wise product (a.k.a. direct product or perturbation), power by a scalar, and a logarithmic scalar product. The coordinates are obtained by the component-wise logarithmic transformation (called isometric logarithmic transform, ilt). rcomp: An absolute distance and an irrelevant total amount yields an Euclidean geometry on the Simplex SD as a subset of the real space RD , using the classical sum and product, and a scalar product similar to the classical one (see Table 1). Note that due to the closure effect, the data set is confined to (a section) of a ðD 1Þ-dimensional affine hyperplane, which implies that these operations might be undefined within SD . To avoid working with singular covariances derived from the closure and still keep the distances unchanged, coordinates are obtained by the isometric planar transform (ipt), which rotates the reference system with a Helmert-like matrix 1 0 1 1 p ffiffi ffi p ffiffi ffi 0 0 C B 2 2 C B C B 1 1 2 C B pffiffiffi pffiffiffi pffiffiffi 0 C B C B 6 6 6 (2) V¼B C C B . . . . .. .. .. .. C B 0 C B B 1 1 1 D 1 C A @ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi DðD 1Þ DðD 1Þ DðD 1Þ DðD 1Þ around the barycenter of the Simplex (Table 2). Table 1 Summary table of definition of the geometries available Significant size? Natural scale by Class
Yes Difference rplus
Yes Quotient aplus
No Difference rcomp
No Quotient acomp
Inner sum
Sum zi ¼ xi þ yi
Perturbation zi ¼ xi yi
Sum zi ¼ xi þ yi (*)
Product by a scalar Scalar product
Product zi ¼ l xi PD i¼1 xi yi
Power zi ¼ xli PD i¼1 ln xi ln yi
Product zi ¼ l xi (*) PD 1 i¼1 xi yi D
Closed perturbation zi ¼ cloðxi yi Þ Closed power zi ¼ cloðxli Þ PD xi yi i¼1 ln gðxÞ ln gðyÞ
Neutral element Form of linear combinations
0 (origin) Weighted sums
1 (Unbalanced) Mass action equations
– End-member mixtures
cloð1Þ ¼ D1 1 (barycenter) Balanced mass action equations
Related references
–
Pawlowsky-Glahn (2003)
Rehder and Zier (2002)
Aitchison (1986, 2002)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Notation: x ¼ ðx1 ; x2 ; . . . ; xD Þ is a vector of observations, gðxÞ ¼ D x1 x2 xD is the geometric mean of its parts, 0 and 1 are, respectively, vectors of D zeroes and D ones, and cloðÞ is the closure operation (Eq. (1)). (*) These operations return an rmult object.
ARTICLE IN PRESS 324
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
Table 2 Summary table of names and definition of the transformations Class Basic transformation
rplus Identity
aplus Logarithm
rcomp Difference
acomp Log-quotient
Centered transform cdtðxÞ Treats all parts equally Inverse
Isometric identity t iitðxÞ n¼x x¼n
Isometric log t iltðxÞ n ¼ lnðxÞ x ¼ expðnÞ
Centered planar t cptðxÞ n ¼ x D1 1 x ¼ n þ D1 1
Centered log-ratio t clrðxÞ x n ¼ ln gðxÞ x ¼ cloðexpðnÞÞ
iitðÞ iitðÞ
iltðÞ iltðÞ
aptðÞ n ¼ xD ¼ ðx1 ; . . . ; xD1 Þ
alrðÞ x n ¼ ln xD
Additive transform Provides full-rank covariances Inverse
PD1
D
x ¼ cloðexpðn; 1ÞÞ
iitðÞ
iltðÞ
x ¼ ðn; 1
Isometric transform idtðÞ Full-rank isometry Inverse
Isometric identity t iitðÞ iitðÞ iitðÞ
Isometric logt iltðÞ iltðÞ iltðÞ
Isometric planar t iptðÞ n ¼ V cptðxÞ cptðxÞ ¼ Vt n
Isometric log-ratio t ilrðÞ n ¼ V clrðxÞ clrðxÞ ¼ Vt n
Reference
–
–
This contribution
Aitchison et al. (2002)
i¼1
xi Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Notation: x is a vector of observations, n is a transformed vector, gðxÞ ¼ D x1 x2 xD is the geometric mean of the parts, 1 is a vector of D ones, cloðÞ is the closure operation (Eq. (1)), and V is ðD 1Þ D-Helmert matrix (Eq. (2)). Generic transformations cdtðÞ and idtðÞ, respectively, mean centered and isometric default transform (see Section 4) for explanations.
acomp: A relative distance and an irrelevant total amount yields an own Euclidean geometry on the Simplex SD as a vector space on itself, characterized by the closed perturbation, or closed component-wise product, and the closed power by a scalar (Aitchison, 1986), and using the centered log-ratio scalar product (Aitchison, 2002), based on the clr transformation (Aitchison, 1982). Coordinates are computed using the ilr transformation (Egozcue et al., 2003) (see Tables 1–2 for their definitions), which also uses the Helmert matrix (Eq. (2)). A complete theoretical description of this geometry can be found in Aitchison et al. (2002), whereas Tolosana-Delgado et al. (2005a) gives some worked examples of how to compute these algebraic operations. rmult: This is the classical multivariate real geometry, and it is used to keep consistency within the code. Note that there is no way to avoid answering these two questions: if the analyst is not conscious of its importance and applies a standard analysis to untransformed data, he/she is implicitly (and inadvertently) using an absolute distance, and forcing the open character of the compositional system. 3. Structure of the package As mentioned, the package aims to provide maximum analogy between the four different approaches. This is achieved by always using the principle of working on coordinates (Pawlowsky-Glahn, 2003). Answering the two questions of Section 2, the user chooses a geometry to represent the sample set (either a whole space or a subset) of the data set. Then the package computes the coordinates of observations in this geometry (with respect to a given internal basis), and does whatever operation, analysis or plot needed with the ‘‘coordinated’’ observations; results are finally applied to the basis to recover them as objects of the original sample set. Most steps (with the obvious exception of the geometry choice) are automated within the package. Most usually, the user does not need to worry about the procedure, but only to interpret adequately the obtained results, with regard to the chosen geometry. The package contains six levels of functions, organized in two blocks: algebraic functions: 1: class definitions, 2: algebraic operations for the several geometries, 3: transformations and coordinate computation procedures,
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
325
statistical and probabilistic functions: 4: descriptive and multivariate statistics, 5: high-level plots, and auxiliary plotting subroutines, 6: simulation and distribution density computation functions. 4. Classes (layer 1) The very structure of R (R Development Core Team, 2004) allows an easy implementation of the principle of working on coordinates. Each object of R (including vectors, matrices and arrays, data frames, categorical variables a.k.a. factors, lists of data, and even functions) may have a class: an attribute which describes the type of the object. For instance, standard classes of R are data.frame (a matrix with column names), factor (a categorical variable, stored as a vector of integer values, together with the equivalence between these integers and the original categories), or lm (the result of a linear model, stored as a list containing the matrix of coefficients, the data frame of residuals, or the data frame of fitted values, among others). Some instructions of R (those called generic) are able to recognize the class of the provided data, and react consequently. For instance, the function plot(X) produces a plot adapted to the class of X: if X is a data frame, result is a matrix of scatter-plots of all pairs of variables in X; if X is a factor, result is a bar plot; and if X has class lm, results are four common diagnostic scatter-plots of linear regression (see R help for details). To use this functionality, the basic level of ‘‘compositions’’ contains the definitions of four classes, corresponding to the four geometries already outlined. Moreover, another neutral class, called rmult (for multivariate real) is also defined. These five classes are particularly useful for vectors, and data frames. A call to any of these functions, for instance aplus(X), checks that X contains positive values, and assigns it a class. In the case of acomp and rcomp classes, X is previously closed through a call to the function clo, an immediate implementation of Eq. (1). Example 1. For instance, the instruction sequence 4 x ¼ c(15,60,25) # c() declares a vector 4 acomp(x) produces the result [1] 0.15 0.60 0.25 attr(,"class") [1] "acomp" showing that the vector has been closed to one, and assigned a class. The fifth class, rmult, simply assigns the class without further checking, since it is solely used to treat conventional multivariate vectors consistently within the code. 5. Transformations and vector space operations (layers 2 and 3) Over this class layer, those transformations defined in Table 2 are built, together with all their inverses. This included the generic transformations cdt and idt, called default centered and isometric transformations. They are the simplest examples of how the generic function capabilities of R are used to implement the principle of working on coordinates: a call to cdt(X) or idt(X) is actually processed as (1) (2) (3) (4) (5)
iit if X has rplus class, ilt if X has aplus class, cpt or ipt if X has rcomp class, clr or ilr if X has acomp class, an identity transformation if X has rmult class.
Results of all these transformations are represented by real vectors of class rmult.
ARTICLE IN PRESS 326
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
Example 2. Using the Hydrochemical data set of Otero et al. (2005) loaded already in the introductory example, if the reader types in the R console window 4 MyComposition will obtain Cl 1 0.24259538 2 0.20486149 (...) 484 0.24423992 485 0.31239697 attr(,"class") [1] "acomp"
SO4 0.51232085 0.56518294
HCO3 0.2450838 0.2299556
0.33496954 0.40887512
0.4207905 0.2787279
Typing any of the following commands 4 clr(MyComposition) 4 cdt(MyComposition) will result in Cl 1 0.252587099 2 0.376789156 (...) 484 0.286624259 485 0.051698805 attr(,"class") [1] "rmult"
SO4 4.949691e-01 6.380262e-01
HCO3 0.2423819545 0.2612370811
2.926432e-02 2.174362e-01
0.2573599356 0.1657374392
Apart from these, transformations alr and apt are also implemented (although they do not have an associated generic function). Finally, the inverse of a given transformation (say clr) can be accessed with a call to function *.inv (thus, clr.inv). These inverse functions take as data a rmult object, and return a data frame (or vector) with the corresponding class, e.g. a closed acomp composition for clr.inv. Aitchison (1986) introduced some operations to deal directly with compositions in the acomp framework, intended to measure and assess compositional change without relying on these transformations. These operations are also defined in functions perturbe and power, direct implementations of their definitions according to Table 1. Note that these two functions admit either acomp or aplus objects (with the only difference that the first is afterwards closed). Furthermore, given that perturbation and power operation replace, respectively, inner sum and outer product in the space structure, the definition of the classical algebraic operations þ and (and its inverses and /) is also adapted in the package. In the same way that the default transformation idt calls the adequate isometric transform given the class of its argument, the algebraic operators call now directly (1) conventional þ and * for rplus and rmult classes; (2) for rcomp, also þ and * are applied, although they return always an rmult object, because the result can be out of the simplex; (3) perturbe and power for aplus and acomp class (with closure in the second case). With these operations, a linear combination of rcomp vectors with weights summing up to one has a special meaning: it is the convex mixture of the original vectors, usually called end-members. Function convex is a convenient shortcut to compute such mixture for the case of two end-members.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
327
Also, scalar product (function scalar), norm (function norm) and distance (function dist) operations are included. Their computation follow always the same structured steps: first the coefficients of the object are computed in accordance with the class using cdt, and then the classical operation is computed with the real values. Example 3. For instance, the Aitchison distance between two closed compositions x and y is obtained as dist(cdt(acomp(x)),cdt(acomp(y))) (declare the vectors as acomp type, apply the adequate centered default transform, and compute their Euclidean distance), as dist(clr(x),clr(y)) (compute the Euclidean distance on the clr-transformed values), or even directly as dist(x,y), if the vectors have been declared first of acomp type. Back-transformation has no sense in this case, since the result of these operations is always a real number. Example 4. In an Euclidean space, coordinates can be computed by using projections onto a set of orthonormal vectors. This example shows how to compute one of these projections: the hydrochemical data set of the last examples will be projected onto the direction which balances anions against cations, following an Aitchison compositional geometry 4 MyComposition ¼ Hydrochem[,c(6:18)] # select the variables 4 MyComposition ¼ acomp(MyComposition) # select the class 4 vd ¼ c(rep(1,8),rep(-1,5)) # cations vs. anions 4 vd ¼ clr.inv(vd) # compositional vector splitting them 4 vd ¼ vd /norm(vd) # make it unitary vector 4 MyBalance ¼ acomp(MyComposition) %*% vd # project the sample # is there any difference on the balance among the rivers? 4 ByRivers ¼ split(MyBalance, Hydrochem[,"River"]) 4 boxplot(ByRivers) # in this balance, no difference The set of available transformations and operations is completed with two amalgamation procedures. Function rcompmargin takes an rcomp object and a vector containing those parts to be left unchanged by the function. The non-selected parts are erased, and their amalgamation is added to the rcomp object. Function acompmargin works similarly, but it takes an acomp object, and instead of computing the amalgamation of the non-selected parts, it computes their geometric mean. 6. Statistics (layer 4) The fourth group of functions contains a set of basic statistical analysis, including principal component analysis (PCA) and descriptive statistics (the classical mean, variance–covariance matrix and correlation), scaling, centering and standardizing procedures. It contains also the metric variance and cluster analysis, both dependent on the distance criterion. Multivariate analysis are also treated in this section. The functions mean, var, cov and cor compute, respectively, the mean, variance matrix, covariance matrix (maybe between two different data sets of the same class) and correlation matrix of a compositional data.frame. They all act in the same way, calling the cdt transform to compute coordinates and computing the desired statistics. In the case of the mean, the result is adequately back-transformed using cdt.inv. Therefore, these statistics return: (a) the classical statistics for rplus and rcomp geometries, (b) the geometric mean and (co)variance of the log-transformed data set for aplus geometry and (c) the closed geometric mean and (co)variance of the clr-transformed data set in the acomp geometry. Apart from these classical statistics, the package contains also functions to compute several advanced spread measures defined in the scope of Aitchison’s approach to compositions: the variation and covariation matrices (Aitchison, 1986), and the metric variance, covariance and standard deviation (Pawlowsky-Glahn and Egozcue, 2001). All of them are computed through the coordinate approach (using cdt), thus they are implicitly generalized to the other geometries.
ARTICLE IN PRESS 328
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
With command variation, one gets a D D matrix T ¼ ðtij Þ with tij ¼ Var½cdti ðxÞ cdtj ðxÞ: this is the variation matrix defined by Aitchison (1986), for acomp and aplus objects, and the variance of pairwise differences for rcomp and rplus objects. The command covariation returning a DðD1Þ DðD1Þ matrix 2 2 Sðx; yÞ ¼ ðsij;kl Þ with sij;kl ¼ Cov½cdti ðxÞ cdtj ðxÞ; cdtk ðyÞ cdtl ðyÞ generalizes this to covariance. Aitchison (1986, 1997) gives some practical examples on the interpretation of these matrices. Metric moments are scalar values representing the dispersion of a multivariate data set (obtained with mvar and msd) or the degree of linear codispersion of two data sets (with mcov) in a single number: mvar (metric variance) and mcov (metric covariance) both return the sum of the eigenvalues of the covariance matrix of the idt-transformed data set(s), whereas msd (metric standard deviation) is the square root of the mean of these eigenvalues. Note that all these statistics have particular interpretations: a look at the works by PawlowskyGlahn and Egozcue (2001), and Daunis-i-Estadella et al. (2002) may help in understanding them. This basic statistics functions are complemented by functions scale (to center and/or standardize a data set), and summary. Function scale represents the data set in coordinates, does classical standardization, scaling or centering procedures on the coordinates, and back-transforms the results. Function summary computes those interesting basic descriptive statistics in a compatible fashion with the geometry of the class of the data set. Regarding how to proceed with high-level statistics, we consider an example: Example 5. A cluster analysis (standard R function hclust) takes as input a distance matrix between the observations, which is directly obtained by dist(X). The fact that the data set X has one of the compositional classes will automatically lead dist to compute the adequate Euclidean distance, as was explained in Example 3. 4 4 4 4 4
MyComp ¼ acomp(Hydrochem[,6:19]) MyTree ¼ hclust(dist(MyComp),method ¼ "ward") plot(MyTree) MyGroups ¼ cutree(MyTree,k ¼ 3) plot(acomp(MyComp,c("NH4","Na","Ca")),col ¼ MyGroups,center ¼ T)
The achieved ternary diagram shows three clear groups that can be assigned to urban influence, potash mining influence and unpolluted water, following the conclusions in Tolosana-Delgado et al. (2005b). Other distance-based statistical analysis may be straightforwardly adapted following the same strategy. Those based on the covariance matrix should be better built upon the idt-transformed data set. In general, it will be interesting to use cdt-transformed data sets for those methods which must track each part separately, while not being influenced by the singularity of the covariance matrix: for instance, PCA needs a clear identification of each part, and was shown by Aitchison (2002) to be uninfluenced by the covariance singularity. Example 6. In this way, the instruction pca ¼ princomp(acomp(X)) computes the clr transformation of the data set X, computes standard PCA on them, and stores the whole result in object pca. Afterwards, biplot(pca) can be used to retrieve a biplot, or screeplot(pca) to obtain a scree plot, whereas summary(pca) will generate a human-readable numeric output. In the same way, declaring the data set to be of any other of the classes will produce an analogue pca object-based on the adequate cdt transform. As an example, readers may follow this sequence of instructions 4 4 4 4 4
x ¼ Hydrochem[,c(6:19)] y ¼ aplus(x) pca ¼ princomp(y) screeplot(pca) biplot(pca)
# # # # #
select the compositional variables take as an open relative composition compute principal component analysis produces figure 2 produces figure 3
If the user looks for a comparison of geometries, the same sequence of commands may produce it by simply changing the fourth line to, e.g. an open, absolute geometry through y ¼ rplus(x). Instead of getting Figs. 2 and 3, results will be Figs. 4 and 5.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
329
pca 8
Variances
6
4
2
0 Comp.1
Comp.3
Comp.5
Comp.7
Comp.9
Fig. 2. Scree plot of example hydrochemical data set, obtained using an aplus geometry.
. . NH4 .
Comp.2
.
. .
.
. .
.
. . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . .. . . .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . .. .... . . .. . ... . .. . . . . . . . .H .. . . . . . . . . . .. . . . . . . . . . .. . . . .... .. . ..... ....... . .. . .. . . . . . . . .. .. ... .. . . . . . . . . . ... . ..... . .. . . . . .. . . . . . ... TOC . .. ..... ...... . . ..... . . . . . . .. . . . . . . . .. . . HCO3 PO4 . .. . . .. . . . . . .. ... . Ba . .Ca ... .. . . ... . . . . . . . . .. .. .. .. . . . . . SO4 .. . . . . . . .. . . . . . . . . . . . .. . . . ...... . . .. Mg. . . K .. . . Sr. . . .. . ...... . .... . Na . . .. ... .. . . . . . NO3 .. . Cl . ... .. . .. . . .. . . . . . . .. . . .. ... .. . . . . . . . . . . . ....... .. . . . . .. . . . . . . . . . . . . .
.
. . .
Comp.1 Fig. 3. Biplot of example hydrochemical data set, obtained using an aplus geometry.
7. Plots (layer 5) The fifth layer contains low- and high-level plotting functions, all of them defined as generic and reacting to the class of its argument.
ARTICLE IN PRESS 330
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338 pca 120000
100000
Variances
80000
60000
40000
20000
0 Comp.1
Comp.3
Comp.5
Comp.7 Comp.9
Fig. 4. Scree plot of example hydrochemical data set, obtained using an rplus geometry.
4000
2000
0
2000
4000
6000 6000
. 0.2
4000 .
SO4 .
Comp.2
0.1 . .
0.0 . 0.1
. . . .. . . Na . .
Cl
. . .. . . . . . .. . . .. .. .. . . . ........ .... . .. .. .. ... .......... . .. . ....Ca .. .. .... ...... . .... . . . . .HCO3 .. ...... .... . . . . . . .. ......... . . . . . . ..Mg .. ..TOC ......................... . . . . . . Sr NO3 Ba H PO4 . NH4 . .. . .. . . . . .. ................ . ......... . . . ... . K . ......................... . . . ......... . ......... . .. . ...... .. . . .. .. . .
2000
0
2000
4000
. .
0.2 . 0.2
0.1
0.0 Comp.1
0.1
0.2
Fig. 5. Biplot of example hydrochemical data set, obtained using an rplus geometry. Note that only most abundant ions (those controlled by geological background, see Otero et al., 2005, for details) are displayed in this biplot, and compare it with Fig. 3.
The high-level plots include functions boxplot, qqnorm, plot and barplot. Each of them is able to represent an object in any of the four geometries. Functions plot, boxplot and qqnorm may generate a panel plot (obtained with standard pairs R function), with a marginalization process depending on the class.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
331
Example 7. As an example, using plot on a composition (from acomp or rcomp classes) with 3 parts, ones obtains a single ternary diagram. For instance, the sequence 4 y ¼ acomp(Hydrochem[,c("Cl","SO4","HCO3")]) 4 plot(y) selects the anion subcomposition of a classical Piper plot and plots it in an adequate way to the acomp geometry, thus as a ternary diagram (Fig. 1). But if one plots a subcomposition with 4 classes 4 y ¼ acomp(Hydrochem[,c("K","Na","Mg","Ca")]) 4 plot(y,pca ¼ T,col.pca ¼ "red",pch ¼ 19,cex ¼ 0.4) the result will be a panel plot of ternary diagrams. Figs. 6–9 show the results obtained with each geometry. Note that in all these figures the first principal component is also drawn as a straight line (obtained thanks to the optional argument pca ¼ T). Note in passing that arguments col.pca ¼ "red", pch ¼ 19, cex ¼ 0.4 are just ‘‘cosmetic’’ (see help of par and plot.acomp for details). As said in the example, instruction plot generates a panel plot, when applied to a data set with four or more parts. For rplus (Fig. 6) or a aplus (Fig. 7) object, this is a standard R pairs plot, although in the second case, both axis are represented in a logarithmic scale by default. For closed compositional geometries, 0 200 400 600 800
0
100 200 300 400 200 150
K
100 50 0
800 600
Na
400 200 0
150 100
Mg
50 0 400 300 Ca
200 100 0 0
50 100 150 200
0
50
100 150
Fig. 6. Panel plot of scatter diagrams obtained with call plot (rplus(x[,c("K", "Na", "Mg", "Ca")])). Line marks direction of first principal component, and fit usual intuition of straight lines.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
332
1.6 1.8 2.0 2.2 2.4
3
4
5
6 7 5.0 2.0
K 0.5 0.1 2.4 2.2 2.0
Na
1.8 1.6
4.0 3.5 3.0
Mg
2.5 2.0 7 6 5 4
Ca
3
0.1
0.5
2.0 5.0
2.0 2.5 3.0 3.5 4.0
Fig. 7. Panel plot of scatter diagrams in log scale obtained plotting y ¼ aplus(x[,c("K", "Na", "Mg", "Ca")]). Line marks the direction of first principal component. In this case, straight lines in logistic geometry look as conventional lines here, since they are plotted in a logarithmic scale.
the presented plot is a matrix of ternary diagrams: in each cell, two plotted parts are determined by row and column position, whereas the third part may be:
computed amalgamating all the rest of the parts (the default behavior for rcomp objects, Fig. 8), computed as the geometric mean of the rest of the parts (the default behavior for acomp objects, Fig. 9), defined by the user, through the optional argument margin. This parameter may be one of: "rcomp", "acomp" (produce, respectively, amalgamation and geometric mean), or the name (as a character string) or the position (as an integer) of a single part of the composition (see Fig. 10).
To produce 4-part plots, the package includes function kingTetrahedron, contributed by Bren and Batagelj (2005). This function must be given a 4-part composition, and it generates an auxiliary file. If this file is opened with the adequate viewer (kING, an open source Java application), it displays a 3D tetrahedron plot. Functions boxplot and qqnorm used on closed compositional objects produce a matrix of plots (respectively, box-plots and QQ normal plots), with all possible pairwise differences or log-ratios.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338 0.0 0.2 0.4 0.6 0.8 1.0
333
0.0 0.2 0.4 0.6 0.8 1.0
+
+
1.0
+
0.8 0.6 K
0.4 K
Na
Mg
K
K
Ca
0.2 0.0
1.0
+
+
+
0.8 0.6 Na
0.4 0.2
Na
K
Na
Mg
Na
Ca
0.0 +
+
1.0
+
0.8 0.6 Mg Mg
K
Mg
0.4 Mg
Na
Ca
0.2 0.0
1.0
+
+
+
0.8 0.6 Ca
0.4 0.2
Ca
K
Ca
Na
Ca
Mg
0.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Fig. 8. Panel plot of scatter diagrams obtained plotting y ¼ rcomp(x[,c("K", "Na", "Mg", "Ca")]). Line marks direction of first principal component. A straight line within a high-dimensional Simplex (in this case, a tetrahedron), does not necessarily end at the edge of ternary diagrams, once projected onto any of its faces.
To obtain a series of QQ or box-plots for a raw compositional vector, one must use instead an rplus object. Example 8. Running the code 4 y ¼ acomp(x[,c("Cl","SO4","HCO3")]) 4 qqnorm(y) generates Fig. 11. Note its symmetry, since, e.g. the plot of ðrow; columnÞ ¼ ð1; 2Þ represents a QQ plot Cl for the variable ln SO4 , whereas plot ð2; 1Þ works with variable ln SO4 Cl (identical to the former, but with changed sign). The low-level functions include straight (to plot lines characterized by a point and a director vector), ellipses (to plot ellipses around a point, according to a quadratic form and a scaling), segments (to plot segments between two sets of points) and lines (to plot a piece-wise line along a set of points). All of them take the given objects in the corresponding geometry, apply the default isometric transform, interpolate densely the object to describe (segments, lines or ellipses), back-transform all the points interpolated and plot them using the standard R function lines.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
334
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0 *
*
1.0
*
0.8 0.6
K
0.4 K
Na
Mg
K
Ca
K
0.2 0.0
1.0
*
0.8
*
*
0.6 Na
0.4 0.2
Na
K
Mg
Na
Na
Ca
0.0 *
*
1.0
*
0.8 0.6
Mg Mg
K
Mg
0.4
Na
Mg
Ca
0.2 0.0
1.0
*
0.8
*
*
0.6 Ca
0.4 0.2
Ca
K
Ca
Na
Ca
Mg
0.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Fig. 9. Panel plot of ternary diagrams obtained plotting y ¼ acomp(x[,c("K", "Na", "Mg", "Ca")]). Line marks the direction of the first principal component. Straight lines in Aitchison geometry do not usually look like conventional lines.
8. Distributions (layer 6) The sixth layer contains some useful functions to characterize random compositions. Their names follow the standard rule in R: if X is the name of the probability model, the density function may be computed with dX, and a simulated random sample is obtained with rX. The following distributions are included for each geometry: rplus: (censored) normal (norm), lognormal (lnorm), aplus: normal (Mateu-Figueras et al., 2002; Pawlowsky-Glahn et al., 2003), rcomp: Dirichlet(Dirichlet), uniform, (censored) normal, acomp: normal (Mateu-Figueras et al., 2003). Example 9. Note that the normal distributions are defined on the coordinates on each geometry (PawlowskyGlahn, 2003). To generate, for instance, a random sample of size n ¼ 100 from a (truncated) standard normal distribution on rcomp geometry, one has to follow the sequence: 4 MyData ¼ rcomp(Hydrochem[,c("Cl","SO4","HCO3")]) 4 mn ¼ mean(MyData) 4 vr ¼ var(MyData)
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338 0.0
0.2
0.4
0.6
0.8
335
1.0 1.0
Ca
Ca
0.8 0.6 K 0.4 K
Na
K
Mg
0.2 0.0
1.0 Ca
Ca
0.8 0.6 Na 0.4 0.2
K
Na
Na
Mg
0.0 1.0 Ca
Ca
0.8 0.6 Mg 0.4 K
Mg
Mg
0.2
Na
0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 10. Panel plot of ternary diagrams obtained with call plot(y,margin ¼ "Ca"), where y ¼ acomp(x[,c("K", "Na", "Mg", "Ca")]). Compare with plots 8 and 9, with respect to the choice of third vertex.
4 x ¼ rnorm.rcomp(n ¼ 100, mean ¼ mn, var ¼ vr) 4 plot(x) To generate a random data set in any of the other classes, one has to replace the two occurrences of the class name (first and fourth lines). 9. Source code The package is available from CRAN (R Development Core Team, 2004) at http://www.cran.r-project.org, or http://stat.boogaart.de/compositions/, its homepage. The source code is contained there, but the user should install the package as normal to access it, in the corresponding R system tree. Acknowledgments This package was partly programmed during a research stage funded by the German Academic Exchange Service (DAAD Project Ref. 314/A/04/33586) and the Catalan Agency of University and research grants (AGAUR Project Ref. 2004-BE-00147). The authors acknowledge the constructive criticisms of H. von Eynatten, and those of an anonymous reviewer.
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
336
0.0
0.2
0.4
0.6
0.8
1.0 1.0 0.8 0.6
Cl 0.4 0.2 0.0 1.0 0.8 0.6 SO4 0.4 0.2 0.0 1.0 0.8 0.6 HCO3 0.4 0.2 0.0 0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Fig. 11. Panel plot of QQ plots, generated in Example 8. See example for further explanations.
Appendix A. Recipe for an analysis of compositions with R What follows is only an ordered bunch of suggestions on how to analyze a compositional data set with compositions in R. It is strongly recommended to use R scripts, since in this way repeating an analysis with another geometry only implies slight changes in step (1) and (5). (1) Answer the two questions of Section 2, and declare your data set to be of the adequate type, e.g. if your data is stored in data frame X, and you chose to view it as a closed composition with a relative geometry, then just declare it with the instruction Y ¼ acomp(X). (2) State the question to be solved by the analysis in compatible terms with the geometry you chose for the data set (if your data is a closed data set from a Piper plot do not try to get the degree of dilution!). (3) Decide which conventional statistical technique will answer your question. (4) If this technique . . . is a cluster analysis, a principal component or just a descriptive statistical analysis, call the adequate function, be it hclust(dist(Y)), princomp(Y) or summary(Y) (since they are included in the package); may deal with deficient-rank covariance matrices, or it just needs a concept of distance, call the corresponding function, giving as data cdt(Y); needs a full-rank covariance matrix, call the corresponding function, giving as data idt(Y).
ARTICLE IN PRESS K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
337
(5) Apply the inverse transformation in your chosen geometry (e.g. clr.inv or ilr.inv in an acomp framework, see Table 2) to those results of your statistical analysis which you want to interpret in terms of the original geometry (see next step for ideas, and be careful when applying these transformations to matrices, by rows or by columns). (6) Represent your results (a look at Section 7 may help): confidence or prediction regions (e.g. ellipses in normal distributions) may be added to existing plots by using the function ellipses or lines; lines (principal component, discriminant, regression analysis) may be added to existing plots by using the function straight; characteristic points (means, modes, regression line intercepts) may be added to existing plots by using the function plot with the optional argument add ¼ TRUE; characteristic points (means, modes, regression line intercepts) and vectors of change (director vector of a regression, difference between the means of two groups) may be plotted as bar plots, calling barplot. (7) Answer your question, taking into account the limitations of your chosen geometry (8) If this is not satisfactory, go back to step 1, change the geometry and be careful to invert your results in step 5 adequately.
References Aitchison, J., 1982. The statistical analysis of compositional data (with discussion). Journal of the Royal Statistical Society, Series B (Statistical Methodology) 44 (2), 139–177. Aitchison, J., 1986. The Statistical Analysis of Compositional Data, Monographs on Statistics and Applied Probability. Chapman & Hall Ltd., London, UK, (Reprinted in 2003 with additional material by The Blackburn Press), 416pp. Aitchison, J., 1997. The one-hour course in compositional data analysis or compositional data analysis is simple. In: Pawlowsky-Glahn, V. (Ed.), Proceedings of IAMG’97—The Third Annual Conference of the International Association for Mathematical Geology, International Center for Numerical Methods in Engineering (CIMNE), Barcelona (E), pp. 3–35. Aitchison, J., 2002. Simplicial inference. In: Viana, M.A.G., Richards, D.S.P. (Eds.), Algebraic Methods in Statistics and Probability, vol. 287. Contemporary Mathematics Series. American Mathematical Society, Providence, Rhode Island, USA, 340pp, pp. 1–22. Aitchison, J., Barcelo´-Vidal, C., Egozcue, J.J., Pawlowsky-Glahn, V., 2002. A concise guide for the algebraic–geometric structure of the simplex, the sample space for compositional data analysis. In: Bayer, U., Burger, H., Skala, W. (Eds.), Proceedings of IAMG’02—The Eighth Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, pp. 387–392, ISSN 09468978. Beardah, C.C., Baxter, M.J., 2005. An R library for compositional data analysis in Archaeometry. In: Mateu-Figueras, G., Barcelo´-Vidal, C. (Eds.), Second Compositional Data Analysis Workshop—CoDaWork’05, CD Proceedings. Universitat de Girona, ISBN 84-8458222-1. hhttp://ima.udg.edu/Activitats/CoDaWork05/CD/Session5/Beardah-Baxter.pdfi. Billheimer, D., Guttorp, P., Fagan, W., 2001. Statistical interpretation of species composition. Journal of the American Statistical Association 96 (456), 1205–1214. van den Boogaart, K.G., Tolosana-Delgado, R., 2006. Compositional data analysis with the R. In: Buccianti, A., Mateu-Figueras, G., Pawlowsky-Glahn, V. (Eds.), Compositional Data Analysis: From Theory to Practice. Special Publication of the Geological Society of London, 264, pp. 119–127. ISBN 1-86239-205-6. Bren, M., Batagelj, V., 2005. Compositional data analysis with R. In: Mateu-Figueras, G., Barcelo´-Vidal, C. (Eds.), Second Compositional Data Analysis Workshop—CoDaWork’05, CD Proceedings. Universitat de Girona. ISBN 84-8458-222-1. hhttp:// ima.udg.es/Activitats/CoDaWork05/i. Chayes, F., 1960. On correlation between variables of constant sum. Journal of Geophysical Research 65 (12), 4185–4193. Daunis-i-Estadella, J., Egozcue, J.J., Pawlowsky-Glahn, V., 2002. Least squares regression in the simplex. In: Bayer, U., Burger, H., Skala, W. (Eds.), Proceedings of IAMG’02—The Eighth Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, pp. 411–416. ISSN 0946-8978. Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueras, G., Barcelo´-Vidal, C., 2003. Isometric logratio transformations for compositional data analysis. Mathematical Geology 35 (3), 279–300. Gabriel, K.R., 1971. The biplot—graphic display of matrices with application to principal component analysis. Biometrika 58 (3), 453–467. Mateu-Figueras, G., Pawlowsky-Glahn, V., Martı´ n-Ferna´ndez, J.A., 2002. Normal in Rþ vs lognormal in R. In: Bayer, U., Burger, H., Skala, W. (Eds.), Proceedings of IAMG’02—The Eighth Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, pp. 305–310. ISSN 0946-8978.
ARTICLE IN PRESS 338
K.G. van den Boogaart, R. Tolosana-Delgado / Computers & Geosciences 34 (2008) 320–338
Mateu-Figueras, G., Pawlowsky-Glahn, V., Barcelo´-Vidal, C., 2003. Distributions on the simplex. In: Thio´-Henestrosa, S., Martı´ nFerna´ndez, J.A. (Eds.), Compositional Data Analysis Workshop—CoDaWork’03, CD Proceedings, Universitat de Girona. ISBN 84-8458-111-X. hhttp://ima.udg.es/Activitats/CoDaWork03/i. Otero, N., Tolosana-Delgado, R., Soler, A., Pawlowsky-Glahn, V., Canals, A., 2005. Relative vs absolute analysis of compositions: a comparative analysis in surface waters of a Mediterranean river. Water Research 39, 1404–1414. Pawlowsky-Glahn, V., 2003. Statistical modelling on coordinates. In: Thio´-Henestrosa, S., Martı´ n-Ferna´ndez, J.A. (Eds.), Compositional Data Analysis Workshop—CoDaWork’03, CD Proceedings, Universitat de Girona. ISBN 84-8458-111-X. hhttp://ima.udg.es/ Activitats/CoDaWork03/i. Pawlowsky-Glahn, V., Egozcue, J.J., 2001. Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA) 15 (5), 384–398. Pawlowsky-Glahn, V., Egozcue, J.J., Burger, H., 2003. An alternative model for the statistical analysis of bivariate positive measurements. In: Cubitt, J. (Ed.), CD Proceedings of IAMG’03—The Ninth Annual Conference of the International Association for Mathematical Geology, University of Portsmouth, Portsmouth, UK. R Development Core Team, 2004. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-00-3. URL hhttp://www.R-project.orgi. Rehder, S., Zier, U., 2002. Some remarks about transformations. In: Bayer, U., Burger, H., Skala, W. (Eds.), Proceedings of IAMG’02— The Eighth Annual Conference of the International Association for Mathematical Geology, Berlin, Germany, pp. 423–428. ISSN 0946-8978. Shurtz, R.F., 2003. Compositional geometry and mass conservation. Mathematical Geology 35 (8), 927–937. Thio´-Henestrosa, S., Martı´ n-Ferna´ndez, J.A., 2005. Dealing with compositional data: the freeware CoDaPack. Mathematical Geology 37 (7), 773–793. Tolosana-Delgado, R., Otero, N., Pawlowsky-Glahn, V., 2005a. Some basic concepts of compositional geometry. Mathematical Geology 37 (7), 673–680. Tolosana-Delgado, R., Otero, N., Pawlowsky-Glahn, V., Soler, A., 2005b. Some basic concepts of compositional geometry. Mathematical Geology 37 (7), 681–702.